## We can create ease in accessing information from newspapers through web scraping. Here we use popular web scraping tools in Python to generate a CSV file containing requisite information from different sections of the web edition of *Indian Express*, a popular national daily in India. 

## This notebook can be scheduled to run daily to access info on a regular basis. While news websites can often distract us through information overload, seeking information like this can help us save time and even organize our news sources in a methodical manner. 

### Importing necessary packages

In [1]:
import pandas as pd
import os
from datetime import datetime 
from datetime import timedelta
from datetime import date
from array import *
import pdb

### Getting articles from UPSC Essentials page

#### A page highlighting the most important news sources for the UPSC examination, one of the toughest exams in the world, selecting an elite cohort of officers to the higher echelons of the Indian bureaucracy

#### To scrape the articles, we will be using the popular Python package of BeautifulSoup 

In [2]:
%%capture
!pip install lxml
!pip install html5lib
!pip install requests
!pip install beautifulsoup4

In [3]:
from bs4 import BeautifulSoup as bs
import requests
from urllib.request import Request, urlopen

In [4]:
url = "https://indianexpress.com/section/upsc-current-affairs/"      #URL of the requisite webpage containing UPSC news sources
text = requests.get(url)
soup = bs(text.content, 'html.parser')                               #Parsing text content for extraction

In [5]:
#soup.prettify()

In [6]:
sec_cont = []
sec_cont = soup.find('div', {'class':'nation'})                    #isolating the required section within the parse text

In [7]:
#sec_cont

#### We will be using the following method to scrape text from each section:
##### 1. Identify and list links of news articles in the body of the webpage
##### 2. Use links to source the articles
##### 3. Elicit requisite information from each article like Title, Date, Subheadings etc. (from the web elements of the page)
##### 4. Sort them by dates and filter the articles of the day.
##### 5. Create a dataframe to store the information. 

In [8]:
upsc_ca_premium_list = []
links=[]

for link in sec_cont.findAll('a', href=True):                         #function to automate scraping of each article in the list
    
    upsc_ca = {}
    page_url = link.get('href')
    
    try:

        # URL

        upsc_ca['URL'] = page_url

        # Invoke URL

        page = requests.get(page_url)
        page_soup = bs(page.content, 'lxml')

        # Title

        upsc_ca['Title'] = page_soup.find('title').text

        # Content

        page_content = ''
        page_soup_div = page_soup.find_all('h2', {'class': 'synopsis'})
        for p_content in page_soup_div:
            page_content = page_content + p_content.text

        # Content

        upsc_ca['Content'] = page_content

        # Date Time

        page_soup_span = page_soup.find_all('span',{'itemprop': 'dateModified'})
        upsc_ca['Publish Date'] = page_soup_span[0].text
        
        links.append(link.get('href'))
        upsc_ca_premium_list.append(upsc_ca)
    
    except:
        print ('ERROR!', page_url)
        
print ('Extracted', len(upsc_ca_premium_list), 'articles from UPSC Current Affairs')
print ('Done')

ERROR! https://indianexpress.com/section/upsc-current-affairs/page/2/
ERROR! https://indianexpress.com/section/upsc-current-affairs/page/3/
ERROR! https://indianexpress.com/section/upsc-current-affairs/page/12/
ERROR! https://indianexpress.com/section/upsc-current-affairs/page/2/
Extracted 50 articles from UPSC Current Affairs
Done


In [9]:
upsc_ca_premium_list = pd.DataFrame.from_dict(upsc_ca_premium_list)

In [10]:
upsc_ca = upsc_ca_premium_list.copy()

In [11]:
upsc_ca['URL']= upsc_ca['URL'].astype('string')
upsc_ca['Title']= upsc_ca['Title'].astype('string')
upsc_ca['Content']= upsc_ca['Content'].astype('string')
upsc_ca['Publish Date']= upsc_ca['Publish Date'].astype('string')

upsc_ca = upsc_ca.iloc[::2]    #Dropping duplicate rows

upsc_ca['Publish Date'] = upsc_ca['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
upsc_ca['Publish Date'] = pd.to_datetime(upsc_ca['Publish Date']).dt.strftime("%Y-%m-%d")

today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = yesterday.strftime("%Y-%m-%d")

try:
    
    upsc_current_affairs = upsc_ca.loc[upsc_ca['Publish Date']==today]
    
except:
    print("No UPSC Current Affairs today")

In [12]:
upsc_current_affairs

Unnamed: 0,URL,Title,Content,Publish Date


# Getting articles from Opinion page

In [13]:
url = "https://indianexpress.com/section/opinion/"
opeds = requests.get(url)
soup = bs(opeds.content, 'html.parser')

soup.prettify

op_cont = []
op_cont = soup.find('div', {'class':'o-opin'})

oped_list = []
articles=[]

for article in op_cont.findAll('a', href=True):
    
    opinion = {}
    op_url = article.get('href')
    
    try:

        # URL

        opinion['URL'] = op_url

        # Invoke URL

        page = requests.get(op_url)
        page_soup = bs(page.content, 'html.parser')

        # Title

        page_soup_title = page_soup.find('h1', {'class':'native_story_title'})
        opinion['Title']= page_soup_title.text

        # Content

        page_content = ''
        page_soup_div = page_soup.find_all('h2',{"class":"synopsis"})
        for p_content in page_soup_div:
            page_content = page_content + p_content.text

        # Content

        opinion['Content'] = page_content 

        # Date Time
        
        dates=[]
        dates = page_soup.find('div',{'id':'storycenterbyline'})
        for date in dates.select_one('span'):
            opinion['Publish Date'] = date.text
        
        articles.append(article.get('href'))
        oped_list.append(opinion)

    
    except Exception as e:
        print(e, op_url)

print ('Extracted', len(oped_list),'opinion articles')
print ('Done with extracting all op-eds')

'NoneType' object has no attribute 'text' https://indianexpress.com/profile/author/radha-kumar/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/columnist/d-sivanandhan/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/columnist/c-raja-mohan/
'NoneType' object has no attribute 'text' https://indianexpress.com/agency/editorial/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/author/ashok-gulati/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/columnist/manish-kumar-prasad/
'NoneType' object has no attribute 'text' https://indianexpress.com/agency/editorial/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/columnist/vikram-s-mehta/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/columnist/rajiv-pratap-rudy/
'NoneType' object has no attribute 'text' https://indianexpress.com/profile/columnist/monali-chowdhurie-aziz/
'NoneType'

In [14]:
opinion_list = pd.DataFrame.from_dict(oped_list)
opinion_list

opinion_list.info()

opinion_list['URL']= opinion_list['URL'].astype('string')
opinion_list['Title']= opinion_list['Title'].astype('string')
opinion_list['Content']= opinion_list['Content'].astype('string')
opinion_list['Publish Date']= opinion_list['Publish Date'].astype('string')

opinion_list = opinion_list.iloc[::2]    #Dropping duplicate rows


opinion_list['Publish Date'] = opinion_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
opinion_list['Publish Date'] = pd.to_datetime(opinion_list['Publish Date']).dt.strftime("%Y-%m-%d")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           50 non-null     object
 1   Title         50 non-null     object
 2   Content       50 non-null     object
 3   Publish Date  50 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


In [15]:
from datetime import date
today = date.today().strftime("%Y-%m-%d")
today

try:
    op_columns = opinion_list.loc[opinion_list['Publish Date']==today]
except:
    op_columns=[]
    print("No Opinion Columns today")

In [16]:
op_columns

Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/opinion/colu...,India’s response to Sri Lanka and Myanmar cris...,"Given our land and sea borders with Myanmar, a...",2022-08-01
2,https://indianexpress.com/article/opinion/colu...,The powerful and ubiquitous ED,ED's prominence points to a shift: Central age...,2022-08-01
4,https://indianexpress.com/article/opinion/colu...,Pakistan@75: Navigating the way forward,Pakistan needs a fresh start. Ending the stale...,2022-08-01
6,https://indianexpress.com/article/opinion/40-y...,"August 1, 1982, Forty Years Ago: Bill To Curb ...","History was made in the Bihar Assembly when, h...",2022-08-01
8,https://indianexpress.com/article/opinion/colu...,"On food inflation, the humble tomato has chall...",Ashok Gulati and Manish Kumar Prasad write: Mo...,2022-08-01
10,https://indianexpress.com/article/opinion/edit...,Jharkhand’s cash scandal coupled with the shad...,"Clearly, coalition politics, especially in the...",2022-08-01
12,https://indianexpress.com/article/opinion/colu...,"Naysayers are wrong, India does have success s...",Vikram S Mehta writes: Too many people believe...,2022-08-01
14,https://indianexpress.com/article/opinion/colu...,Financial health of airline sector is the real...,With multiple low-cost airline operators const...,2022-08-01
16,https://indianexpress.com/article/opinion/colu...,Friend-shoring: The medium-term response for i...,"Monali Chowdhurie Aziz: Friend-shoring, as the...",2022-08-01
18,https://indianexpress.com/article/opinion/edit...,US-China tensions may turn for the worse if bo...,Biden and Xi have apparently agreed to explore...,2022-08-01


# Getting articles from Explained pages

In [17]:
def explained(x):
    url = (("https://indianexpress.com/about/explained-") + (x))
    expl_cont = requests.get(url)
    soup = bs(expl_cont.content, 'html.parser')

    soup.prettify

    sec_exp = []
    sec_exp = soup.find('div', {'class':'search-result'})

    links =[]

    for link in sec_exp.findAll('a', href=True):
        section_explained = {}
        sec_exp_url = link.get('href')
    
        try:
            # URL

            section_explained['URL'] = sec_exp_url

            # Invoke URL

            page = requests.get(sec_exp_url)
            page_soup = bs(page.content, 'html.parser')

            # Title

            page_soup_title = page_soup.find('h1', {'class':'native_story_title'})
            section_explained['Title']= page_soup_title.text

            # Content

            page_content = ''
            page_soup_div = page_soup.find_all('h2',{"class":"synopsis"})
            for p_content in page_soup_div:
                page_content = page_content + p_content.text

            # Content

            section_explained['Content'] = page_content 

            # Date Time

            dates=[]
            dates = page_soup.find('div',{'id':'storycenterbyline'})
            for date in dates.select_one('span'):
                section_explained['Publish Date'] = date.text

            links.append(link.get('href'))
            sec_exp_list.append(section_explained)

        except Exception as e:
            print(e, sec_exp_url)

    print ('Extracted', len(sec_exp_list), 'explained articles in ' + (x))
    print ('Done with extracting all explained articles in ' + (x))

### Economics

In [18]:
sec_exp_list = []
explained('economics')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in economics
Done with extracting all explained articles in economics


In [19]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    econ_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==today]
except:
    econ_explained=[]
    print("No Explained Economics today")
    
econ_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/explained/ex...,Explained: 3 reasons why GST collections conti...,"Experts say that action against tax evaders, i...",2022-08-01
2,https://indianexpress.com/article/explained/ex...,ExplainSpeaking | Global bright spot or the on...,Employment data both from MGNREGA and the annu...,2022-08-01


### Health

In [20]:
sec_exp_list = []
explained('health')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in health
Done with extracting all explained articles in health


In [21]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    health_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==today]
except:
    health_explained=[]
    print("No Explained Health today")
    
health_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/explained/ex...,Explained: A man who had monkeypox has died in...,It is not yet known what comorbidities this pa...,2022-08-01


### Politics

In [22]:
sec_exp_list = []
explained('politics')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in politics
Done with extracting all explained articles in politics


In [23]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    politics_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==today]
except:
    politics_explained=[]
    print("No Explained Politics today")
    
politics_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date


### Culture

In [24]:
sec_exp_list = []
explained('culture')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in culture
Done with extracting all explained articles in culture


In [25]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    culture_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==today]
except:
    culture_explained=[]
    print("No Explained Culture today")
    
culture_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date


### Global

In [26]:
sec_exp_list = []
explained('global')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in global
Done with extracting all explained articles in global


In [27]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    global_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==yesterday]
except:
    global_explained=[]
    print("No Explained Global today")
    
global_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/explained/ex...,Explained: What’s driving the power struggle i...,The tussle over who gets to form the next gove...,2022-07-31


### Sci-tech

In [28]:
sec_exp_list = []
explained('sci-tech')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in sci-tech
Done with extracting all explained articles in sci-tech


In [29]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    scitech_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==today]
except:
    scitech_explained=[]
    print("No Explained Sci-Tech today")
    
scitech_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date


### Climate

In [30]:
sec_exp_list = []
explained('climate')

No connection adapters were found for 'javascript:void(0);' javascript:void(0);
Extracted 20 explained articles in climate
Done with extracting all explained articles in climate


In [31]:
sec_exp_list = pd.DataFrame.from_dict(sec_exp_list)
sec_exp_list

sec_exp_list.info()

sec_exp_list['URL']= sec_exp_list['URL'].astype('string')
sec_exp_list['Title']= sec_exp_list['Title'].astype('string')
sec_exp_list['Content']= sec_exp_list['Content'].astype('string')
sec_exp_list['Publish Date']= sec_exp_list['Publish Date'].astype('string')


sec_exp_list = sec_exp_list.iloc[::2]    #Dropping duplicate rows
sec_exp_list

sec_exp_list['Publish Date'] = sec_exp_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))
sec_exp_list['Publish Date']

sec_exp_list['Publish Date'] = pd.to_datetime(sec_exp_list['Publish Date']).dt.strftime("%Y-%m-%d")
sec_exp_list['Publish Date']

try:
    climate_explained = sec_exp_list.loc[sec_exp_list['Publish Date']==today]
except:
    climate_explained=[]
    print("No Explained Climate today")
    
climate_explained

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           20 non-null     object
 1   Title         20 non-null     object
 2   Content       20 non-null     object
 3   Publish Date  20 non-null     object
dtypes: object(4)
memory usage: 768.0+ bytes


Unnamed: 0,URL,Title,Content,Publish Date


In [32]:
explained_list = [econ_explained,health_explained,politics_explained, culture_explained, global_explained, scitech_explained, climate_explained]

In [33]:
explained = pd.concat(explained_list)
explained

Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/explained/ex...,Explained: 3 reasons why GST collections conti...,"Experts say that action against tax evaders, i...",2022-08-01
2,https://indianexpress.com/article/explained/ex...,ExplainSpeaking | Global bright spot or the on...,Employment data both from MGNREGA and the annu...,2022-08-01
0,https://indianexpress.com/article/explained/ex...,Explained: A man who had monkeypox has died in...,It is not yet known what comorbidities this pa...,2022-08-01
0,https://indianexpress.com/article/explained/ex...,Explained: What’s driving the power struggle i...,The tussle over who gets to form the next gove...,2022-07-31


# Science section

In [34]:
url = "https://indianexpress.com/section/science/"
markup = requests.get(url)
soup = bs(markup.content, 'html.parser')

soup.prettify

sci_cont = []
sci_cont = soup.find('div', {'class':'nation'})

science_list = []
links=[]

for link in sci_cont.findAll('a', href=True):
    
    science = {}
    sci_url = link.get('href')
    
    try:

        # URL

        science['URL'] = sci_url

        # Invoke URL

        page = requests.get(sci_url)
        page_soup = bs(page.content, 'html.parser')

        # Title

        page_soup_title = page_soup.find('h1', {'class':'native_story_title'})
        science['Title']= page_soup_title.text

        # Content

        page_content = ''
        page_soup_div = page_soup.find_all('h2',{"class":"synopsis"})
        for p_content in page_soup_div:
            page_content = page_content + p_content.text

        # Content

        science['Content'] = page_content 

        # Date Time
        
        dates=[]
        dates = page_soup.find('div',{'id':'storycenterbyline'})
        for date in dates.select_one('span'):
            science['Publish Date'] = date.text
        
        links.append(link.get('href'))
        science_list.append(science)
    
    except Exception as e:
        print(e, sci_url)
        
print ('Extracted', len(science_list), 'science articles')
print ('Done with extracting all science articles')

'NoneType' object has no attribute 'text' https://indianexpress.com/section/science/page/2/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/science/page/3/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/science/page/224/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/science/page/2/
Extracted 50 science articles
Done with extracting all science articles


In [35]:
science = pd.DataFrame.from_dict(science_list)
science

science.info()

science['URL']= science['URL'].astype('string')
science['Title']= science['Title'].astype('string')
science['Content']= science['Content'].astype('string')
science['Publish Date']= science['Publish Date'].astype('string')

science = science.iloc[::2]    #Dropping duplicate rows


science['Publish Date'] = science['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))

science['Publish Date'] = pd.to_datetime(science['Publish Date']).dt.strftime("%Y-%m-%d")
science

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           50 non-null     object
 1   Title         50 non-null     object
 2   Content       50 non-null     object
 3   Publish Date  50 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/technology/s...,NASA will send more helicopters to Mars,The trip back to Earth would take a few more y...,2022-08-01
2,https://indianexpress.com/article/technology/s...,AI predicts the shape of nearly every protein ...,DeepMind has released predictions for nearly e...,2022-07-31
4,https://indianexpress.com/article/technology/s...,Chinese Space rocket debris crashes back to Ea...,The US Space Command said that the Long March ...,2022-07-31
6,https://indianexpress.com/article/technology/s...,Science news weekly recap: Chinese rocket debr...,From Chinese rocket debris to robots in the In...,2022-07-31
8,https://indianexpress.com/article/technology/s...,Digging Deep: How mangroves are affected by cl...,Mangroves are tropical forest ecosystems that ...,2022-07-29
10,https://indianexpress.com/article/technology/s...,"There are holes on the ocean floor, scientists...","The question the scientists are posing, to the...",2022-07-29
12,https://indianexpress.com/article/technology/s...,Hidden Menace: Massive methane leaks speed up ...,Massive amounts of methane is venting into the...,2022-07-28
14,https://indianexpress.com/article/technology/s...,China closely tracking debris of its most powe...,"Debris from a large, newly launched Chinese ro...",2022-07-28
16,https://indianexpress.com/article/technology/s...,NASA robots work together for the first time o...,"For the first time ever, two Astrobee robots h...",2022-07-28
18,https://indianexpress.com/article/technology/s...,Astronaut Buzz Aldrin’s Apollo 11 flight jacke...,The jacket front displays NASA's logo and the ...,2022-07-27


In [36]:
from datetime import date
today = date.today().strftime("%Y-%m-%d")
today

try:
    science = science.loc[science['Publish Date']==today]
except:
    science=[]
    print("No Science Columns today")

In [37]:
science

Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/technology/s...,NASA will send more helicopters to Mars,The trip back to Earth would take a few more y...,2022-08-01


# Business section: Economy + Banking and Finance

In [38]:
biz_list=[]
def biz(x):
    url = (("https://indianexpress.com/section/business/") + (x))
    print(url)
    business_cont = requests.get(url)
    soup = bs(business_cont.content, 'html.parser')

    soup.prettify

    business_list=[]
    business_list = soup.find('div', {'class':'nation'})

    links =[]

    for link in business_list.find_all('a', href=True):
        
        business_page = {}
        business_url = link.get('href')
    
        try:
            # URL

            business_page['URL'] = business_url

            # Invoke URL

            page = requests.get(business_url)
            page_soup = bs(page.content, 'html.parser')

            # Title

            page_soup_title = page_soup.find('h1', {'class':'native_story_title'})
            business_page['Title']= page_soup_title.text

            # Content

            page_content = ''
            page_soup_div = page_soup.find_all('h2',{"class":"synopsis"})
            for p_content in page_soup_div:
                page_content = page_content + p_content.text

            # Content

            business_page['Content'] = page_content 

            # Date Time

            dates=[]
            dates = page_soup.find('div',{'id':'storycenterbyline'})
            for date in dates.select_one('span'):
                business_page['Publish Date'] = date.text

            links.append(link.get('href'))
            biz_list.append(business_page)


        except Exception as e:
            print(e, business_url)
            
    print ("Extracted", len(biz_list), "pages in " + (x))
    print ('Done with extracting all business articles in ' + (x))

In [39]:
biz_list=[]
biz('economy')

https://indianexpress.com/section/business/economy
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/economy/page/2/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/economy/page/3/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/economy/page/307/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/economy/page/2/
Extracted 50 pages in economy
Done with extracting all business articles in economy


In [40]:
biz_list = pd.DataFrame.from_dict(biz_list)
biz_list

biz_list.info()

biz_list['URL']= biz_list['URL'].astype('string')
biz_list['Title']= biz_list['Title'].astype('string')
sec_exp_list['Content']= biz_list['Content'].astype('string')
biz_list['Publish Date']= biz_list['Publish Date'].astype('string')


biz_list = biz_list.iloc[::2]    #Dropping duplicate rows

biz_list['Publish Date'] = biz_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))

biz_list['Publish Date'] = pd.to_datetime(biz_list['Publish Date']).dt.strftime("%Y-%m-%d")

try:
    economy = biz_list.loc[biz_list['Publish Date']==today]
except:
    economy=[]
    print("No Business:Economy today")
    
economy

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           50 non-null     object
 1   Title         50 non-null     object
 2   Content       50 non-null     object
 3   Publish Date  50 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/business/eco...,GST collection rises 28% to Rs 1.49 lakh crore...,This is the sixth time that the monthly GST co...,2022-08-01
2,https://indianexpress.com/article/business/eco...,RBI likely to raise key policy rate by at leas...,The Reserve Bank of India's rate-setting panel...,2022-08-01
4,https://indianexpress.com/article/business/eco...,India’s manufacturing activity touches 8-month...,The seasonally adjusted S&P Global India Manuf...,2022-08-01
6,https://indianexpress.com/article/business/eco...,"States, Centre to meet on labour: Modalities o...",Implementing the labour codes not later than 2...,2022-08-01
8,https://indianexpress.com/article/business/eco...,"Raghuram Rajan lauds RBI, says India not facin...","The former RBI governor, Raghuram Rajan, said ...",2022-08-01


In [41]:
biz_list=[]
biz('banking-and-finance')

https://indianexpress.com/section/business/banking-and-finance
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/banking-and-finance/page/2/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/banking-and-finance/page/3/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/banking-and-finance/page/183/
'NoneType' object has no attribute 'text' https://indianexpress.com/section/business/banking-and-finance/page/2/
Extracted 50 pages in banking-and-finance
Done with extracting all business articles in banking-and-finance


In [42]:
biz_list = pd.DataFrame.from_dict(biz_list)
biz_list

biz_list.info()

biz_list['URL']= biz_list['URL'].astype('string')
biz_list['Title']= biz_list['Title'].astype('string')
sec_exp_list['Content']= biz_list['Content'].astype('string')
biz_list['Publish Date']= biz_list['Publish Date'].astype('string')


biz_list = biz_list.iloc[::2]    #Dropping duplicate rows

biz_list['Publish Date'] = biz_list['Publish Date'].map(lambda x: x.lstrip('Updated: '.rstrip('')))

biz_list['Publish Date'] = pd.to_datetime(biz_list['Publish Date']).dt.strftime("%Y-%m-%d")

try:
    bfsi = biz_list.loc[biz_list['Publish Date']==today]
except:
    bfsi=[]
    print("No Business:BFSI today")
    
bfsi

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           50 non-null     object
 1   Title         50 non-null     object
 2   Content       50 non-null     object
 3   Publish Date  50 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


Unnamed: 0,URL,Title,Content,Publish Date
0,https://indianexpress.com/article/business/ban...,"Electoral bonds: Parties mop up over Rs 10,000...",Electoral bonds are purchased anonymously by d...,2022-08-01


In [43]:
read_list = [upsc_current_affairs,op_columns,explained,science,economy,bfsi]

In [44]:
bfsi.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   URL           1 non-null      string
 1   Title         1 non-null      string
 2   Content       1 non-null      object
 3   Publish Date  1 non-null      object
dtypes: object(2), string(2)
memory usage: 40.0+ bytes


In [45]:
read = pd.concat(read_list)
read.reset_index(inplace = True, drop = True)

In [46]:
read = read[['Publish Date','Title','Content','URL']]
read

Unnamed: 0,Publish Date,Title,Content,URL
0,2022-08-01,India’s response to Sri Lanka and Myanmar cris...,"Given our land and sea borders with Myanmar, a...",https://indianexpress.com/article/opinion/colu...
1,2022-08-01,The powerful and ubiquitous ED,ED's prominence points to a shift: Central age...,https://indianexpress.com/article/opinion/colu...
2,2022-08-01,Pakistan@75: Navigating the way forward,Pakistan needs a fresh start. Ending the stale...,https://indianexpress.com/article/opinion/colu...
3,2022-08-01,"August 1, 1982, Forty Years Ago: Bill To Curb ...","History was made in the Bihar Assembly when, h...",https://indianexpress.com/article/opinion/40-y...
4,2022-08-01,"On food inflation, the humble tomato has chall...",Ashok Gulati and Manish Kumar Prasad write: Mo...,https://indianexpress.com/article/opinion/colu...
5,2022-08-01,Jharkhand’s cash scandal coupled with the shad...,"Clearly, coalition politics, especially in the...",https://indianexpress.com/article/opinion/edit...
6,2022-08-01,"Naysayers are wrong, India does have success s...",Vikram S Mehta writes: Too many people believe...,https://indianexpress.com/article/opinion/colu...
7,2022-08-01,Financial health of airline sector is the real...,With multiple low-cost airline operators const...,https://indianexpress.com/article/opinion/colu...
8,2022-08-01,Friend-shoring: The medium-term response for i...,"Monali Chowdhurie Aziz: Friend-shoring, as the...",https://indianexpress.com/article/opinion/colu...
9,2022-08-01,US-China tensions may turn for the worse if bo...,Biden and Xi have apparently agreed to explore...,https://indianexpress.com/article/opinion/edit...


In [47]:
today = date.today()
yesterday = today - timedelta(days = 1)

yesterday = yesterday.strftime("%b-%d-%Y")
today = today.strftime("%b-%d-%Y")

In [48]:
read.to_csv(today + ' IE extracts.csv', index=False)

In [49]:
print("File created for daily IE reading")

File created for daily IE reading
