# Google News Crawler

This script crawls Main and Sub News from : 
https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen

Importing all the dependencies

In [1]:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
import time
import pandas as pd
from newspaper import Article
import nltk
from IPython.display import display, HTML
import warnings
from tabulate import tabulate
nltk.download('punkt')
warnings.filterwarnings("ignore") 
pd.options.display.max_colwidth = 2000

[nltk_data] Downloading package punkt to /home/gopal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Class to handle all the methods related to News

In [2]:
class News():
    def __init__(self):
        self.base_url = "https://news.google.com"

        
    # Method for getting HTML content
    # Returns BeautifulSoup object
    
    def bs_html(self, url):
        response = requests.get(url)
        
        # HTML of the page
        html = response.text
        # BeautifulSoup object of the HTML
        bs_html = BeautifulSoup(html, features = 'lxml')

        return bs_html
   
    # Get article data using newspaper library
    
    def get_article_data(self, url):
        
        # en For English    
        article = Article(url, language="en")
        try:
            article.download() 
            article.parse() 
            article.nlp() 

            article_data = list() 

            article_data.append(article.title) 
            article_data.append(article.summary)       
            article_data.append(article.publish_date)
            article_data.append(url)
        except:
            
            article_data = None
            
        return article_data
    
    # For getting all the aricles data using multiproccessing for better speed
    
    def multi_proccess_article_data(self, article_urls):
        
        n_articles = len(article_urls)

        with Pool(processes=20) as pool:
            all_articles_data = pool.map_async(self.get_article_data, article_urls, chunksize = 1)
            
            while not all_articles_data.ready():
                all_articles_data.wait(timeout=1)
        
        
        return all_articles_data.get()
    
    # Return all the URLs from given BeautifulSoup object
    
    def get_urls(self, list_news):
        
        # Collects all the elements with 'a' tag
        list_news = [news.find('a') for news in list_news]

        news_urls = list()

        for news in list_news:
            
            # Extracting 'href' attribute
            sub_url = news['href']
            
            # Concatenating base and parital urls to make the URL complete
            n_url = self.base_url+sub_url.replace('.','')

            news_urls.append(n_url)
        

        return (news_urls)
    def display_table(self, df, header):
        
        # Setting the indcies 1 to the length of the DatFrame
        df.index = range(1, len(df)+1)
        
        # Setting the url with 'a' for better convention
        df['URL'] = df['URL'].apply(lambda x: '<a href="'+x+'">link</a>')
        
        # Converting the DataFrame into HTML and adding the title
        html_table = '<h1 align = "Center">{}</h1>'.format(header)+df.to_html(escape=False)
        
        # For shifting the header to the left
        html_table =html_table.replace('<th>','<th style="text-align: left;">')
        # FOr shifting the other strings to the left
        html_table =html_table.replace('<td>','<td style="text-align: left;">')
        
        # Displays the html by replacing '\\n' by '<br>' for new line
        display(HTML(html_table.replace('\\n','<br>')))

In [3]:
# Url of the main news page
url = "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen"

# Creating a News object
news_obj = News()

# getting the BeautiffulSoup object of the main news' page HTML
html = news_obj.bs_html(url)

In [4]:
# Extracting the main news URLS
main_news = html.find_all('h3',class_ = 'ipQwMb ekueJc gEATFF RD0gLb')
main_news_urls = news_obj.get_urls(main_news)

# Extracting the sub news URLs
sub_news = html.find_all('div',class_ ="xrnccd F6Welf R7GTQ keNKEd j7vNaf")
sub_news = [news.find('h4') for news in sub_news]
sub_news_urls = news_obj.get_urls(sub_news)

Used multiproccessing for faster extraction

In [5]:
%%time
main_news_data = news_obj.multi_proccess_article_data(main_news_urls)
sub_news_data = news_obj.multi_proccess_article_data(sub_news_urls)

# Columns for the DataFrame of Articles data
columns = ['Title','Summary', 'Published Date', 'URL']

CPU times: user 101 ms, sys: 203 ms, total: 303 ms
Wall time: 16.2 s


For displaying the Main news aricles 

In [6]:
# Removing if there is None type data
main_news_data = [data for data in main_news_data if data]

# COnverting into pandas DataFrame with above defined columns
df_main_news = pd.DataFrame(main_news_data, columns = columns)

# Displays Main News
news_obj.display_table(df_main_news.head(), header = "Main News")

Unnamed: 0,Title,Summary,Published Date,URL
1,"Facebook in talks to acquire stake in top Indian telco Reliance Jio, report says – TechCrunch","Reliance Jio, a three-and-a-half-year-old subsidiary of India’s most valued firm Reliance Industries, may have attracted the attention of an American giant: Facebook. Mukesh Ambani, India’s richest man who runs Reliance Industries, has poured over $25 billion into Reliance Jio over the years. Reliance Jio also owns a suite of services including music streaming service JioSaavn, on-demand live television service JioTV and payments service JioPay. Earlier this year, Reliance Industries announced JioMart, a joint venture between Reliance Jio and Reliance Retail, the nation’s largest retail chain, to soft-launch an e-commerce business. Facebook and Reliance Jio declined to comment.",2020-03-24 08:30:42,link
2,"Taking Stock: FM comments soothe nerves; Nifty off high but holds 7,800","Stocks and Sectors:Sectorally, the S&P BSE IT index rose 6.9 percent, followed by the BSE Energy index rose 4.2 percent, and the S&P FMCG index gained 3.1 percent. On the losing front, the S&P BSE Realty index fell 2.01 percent, followed by the S&P BSE Capital Goods index that fell 0.73 percent, and the Public Sector was down 0.23 percent. On the broader markets front, the S&P BSE Midcap index rose 1.5 percent, while the S&P BSE Smallcap index rose 0.05 percent. IndusInd Bank: The share price of IndusInd Bank fell over 7 percent after the bank's MD & CEO Romesh Sobti retired. Dr Reddy’s Lab: Dr Reddy’s Laboratories share price rose 3 percent on March 24 after the company said it is going to consider fundraising.",,link
3,View: India's virus-stricken economy is in a dire need of a vaccine,"The latter will have a long-term effect on the financial and mental health for generations to come. The next 90 days will dictate whether or not the US heads toward a depression-like era where unemployment rose towards 25%-plus. Venture capitalists (VCs) will say they are open for business, but they are the most risk-averse class in town. Expect investors to push for companies to shut down and return capital.Startup and small business owners should check their business insurance policies and see if they have any ‘business interruption insurance’. It will define how India can come out as a warrior by taking on both viral and financial infections head on.P.S.",2020-03-24 23:00:00+00:00,link
4,"Kishore Biyani in a spot as lenders invoke shares amid stock crash, lockdown","On Tuesday, group company Future Retail said that certain lenders who held NCDs through IDBI Trusteeship Services Ltd have invoked promoter pledged shares worth 8%. “Despite monetization of investments across various Group entities, the total Group debt has increased as on 31 December 2019, as against 31 March 2019. The money from the sale proceeds is to be used primarily to repay debts of Future Group, said the first person. Future group was caught in a debt trap when India’s economic growth started slowing down in 2010. That episode saw Biyani sell his apparel store business Pantaloons to Aditya Birla group and his financial services business Future Capital to Warburg Pincus.",2020-03-25 00:26:21+05:30,link
5,"Nation under lockdown for 21 days, but stock market will stay open","Prime Minister Narendra Modi on March 24 announced a lockdown in the entire country starting midnight from Tuesday for the next 21 days, but stock market operations will continue, as usual, Ashish Chauhan, CEO, BSE said in a tweet. In another tweet, he said the daily operations of BSE will continue as usual putting speculation to rest on whether the market will function during the lockdown. That's why the PM said, if we don't do this 21-day lockdown, we will go back to 21 years. Giving assurance to the market, FM assured investors and market participants said that the government is closely monitoring the situation. We have provided all required digital tools to our employees so they can continue working seamlessly even from home,” he said.",,link


In [7]:
# Remove the None type data
sub_news_data = [data for data in sub_news_data if data]

# Creating pandas DataFrame for Subnews aricles data
df_sub_news = pd.DataFrame(sub_news_data, columns = columns)

# Displays the Sub News articles data
news_obj.display_table(df_sub_news.head(), header = "Sub News")

Unnamed: 0,Title,Summary,Published Date,URL
1,"Reliance to pay twice to those employees who earn below ₹ 30,000","Reliance Industries Limited (RIL) announced a slew of measures to fight against deadly coronavirus infection spread across the country. Those who earn below ₹30,000 per month, will get their salaries twice a month in the wake of coronavirus outbreak. ""RIL has deployed the combined strengths of Reliance Foundation, Reliance Retail, Jio, Reliance Life Sciences, Reliance Industries, and all the 6,00,000 members of the Reliance Family on this action plan against COVID-19,"" the company said. Reliance Industries also extended the work-from-home platform to most of its employees except those who handle critical roles in maintaining Reliance Jio network. Reliance Retail across the country will remain functional during the lockdown.",2020-03-24 15:51:49+05:30,link
2,"Sensex, Nifty hold gains as investors pin hope to govt's fiscal package","Sensex, Nifty end in the greenIndian equity markets managed to hold decent gains, after rising as much as 5% intraday. Key indices were volatile in opening trade, but traded higher later in the day buoyed by the rise in IT and fast moving consumer goods (FMCG) stocks. The markets rose the most when finance minister Nirmala Sitharaman said on Twitter that the government plans to soon announce a fiscal stimulus package. The Sensex rose 692.79 points, or 2.67%, to end today's session at 26,674.03, while the Nifty 50 index settled 190.80 points, or 2.51%, higher at 7,801.05. Shares of Infosys led gains on Nifty, followed by Adani Ports, Britannia, Bajaj Finance and Maruti Suzuki, while Mahindra & Mahindra, Grasim Industries, IndusInd Bank, Power Grid Corporation and Bharti Infratel were the biggest laggards in today's session.",2020-03-24 08:32:22+05:30,link
3,Stock market to remain open despite 21-day India lockdown,"Mumbai: The country’s leading stock exchanges NSE and BSE will remain open despite the 21-day nationwide lockdown announced by Prime Minister Narendra Modi in the wake of the coronavirus pandemic.“We will remain open,” a NSE spokesperson told ETMarkets.com.Ashishkumar Chauhan, CEO of BSE also confirmed that the stock exchange will remain open. “BSE day to day operations @BSEIndia will continue,” Chauhan tweeted.Addressing the nation for the second time in less than a week, Modi called for a nationwide lockdown starting midnight to contain the coronavirus spread.The duration of the lockdown will be 21 days, he added. Providing a rationale behind this major step, Modi said that it was necessitated due to the severity of the situation.Reiterating the importance of social distancing, he said, we can only prevent new cases and contain the virus through it. India may have to pay a big price due to the negligence of a few. ""To stop coronavirus, stay at a distance from each other and stay inside your houses,"" he said.The government clarified that all essential services will remain functional.",2020-03-24 21:19:00+00:00,link
4,"Gold prices today jump ₹ 600 per 10 gram, silver rates surge about ₹ 2,000","Gold prices in India rose sharply today tracking an uptick in global rates after US Federal Reserve launched unlimited quantitative easing. On MCX, gold futures were up 1.5% or about ₹600 per 10 gram to ₹41,780, extending their ₹700 gain of the previous session. Silver futures on MCX rebounded 5% or about ₹2,000 per kg to ₹39,861 per kg, extending their 6% gain of the previous session. Gold prices in India have seen a big swing this month, hitting a record high of about ₹45,000 per 10 gram and thereafter correcting to about ₹39,500 levels. Spot gold climbed 2% to $1,583.53 per ounce after rising 4% in the previous session.",2020-03-24 09:23:37+05:30,link
5,Infosys Shares Jump Nearly 13 Percent as US SEC Concludes Probe in Whistleblower Case,"The company's scrip zoomed 12.69 per cent to close at Rs 593.55 on the BSE. On the National Stock Exchange (NSE), it climbed 12 per cent to close at Rs 589.80. In terms of traded volume, 6.29 lakh shares were traded on the BSE and over 2 crore shares on the NSE during the day. The company's market valuation zoomed Rs 28,489.15 crore to Rs 2,52,786.15 crore on the BSE. In a regulatory filing on Tuesday, Infosys said it has received a notification from the SEC stating that its investigation has concluded.",2020-03-24 19:03:33+05:30,link
