# Google News Crawler

Crawls news from https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen

Importing dependencies

In [1]:
import urllib.request as request
from bs4 import BeautifulSoup
import pandas as pd

Here, base_url is for concatenating to the partial url from google news page and making the url complete.
And target_url is the url of the page, we are going to scrape

In [2]:
base_url = 'https://news.google.com'
target_url = "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen"

Sending a get request to the server and checking the status code if that worked perfect
As the response code is 200, our request is successful

In [3]:
response = request.Request(target_url,headers={'User-Agent': 'Mozilla/5.0'})
content_obj = request.urlopen(response)
print('Response Code :',content_obj.getcode())

Response Code : 200


Grabing HTML content from request object and let's have a look at the first 500 characters from HTML

In [4]:
html_content = content_obj.read()
print(html_content[:500])

b'<!doctype html><html lang="en" dir="ltr"><head><base href="https://news.google.com/"><meta name="referrer" content="origin"><link rel="canonical" href="https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB"><meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui"><meta name="apple-itunes-app" content="app-id=459182288"><meta name="google-site-verification" content="AcBy5YFny2HQgVUCR18tO5YUTf6MpVlcJqGTd-a9-SI"><meta name="mobile-web-app-capab'


Converting the HTML content to a BeautifulSoup object to parse elements

In [5]:
soup_html_content = BeautifulSoup(html_content,features="lxml")

In the HTML layout of the page class 'Vlf0vb' from div 'c-wiz' is for containing all the news so let's just grab that, it will make parsing easier.
And every h3 and h4 tag under class 'ipQwMb ekueJc gEATFF RD0gLb' is containg the information of all the main news and subnews so let's grab the elements

In [6]:
news_wrapper = soup_html_content.find('c-wiz', class_ ='Vlf0vb')
all_news = news_wrapper.find_all(['h3','h4'], class_ = 'ipQwMb ekueJc gEATFF RD0gLb') # h3 for main news and h4 for sub-news

Lets iterate through all the news element and grab all the Headlines and URLs and store them into news_data list

In [7]:
news_data = list() # An Empty list
for news in all_news:
    news_url_tag = news.find('a')
    news_headline = news_url_tag.text
    news_url = (base_url + news_url_tag.get('href').replace('.',''))
    news_data.append([news_headline,news_url])

Let's have a look at first 20 news from the page

In [8]:
index = [i+1 for i in range(len(news_data))]
pd_news = pd.DataFrame(news_data, columns = ['Headline', 'URL'], index = index)
dfStyler = pd_news.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
pd_news.head(20)

Unnamed: 0,Headline,URL
1,"Maruti Suzuki, Hero, Tata, Bajaj Announces Pla...",https://news.google.com/articles/CBMiXWh0dHBzO...
2,"Maruti Suzuki suspends production at Gurgaon, ...",https://news.google.com/articles/CBMiigFodHRwc...
3,Coronavirus Pandemic: Maruti Suzuki Halts Vehi...,https://news.google.com/articles/CAIiEKPD8Dxuo...
4,Coronavirus impact: Maruti Suzuki suspends pro...,https://news.google.com/articles/CAIiEKqaOcgoW...
5,Coronavirus | Maruti Suzuki suspends productio...,https://news.google.com/articles/CAIiECtmqQTaR...
6,Banks Should Declare 1-Year Moratorium on All ...,https://news.google.com/articles/CAIiEMmF_bYW8...
7,Hero MotoCorp halts production at all plants w...,https://news.google.com/articles/CBMingFodHRwc...
8,Hero MotoCorp shuts down all plants globally a...,https://news.google.com/articles/CBMie2h0dHBzO...
9,"Coronavirus impact: Hero MotoCorp, Fiat, Tata ...",https://news.google.com/articles/CAIiEP0M21Wuv...
10,Hero MotoCorp To Advance Payments During COVID...,https://news.google.com/articles/CAIiEGxUQbEmM...


Function to search query in the news

In [9]:
def search_query(query):  
    matched_news = list()  # An Empty list
    for news in news_data:
        if(query in news[0]):
            matched_news.append(news)
    if(len(matched_news) == 0):
        print('No news found related to '+query+'!')
    else:
        print('These are the news related to '+ query+':\n')
        for news in matched_news:
            print('Headline : '+news[0])
            print('URL : '+news[1])
            print('*'*100)
            print()

Let's try to call the search_query function

In [10]:
search_query('GoAir')

These are the news related to GoAir:

Headline : Coronavirus Pandemic | GoAir announces 50% salary cut for top management
URL : https://news.google.com/articles/CAIiEPWOAM2vB6VvRUJkRfW0Tf4qGQgEKhAIACoHCAowlbCPCzCrzaIDMOjUrgY?hl=en-IN&gl=IN&ceid=IN%3Aen
****************************************************************************************************

Headline : Coronavirus Effect: GoAir Announces 50 per cent Pay Cut for its Top Employees
URL : https://news.google.com/articles/CBMifmh0dHBzOi8vd3d3Lm5ld3MxOC5jb20vbmV3cy9idXNpbmVzcy9jb3JvbmF2aXJ1cy1lZmZlY3QtZ29haXItYW5ub3VuY2VzLTUwLXBlci1jZW50LXBheS1jdXQtZm9yLWl0cy10b3AtZW1wbG95ZWVzLTI1NDU5ODEuaHRtbNIBggFodHRwczovL3d3dy5uZXdzMTguY29tL2FtcC9uZXdzL2J1c2luZXNzL2Nvcm9uYXZpcnVzLWVmZmVjdC1nb2Fpci1hbm5vdW5jZXMtNTAtcGVyLWNlbnQtcGF5LWN1dC1mb3ItaXRzLXRvcC1lbXBsb3llZXMtMjU0NTk4MS5odG1s?hl=en-IN&gl=IN&ceid=IN%3Aen
****************************************************************************************************

Headline : COVID-19 impact: GoAir 

Completed!