# Google News Crawler

### Crawls news from https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen

Importing dependencies

In [1]:
import urllib.request as request
from bs4 import BeautifulSoup
import pandas as pd

Here, base_url is for concatenating to the partial url from google news page and making the url complete.
And target_url is the url of the page, we are going to scrape

In [2]:
base_url = 'https://news.google.com'
target_url = "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB?hl=en-IN&gl=IN&ceid=IN%3Aen"

Sending a get request to the server and checking the status code if that worked perfect
### As the response code is 200, our request is successful

In [3]:
response = request.Request(target_url,headers={'User-Agent': 'Mozilla/5.0'})
content_obj = request.urlopen(response)
print('Response Code :',content_obj.getcode())

Response Code : 200


Grabing HTML content from request object and let's have a look at the first 500 characters from HTML

In [4]:
html_content = content_obj.read()
print(html_content[:500])

b'<!doctype html><html lang="en" dir="ltr"><head><base href="https://news.google.com/"><meta name="referrer" content="origin"><link rel="canonical" href="https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pKVGlnQVAB"><meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui"><meta name="apple-itunes-app" content="app-id=459182288"><meta name="google-site-verification" content="AcBy5YFny2HQgVUCR18tO5YUTf6MpVlcJqGTd-a9-SI"><meta name="mobile-web-app-capab'


Converting the HTML content to a BeautifulSoup object to parse elements

In [5]:
soup_html_content = BeautifulSoup(html_content,features="lxml")

In the HTML layout of the page class 'Vlf0vb' from div 'c-wiz' is for containing all the news so let's just grab that, it will make parsing easier.
And every h3 and h4 tag under class 'ipQwMb ekueJc gEATFF RD0gLb' is containg the information of all the main news and subnews so let's grab the elements

In [6]:
news_wrapper = soup_html_content.find('c-wiz', class_ ='Vlf0vb')
all_news = news_wrapper.find_all(['h3','h4'], class_ = 'ipQwMb ekueJc gEATFF RD0gLb') # h3 for main news and h4 for sub-news

Lets iterate through all the news element and grab all the Headlines and URLs and store them into news_data list

In [7]:
news_data = list() # An Empty list
for news in all_news:
    news_url_tag = news.find('a')
    news_headline = news_url_tag.text
    news_url = (base_url + news_url_tag.get('href').replace('.',''))
    news_data.append([news_headline,news_url])

Let's have a look at first 10 news from the page

In [8]:
index = [i+1 for i in range(len(news_data))]
pd_news = pd.DataFrame(news_data, columns = ['Headline', 'URL'], index = index)
dfStyler = pd_news.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
pd_news.head(10)

Unnamed: 0,Headline,URL
1,"Government approves a package of Rs 13,760 cro...",https://news.google.com/articles/CAIiEE-KNB5sl...
2,Union Cabinet Approves Three Schemes To Enable...,https://news.google.com/articles/CAIiENxdjwu_0...
3,Cabinet approves schemes to boost electronics ...,https://news.google.com/articles/CAIiEKfHD6NvC...
4,Cabinet approves scheme to make India electron...,https://news.google.com/articles/CBMiemh0dHBzO...
5,Govt announces incentive for electronics to ma...,https://news.google.com/articles/CAIiEOGS0cGin...
6,Former Vistara CSCO Sanjiv Kapoor joins GoAir,https://news.google.com/articles/CAIiENmANklaX...
7,Coronavirus Impact: GoAir senior management to...,https://news.google.com/articles/CBMipwFodHRwc...
8,Coronavirus impact: GoAir announces 50% pay cu...,https://news.google.com/articles/CBMif2h0dHBzO...
9,GoAir appoints Sanjiv Kapoor as advisor,https://news.google.com/articles/CAIiELgKGRztv...
10,RBI extends restrictions on PMC Bank for three...,https://news.google.com/articles/CAIiEG7T1_E8c...


### Function to search query in the news

In [9]:
def search_query(query):  
    matched_news = list()  # An Empty list
    for news in news_data:
        if(query in news[0]):
            matched_news.append(news)
    if(len(matched_news) == 0):
        print('No news found related to '+query+'!')
    else:
        print('These are the news related to '+ query+':')
        for news in matched_news:
            print('Headline : '+news[0])
            print('URL : '+news[1])
            print('*'*100)

In [10]:
search_query('GoAir')

These are the news related to GoAir:
Headline : Former Vistara CSCO Sanjiv Kapoor joins GoAir
URL : https://news.google.com/articles/CAIiENmANklaXSs8q9ygAupfbwIqGQgEKhAIACoHCAowlbCPCzCrzaIDMOjUrgY?hl=en-IN&gl=IN&ceid=IN%3Aen
****************************************************************************************************
Headline : Coronavirus Impact: GoAir senior management to take 50% pay cut
URL : https://news.google.com/articles/CBMipwFodHRwczovL2Vjb25vbWljdGltZXMuaW5kaWF0aW1lcy5jb20vaW5kdXN0cnkvdHJhbnNwb3J0YXRpb24vYWlybGluZXMtLy1hdmlhdGlvbi9jb3JvbmF2aXJ1cy1pbXBhY3QtZ29haXItc2VuaW9yLW1hbmFnZW1lbnQtdG8tdGFrZS01MC1wYXktY3V0L2FydGljbGVzaG93Lzc0NzUyMzk4LmNtc9IBogFodHRwczovL20uZWNvbm9taWN0aW1lcy5jb20vaW5kdXN0cnkvdHJhbnNwb3J0YXRpb24vYWlybGluZXMtLy1hdmlhdGlvbi9jb3JvbmF2aXJ1cy1pbXBhY3QtZ29haXItc2VuaW9yLW1hbmFnZW1lbnQtdG8tdGFrZS01MC1wYXktY3V0L2FtcF9hcnRpY2xlc2hvdy83NDc1MjM5OC5jbXM?hl=en-IN&gl=IN&ceid=IN%3Aen
********************************************************************************