<a href="https://colab.research.google.com/github/jgamel/learn_n_dev/blob/python_web_scrapping/GoogleNews_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google News

PyGoogleNews, created by the NewsCatcher Team, acts like a Python wrapper for Google News or an unofficial Google News API. It is based on one simple trick: it exploits a lightweight Google News RSS feed.

What data points can it fetch for you?

* Top stories
* Topic-related news feeds
* Geolocation specific news feed
* An extensive query-based search feed

Install GoogleNews

In [4]:
!pip install pygooglenews

Collecting pygooglenews
  Downloading pygooglenews-0.1.2-py3-none-any.whl (10 kB)
Collecting dateparser<0.8.0,>=0.7.6
  Downloading dateparser-0.7.6-py2.py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 4.3 MB/s 
[?25hCollecting feedparser<6.0.0,>=5.2.1
  Downloading feedparser-5.2.1.zip (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 50.6 MB/s 
[?25hCollecting requests<3.0.0,>=2.24.0
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.7 MB/s 
[?25hCollecting beautifulsoup4<5.0.0,>=4.9.1
  Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 55.1 MB/s 
Building wheels for collected packages: feedparser
  Building wheel for feedparser (setup.py) ... [?25l[?25hdone
  Created wheel for feedparser: filename=feedparser-5.2.1-py3-none-any.whl size=44952 sha256=4d3182129f74e85e17633c277273a636a7ca8208da95e1e48770bd2b8a0f7169
  Stored in d

### Example 1:

Pull Top Headlines from Google News

In [5]:
from pygooglenews import GoogleNews
import json
import time

gn = GoogleNews()
top = gn.top_news()

entries = top["entries"]
count = 0
for entry in entries:
  count = count + 1
  print(
    str(count) + ". " + entry["title"] + entry["published"]
  )
  time.sleep(0.25)

1. Which states would restrict or protect abortion rights if Roe v. Wade is struck down? - CBS NewsWed, 04 May 2022 19:40:31 GMT
2. Roe leak may impact how Supreme Court decides gun rights, climate and immigration cases this spring - CNNThu, 05 May 2022 11:50:00 GMT
3. Attacks on Mariupol steelworks intensify as Russia looks to end standoff; fate of Ukraine's Donbas in the balance - CNBCThu, 05 May 2022 14:07:00 GMT
4. Abortion Pills Will Be the Next Battleground in a Post-Roe America - The New York TimesThu, 05 May 2022 09:00:37 GMT
5. More than a show vote? Senate Dems weigh their Roe plans - POLITICOWed, 04 May 2022 21:30:44 GMT
6. New audio: McCarthy said 25th Amendment 'takes too long' and wanted to reach out to Biden after January 6 attack - CNNThu, 05 May 2022 14:14:00 GMT
7. New Mexico Wildfire Rips Through a Hispanic Bastion - The New York TimesThu, 05 May 2022 09:00:36 GMT
8. Fiji seizes $300 million superyacht belonging to Russian oligarch Suleiman Kerimov - CNBCThu, 05 May 

The code above shows how you can extract certain data points from the top news articles in the Google RSS feed. You can replace the code “gn.top_news()” with “gn.topic_headlines('business')” to get the top headlines related to “Business” or you could have replaced it with “gn.geo_headlines('San Fran')” to get the top news in the San Fransisco region.

You can also use complex queries such as “gn.search('boeing OR airbus')” to find news articles mentioning Boeing or Airbus or “gn.search('boeing -airbus')” to find all news articles that mention Boeing but not Airbus.

When web-scraping news articles with this library, for every news entry that you capture, you get the following data points, that you can use for data processing, or training your machine learning model, or running NLP scripts:

* Title - contains the Headline for the article
* Link - the original link for the article
* Published - the date on which it was published
* Summary - the article summary
* Source - the website on which it was published
* Sub-Articles - list of titles, publishers, and links that are on the same topic

We extracted just a few of the available data points, but you can extract the others as well, based on your requirements. Here’s a small example of the results produced by complex queries.

If you run the code below:

In [6]:
from pygooglenews import GoogleNews
import json
import time

gn = GoogleNews()
top = gn.topic_headlines('business')

entries = top["entries"]
count = 0
for entry in entries:
  count = count + 1
  print(
    str(count) + ". " + entry["title"] + entry["published"] + entry["link"]
  )
  time.sleep(0.25)

1. Elon Musk expected to serve as temporary Twitter CEO after deal closes - CNBCThu, 05 May 2022 13:13:20 GMThttps://news.google.com/__i/rss/rd/articles/CBMia2h0dHBzOi8vd3d3LmNuYmMuY29tLzIwMjIvMDUvMDUvZWxvbi1tdXNrLWV4cGVjdGVkLXRvLXNlcnZlLWFzLXRlbXBvcmFyeS10d2l0dGVyLWNlby1hZnRlci1kZWFsLWNsb3Nlcy5odG1s0gEA?oc=5
2. Fed interest rate hike: Here's the impact on credit rates, mortgages, and the rest of your wallet - CBS NewsThu, 05 May 2022 11:20:00 GMThttps://news.google.com/__i/rss/rd/articles/CBMiYWh0dHBzOi8vd3d3LmNic25ld3MuY29tL25ld3MvZmVkZXJhbC1yZXNlcnZlLWludGVyZXN0LXJhdGUtaW5jcmVhc2UtbW9ydGdhZ2UtY3JlZGl0LXNhdmluZ3MtY29zdC_SAQA?oc=5
3. British pound plummets as Bank of England warns of recession risk - CNBCThu, 05 May 2022 12:00:26 GMThttps://news.google.com/__i/rss/rd/articles/CBMicWh0dHBzOi8vd3d3LmNuYmMuY29tLzIwMjIvMDUvMDUvYnJpdGlzaC1wb3VuZC1zaW5rcy1hZ2FpbnN0LWRvbGxhci1zZXQtZm9yLXdvcnN0LWRhaWx5LWRyb3Atc2luY2UtbWFyY2gtMjAyMC5odG1s0gEA?oc=5
4. Starlink's new Portability feature brings i

In [7]:
from pygooglenews import GoogleNews
import json
import time

gn = GoogleNews()
top = gn.geo_headlines('Saint Louis') 

entries = top["entries"]
count = 0
for entry in entries:
  count = count + 1
  print(
    str(count) + ". " + entry["title"] + entry["published"]
  )
  time.sleep(0.25)



1. Innocent driver killed in crash involving car fleeing from police in south St. Louis - KSDK.comThu, 05 May 2022 00:00:00 GMT
2. Former Bridgeton councilman charged with stealing by deceit - KMOV4Wed, 04 May 2022 19:31:00 GMT
3. St. Louis dad frustrated with no-show school bus as son misses class - KTVI Fox 2 St. LouisThu, 05 May 2022 02:52:50 GMT
4. St. Louis County health officials warn of counterfeit COVID-19 tests - KMOV4Wed, 04 May 2022 22:36:00 GMT
5. Comic Industry Superstar Jim Lee Got His Start in St. Louis - Riverfront TimesWed, 04 May 2022 14:30:00 GMT
6. Engine ripped from minivan in crash on Kingshighway in St. Louis - KSDK.comThu, 05 May 2022 11:43:00 GMT
7. Temperatures could hit 90 degrees near St. Louis next week - KTVI Fox 2 St. LouisWed, 04 May 2022 18:53:15 GMT
8. Report: AT&T building in downtown St. Louis has new owner - KMOV4Wed, 04 May 2022 11:44:00 GMT
9. Cherokee Street in south St. Louis is lined with diversity and paved with opportunity - KSDK.comWed, 04 M

In [8]:
from pygooglenews import GoogleNews

gn = GoogleNews()
s = gn.search('russia -putin') 


for entry in s["entries"]:
    print(entry["title"])

Russia-Ukraine war, MMIW day of awareness, Cinco de Mayo: 5 things to know Thursday - USA TODAY
Russia's economy is back on its feet - The Economist
China's alumina exports to Russia surge after Ukraine invasion - Quartz
Russia weighs on Credit Agricole's Q1, capital miss hits shares - Reuters.com
Ukrainian Commander Says Russia ‘Violated’ Truce At Mariupol Steel Plant, Though Kremlin Still Denies Storming The Area - Forbes
Finland is prepared for Russia cutting its gas supplies, says minister - Reuters.com
UniCredit surprises with buyback as it tackles Russia exit - Reuters.com
U.S. intel helping Ukraine target, kill generals won’t stop Russia, officials say - Global News
Japan says difficult to immediately follow Russia oil embargo - Reuters.com
Russia Finds New Use for Nord Stream 2 as Europe Shuns Its Gas - Bloomberg
NSA cyber boss seeks to discourage vigilante hacking against Russia - C4ISRNet
Russia Just Lost Its Most Advanced Operational Tank In Ukraine - The War Zone
Russia's N

### Example 2:

Search Google News and Save to CSV File

In [11]:
import pandas as pd
import csv
from pygooglenews import GoogleNews

gn = GoogleNews (lang = 'en', country = 'UK') 

russiasearch = gn.search('intitle:russia', helper = True, from_ = '2022-01-01', to_= '2022-12-31')

print(russiasearch['feed'].title)

for item in russiasearch ['entries']:
  print(item['title'])

with open('/tmp/russia_search.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['russiasearch'])  # I presume you meant this to be your header
    
    # Use your loop from before...
    for item in russiasearch ['entries']:
        # And write each item
        writer.writerow([item['title']])

#file.close()

"intitle:russia after:2022-01-01 before:2022-12-31" - Google News
Russia-Ukraine war: what we know on day 71 of the invasion - The Guardian
Russia-Ukraine war news: Live updates - The Washington Post
‘A second Afghanistan’: Doubts over Russia’s war prosecution - Al Jazeera English
Humiliated Russia faces an epoch-defining defeat - The Telegraph
May 4, 2022: Russia-Ukraine news - CNN
Will the energy war hurt Europe more than Russia? - The Week UK
U.S. says Russian forces are largely stalled in Ukraine; Biden to discuss more sanctions with G-7 - CNBC
Russia cut off from UK services - GOV.UK
Why May 9 is a big day for Russia, and what a declaration of war would mean - CNN
Russia-Ukraine Latest News: May 4, 2022 - Bloomberg
Putin is failing in Ukraine but succeeding at oppressing Russia - The Economist
May 3, 2022 Russia-Ukraine news: Live updates - CNN
Russia-Ukraine Latest News: May 5, 2022 - Bloomberg
Head of Russian Orthodox church on draft EU sanctions list - The Guardian
Ukraine war:

In [10]:
# Python program to read CSV file line by line
# import necessary packages
import csv

# Open file
with open('/tmp/russia_search.csv') as file_obj:
	
	# Create reader object by passing the file
	# object to reader method
	reader_obj = csv.reader(file_obj)
	
	# Iterate over each row in the csv
	# file using reader object
	for row in reader_obj:
		print(row)


['russiasearch']
['Russia-Ukraine war: what we know on day 71 of the invasion - The Guardian']
['Russia-Ukraine war news: Live updates - The Washington Post']
['‘A second Afghanistan’: Doubts over Russia’s war prosecution - Al Jazeera English']
['Humiliated Russia faces an epoch-defining defeat - The Telegraph']
['May 4, 2022: Russia-Ukraine news - CNN']
['Will the energy war hurt Europe more than Russia? - The Week UK']
['U.S. says Russian forces are largely stalled in Ukraine; Biden to discuss more sanctions with G-7 - CNBC']
['Russia cut off from UK services - GOV.UK']
['Why May 9 is a big day for Russia, and what a declaration of war would mean - CNN']
['Russia-Ukraine Latest News: May 4, 2022 - Bloomberg']
['Putin is failing in Ukraine but succeeding at oppressing Russia - The Economist']
['May 3, 2022 Russia-Ukraine news: Live updates - CNN']
['Russia-Ukraine Latest News: May 5, 2022 - Bloomberg']
['Head of Russian Orthodox church on draft EU sanctions list - The Guardian']
["Ukr