Documentation: https://github.com/kotartemiy/pygooglenews?tab=readme-ov-file

In [None]:
# Install pygooglenews using pip
# !pip install pygooglenews --upgrade

# 1. Setup

In [9]:
from pygooglenews import GoogleNews
import pprint

In [None]:
# Create a GoogleNews object
gn = GoogleNews(lang = 'en', country = 'US')  # Languages and regions available can be found in the Google News webpage: https://news.google.com/

# 2. Top news

In [13]:
top = gn.top_news()  # Get the top news articles

print(f'Type of the top variable: {type(top)}\n')

pprint.pprint(top)

Type of the top variable: <class 'dict'>

{'entries': [{'guidislink': False,
              'id': 'CBMifEFVX3lxTFBWX2lhRl84MFhFcENuWW5Hd19YR2kwQnFtUkR6blhLOVFfbkdTaVBjZWNlTzFfMFgzUzQ1OGwxUjQ2ZE9VM181eVBVZjdPRllNVjZiRmNNYWc5Mkl2WEcyRVBuU0FFX0FTODZsd2dHMmRsOEtycFRJU21Jd04',
              'link': 'https://news.google.com/rss/articles/CBMifEFVX3lxTFBWX2lhRl84MFhFcENuWW5Hd19YR2kwQnFtUkR6blhLOVFfbkdTaVBjZWNlTzFfMFgzUzQ1OGwxUjQ2ZE9VM181eVBVZjdPRllNVjZiRmNNYWc5Mkl2WEcyRVBuU0FFX0FTODZsd2dHMmRsOEtycFRJU21Jd04?oc=5',
              'links': [{'href': 'https://news.google.com/rss/articles/CBMifEFVX3lxTFBWX2lhRl84MFhFcENuWW5Hd19YR2kwQnFtUkR6blhLOVFfbkdTaVBjZWNlTzFfMFgzUzQ1OGwxUjQ2ZE9VM181eVBVZjdPRllNVjZiRmNNYWc5Mkl2WEcyRVBuU0FFX0FTODZsd2dHMmRsOEtycFRJU21Jd04?oc=5',
                         'rel': 'alternate',
                         'type': 'text/html'}],
              'published': 'Thu, 08 May 2025 10:34:00 GMT',
              'published_parsed': time.struct_time(tm_year=2025, tm_mon=5, tm_mday=8, 

According to the documentation, "The returned object contains feed (FeedParserDict) and entries list of articles found with all data parsed." Note, however, that the object does not contain the content of the news (just a summary). So we would have to make requests of all of these links in order to extract the actual text of the news.

Specifically, the returned object is the following:

> All 4 functions return the dictionary that has 2 sub-objects:
>   - feed - contains the information on the feed metadata
>   - entries - contains the parsed articles
> 
> Both are inherited from the `Feedparser`. The only change is that each dictionary under entries also contains sub_articles which are the similar articles found in the description. Usually, it is non-empty for top_news() and topic_headlines() feeds.

In [15]:
# Print the top news articles
for i, article in enumerate(top['entries']):
    print(f"Article {i+1}:")
    print(f"Title: {article.title}")
    print(f"Link: {article.link}")
    print(f"Published: {article.published}")
    print(f"Source: {article.source}")
    print()

Article 1:
Title: Trump announces trade deal with the UK, first since his tariffs sent markets reeling - NPR
Link: https://news.google.com/rss/articles/CBMifEFVX3lxTFBWX2lhRl84MFhFcENuWW5Hd19YR2kwQnFtUkR6blhLOVFfbkdTaVBjZWNlTzFfMFgzUzQ1OGwxUjQ2ZE9VM181eVBVZjdPRllNVjZiRmNNYWc5Mkl2WEcyRVBuU0FFX0FTODZsd2dHMmRsOEtycFRJU21Jd04?oc=5
Published: Thu, 08 May 2025 10:34:00 GMT
Source: {'href': 'https://www.npr.org', 'title': 'NPR'}

Article 2:
Title: Trump nominates Dr. Casey Means, wellness influencer close to RFK Jr., for surgeon general - PBS
Link: https://news.google.com/rss/articles/CBMiwgFBVV95cUxOYTZSUnhad3hydEpHZ3JaUmtHRm8tbFNseXJZdGhucmozdk0wdDJXM0FOY2M5aVM5ZzhrczBSeEE2ZDVyTnFScFhlZ1lxYjhhWmJmQUpaR3BKdlMtYmppZmtfNjBjNHl3VG1MTFI2a0x4bkFia19aOU9pVlE3UkpoZVQ4TzMtSk5oVGdTRHBxNEt2NENfWWU2SXBFTi1VbjJiVnBxRl93VXBHVFJxX2xfd1NRdm1sUGFIdzY0MUtKZVJ4QQ?oc=5
Published: Wed, 07 May 2025 22:33:17 GMT
Source: {'href': 'https://www.pbs.org', 'title': 'PBS'}

Article 3:
Title: Economists warn Trump's res

# 3. Stories by location

According to the documentation, the geolocation is "language agnostic" ("it does not guarantee that the feed for any specific place will exist").

In [16]:
harare = gn.geo_headlines('Harare')  # Get news articles for a specific location (e.g., Harare - Zimbabwe's capital)

# Print the news articles referring to Harare
for i, article in enumerate(harare['entries']):
    print(f"Article {i+1}:")
    print(f"Title: {article.title}")
    print(f"Link: {article.link}")
    print(f"Published: {article.published}")
    print(f"Source: {article.source}")
    print()

Article 1:
Title: Zimbabwe protest: Harare shuts down as Blessed Geza calls for Emmerson Mnangagwa to resign - BBC
Link: https://news.google.com/rss/articles/CBMiWkFVX3lxTE9tcl9yeVFWVnVLQXVxYmJtWWRCT1dDYy00cDZoQWZ1VlZWbHh3REJqeTBqWDFJbnpZSlU1blhZOHJRQW9TVEo5aXNXblFLQklKaDU3Q0w4N012QdIBX0FVX3lxTE5INWtCazN3SXhnSGY1OUVYOE5KVXhaZmRFaFJQeldKeW5UYzMtM1ZnRGxCbjBDbDh2Zmt0N3BXNXVjQnVkWUhHWXhCYk9wVktfc0RobGFJREEybzByWkEw?oc=5
Published: Mon, 31 Mar 2025 07:00:00 GMT
Source: {'href': 'https://www.bbc.com', 'title': 'BBC'}

Article 2:
Title: SADC Ministers Unite in Harare to Help Shape Africa’s Digital Future - TechAfrica News
Link: https://news.google.com/rss/articles/CBMiqgFBVV95cUxNYnhQVVIxVF9hYWExTllNMHpXbjZzaUh4aFNQbWIxZ3h0NzNRbnNWVGFWdXZQdktXcHIyOEgtTzYtbER1YjIwaG5pRVZaRkhWYWJ6aXdwZFlVaWFQcmZKUmcxMFJFRDh5d0h0X2ttNU9qOFBDRlV0OVJETWtUUlRGcXpFY2pyYVlFVDhBbG5xSjZGQ3dmb1lDZ1M4b2V1aTBvOGV2ZG1LSnA0QQ?oc=5
Published: Tue, 18 Mar 2025 07:00:00 GMT
Source: {'href': 'https://techafricanews.com', 'title

From how many different sources are these news coming from?

In [17]:
# Check the number of unique sources in the Harare news articles
sources = set()
for article in harare['entries']:
    sources.add(article.source)
print(f"Number of unique sources: {len(sources)}")

# Print the unique sources
print("Unique sources:")
for source in sources:
    print(source)

Number of unique sources: 27
Unique sources:
{'href': 'https://www.voanews.com', 'title': 'VOA - Voice of America English News'}
{'href': 'https://www.techzim.co.zw', 'title': 'Techzim'}
{'href': 'https://www.cricbuzz.com', 'title': 'Cricbuzz.com'}
{'href': 'https://africanarguments.org', 'title': 'African Arguments'}
{'href': 'https://m.economictimes.com', 'title': 'The Economic Times'}
{'href': 'https://www.bbc.com', 'title': 'BBC'}
{'href': 'https://www.musicinafrica.net', 'title': 'Music In Africa |'}
{'href': 'https://m.economictimes.com', 'title': 'The Economic Times'}
{'href': 'https://www.travelnews.co.za', 'title': 'travelnews.co.za'}
{'href': 'https://www.unicef.org', 'title': 'Unicef'}
{'href': 'https://www.dailymaverick.co.za', 'title': 'Daily Maverick'}
{'href': 'https://techafricanews.com', 'title': 'TechAfrica News'}
{'href': 'https://www.musicinafrica.net', 'title': 'Music In Africa |'}
{'href': 'https://www.eeas.europa.eu', 'title': 'EEAS'}
{'href': 'https://www.aircar

# 4. Stories by query

In [23]:
query = gn.search(query = 'conflict in Harare', from_ = '2014-02-09', to_ = '2020-07-19')  # Get news articles for a specific query (e.g., conflict in Harare)

# Print the news articles referring to the query
for i, article in enumerate(query['entries']):
    print(f"Article {i+1}:")
    print(f"Title: {article.title}")
    print(f"Link: {article.link}")
    print(f"Published: {article.published}")
    print(f"Source: {article.source}")
    print()

and fails to parse leap day. The default behavior will change in Python 3.15
to either always raise an exception or to use a different default year (TBD).
To avoid trouble, add a specific year to the input & format.
See https://github.com/python/cpython/issues/70647.
  query = gn.search(query = 'conflict in Harare', from_ = '2014-02-09', to_ = '2020-07-19')  # Get news articles for a specific query (e.g., conflict in Harare)


Article 1:
Title: The Zimbabwe-Mozambique border conflict - Foreign Brief
Link: https://news.google.com/rss/articles/CBMibkFVX3lxTE9paHZNSWZNSDBFS1BwMXZVc3VwdVBqaWlDRUpNOHI4eUd6VEdWVkUwbHdpZldycXlubFFjMnVWWWM4VGE0akZHT0NNbUFYM1c5U2JjSHViWERYTE9rMEJ6bk81anp6bVRub2ZFX3p3?oc=5
Published: Tue, 07 Mar 2017 08:00:00 GMT
Source: {'href': 'https://foreignbrief.com', 'title': 'Foreign Brief'}

Article 2:
Title: Zimbabwe’s Coup Did Not Create Democracy from Dictatorship - Carnegie Endowment for International Peace
Link: https://news.google.com/rss/articles/CBMirgFBVV95cUxNZElEd21NSFE2ZEZ4YVdBS08zb2xXQUE4dUNVVWhBTG9VTllqZ3NoY3hVUHhRNVFYS3JpV3YzZVFiSVQ3RG1UOGpkMWYtUGF2OS1zcTFuV2FFcVZpeXpKTXlibE1CN1JzMEpCMGFQVGRrSXVGaU5FLURIbEViNU1fZDd0RWZoU2J1UUtUaGNwNF8tZV8xVzFaTUlxdVFzQWFmWTlwal9rZXFfZVY5MlE?oc=5
Published: Thu, 16 Aug 2018 07:00:00 GMT
Source: {'href': 'https://carnegieendowment.org', 'title': 'Carnegie Endowment for International Peace'}

Article 3:
Title: Political Instability in Zimbabwe - C

More information about the options for querying can be found here: https://github.com/kotartemiy/pygooglenews?tab=readme-ov-file#stories-by-a-query

# 5. But can we get the content of the articles?

In [24]:
import requests
from bs4 import BeautifulSoup

In [30]:
# Let's try to make a request to one of the links
url = query['entries'][0].link
print(url)

# Make a request to the URL
response = requests.get(url, allow_redirects=True)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    print(soup.prettify())  # Print the parsed HTML content

    # # Extract the title
    # title = soup.title.string if soup.title else 'No title found'
    
    # # Extract the first paragraph
    # first_paragraph = soup.find('p').text if soup.find('p') else 'No paragraph found'
    
    # print(f"Title: {title}")
    # print(f"First paragraph: {first_paragraph}")

https://news.google.com/rss/articles/CBMibkFVX3lxTE9paHZNSWZNSDBFS1BwMXZVc3VwdVBqaWlDRUpNOHI4eUd6VEdWVkUwbHdpZldycXlubFFjMnVWWWM4VGE0akZHT0NNbUFYM1c5U2JjSHViWERYTE9rMEJ6bk81anp6bVRub2ZFX3p3?oc=5
<!DOCTYPE html>
<html dir="ltr" lang="en-US">
 <head>
  <base href="https://news.google.com/"/>
  <link href="//www.gstatic.com" rel="preconnect"/>
  <meta content="origin" name="referrer"/>
  <link href="https://news.google.com/rss/articles/CBMibkFVX3lxTE9paHZNSWZNSDBFS1BwMXZVc3VwdVBqaWlDRUpNOHI4eUd6VEdWVkUwbHdpZldycXlubFFjMnVWWWM4VGE0akZHT0NNbUFYM1c5U2JjSHViWERYTE9rMEJ6bk81anp6bVRub2ZFX3p3" rel="canonical"/>
  <meta content="width=device-width,initial-scale=1,minimal-ui" name="viewport"/>
  <meta content="AcBy5YFny2HQgVUCR18tO5YUTf6MpVlcJqGTd-a9-SI" name="google-site-verification"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="News" name="application-name"/>
  <meta content="News" name="apple-mobile-web-app-

This does not look good: Google News links to news first are shown with a "stall", in the sense that the link that is shown in the webpage is not the actual link to the article - but instead a link that redirects to the actual article.

Solution from [this](https://www.w3resource.com/python-exercises/urllib3/python-urllib3-exercise-7.php) article:

In [32]:
# Import the urllib3 library
import urllib3

def handle_redirects(initial_url):
    # Create a PoolManager instance
    http = urllib3.PoolManager()

    try:
        # Make a GET request with allow_redirects set to True
        response = http.request('GET', initial_url, redirect=True)

        # Check if the request was successful (status code 200)
        if response.status == 200:
            # Print the final response URL after following redirects
            print("Final Response URL:")
            print(response.geturl())
        else:
            # Print an error message if the request was not successful
            print(f"Error: Unable to fetch data. Status Code: {response.status}")

    except urllib3.exceptions.RequestError as e:
        print(f"Error: {e}")

In [33]:
if __name__ == "__main__":
    handle_redirects(url)

Final Response URL:
https://news.google.com/rss/articles/CBMibkFVX3lxTE9paHZNSWZNSDBFS1BwMXZVc3VwdVBqaWlDRUpNOHI4eUd6VEdWVkUwbHdpZldycXlubFFjMnVWWWM4VGE0akZHT0NNbUFYM1c5U2JjSHViWERYTE9rMEJ6bk81anp6bVRub2ZFX3p3?oc=5&ucbcb=1&hl=en-US&gl=US&ceid=US:en


Nope.