Documentation: https://github.com/kotartemiy/pygooglenews?tab=readme-ov-file

In [3]:
# Install pygooglenews using pip
# !pip install pygooglenews --upgrade

# 1. Setup

In [4]:
from pygooglenews import GoogleNews
import pprint

In [5]:
# Create a GoogleNews object
language = 'en'  # English
country = 'US'  # United States 

gn = GoogleNews(lang = language, country=country)  # Languages and regions available can be found in the Google News webpage: https://news.google.com/

# 2. Top news

In [6]:
from itertools import islice

top = gn.top_news()  # Get the top news articles

print(f'Type of the top variable: {type(top)}\n')

# Pretty print only the first x lines of the dictionary
x = 10  # Number of lines to print
for line in islice(pprint.pformat(top).splitlines(), x):
    print(line)

Type of the top variable: <class 'dict'>

{'entries': [{'guidislink': False,
              'id': 'CBMiowFBVV95cUxNeURuSlNSR2lveWF0ZWoyRjZZRXlnaGZ2X0hHYXhxeGZtaUFpRVI5QmNnWDJGdENmbXVqZkl5RTE3MnRGb0YyTnlFMWxBZXZDYS05emUtOE5RSkNhSklSRGYwdlotSERld0t6V1lGNjE2MDZXMXlCdDRvQmZlLW9MWVhXOXRYenBNMEVVVE9qQV9HckM5NEd4dUV6QkNGenNOZ3JV',
              'link': 'https://news.google.com/rss/articles/CBMiowFBVV95cUxNeURuSlNSR2lveWF0ZWoyRjZZRXlnaGZ2X0hHYXhxeGZtaUFpRVI5QmNnWDJGdENmbXVqZkl5RTE3MnRGb0YyTnlFMWxBZXZDYS05emUtOE5RSkNhSklSRGYwdlotSERld0t6V1lGNjE2MDZXMXlCdDRvQmZlLW9MWVhXOXRYenBNMEVVVE9qQV9HckM5NEd4dUV6QkNGenNOZ3JV?oc=5',
              'links': [{'href': 'https://news.google.com/rss/articles/CBMiowFBVV95cUxNeURuSlNSR2lveWF0ZWoyRjZZRXlnaGZ2X0hHYXhxeGZtaUFpRVI5QmNnWDJGdENmbXVqZkl5RTE3MnRGb0YyTnlFMWxBZXZDYS05emUtOE5RSkNhSklSRGYwdlotSERld0t6V1lGNjE2MDZXMXlCdDRvQmZlLW9MWVhXOXRYenBNMEVVVE9qQV9HckM5NEd4dUV6QkNGenNOZ3JV?oc=5',
                         'rel': 'alternate',
                         'type': 't

According to the documentation, "The returned object contains feed (FeedParserDict) and entries list of articles found with all data parsed." Note, however, that the object does not contain the content of the news (just a summary). So we would have to make requests of all of these links in order to extract the actual text of the news.

Specifically, the returned object is the following:

> All 4 functions return the dictionary that has 2 sub-objects:
>   - feed - contains the information on the feed metadata
>   - entries - contains the parsed articles
> 
> Both are inherited from the `Feedparser`. The only change is that each dictionary under entries also contains sub_articles which are the similar articles found in the description. Usually, it is non-empty for top_news() and topic_headlines() feeds.

In [11]:
# Print the top news articles
for i, article in enumerate(top['entries'][:5]):
    print(f"Article {i+1}:")
    print(f"Title: {article.title}")
    print(f"Link: {article.link}")
    print(f"Published: {article.published}")
    print(f"Source: {article.source}")
    print()

Article 1:
Title: Harvard's president speaks out against Trump. And, an analysis of DEI job losses - NPR
Link: https://news.google.com/rss/articles/CBMiowFBVV95cUxNeURuSlNSR2lveWF0ZWoyRjZZRXlnaGZ2X0hHYXhxeGZtaUFpRVI5QmNnWDJGdENmbXVqZkl5RTE3MnRGb0YyTnlFMWxBZXZDYS05emUtOE5RSkNhSklSRGYwdlotSERld0t6V1lGNjE2MDZXMXlCdDRvQmZlLW9MWVhXOXRYenBNMEVVVE9qQV9HckM5NEd4dUV6QkNGenNOZ3JV?oc=5
Published: Tue, 27 May 2025 11:30:37 GMT
Source: {'href': 'https://www.npr.org', 'title': 'NPR'}

Article 2:
Title: King Charles's Throne Speech to Canada's Parliament - follow live - BBC
Link: https://news.google.com/rss/articles/CBMiVEFVX3lxTE94Y2U4eXBCYkVGRTZ4alU1OHhJamhXX0NmaFhOdjNvY19MZERTQkVPcGs0R056cGs5eVUweE1VZUxPMlkzUWhHZ3dPY0Vpekc5dktPOQ?oc=5
Published: Tue, 27 May 2025 12:56:15 GMT
Source: {'href': 'https://www.bbc.com', 'title': 'BBC'}

Article 3:
Title: NPR sues Trump over executive order cutting federal funding - CNBC
Link: https://news.google.com/rss/articles/CBMid0FVX3lxTE9hVG1Kelltd2NZc1luOTN2RVpjV

# 3. Stories by location

According to the documentation, the geolocation is "language agnostic" ("it does not guarantee that the feed for any specific place will exist").

In [12]:
harare = gn.geo_headlines('Harare')  # Get news articles for a specific location (e.g., Harare - Zimbabwe's capital)

# Print the news articles referring to Harare
for i, article in enumerate(harare['entries'][:5]):
    print(f"Article {i+1}:")
    print(f"Title: {article.title}")
    print(f"Link: {article.link}")
    print(f"Published: {article.published}")
    print(f"Source: {article.source}")
    print()

Article 1:
Title: Harare Hits the High Notes: Analog Africa Digs Deep with ‘Roots Rocking Zimbabwe’ - World Music Central
Link: https://news.google.com/rss/articles/CBMivAFBVV95cUxOWGhwb0dyOG8tU1FXRXlEVzdSSm9GMV9SWWpCOUlOclE3VHZCcnRVb3lrNmRyY1dTckIzWU96RTZhYXRzdlJid1lQTDhTVFllZVpkNTQwTDQxd1Q1YktibktfMUVYS3lKNEhNNC02MFVhci1VbUQ3WEpZYko1LV80OGpHT2g4VzlIdlczSEtkYnVtNzJOV0Q2RGRXcnBEOUtVMHNIbURVeDZRR2FZMVBrNzBJUlRqZlIxcTFCNA?oc=5
Published: Sun, 25 May 2025 08:53:12 GMT
Source: {'href': 'https://worldmusiccentral.org', 'title': 'World Music Central'}

Article 2:
Title: AF KLM Cargo’s network realignment sees Harare freighter suspension - Air Cargo News
Link: https://news.google.com/rss/articles/CBMiywFBVV95cUxOaXFJLVo3SEVlczdaSkxNeVRTUUNRQ2l1c083WjZuN2l1aEFJalVRR2d0S1BzNV9RMVFtVEJtNm1aUVBJSzQwOWdaNjJUTktha3pZT3ZSdlF6YlhBeDVIdk9iOWdXeEYzWXphUFBkaWFheEZBckJ4NlhqN0prQndodGhWRHQtdVAtVDhHTFhnTEMxUHg3TUlmc0hYaGhkN2loN1g2RVFmWjR4bGFRaWRlWWdxNW0wQWN3c25zby1RNWJ1N3FDQ3RFQ012SQ?oc=5
Published: Wed, 2

From how many different sources are these news coming from?

In [17]:
# Check the number of unique sources in the Harare news articles
sources = set()
for article in harare['entries']:
    sources.add(article.source)
print(f"Number of unique sources: {len(sources)}")

# Print the unique sources
print("Unique sources:")
for source in sources:
    print(source)

Number of unique sources: 27
Unique sources:
{'href': 'https://www.voanews.com', 'title': 'VOA - Voice of America English News'}
{'href': 'https://www.techzim.co.zw', 'title': 'Techzim'}
{'href': 'https://www.cricbuzz.com', 'title': 'Cricbuzz.com'}
{'href': 'https://africanarguments.org', 'title': 'African Arguments'}
{'href': 'https://m.economictimes.com', 'title': 'The Economic Times'}
{'href': 'https://www.bbc.com', 'title': 'BBC'}
{'href': 'https://www.musicinafrica.net', 'title': 'Music In Africa |'}
{'href': 'https://m.economictimes.com', 'title': 'The Economic Times'}
{'href': 'https://www.travelnews.co.za', 'title': 'travelnews.co.za'}
{'href': 'https://www.unicef.org', 'title': 'Unicef'}
{'href': 'https://www.dailymaverick.co.za', 'title': 'Daily Maverick'}
{'href': 'https://techafricanews.com', 'title': 'TechAfrica News'}
{'href': 'https://www.musicinafrica.net', 'title': 'Music In Africa |'}
{'href': 'https://www.eeas.europa.eu', 'title': 'EEAS'}
{'href': 'https://www.aircar

# 4. Stories by query

In [9]:
query = gn.search(query = 'conflict in Harare', from_ = '2014-02-09', to_ = '2020-07-19')  # Get news articles for a specific query (e.g., conflict in Harare)

# Print the news articles referring to the query
for i, article in enumerate(query['entries'][:5]):
    print(f"Article {i+1}:")
    print(f"Title: {article.title}")
    print(f"Link: {article.link}")
    print(f"Published: {article.published}")
    print(f"Source: {article.source}")
    print()

Article 1:
Title: The Zimbabwe-Mozambique border conflict - Foreign Brief
Link: https://news.google.com/rss/articles/CBMibkFVX3lxTE9paHZNSWZNSDBFS1BwMXZVc3VwdVBqaWlDRUpNOHI4eUd6VEdWVkUwbHdpZldycXlubFFjMnVWWWM4VGE0akZHT0NNbUFYM1c5U2JjSHViWERYTE9rMEJ6bk81anp6bVRub2ZFX3p3?oc=5
Published: Tue, 07 Mar 2017 08:00:00 GMT
Source: {'href': 'https://foreignbrief.com', 'title': 'Foreign Brief'}

Article 2:
Title: Five Issues to Watch as the Zimbabwe Crisis Unfolds - Africa Center for Strategic Studies
Link: https://news.google.com/rss/articles/CBMikAFBVV95cUxPT3Q5QUNxcEU3enBmUTFMaG9tZ29BR0MzWlV1M0Q5T2dUSU9IV0hwS1IzVjRpTkpVX1VBUzcya25jMEtJVXBfRGFrR3gxZC1kbGJkTnVsaEFWUi04QXZpYzUzdzJYRjMyV3R4SVpfY2k2ZEd4LU96eWFjTnBRUUljZEFtVy1SN2cxLVc1U2s2LXU?oc=5
Published: Thu, 16 Nov 2017 08:00:00 GMT
Source: {'href': 'https://africacenter.org', 'title': 'Africa Center for Strategic Studies'}

Article 3:
Title: Zimbabwe’s Coup Did Not Create Democracy from Dictatorship - Carnegie Endowment for International Peace

More information about the options for querying can be found here: https://github.com/kotartemiy/pygooglenews?tab=readme-ov-file#stories-by-a-query

# 5. But can we get the content of the articles?

In [7]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Let's try to make a request to one of the links
url = query['entries'][0].link
print(url)

# Make a request to the URL
response = requests.get(url, allow_redirects=True)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    
    print(soup.prettify())  # Print the parsed HTML content

    # # Extract the title
    # title = soup.title.string if soup.title else 'No title found'
    
    # # Extract the first paragraph
    # first_paragraph = soup.find('p').text if soup.find('p') else 'No paragraph found'
    
    # print(f"Title: {title}")
    # print(f"First paragraph: {first_paragraph}")

This does not look good: Google News links to news first are shown with a "stall", in the sense that the link that is shown in the webpage is not the actual link to the article - but instead a link that redirects to the actual article.

Solution from [this](https://www.w3resource.com/python-exercises/urllib3/python-urllib3-exercise-7.php) article:

In [32]:
# Import the urllib3 library
import urllib3

def handle_redirects(initial_url):
    # Create a PoolManager instance
    http = urllib3.PoolManager()

    try:
        # Make a GET request with allow_redirects set to True
        response = http.request('GET', initial_url, redirect=True)

        # Check if the request was successful (status code 200)
        if response.status == 200:
            # Print the final response URL after following redirects
            print("Final Response URL:")
            print(response.geturl())
        else:
            # Print an error message if the request was not successful
            print(f"Error: Unable to fetch data. Status Code: {response.status}")

    except urllib3.exceptions.RequestError as e:
        print(f"Error: {e}")

In [33]:
if __name__ == "__main__":
    handle_redirects(url)

Final Response URL:
https://news.google.com/rss/articles/CBMibkFVX3lxTE9paHZNSWZNSDBFS1BwMXZVc3VwdVBqaWlDRUpNOHI4eUd6VEdWVkUwbHdpZldycXlubFFjMnVWWWM4VGE0akZHT0NNbUFYM1c5U2JjSHViWERYTE9rMEJ6bk81anp6bVRub2ZFX3p3?oc=5&ucbcb=1&hl=en-US&gl=US&ceid=US:en


Nope.

# 6. Trying the googlenewsdecoder package
https://github.com/SSujitX/google-news-url-decoder/tree/main 

In [None]:
# pip install googlenewsdecoder

In [19]:
from googlenewsdecoder import gnewsdecoder

source_urls = [article.link for article in query['entries']]

def decode_urls(source_urls):
    interval_time = 1  # interval is optional, default is None
    #proxy = "http://user:pass@localhost:8080" # proxy is optional, default is None

    results = []

    for url in source_urls:
        try:
            decoded_url = gnewsdecoder(url, 
                                       interval=interval_time, 
                                       # proxy=proxy
                                       )
            if decoded_url.get("status"):
                clean_url = decoded_url["decoded_url"]
                print("Decoded URL:", clean_url)
                results.append(clean_url)
            else:
                print("Error:", decoded_url["message"])
        except Exception as e:
            print(f"Error occurred: {e}")

    return results

decoded_urls = decode_urls(source_urls)

Decoded URL: https://foreignbrief.com/zimbabwe-mozambique-border-conflict/
Decoded URL: https://africacenter.org/spotlight/five-issues-to-watch-as-the-zimbabwe-crisis-unfolds/
Decoded URL: https://carnegieendowment.org/posts/2018/08/zimbabwes-coup-did-not-create-democracy-from-dictatorship?lang=en
Decoded URL: https://www.seattletimes.com/nation-world/zimbabwes-flag-center-of-social-media-war-over-frustrations/
Decoded URL: https://www.aljazeera.com/economy/2019/9/6/robert-mugabe-leaves-a-legacy-of-economic-mismanagement
Decoded URL: https://thediplomat.com/2016/01/zimbabwe-chinas-all-weather-friend-in-africa/
Decoded URL: https://www.icrc.org/en/document/family-war-zimbabwe-children-idp-conflict
Decoded URL: https://www.globalwitness.org/en/campaigns/conflict-diamonds/leave-no-stone-unturned/
Decoded URL: https://www.newsweek.com/zimbabwes-mugabe-tells-war-vets-im-not-dead-yet-445400
Decoded URL: https://www.pewresearch.org/short-reads/2017/11/17/egypts-coup-is-first-in-2013-as-takeov

In [20]:
decoded_urls = list(set(decoded_urls))  # Remove duplicates
print(f"Number of unique decoded URLs: {len(decoded_urls)}")

Number of unique decoded URLs: 48
