<a target="_blank" href="https://colab.research.google.com/github/mcosarinsky/TP-GoogleNews/blob/main/news_download.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Documentación

 [gnews](https://github.com/ranahaani/GNews/blob/master/README.md)

[googlenewsdecoder](https://github.com/SSujitX/google-news-url-decoder)

In [1]:
!pip install gnews googlenewsdecoder newspaper3k lxml_html_clean tqdm

Collecting gnews
  Downloading gnews-0.3.8-py3-none-any.whl.metadata (17 kB)
Collecting googlenewsdecoder
  Downloading googlenewsdecoder-0.1.6-py3-none-any.whl.metadata (5.4 kB)
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting lxml_html_clean
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Collecting feedparser~=6.0.2 (from gnews)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting dnspython (from gnews)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting selectolax>=0.3.21 (from googlenewsdecoder)
  Downloading selectolax-0.3.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.9 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.1.3-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (fr

In [49]:
import requests
import datetime

from dateutil.relativedelta import relativedelta
from gnews import GNews
from googlenewsdecoder import new_decoderv1
from bs4 import BeautifulSoup
from tqdm import tqdm


class News:
    def __init__(self, start_date: datetime.date, end_date: datetime.date):
        self.start_date = start_date
        self.end_date = end_date
        self.google_news = GNews(language='es', country='Argentina', start_date=start_date, end_date=end_date)

    def get_google_news(self, site):
        """Fetch news articles for the specified site and date range."""
        results = self.google_news.get_news_by_site(site)
        return results

    def update_google_news_dates(self, start_date, end_date):
        """Updates the google_news object with new start_date and end_date."""
        self.google_news.start_date = start_date
        self.google_news.end_date = end_date

    def get_article_content(self, article, time_interval=5):
        url = article['url']
        try:
            # Decode the Google News URL
            decoded_data = new_decoderv1(url, interval=time_interval)
            if decoded_data.get("status"):
                decoded_url = decoded_data["decoded_url"]
                response = requests.get(decoded_url)

                # Check if the request was successful
                if response.status_code == 200:
                    # Parse the HTML content
                    soup = BeautifulSoup(response.text, 'html.parser')

                    # Extract the description
                    description_tag = soup.find('meta', property='og:description')
                    description = description_tag.get('content') if description_tag else "Description not found."

                    # Extract article text
                    text = self.google_news.get_full_article(decoded_url).text

                    # Update article dictionary with description, text and url
                    article['url'] = decoded_url
                    article['description'] = description
                    article['text'] = text
                    return article
                else:
                    return {"error": f"Failed to retrieve the article from the decoded URL. Status code: {response.status_code}"}
            else:
                return {"error": f"Error decoding URL: {decoded_data['message']}"}
        except Exception as e:
            return {"error": str(e)}

    def fetch_articles(self, sites, time_interval=1):
        all_articles = []

        for site in sites:
            # Extract the domain or section name
            site_name = site.split('.com')[0].split('/')[-1]
            articles = self.get_google_news(site)

            for i in tqdm(range(len(articles)), desc=f"Processing articles from {site}"):
                article_content = self.get_article_content(articles[i], time_interval=time_interval)
                article_content['site'] = site_name
                all_articles.append(article_content)

        return all_articles

## Descarga datos

In [18]:
start = datetime.date(2015, 1, 15)
end = datetime.date(2015, 1, 17)

# Create an instance of News
news_fetcher = News(start_date=start, end_date=end)

# Fetch Google News articles
articles = news_fetcher.get_google_news(site='lanacion.com.ar/economia')
print("Articles fetched")
print("Extracting content")
article_content = news_fetcher.get_article_content(articles[1], time_interval=1)

print(article_content)

Articles fetched
Extracting content
{'title': 'La inflación del Indec cerró en 23,9 por ciento en 2014 - LA NACION', 'description': 'El instituto de estadísticas oficial informó el aumento promedio de precios de la economía del año que pasó; las consultoras privadas habían calculado un 38,5 por ciento', 'published date': 'Fri, 16 Jan 2015 08:00:00 GMT', 'url': 'https://www.lanacion.com.ar/economia/la-inflacion-del-indec-cerro-en-239-por-ciento-en-2014-nid1760688/', 'publisher': {'href': 'https://www.lanacion.com.ar', 'title': 'LA NACION'}, 'text': 'La inflación oficial de 2014 fue de 23,9 por ciento, según informó hoy el Instituto Nacional de Estadística y Censos ( INDEC ). La evaluación anual del organismo resultó 14,6 puntos inferior al 38,5 por ciento estimado por las consultoras privadas.\n\nEn lo que respecta a diciembre, la medición del INDEC alcanzó el 1 por ciento frente al 1,87 por ciento del sector privado.\n\nLa diferencia entre ambas mediciones resulta relevante si se tiene

In [56]:
sites = ['lanacion.com.ar/economia', 'lanacion.com.ar/politica',
         'perfil.com/noticias/politica', 'perfil.com/noticias/economia',
         'clarin.com/economia', 'clarin.com/politica']

news_fetcher = News(start_date=None, end_date=None)
all_articles = []
start = datetime.date(2022, 1, 1)
end_of_year = datetime.date(2022, 12, 31)
delta_days = 15

# Fetch news for entire year increasing by delta_days
while start <= end_of_year:
    end = min(start + datetime.timedelta(days=delta_days), end_of_year)
    news_fetcher.update_google_news_dates(start, end)

    print(f"Fetching articles starting at {start.strftime('%d %B %Y')}\n")

    articles = news_fetcher.fetch_articles(sites, time_interval=1)
    all_articles.extend(articles)

    start = start + datetime.timedelta(days=delta_days + 1)
    print('\n')

Fetching articles starting at 01 January 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 93/93 [03:46<00:00,  2.44s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 51/51 [01:36<00:00,  1.90s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 35/35 [01:06<00:00,  1.90s/it]
Processing articles from clarin.com/politica: 100%|██████████| 41/41 [01:18<00:00,  1.91s/it]




Fetching articles starting at 17 January 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 96/96 [03:03<00:00,  1.91s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 63/63 [01:58<00:00,  1.88s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 44/44 [01:23<00:00,  1.90s/it]
Processing articles from clarin.com/politica: 100%|██████████| 45/45 [01:26<00:00,  1.91s/it]




Fetching articles starting at 02 February 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 90/90 [02:49<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 75/75 [02:20<00:00,  1.87s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 37/37 [01:10<00:00,  1.90s/it]
Processing articles from clarin.com/politica: 100%|██████████| 75/75 [02:21<00:00,  1.89s/it]




Fetching articles starting at 18 February 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 93/93 [02:55<00:00,  1.88s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 56/56 [01:45<00:00,  1.89s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 49/49 [01:31<00:00,  1.86s/it]
Processing articles from clarin.com/politica: 100%|██████████| 58/58 [01:48<00:00,  1.87s/it]




Fetching articles starting at 06 March 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 78/78 [02:26<00:00,  1.88s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 69/69 [02:10<00:00,  1.89s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 51/51 [01:37<00:00,  1.91s/it]
Processing articles from clarin.com/politica: 100%|██████████| 82/82 [02:34<00:00,  1.89s/it]




Fetching articles starting at 22 March 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:07<00:00,  1.87s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 97/97 [03:04<00:00,  1.91s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 54/54 [01:41<00:00,  1.88s/it]
Processing articles from clarin.com/politica: 100%|██████████| 100/100 [03:08<00:00,  1.88s/it]




Fetching articles starting at 07 April 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:10<00:00,  1.90s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 88/88 [02:46<00:00,  1.90s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 47/47 [01:30<00:00,  1.92s/it]
Processing articles from clarin.com/politica: 100%|██████████| 90/90 [02:52<00:00,  1.91s/it]




Fetching articles starting at 23 April 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:09<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 84/84 [02:37<00:00,  1.88s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 52/52 [01:38<00:00,  1.90s/it]
Processing articles from clarin.com/politica: 100%|██████████| 89/89 [02:49<00:00,  1.90s/it]




Fetching articles starting at 09 May 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:07<00:00,  1.88s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 88/88 [02:44<00:00,  1.86s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 57/57 [01:46<00:00,  1.87s/it]
Processing articles from clarin.com/politica: 100%|██████████| 91/91 [02:55<00:00,  1.93s/it]




Fetching articles starting at 25 May 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:08<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 90/90 [02:51<00:00,  1.91s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 60/60 [01:51<00:00,  1.86s/it]
Processing articles from clarin.com/politica: 100%|██████████| 84/84 [02:36<00:00,  1.86s/it]




Fetching articles starting at 10 June 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:09<00:00,  1.90s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 83/83 [02:36<00:00,  1.89s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 51/51 [01:36<00:00,  1.89s/it]
Processing articles from clarin.com/politica: 100%|██████████| 75/75 [02:20<00:00,  1.87s/it]




Fetching articles starting at 26 June 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:08<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:09<00:00,  1.90s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 42/42 [01:20<00:00,  1.91s/it]
Processing articles from clarin.com/politica: 100%|██████████| 86/86 [02:42<00:00,  1.89s/it]




Fetching articles starting at 12 July 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:05<00:00,  1.86s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 94/94 [02:57<00:00,  1.89s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 55/55 [01:43<00:00,  1.89s/it]
Processing articles from clarin.com/politica: 100%|██████████| 70/70 [02:13<00:00,  1.91s/it]




Fetching articles starting at 28 July 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:08<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:10<00:00,  1.90s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 47/47 [01:29<00:00,  1.91s/it]
Processing articles from clarin.com/politica: 100%|██████████| 79/79 [02:28<00:00,  1.88s/it]




Fetching articles starting at 13 August 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:08<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:07<00:00,  1.87s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 48/48 [01:30<00:00,  1.88s/it]
Processing articles from clarin.com/politica: 100%|██████████| 92/92 [02:53<00:00,  1.88s/it]




Fetching articles starting at 29 August 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:05<00:00,  1.85s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:09<00:00,  1.90s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 53/53 [01:41<00:00,  1.92s/it]
Processing articles from clarin.com/politica: 100%|██████████| 97/97 [03:06<00:00,  1.93s/it]




Fetching articles starting at 14 September 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:09<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:08<00:00,  1.88s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 52/52 [01:38<00:00,  1.89s/it]
Processing articles from clarin.com/politica: 100%|██████████| 73/73 [02:18<00:00,  1.89s/it]




Fetching articles starting at 30 September 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:09<00:00,  1.89s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 79/79 [02:30<00:00,  1.91s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 50/50 [01:33<00:00,  1.88s/it]
Processing articles from clarin.com/politica: 100%|██████████| 79/79 [02:28<00:00,  1.88s/it]




Fetching articles starting at 16 October 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:08<00:00,  1.88s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:07<00:00,  1.87s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 72/72 [02:14<00:00,  1.86s/it]
Processing articles from clarin.com/politica: 100%|██████████| 86/86 [02:43<00:00,  1.90s/it]




Fetching articles starting at 01 November 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:09<00:00,  1.90s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 100/100 [03:08<00:00,  1.88s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 66/66 [02:04<00:00,  1.89s/it]
Processing articles from clarin.com/politica: 100%|██████████| 68/68 [02:07<00:00,  1.88s/it]




Fetching articles starting at 17 November 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:10<00:00,  1.91s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 92/92 [02:53<00:00,  1.88s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 71/71 [02:12<00:00,  1.87s/it]
Processing articles from clarin.com/politica: 100%|██████████| 54/54 [01:41<00:00,  1.89s/it]




Fetching articles starting at 03 December 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:07<00:00,  1.88s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 77/77 [02:25<00:00,  1.89s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 62/62 [01:56<00:00,  1.88s/it]
Processing articles from clarin.com/politica: 100%|██████████| 56/56 [01:45<00:00,  1.88s/it]




Fetching articles starting at 19 December 2022



Processing articles from lanacion.com.ar/economia: 100%|██████████| 100/100 [03:08<00:00,  1.88s/it]
Processing articles from lanacion.com.ar/politica/: 100%|██████████| 70/70 [02:12<00:00,  1.89s/it]
Processing articles from perfil.com/seccion/economia: 0it [00:00, ?it/s]
Processing articles from perfil.com/seccion/politica: 0it [00:00, ?it/s]
Processing articles from clarin.com/economia: 100%|██████████| 44/44 [01:22<00:00,  1.88s/it]
Processing articles from clarin.com/politica: 100%|██████████| 54/54 [01:41<00:00,  1.88s/it]








In [58]:
import pandas as pd

articles_df = pd.DataFrame(all_articles)
articles_df.to_csv('articulos.csv', index=False)

articles_df.head()

Unnamed: 0,title,description,published date,url,publisher,text,site,error
0,La resolución del Senasa que tiene en vilo a l...,Una disposición del organismo sanitario dio pl...,"Thu, 13 Jan 2022 08:00:00 GMT",https://www.lanacion.com.ar/economia/campo/gan...,"{'href': 'https://www.lanacion.com.ar', 'title...","Al doble estándar de las plantas frigoríficas,...",lanacion,
1,Apagón masivo: esta es la casa que dejó sin lu...,La antena de internet de una casa produjo un c...,"Wed, 12 Jan 2022 08:00:00 GMT",https://www.lanacion.com.ar/economia/apagon-ma...,"{'href': 'https://www.lanacion.com.ar', 'title...","Sobre la avenida Eva Perón al 6900, en el part...",lanacion,
2,Desembarco: finalmente Juan Valdez llega a la ...,El jueves la cadena colombiana inaugurará su p...,"Tue, 11 Jan 2022 08:00:00 GMT",https://www.lanacion.com.ar/economia/negocios/...,"{'href': 'https://www.lanacion.com.ar', 'title...",“¡Finalmente estamos acá!”. Así lo proclama el...,lanacion,
3,¿Quién inventó los precios? ¿Por qué tienen qu...,"Cuando rige la competencia, y el Estado ni sub...","Sun, 09 Jan 2022 08:00:00 GMT",https://www.lanacion.com.ar/economia/quien-inv...,"{'href': 'https://www.lanacion.com.ar', 'title...","En un programa de televisión, a un niño que fo...",lanacion,
4,Qué riesgo esconde el boom de las inversiones ...,Durante a pandemia se incrementó el número de ...,"Fri, 14 Jan 2022 08:00:00 GMT",https://www.lanacion.com.ar/economia/que-riesg...,"{'href': 'https://www.lanacion.com.ar', 'title...",La aparición de apps y plataformas que permite...,lanacion,
