In [1]:
from google.colab import drive
drive.mount('/content/drive/')
project_data_path='/content/drive/MyDrive/project_peruvian_news/data'

ModuleNotFoundError: No module named 'google.colab'

# Notebook Abstract:

Note: This is one of two notebooks provided for our project. This notebook, must be completed before moving on to the second "Analysis.ipynb" notebook

In the following notebook, we will be showcasing the data aggregation method we utilized in our proejct. We utilized the BeautifulSoup library to collect the title, category, date, and href of every news article in our news website (https://gestion.pe) between two specified start and end dates.

Upon aggregation, we also incorperated a translation processor to change the natively spanish titles to english for topic modeling and sentiment analysis.

# Imports

Here is just a super large cell holding all of the neccessary imports that our project will use for:

- scraping websites
- creating dataframes
- filtering data
- processing translations


In [2]:
import os
import datetime
import json
import bs4
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
!pip install deep-translator
from deep_translator import GoogleTranslator

Collecting deep-translator
  Downloading deep_translator-1.8.3-py3-none-any.whl (29 kB)
Installing collected packages: deep-translator
Successfully installed deep-translator-1.8.3


# Initializing the scaping href

Here is subjectively the most important part of the entire project. Here we initialize the locaiton of where our data is coming from, and from what dates we would like to collect the information from

In [3]:
# Base href for we scaping
url_base = "https://gestion.pe"
# start and end date for web scaping
start_date = '2020-01-1'
end_date = '2021-12-31'


# Defining functions for the purpose of Web Scraping

In this section of the notebook we will define a colleciton of different functions to display our use case of web scraping with beautiful soup

In [4]:
def dates(begin, end):
  """
  Description: 
  The dates funciton returns a list of datetime objects for every day between the begin and end string parameters

  Parameters:
  begin (str) - the first day in the list of dates format('%Y-%m-%d')
  end (str) - the last day in the list of dates format('%Y-%m-%d')
  """
  start = datetime.datetime.strptime(begin, '%Y-%m-%d')
  final = datetime.datetime.strptime(end, '%Y-%m-%d')
  step = datetime.timedelta(days=1)
  vector = []
  while(start <= final):
      vector.append(start.date())
      start += step
  new_vector = [date_obj.strftime('%Y-%m-%d') for date_obj in vector]
  return  new_vector

In [5]:
def catching_news_date(date):
  """
  Description: 
  The catching_news_date funciton returns a list of json objects for every news article created on the parameter date

  Parameters:
  date (Datetime) - the date that the web scraper will collect news article titles from
  """
  try:
    global url_base
    url = f"{url_base}/archivo/todas/{date}/"
    html = urlopen(url)
    soup = BeautifulSoup(html, 'html.parser')
    news = []
    for div in soup.find('div', {'role':'main'}).div.find_all('div', recursive=False):
        try:
            a = div.find('a', {'class': 'story-item__section text-sm text-black md:mb-15'})
            category = a.text
            title = div.h2.a.text
            href = div.h2.a['href']
            # print({'categoria': category, 'titulo': title, 'href': href, 'fecha': date})
            news.append({'categoria': category, 'titulo': title, 'href': href, 'fecha': date})
        except:
            continue
    # print(news)
    return news
         
  except:
    print('error with', url)
    return []
  
  

# Conducting Web Scrape

After defining our funcitons to our liking to extract the necessary information of all the news articles from our specified start and end dates:
- category
- title
- date
- href

In [6]:
today = datetime.datetime.today().strftime('%Y-%m-%d')
castillo = []
dates_c = dates(start_date, end_date)
for date in dates_c:
    print("\r", f'scraping news for date {date}', end="")
    news_content = catching_news_date(date)
    castillo += news_content
df = pd.DataFrame(castillo)

 scraping news for date 2021-12-31

# Forming a database and filtering articles by category

Suprise, we actually initialized the dataframe of pandas in the last cell, but that was just for safe coding purposes. We wouldn't want to forget to save the data in a more secure data model after all that web scraping

In [7]:
# a quick look of infomation
print('# or rows',df.shape[0])

df.head() 

# or rows 18275


Unnamed: 0,categoria,titulo,href,fecha
0,Economía,"David Stern, el hombre que transformó la NBA e...",/economia/david-stern-el-hombre-que-transformo...,2020-01-01
1,Perú,Elecciones 2020: Conoce el local de votación p...,/peru/onpe-lugar-de-votacion-mesa-de-votacion-...,2020-01-01
2,Internacional,Cómo un país con vinos desconocidos conquistó ...,/mundo/internacional/como-un-pais-con-vinos-de...,2020-01-01
3,EEUU,Manifestantes pro-Irán levantan campamento ant...,/mundo/eeuu/manifestantes-pro-iran-levantan-ca...,2020-01-01
4,Tendencias,Una mujer asume por primera vez la subadminist...,/tendencias/una-mujer-asume-por-primera-vez-la...,2020-01-01


In [8]:
# displaying all of the collected categoreis
df['categoria'].unique()

array(['Economía', 'Perú', 'Internacional', 'EEUU', 'Tendencias',
       'Empresas', 'Política', 'Mundo', 'Management & Empleo',
       'Finanzas Personales', 'Tu Dinero', 'Inmobiliarias', 'Gestión TV',
       'Mercados', 'Fotogalerías', 'Consultorio de Negocios',
       'Tecnología', 'Opinión', 'Estilos', 'Publirreportaje', 'Viajes',
       'Lujo', 'Consultorio Tributario', 'Con Las Cuentas Claras',
       'México', 'España', 'Hablemos más Simple', 'Moda',
       '20 en Empleabilidad', 'Pregunta de hoy', 'Finanzas personales',
       'Tecnología y Ciencia', 'Banca Lab', ''], dtype=object)

In [9]:
# applying a filter to the dataframe to select only the rows whose category is listed below
mask = ['Economía', 'Internacional', 'Empresas', 'Política', 'Mundo','Mercados']


# filtering data
df=df[df['categoria'].isin(mask)]
df=df.reset_index().drop(["index"], axis=1)

# Translating Articles into English

So, due to the largely english supported libraries for topic modeling and sentiment analysis. We are going to translate all of the titles. It should be noted that this could cause I significant bias in the results since we are preforming natural language processing in the unoriginal language.

In [10]:
# init progress start
count=0

def displayCount():
  """
  Description:
  the function displayCount is a simplistic progress function
  """
  global count
  count=count+1
  print("\r", 'translating title #',count, end='')

In [11]:
g=GoogleTranslator(source='auto', target='en')
def translateTitle(title):
  """
  Description:
  translateTitle uses an instance of google translate to turn spanish titles into english titles

  Parameters:
  title (str) - a string of spanish words preferrably :)
  """
  global g
  displayCount()
  return g.translate(title)

In [12]:
# # now we make a new column and process the titles, but this is not the optimal way to do so, other methods may be better, but this is garanteed to work accurately
df['title']=df['titulo'].apply(translateTitle)

# the optimimized method
# df['title'] = g.translate_batch(batch=[text for text in df['titulo'].to_numpy()])

 translating title # 8401ranslating title # 797

KeyboardInterrupt: 

In [None]:
df

Unnamed: 0,categoria,titulo,href,fecha,title
0,Economía,"David Stern, el hombre que transformó la NBA e...",/economia/david-stern-el-hombre-que-transformo...,2020-01-01,"David Stern, the man who transformed the NBA i..."
1,Internacional,Cómo un país con vinos desconocidos conquistó ...,/mundo/internacional/como-un-pais-con-vinos-de...,2020-01-01,How a country with unknown wines conquered New...
2,Empresas,Hombre más rico de Asia reta a Amazon con nuev...,/economia/empresas/hombre-mas-rico-de-asia-ret...,2020-01-01,Asia's richest man challenges Amazon with new ...
3,Empresas,Italia inicia el camino para suspender las con...,/economia/empresas/italia-inicia-el-camino-par...,2020-01-01,Italy begins the path to suspend concessions t...
4,Política,Elecciones 2020: JNE confirma dos candidaturas...,/peru/politica/elecciones-2020-jne-confirma-do...,2020-01-01,Elections 2020: JNE confirms two candidacies a...
...,...,...,...,...,...
11922,Política,Comisión del Congreso citará al alcalde de Lim...,/peru/politica/incendio-en-mesa-redonda-congre...,2021-12-31,Congress Commission will summon the mayor of L...
11923,Mundo,Dakar echa raíces en el desierto de Arabia Sau...,/mundo/dakar-echa-raices-en-el-desierto-de-ara...,2021-12-31,Dakar takes root in the desert of Saudi Arabia...
11924,Mundo,"Ómicron crece en Latinoamérica, países descart...",/mundo/covid-ops-omicron-crece-en-latinoameric...,2021-12-31,"Ómicron grows in Latin America, countries rule..."
11925,Economía,MEF flexibiliza deducción de gastos para el pa...,/economia/mef-flexibiliza-deduccion-de-gastos-...,2021-12-31,MEF makes deduction of expenses for the paymen...


# Saving the dataframe for Analaysis

Once the translation of all the titles is finished, you will have completed all the necessary content to save the data into the project repository and continue onto the analysis.

But do not leave before you save, that would suck

In [None]:
df.to_csv("../data/data_news.csv')