In [1]:
from google.colab import drive
drive.mount('/content/drive/')
project_data_path='/content/drive/MyDrive/project_peruvian_news/data'

ModuleNotFoundError: No module named 'google.colab'

# Notebook Abstract:

Note: This is one of two notebooks provided for our project. This notebook, must be completed before moving on to the second "Analysis.ipynb" notebook

In the following notebook, we will be showcasing the data aggregation method we utilized in our proejct. We utilized the BeautifulSoup library to collect the title, category, date, and href of every news article in our news website (https://gestion.pe) between two specified start and end dates.

Upon aggregation, we also incorperated a translation processor to change the natively spanish titles to english for topic modeling and sentiment analysis.

# Imports

Here is just a super large cell holding all of the neccessary imports that our project will use for:

- scraping websites
- creating dataframes
- filtering data
- processing translations


In [1]:
import os
import datetime
import json
import bs4
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
!pip install deep-translator
from deep_translator import GoogleTranslator



# Initializing the scaping href

Here is subjectively the most important part of the entire project. Here we initialize the locaiton of where our data is coming from, and from what dates we would like to collect the information from

In [2]:
# Base href for we scaping
url_base = "https://gestion.pe"
# start and end date for web scaping
start_date = '2018-01-1'
end_date = '2022-05-1'


# Defining functions for the purpose of Web Scraping

In this section of the notebook we will define a colleciton of different functions to display our use case of web scraping with beautiful soup

In [3]:
def dates(begin, end):
  """
  Description: 
  The dates funciton returns a list of datetime objects for every day between the begin and end string parameters

  Parameters:
  begin (str) - the first day in the list of dates format('%Y-%m-%d')
  end (str) - the last day in the list of dates format('%Y-%m-%d')
  """
  start = datetime.datetime.strptime(begin, '%Y-%m-%d')
  final = datetime.datetime.strptime(end, '%Y-%m-%d')
  step = datetime.timedelta(days=1)
  vector = []
  while(start <= final):
      vector.append(start.date())
      start += step
  new_vector = [date_obj.strftime('%Y-%m-%d') for date_obj in vector]
  return  new_vector

In [4]:
def catching_news_date(date):
  """
  Description: 
  The catching_news_date funciton returns a list of json objects for every news article created on the parameter date

  Parameters:
  date (Datetime) - the date that the web scraper will collect news article titles from
  """
  try:
    global url_base
    url = f"{url_base}/archivo/todas/{date}/"
    html = urlopen(url)
    soup = BeautifulSoup(html, 'html.parser')
    news = []
    for div in soup.find('div', {'role':'main'}).div.find_all('div', recursive=False):
        try:
            a = div.find('a', {'class': 'story-item__section text-sm text-black md:mb-15'})
            category = a.text
            title = div.h2.a.text
            href = div.h2.a['href']
            # print({'categoria': category, 'titulo': title, 'href': href, 'fecha': date})
            news.append({'categoria': category, 'titulo': title, 'href': href, 'fecha': date})
        except:
            continue
    # print(news)
    return news
         
  except:
    print('error with', url)
    return []
  
  

# Conducting Web Scrape

After defining our funcitons to our liking to extract the necessary information of all the news articles from our specified start and end dates:
- category
- title
- date
- href

In [5]:
today = datetime.datetime.today().strftime('%Y-%m-%d')
castillo = []
dates_c = dates(start_date, end_date)
for date in dates_c:
    print("\r", f'scraping news for date {date}', end="")
    news_content = catching_news_date(date)
    castillo += news_content
df = pd.DataFrame(castillo)

 scraping news for date 2022-05-01

# Forming a database and filtering articles by category

Suprise, we actually initialized the dataframe of pandas in the last cell, but that was just for safe coding purposes. We wouldn't want to forget to save the data in a more secure data model after all that web scraping

In [6]:
# a quick look of infomation
print('# or rows',df.shape[0])

df.head() 

# or rows 39502


Unnamed: 0,categoria,titulo,href,fecha
0,Mundo,"La vida ""verdaderamente dura"" de los habitante...",/mundo/vida-verdaderamente-dura-habitantes-teh...,2018-01-01
1,Mundo,"Ecuador: Ley para fortalecer dolarización, en ...",/mundo/ecuador-ley-fortalecer-dolarizacion-vig...,2018-01-01
2,EEUU,Trump amenaza con cortar ayuda a Pakistán en s...,/mundo/eeuu/trump-amenaza-cortar-ayuda-pakista...,2018-01-01
3,Internacional,Hasta los pingüinos se resguardan del frío ext...,/mundo/internacional/pingueinos-resguardan-fri...,2018-01-01
4,Tendencias,"""Los Últimos Jedi"" supera los US$ 1,000 millon...",/tendencias/ultimos-jedi-supera-us-1-000-millo...,2018-01-01


In [7]:
# displaying all of the collected categoreis
df['categoria'].unique()

array(['Mundo', 'EEUU', 'Internacional', 'Tendencias', 'Economía', 'Perú',
       'Política', 'Gestión TV', 'Fotogalerías', 'Mercados', 'Tu Dinero',
       'Empresas', 'Tecnología', 'Inmobiliarias', 'Management & Empleo',
       '', 'Estilos', 'Consultorio de Negocios', 'Opinión', 'Moda',
       'Publirreportaje', 'México', 'Finanzas Personales', 'Viajes',
       'España', 'Lujo', 'Trabajo en Acción', '20 en Empleabilidad',
       'Pregunta de hoy', 'Consultorio Tributario', 'CADE 2018',
       'Hablemos más Simple', 'Con Las Cuentas Claras',
       'Finanzas personales', 'Tecnología y Ciencia', 'Banca Lab',
       'Empresa', 'Videos'], dtype=object)

In [9]:
# applying a filter to the dataframe to select only the rows whose category is listed below
mask = ['Economía', 'Internacional', 'Empresas', 'Política', 'Mundo','Mercados']


# filtering data
df=df[df['categoria'].isin(mask)]
df=df.reset_index().drop(["index"], axis=1)

In [11]:
df.to_csv("../data/data_es.csv")

# Translating Articles into English

So, due to the largely english supported libraries for topic modeling and sentiment analysis. We are going to translate all of the titles. It should be noted that this could cause I significant bias in the results since we are preforming natural language processing in the unoriginal language.

In [12]:
# init progress start
count=0

def displayCount():
  """
  Description:
  the function displayCount is a simplistic progress function
  """
  global count
  count=count+1
  print("\r", 'translating title #',count, end='')

In [13]:
g=GoogleTranslator(source='auto', target='en')
def translateTitle(title):
  """
  Description:
  translateTitle uses an instance of google translate to turn spanish titles into english titles

  Parameters:
  title (str) - a string of spanish words preferrably :)
  """
  global g
  displayCount()
  return g.translate(title)

In [14]:
# # now we make a new column and process the titles, but this is not the optimal way to do so, other methods may be better, but this is garanteed to work accurately
df['title']=df['titulo'].apply(translateTitle)

# the optimimized method
# df['title'] = g.translate_batch(batch=[text for text in df['titulo'].to_numpy()])

 translating title # 26898

In [15]:
df

Unnamed: 0,categoria,titulo,href,fecha,title
0,Mundo,"La vida ""verdaderamente dura"" de los habitante...",/mundo/vida-verdaderamente-dura-habitantes-teh...,2018-01-01,"The ""truly hard"" life of the inhabitants of Te..."
1,Mundo,"Ecuador: Ley para fortalecer dolarización, en ...",/mundo/ecuador-ley-fortalecer-dolarizacion-vig...,2018-01-01,"Ecuador: Law to strengthen dollarization, in f..."
2,Internacional,Hasta los pingüinos se resguardan del frío ext...,/mundo/internacional/pingueinos-resguardan-fri...,2018-01-01,Even the penguins take shelter from the extrem...
3,Internacional,"Los 25 años de la ""separación de tercipelo"" de...",/mundo/internacional/25-anos-separacion-tercip...,2018-01-01,"The 25 years of the ""velvet separation"" of the..."
4,Economía,Midis:Pensión 65 llega al 99.9% de ejecución p...,/economia/midis-pension-65-llega-al-99-9-ejecu...,2018-01-01,Midis: Pension 65 reaches 99.9% of budget exec...
...,...,...,...,...,...
26893,Economía,"Se proyecta el ingreso a Tacna de 7,000 chilen...",/economia/se-proyecta-el-ingreso-a-tacna-de-70...,2022-05-01,"The entry to Tacna of 7,000 Chileans is projec..."
26894,Economía,Estos son las 15 ciudades que registraron una ...,/economia/estos-son-las-15-ciudades-que-regist...,2022-05-01,These are the 15 cities that registered an inf...
26895,Mundo,Gobierno boliviano decreta un incremento salar...,/mundo/gobierno-boliviano-decreta-un-increment...,2022-05-01,Bolivian government decrees a wage increase re...
26896,Mundo,Chile reabre mayoría de sus fronteras terrestr...,/mundo/chile-reabre-mayoria-de-sus-fronteras-t...,2022-05-01,Chile reopens most of its land borders to tour...


# Saving the dataframe for Analaysis

Once the translation of all the titles is finished, you will have completed all the necessary content to save the data into the project repository and continue onto the analysis.

But do not leave before you save, that would suck

In [17]:
df.to_csv("../data/data_en.csv")