# Exploring News Articles from Página 12: An NLP Project

I am conducting an NLP project in which I am scraping news articles from the Argentine digital newspaper, Página 12, using Python's BeautifulSoup, Pandas, and Requests packages. By utilizing these tools, I am able to extract the relevant information from the webpage and structure it into a usable format for analysis and modeling. This enables me to gain insights and extract valuable information from the articles published there.

Package importation

In [29]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime


In [30]:
today = datetime.date.today()

The url from Pagina12 is https://www.pagina12.com.ar

In [31]:
url = 'https://www.pagina12.com.ar//'
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}


The two functions `obtener_sopa` and `crear_linkp12` are used in a web scraping project.

The `obtener_sopa` function takes two inputs: link and headers. It uses the requests package in Python to send a GET request to the specified link, with the given headers. The function first checks that the status code of the response is 200, which indicates a successful response. If the status code is not 200, an error message is raised. The function then returns the response content as a BeautifulSoup object, which is a parsing library used to extract information from HTML or XML documents.

The `crear_linkp12` function takes one input: link. It creates a link to the pagina12 website by concatenating the string `'https://www.pagina12.com.ar'` with the input link. The function returns the newly created link.

In [32]:
def obtener_sopa(link, headers = headers):
    req = requests.get(link,
                     headers = headers)
    assert req.status_code == 200, 'El status code no es 200, algo fallo'
    return BeautifulSoup(req.text, 'lxml')
def crear_linkp12(link):
  return 'https://www.pagina12.com.ar'+link

In [33]:
sopa = obtener_sopa(url)

### Key information extraction

In [34]:
#Devuelve el primer elemento que encuentra con ese tag el find
lista_secciones = sopa.find('div', attrs={'class':'p12-dropdown-content'}).find_all('a', attrs={'class': 'p12-dropdown-item'})

Here I show the sections that the newspaper has:

In [35]:
lista_secciones

[<a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/el-pais">El país</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/economia">Economía</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/sociedad">Sociedad</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/suplementos/cultura-y-espectaculos">Espectáculos</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/deportes">Deportes</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/ciencia">Ciencia</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/el-mundo">El mundo</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/edicion-impresa">Edición impresa</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/universidad-diario">Universidad</a>,
 <a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/ajedrez">Aj

Not all sections are of interest to me, some are very specific and we won't find them in other newspapers. So I select those of interest. 
After that I let the user decide one from the list: 



['El país', 'Economía', 'Sociedad', 'Espectáculos',
                       'Deportes', 'Ciencia', 'Cultura', 'Turismo' ]

In [36]:
seccion_deseada = input("Enter one of the following sections: Cultura (Culture),Espectáculos (Entertainment),Ciencia (Science),El país (The country),Sociedad (Society),Deportes (Sports),El mundo (The World) o Economía (Economy)   ")
pags = int(input('Enter the number of pages you want:   '))

In [37]:
secciones_deseadas = [seccion_deseada]


In [38]:
lista_secciones_f = [seccion for seccion in lista_secciones if seccion.get_text() in secciones_deseadas]
lista_secciones_f

[<a class="p12-dropdown-item" href="https://www.pagina12.com.ar/secciones/economia">Economía</a>]

I built a list that contains dictionaries inside. Each dictionary will have the name of the section, a link to it, the highlighted note, and the rest of the notes.

In [39]:
l = []

In [40]:
links_secciones = [seccion.get('href') for seccion in lista_secciones_f]
text_secciones = [seccion.get_text() for seccion in lista_secciones_f]

In [41]:
links_secciones

['https://www.pagina12.com.ar/secciones/economia']

The function `generar_paginacion_links` takes two inputs: the number of pages to generate (`cant_paginas`) and a list of sections (`lista_secc`). It generates a new list of links by appending the page number to the original link in the list of sections, except for the 'Espectáculos' section, which uses a different format based on the date. The function returns the generated list of links.

The function `generar_paginacion_nombres` takes two inputs: the number of pages to generate (`cant_paginas`) and a list of sections (`lista_secc`). It generates a new list of names by repeating the names of the sections in the input list. The function returns the generated list of names.

In [42]:
def generar_paginacion_links(cant_paginas, lista_secc ):
  nuevos_links = []
  for i in lista_secc:
    if 'espectaculos' not in i:
      for pag in range(0, cant_paginas, 1):
        new_link = i + '?page=' + str(pag + 1)
        nuevos_links.append(new_link)
    else:
        for pag in range(0, cant_paginas, 1):
         
          day= today - datetime.timedelta(days = pag+1)
          day = str(day.year * 10000 + day.month * 100 + day.day)
          new_link = i + '/' + day[6:8] + '-' + day[4:6] + '-' + day[:4]
          nuevos_links.append(new_link)
  return nuevos_links

def generar_paginacion_nombres(cant_paginas, lista_secc):
  nuevos_nombres = []
  for i in text_secciones:
    for pag in range(0, cant_paginas, 1):
      new_name = i
      nuevos_nombres.append(new_name)
  return nuevos_nombres

In [43]:
links_secciones.extend(generar_paginacion_links(pags, links_secciones))

In [44]:
text_secciones.extend(generar_paginacion_nombres(pags, text_secciones))

In [45]:
for link, text in zip(links_secciones, text_secciones):
  p12_dicc = {}
  p12_dicc['id_seccion'] = text
  p12_dicc['link_seccion'] = link
  l.append(p12_dicc)

In [46]:
for i in l:
  i['sopa_pagina'] = obtener_sopa(i['link_seccion'])

We obtain the notes from the entire page for all sections (only from the first page).

In [47]:
for seccion in l:
    if seccion['id_seccion'] not in ['Turismo', 'Espectáculos']:
       notas = seccion['sopa_pagina'].find_all('div', attrs = {'class' : 'article-item__content'})
    else:
      notas = seccion['sopa_pagina'].find_all('div', attrs = {'class' : 'article-box__container'})
    seccion['notas'] = []
    for i in notas:
       link = i.a.get('href')
       seccion['notas'].append({'link' : crear_linkp12(link),
                             'sopa' : obtener_sopa(crear_linkp12(link))
                               })   


Now for each HTML of the notes that we have downloaded, we get the title, subtitle, and the first paragraph.

In [48]:
def get_data_from_note(nota, texto):
  nota['primer_parrafo'] = texto.find("div", {"class":"article-main-content article-text"}).p.text
  nota['titulo'] = texto.find('h1').text
  nota['subtitulo'] = texto.find('h3').text

for seccion in l:
  if 'notas' in seccion.keys():
    for nota in seccion['notas']:

      texto = nota['sopa']
      try:
        get_data_from_note(nota, texto)
      except AttributeError:
        continue     

  

In [49]:
df = pd.DataFrame()
for seccion in l:
  if 'notas' in seccion.keys():
    df1 = pd.DataFrame()
    for nota in seccion['notas']:
      try:
        nota['titulo']
      except KeyError: 
        continue
      dicc = {}
      dicc = {'titulo': [nota['titulo']], 
              'subtitulo': [nota['subtitulo']], 
              'primer_parrafo': [nota['primer_parrafo']],
              'seccion' : [seccion['id_seccion']]
              }

        
      df1 = pd.concat([df1, pd.DataFrame(dicc)])
    df = pd.concat([df, df1])
  


This is a preview of the data downloaded:

In [50]:
df.head()

Unnamed: 0,titulo,subtitulo,primer_parrafo,seccion
0,🔴 En vivo. Dólar blue y dólar hoy: todas las c...,El precio de compra y venta de la divisa,El dólar blue cotiza a $379 para la venta y $3...,Economía
0,Presentan la segunda etapa de Precios Justos,"Incluye nuevos productos, la canasta escolar y...","El ministro de Economía, Sergio Massa, anunció...",Economía
0,"Dólar blue hoy, dólar hoy: a cuánto cotizan el...",El dólar blue sube 3 pesos,La brecha entre el oficial y el dólar blue lle...,Economía
0,Emitirán un billete de 2000 pesos,"""Mejorará el funcionamiento de cajeros automát...",El Directorio del Banco Central de la Repúblic...,Economía
0,Mal inicio de mes para las reservas,El Banco Central debió vender u$s 56 millones....,El dólar oficial cerró con una cotización prom...,Economía


In [51]:
df.shape

(231, 4)

In [52]:
df.seccion.value_counts()

Economía    231
Name: seccion, dtype: int64

In [53]:
date = today.year * 10000 + today.month * 100 + today.day

In [54]:
path_save = './data/'

In [55]:
seccion_deseada_ = seccion_deseada.replace(" ", "")

I save it to keep working in the nlp project in another notebook.

In [56]:
df.to_csv(path_save + f'bajada_p12_{date}-{seccion_deseada_}.csv',
          index = False)