# World Cup Qatar 2022 Spanish News

Data Sources: <br>
https://as.com/  <br>
https://www.europapress.es/  <br>
https://www.elmundo.es/deportes/futbol/  <br>
https://elpais.com/  <br>

### Import libraries

In [1]:
import pandas as pd
import re
import os
import demoji
import string

### Import raw data to use as our custom knowledge base

For this demo we'll use news about the World Cup 22, because ChatGPT thinks this tournament hasn't taken place yet.

In [2]:
getwc = os.getcwd()
rawdata_directory = #'/path/to/folder'
all_files = os.listdir(rawdata_directory)
csv_files = [f for f in all_files if f.endswith('.csv')]

In [3]:
csv_files

['elpais.csv', 'as-futbol.csv', 'elmundo.csv', 'europapress.csv']

In [4]:
df = pd.concat((pd.read_csv(getwc + '/Data_Raw/' + f) for f in csv_files), ignore_index=True)
df.head()

Unnamed: 0,web-scraper-order,web-scraper-start-url,noticia,noticia-href,texto
0,1680120613-1805,https://elpais.com/buscador/?q=mundial%2B2022%...,Peligro al volante,https://elpais.com/opinion/2022-11-19/peligro-...,Cartas a la DirectoraiOpinión de un lector sob...
1,1680120617-1806,https://elpais.com/buscador/?q=mundial%2B2022%...,"Livakovic, el iluminado que no paraba penaltis",https://elpais.com/deportes/mundial-futbol/202...,Argentina\n ARG\n ...
2,1680120621-1807,https://elpais.com/buscador/?q=mundial%2B2022%...,El Athletic gana peso en la selección,https://elpais.com/deportes/2022-09-28/el-athl...,liga de las nacionesEl Athletic gana peso en l...
3,1680120624-1808,https://elpais.com/buscador/?q=mundial%2B2022%...,Las citas clave que necesitas saber hoy,https://cincodias.elpais.com/economia/2023-03-...,La Agenda de Cinco DíasLas citas clave que nec...
4,1680120627-1809,https://elpais.com/buscador/?q=mundial%2B2022%...,La CNMC investiga a RTVE por la posible venta ...,https://elpais.com/television/2022-10-04/la-cn...,RTVELa CNMC investiga a RTVE por la posible ve...


In [5]:
df = df.drop(columns = ['noticia', 'web-scraper-order', 'web-scraper-start-url', 'noticia-href'])
df.head()

Unnamed: 0,texto
0,Cartas a la DirectoraiOpinión de un lector sob...
1,Argentina\n ARG\n ...
2,liga de las nacionesEl Athletic gana peso en l...
3,La Agenda de Cinco DíasLas citas clave que nec...
4,RTVELa CNMC investiga a RTVE por la posible ve...


### Text cleaning

Cleaning the text before feeding it to ChatGPT is important for several reasons: <br>

- Consistency and uniformity across the entire dataset so the model learns from consistent patterns in the data, leading to better performance.<br>

- Noise Reduction: many of the news scraped contains a lot of noise such as HTML tags, special characters, and emojis. Removing this noise, could make it easier for the model to learn the underlying patterns in the data.<br>

- Efficiency: save the computational resources by reducing the amount of data that the model has to process.<br>

- Saving in cost of credits in our OpenAI API account, $ 0.02 for every 1,000 tokens. This experiment is only to test how to use custom knowledge to train ChatGPT.

In [6]:
# Drop rows where the text of the news has null values
df = df.dropna(subset = ['texto']).reset_index(drop = True)

In [7]:
def count_characters(df, column_name):
    
    # Create a new column that contains the length of each string
    # It'll be used to compare if there where changes to the original text
    df['character_count'] = df[column_name].str.len()
    
    return df

In [8]:
df = count_characters(df, 'texto')
df.head()

Unnamed: 0,texto,character_count
0,Cartas a la DirectoraiOpinión de un lector sob...,6089
1,Argentina\n ARG\n ...,9337
2,liga de las nacionesEl Athletic gana peso en l...,8425
3,La Agenda de Cinco DíasLas citas clave que nec...,5982
4,RTVELa CNMC investiga a RTVE por la posible ve...,9939


In [9]:
char_bf_transf = df['character_count'].sum()
char_bf_transf

7705546

There were news related to other tournaments/subjects that were not properly filterd by the search engine while filtering by "Qatar 2022". Let's remove rows based on wether a certain word or phrase (such as other torunaments)  appear in the text of the news.

In [10]:
words_to_filter_out = ['LaLiga', 'Santander', 'Champions', 'Copa del Rey', 'elecciones', 'Seguridad Social', 'covid' \
                       'ENERGÍA', 'EMPLEO', 'TELECOMUNICACIONES', 'IRPF', 'MOTOGP', 'BANCO MUNDIAL', 'DEPARTAMENTO DE ESTADO', \
                       'RESERVA FEDERAL', 'Coyuntura económica', 'Inditex', 'MÁLAGA', 'PIB', 'Pymes', 'DESAHUCIOS', 'CONCIERTOS', 'RESIDENCIAS ANCIANOS', 'AGRESIONES SEXUALES'] 

In [11]:
# Filter out rows that contain any word in the list
df = df[~df['texto'].str.contains('|'.join(words_to_filter_out))].reset_index(drop = True)
df

Unnamed: 0,texto,character_count
0,Cartas a la DirectoraiOpinión de un lector sob...,6089
1,liga de las nacionesEl Athletic gana peso en l...,8425
2,Cartas a la DirectoraiOpinión de un lector sob...,6278
3,Liga italianaLa Serie A más insólitaEl campeon...,7064
4,FÚTBOLEl alcalde de Valencia rechaza instalar ...,2670
...,...,...
688,0 seconds of 0 secondsVolume 90%Press shift qu...,6825
689,Europa Press Internacional\n\n\nPublicado: jue...,4035
690,Archivo - Banderas del Mundial de fútbol en un...,4977
691,"El delantero portugués Cristiano Ronaldo, en e...",6038


In [12]:
def clean_text(text):
    
    # Remove HTML tags
    clean = re.compile('<.*?>')
    text = re.sub(clean, '', text)
    
    # Remove text after 'Tags' (links & sections at the bottom of the website)
    text = text.split('Tags', 1)[0]
    
    # Remove text after 'Lee también' (recommendations to read other articles)
    text = text.split('Lee también', 1)[0]
    
    # Remove text after 'Al minuto', idem
    text = text.split('Al minuto', 1)[0]
    
    # Remove text before 'protecciondedatos.es'
    try:
        text = text.split('protecciondedatos.es', 1)[1]
        
    except:
        text

    # Remove users comments
    try:
        text = text.split('CESTWhatsappFacebookTwitterCopiar', 1)[0]
        
    except:
        text
    
    # Remove users @username
    text = re.sub(r'@\w+', '', text)
    
    # Remove text between curly braces
    text = re.sub(r'\{.*?\}', '', text)
    
    # Remove line breaks
    text = text.replace('\n', '').replace('\r', '')
    
    # Remove emojis
    text = demoji.replace(text, '')
    
    # Remove URLs
    text = re.sub(r'https\S+', '', text, flags = re.MULTILINE)
    
    # Remove text between breaklines
    text = re.sub('\n.*?\n', '\n', text, flags = re.DOTALL)
    
    # Remove timestamp from news
    pattern = r'(Actualizado.*|Publicado:.* ?-? ?\d+:\d+)'
    text = re.sub(pattern, '', text)
    
    # Remove the text to share the news on social media
    text = re.sub(r'Compartir en FacebookCompartir en TwitterEnviar por emailVer \d+ comentarios', '', text)
    
    # Remove other irrelevant text
    text = re.sub('AGENDA DEL DÍA', '', text)
    text = re.sub(r'Opinión[\w\s]+\.', '', text)
    text = re.sub('EUROPA PRESS', '', text, flags = re.I)
    text = re.sub('Archivo - ', '', text)
    text = re.sub('Image - ', '', text)
    text = re.sub('Redacción Barcelona', '', text)
    text = re.sub(' Aquí tienes[\w\s]+ selección \w+', '', text)
    text = re.sub(': [\w\s]+ del \w+ \d*', '', text)
    text = re.sub('Mundial 2022 Qatar', 'Mundial 2022 ', text)
    text = re.sub('once, estrella, convocatoria y calendario de partidos', '', text)
    text = re.sub('Enviado especial \w+', '', text)
    text = re.sub('Doha', '', text)
    
    # Remove more than two spaces
    text = re.sub(r'\s{1,}', ' ', text)
    
    return text

In [13]:
df['news'] = df['texto'].apply(clean_text).str.strip()

In [14]:
# Drop rows based on length of the string in 'text' column, if the news is short, then we drop it
df = df[df['news'].str.len() >= 80].reset_index(drop = True)

In [15]:
df = count_characters(df, 'news')
char_af_transf = df['character_count'].sum()
char_af_transf

1475747

In [16]:
df = df[['news']]
df.head()

Unnamed: 0,news
0,Cartas a la Directorai Dirigidas al director d...
1,liga de las nacionesEl Athletic gana peso en l...
2,Cartas a la Directorai Dirigidas al director d...
3,Liga italianaLa Serie A más insólitaEl campeon...
4,FÚTBOLEl alcalde de Valencia rechaza instalar ...


In [17]:
df.to_csv('train_data.txt', index = False)