# Data Cleaning and text preparation

In this notebook, we explore our collected data. We deal with missing values, and then we prepare the text.
In order to do this we clean special characters, punctuation signs, etc. Then we remove stop words and we use lemmatization.

For lemmatization we use [spaCy](https://spacy.io/).

In [1]:
import pandas as pd
import re
import string
import spacy

We load the data 

In [2]:
path='/home/maggie/News_classifier/1.Data_Collection/' 
df_news=pd.read_csv(path + 'news_data.csv',encoding='utf8')
df_news.Fecha=df_news.Fecha.apply(pd.to_datetime)

We explore our dataset

In [3]:
df_news.shape

(26058, 6)

A look at the first five rows of our data frame:

In [4]:
df_news.head()

Unnamed: 0,Título,Link,Descripcion,Fecha,Diario,Label
0,Vecinos de Cristina Kirchner colgaron una band...,https://www.clarin.com/politica/vecinos-cristi...,La pusieron en el piso de arriba de donde vive...,2021-11-09 13:41:23,Clarín,Política
1,"Tras un año de cortocircuitos, el Gobierno rea...",https://www.clarin.com/politica/ano-cortocircu...,Volvieron a mostrarse juntos en campaña. Inten...,2021-11-09 13:07:56,Clarín,Política
2,El domingo Javier Milei tendrá su búnker en el...,https://www.clarin.com/politica/domingo-javier...,El economista liberal de la Libertad Avanza es...,2021-11-09 12:33:57,Clarín,Política
3,Elecciones 2021: se podrá viajar gratis todo e...,https://www.clarin.com/politica/elecciones-202...,Lo confirmó el Ministerio de Transporte. En la...,2021-11-09 11:57:30,Clarín,Política
4,El drama de la inseguridad en la Provincia: se...,https://www.clarin.com/politica/drama-inseguri...,Son números oficiales de 2020. Se iniciaron 78...,2021-11-09 11:57:13,Clarín,Política


In [5]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26058 entries, 0 to 26057
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Título       26056 non-null  object        
 1   Link         26049 non-null  object        
 2   Descripcion  24184 non-null  object        
 3   Fecha        26058 non-null  datetime64[ns]
 4   Diario       26058 non-null  object        
 5   Label        26058 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 1.2+ MB


We noticed that some descriptions are missing. In this case, we only work with the title. We fill in the missing values.

In [6]:
df_news.Descripcion = df_news.Descripcion.fillna('')

We also noticed that very few news do not have a title or URL link. We remove them from our data frame.

In [7]:
df_news=df_news.dropna().reset_index(drop=True)

We check that there are no duplicate rows.

In [8]:
df_news.duplicated().value_counts()


False    26049
dtype: int64

We create a new column, joining title and description.

In [9]:
df_news['Texto']=df_news['Título'] +'. '+ df_news['Descripcion']

### Text cleaning

By manual inspection, we see that all the news articles from the newspaper "Perfil" end with "leer más'. For example:

In [10]:
df_news[df_news.Diario=='Perfil'].Descripcion.iloc[0]

'<p><img src="https://fotos.perfil.com/2021/11/09/trim/540/304/javier-milei-1266604.jpg" alt="Javier Milei" /></p>El candidato a diputado cruzó duramente al ministro de Seguridad a raíz del caso del kiosquero asesinado en Ramos Mejía. <a href="https://www.perfil.com/noticias/politica/milei-a-anibal-fernandez-por-que-sos-tan-h-de-p.phtml">Leer más</a>'

We create a function that removes that.

In [11]:
def remove_leer_mas(text):
    try:
        if text.split()[-2]=='leer' and text.split()[-1]=='más':
            return text[:-9]
        else:
            return text
    except:
        return text

Also, we create a function that removes HTML tags and some special entities that appear. 

In [12]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    clean = re.compile('<.*?>')
    #some special entities, see https://www.htmlhelp.com/reference/html40/entities/special.html
    # to know what is included, for example: df_news[df_news['Texto'].str.contains('quot')]['Texto']
    pattern = re.compile("|".join(['&amp','ndash','mdash','lsquo','rsquo','ldquo','rdquo']))
    text = pattern.sub('', text)
    return re.sub(clean, '', text)

We also have some symbols for the accent marks. For example:

In [13]:
df_news.Texto[298]

'Otra pelea m&#225;s del Gobierno con el peronismo de C&#243;rdoba. \n\n'

We create a dictionary to replace these symbols into readable accent marks. 

In [14]:
tildes={"&#225;":'á',"&#233;":'e',"&#237;":'í',"&#243;":'ó',"&#250;":'ú'}

Now, we create our text cleaning function.

In [15]:
def clean_text(text):
    #remove htlm tags
    text=remove_html_tags(text)
    #make text lowercase
    text = text.lower()
    #correct accent marks
    for key in tildes.keys():
        text = text.replace(key, tildes[key])
    #remove punctuation 
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    #remove some non-ASCII characters, see https://www.codetable.net/asciikeycodes
    text=re.sub(r'[\x7f-\x9f]','',text) 
    #this is a white space
    text=re.sub(r'[\xa0]',' ', text)
    text=re.sub(r'[\xa1-\xbf]','',text) 
    #remove words containing numbers
    text = re.sub('\w*\d\w*', '', text)
    #remove some additional punctuation that was missed the first time around.
    text = re.sub('[‘’“”…«»–]', '', text)
    #remove linespaces
    text = re.sub('\n', ' ', text)
    #remove leading and trailing whitespaces
    text = re.sub('\s+', ' ',text).strip()
    #remove "leer mas" from news from perfil
    text=remove_leer_mas(text)
    return text

We create a new column with the clean text

In [16]:
df_news['Texto_clean'] = df_news.Texto.apply(clean_text)

### Lemmatization and removal of stop words

We load Spanish stopwords from https://countwordsfree.com/stopwords/spanish modified by stopwords.py. The last one is a script that adds to our stopwords the lemmatized form of them. 

In [17]:
my_file = open("stopwords_spanish.txt", "r")
stopwords = my_file.read()
stopwords  = stopwords.split("\n")
my_file.close()

We use spacy for lemmatization. See the [Spanish models](https://spacy.io/models/es).

In [18]:
nlp = spacy.load('es_core_news_sm')

We previously checked that lemmatization works poorly for some words. By manual inspection, we add some of them into these objects. We want to keep the words in the list words_with_bad_lemmatization in that way and we want to change the keys in the dictionary words_with_bad_lemmatization_2 for the values that appear there.

In [19]:
words_with_bad_lemmatization=['lucas','coronavirus','efemérides','argentina','wanda','drive','core','sosa','matías']
words_with_bad_lemmatization_2={'argentino':'argentina','argentinos':'argentina','ruso':'rusia',
                                'rusa':'rusia','rusos':'rusia','rusas':'rusia','ucraniano':'ucrania','ucraniana':
                               'ucrania','ucranianas':'ucrania','ucranianos':'ucrania','francés':'francia','actriz':'actor',
                               'actrices':'actor'}

We iterate through every row in the data frame in order to lemmatize.

In [21]:
rows = len(df_news)
lemmatized_text_list= []

for row in range(rows):
    # we create an empty list containing lemmatized words
    lemmatized_words = []
    # save the text and its words into an object
    texto=nlp(df_news.loc[row]['Texto_clean'])
    for palabra in texto:
        lemma=palabra.lemma_
        for word in lemma.split(): #some lemmas have two words, ex. "mostrarse" into "mostrar" and "él"
            if word not in stopwords and palabra.text not in words_with_bad_lemmatization and palabra.text not in words_with_bad_lemmatization_2.keys():
                lemmatized_words.append(word)
            elif palabra.text in words_with_bad_lemmatization:
                lemmatized_words.append(palabra.text) 
            elif palabra.text in words_with_bad_lemmatization_2.keys():
                lemmatized_words.append(words_with_bad_lemmatization_2[palabra.text])
    # join the list
    lemmatized_text = " ".join(lemmatized_words)
    #There are very short news articles that after cleaning they become empty. We remove these rows then from the data frame.
    if lemmatized_text=='':
        df_news.drop(row, inplace=True)
        df_news.reset_index(drop=True)
    else:
        # Append to the list containing the texts
        lemmatized_text_list.append(lemmatized_text)

We create a new column with the lemmatized clean text and without stopwords.

In [22]:
df_news['Texto_clean_lemmatized_and_stopwords']  = lemmatized_text_list



As an example, we look at one article, in the original form, after the text cleaning, and after lemmatization and removal of stopwords.

In [23]:
df_news.Texto.loc[1000]

'Por qué el virus Nipah puede convertirse en "la peor pandemia que la humanidad haya enfrentado". <p><img src="https://fotos.perfil.com/2021/09/13/trim/540/304/nipah-kerala-india-1230332.jpg" alt="nipah kerala india" /></p>Expertos creen que, debido a la alta letalidad que tiene el virus NiV, una nueva cepa "podría representar la peor pandemia" que haya vivido la humanidad. El estado indio de Kerala observa con atención el nuevo brote. <a href="https://www.perfil.com/noticias/ciencia/por-que-el-virus-nipah-puede-convertirse-en-la-peor-pandemia-que-la-humanidad-haya-enfrentado.phtml">Leer más</a>'

In [24]:
df_news.Texto_clean.loc[1000]

'por qué el virus nipah puede convertirse en la peor pandemia que la humanidad haya enfrentado expertos creen que debido a la alta letalidad que tiene el virus niv una nueva cepa podría representar la peor pandemia que haya vivido la humanidad el estado indio de kerala observa con atención el nuevo brote'

In [25]:
df_news.Texto_clean_lemmatized_and_stopwords.loc[1000]

'virus nipah convertir peor pandemia humanidad enfrentar experto alto letalidad virus niv cepa representar peor pandemia vivir humanidad indio kerala observar atención brote'

We save the modified data in a new CSV file.

In [26]:
df_news.to_csv('clean_data.csv',index=False)  