# Text Preprocessing

In the previous notebook (`0-Exploratory-Data-Analysis`), we conducted a brief exploratory data analysis. Here we're going to work with Natural Language Processing (NLP) techniques  to create the text pre-processing pipeline needed to prepare our data to be used as input to Machine Learning models.

### Importing Libraries

In [1]:
import zipfile
import numpy as np
import pandas as pd

import re
import nltk
import unidecode

### Loading Data

In [2]:
# Reading the zipfile containing the datasets
zf = zipfile.ZipFile("data/olist-datasets.zip")

# Loading the order reviews dataset
reviews_df = pd.read_csv(zf.open("olist_order_reviews_dataset.csv"),
                         parse_dates=['review_creation_date', 
                                      'review_answer_timestamp'])

# Removing Orders without a Review
reviews_df = (reviews_df[['review_comment_message', 'review_score']]
              .dropna(subset=["review_comment_message"])).reset_index(drop=True)
print(f"> We have {len(reviews_df)} written reviews.")
reviews_df.head()

> We have 40977 written reviews.


Unnamed: 0,review_comment_message,review_score
0,Recebi bem antes do prazo estipulado.,5
1,Parabéns lojas lannister adorei comprar pela I...,5
2,aparelho eficiente. no site a marca do aparelh...,4
3,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",4
4,"Vendedor confiável, produto ok e entrega antes...",5


## 1. Removing New Lines
___

There are some symbols that indicate the end of a line (`\r`, `\n` and `\r\n`), which don't really help us to understand the text meaning. Let's create a simple regex for removing them: 

In [3]:
def remove_newline(text):
    """ Removes new line symbols
    
    Args:
        text (str): original text
        
    Returns:
        (str): text without new line symbols
    """
    
    regex_pattern = "[\r|\n|\r\n]"
    return re.sub(regex_pattern, " ", text)

- **Example:**

In [4]:
example_text = reviews_df.iloc[3]['review_comment_message']
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(remove_newline(example_text))

> Text before preprocessing:


'Mas um pouco ,travando...pelo valor ta Boa.\r\n'


> Text after preprocessing:


'Mas um pouco ,travando...pelo valor ta Boa.  '

## 2. Replacing Dates
___

In reviews, it is common to see dates on comments (sometimes a customer will write the specific date when he ordered/bought the product). A customer could also specify the month or day of week. Let's create a regex that will replace:
- dates for the word `data`;
- months for the word `mes`;
- days of the week for the word `diasemana`.


In [5]:
def replace_dates(text):
    """ Replaces dates, months and days of the week for keywords
    
    Args:
        text (str): original text
        
    Returns:
        (str): preprocessed text    
    """
    
    date_pattern = "(\d+)(/|.)(\d+)(/|.)(\d+)"
    new_text = re.sub(date_pattern, " data ", text)
    
    month_pattern = "janeiro|fevereiro|março|abril|maio|junho|julho|agosto|setembro|outubro|novembro|dezembro"
    new_text = re.sub(month_pattern, " mes ", new_text)
    
    day_pattern = "segunda|terça|quarta|quinta|sexta|sabado|sábado|domingo"
    new_text = re.sub(day_pattern, " diasemana ", new_text)
    
    return new_text

- **Example:**

In [6]:
example_text = "fiz uma compra no dia 01/01/22, em janeiro, numa sexta"
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(replace_dates(example_text))

> Text before preprocessing:


'fiz uma compra no dia 01/01/22, em janeiro, numa sexta'


> Text after preprocessing:


'fiz uma compra no dia  data , em  mes , numa  diasemana '

## 3. Replacing Numbers
___

We'll replace numbers with the word `numero`.

In [7]:
def replace_numbers(text):
    """ Replaces numbers with the keyword 'numero'
    
    Args:
        text (str): original text
        
    Returns:
        (str): preprocessed text    
    """
    
    return re.sub("[0-9]+", " numero ", text)

- **Example:**

In [8]:
example_text = "gastei 1500 reais neste celular"
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(replace_numbers(example_text))

> Text before preprocessing:


'gastei 1500 reais neste celular'


> Text after preprocessing:


'gastei  numero  reais neste celular'

## 4. Replacing Negation Words 
___

In Portuguese, the word `não` is used to represent negation. And this word is usually listed as a stopword in many libraries, but removing this word may imply in a misinterpretation of the meaning of the text.

So let's replace the words `não` with the keyword `negação`.

In [9]:
def replace_negation_words(text):
    """ Replaces negation words with the keyword 'negação'
    
    Args:
        text (str): original text
    
    Returns:
        (str): preprocessed text
    """
    
    return re.sub("não|ñ|nao", " negação ", text)

- **Example:**

In [10]:
example_text = "não gostei do notebook que comprei nessa loja"
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(replace_negation_words(example_text))

> Text before preprocessing:


'não gostei do notebook que comprei nessa loja'


> Text after preprocessing:


' negação  gostei do notebook que comprei nessa loja'

## 5. Removing additional whitespaces
___

As we can see in the examples above, some of our regex functions may end up adding more whitespaces to the text. Let's create a function to eliminate the unecessary whitespaces:

In [11]:
def remove_additional_whitespaces(text):
    """ Removes additional whitespaces
    
    Args:
        text (str): original text
    
    Returns:
        (str): preprocessed text
    """
    
    new_text = re.sub("\s+", " ", text)
    new_text = new_text.strip()
    return new_text

In [12]:
example_text = ' negação  gostei do notebook que comprei nessa loja  '
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(remove_additional_whitespaces(example_text))

> Text before preprocessing:


' negação  gostei do notebook que comprei nessa loja  '


> Text after preprocessing:


'negação gostei do notebook que comprei nessa loja'

## 6. Removing Stopwords and Punctuation
___


There are many commonly used words that don't really help us understand a text. These are called **Stopwords**, which are filtered out when working with NLP.

Let's import a list of portuguese stopwords available on `nltk` and create a function to remove them from our text:

In [13]:
# Importing the portuguese stopwords
stopwords_ptbr = nltk.corpus.stopwords.words("portuguese")

In [14]:
def remove_stopwords_punctuation(text, stopwords):
    """ Removes stopwords and punctuation
    
    Args:
        text (str): original text
        stopwords (list): list of stopwords
        
    Returns:
        (str): preprocessed text    
    """
    
    tokens = nltk.tokenize.word_tokenize(text)
    words = [t for t in tokens if t.isalpha() and t not in stopwords]
    return " ".join(words)

- **Example:**

In [15]:
example_text = "o pedido foi entregue muito rápido!"
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(remove_stopwords_punctuation(example_text, stopwords_ptbr))

> Text before preprocessing:


'o pedido foi entregue muito rápido!'


> Text after preprocessing:


'pedido entregue rápido'

## 7. Removing Accent Marks
___

Unlike English, in Portuguese we use many accents marks such as acute accent, cedilla, circumflex, diaeresis, grave accent and tilde.

But many people may end up not accentuating the words. So let's create a function to remove these accent marks:

In [16]:
def remove_accent_marks(text):
    """ Removes accent marks
    
    Args:
        text (str): original text
    
    Returns:
        (str): preprocessed text
    """
    
    return unidecode.unidecode(text)

- **Example:**

In [17]:
example_text = "ótimo rápido péssimo não"
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(remove_accent_marks(example_text))

> Text before preprocessing:


'ótimo rápido péssimo não'


> Text after preprocessing:


'otimo rapido pessimo nao'

## 8. Stemming
___

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words.

The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

Let's create a function to extract the stem of each word in a text:

In [18]:
def text_stemmer(text, stemmer):
    """ Reduces each word of the text to its stem/root
    
    Args:
        text (str): original text
        stemmer (class): class of the stemmer
        
    Returns:
        (str): preprocessed text

    """
    return " ".join([stemmer.stem(word) for word in text.split()])

- **Example:**

In [19]:
nltk_stemmer = nltk.RSLPStemmer()

example_text = "fiz uma compra e o produto demorou muito tempo para chegar"
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(text_stemmer(example_text, nltk_stemmer))

> Text before preprocessing:


'fiz uma compra e o produto demorou muito tempo para chegar'


> Text after preprocessing:


'fiz uma compr e o produt demor muit temp par cheg'

## 9. Text Preprocessing Function
___


In the previous sections, we defined several text pre-processing functions. Now, let's create a function that will run the entire preprocessing pipeline: 

In [35]:
def text_preprocessing(text, stopwords, stemmer):
    """ Run the text preprocessing pipeline
    
    Args:
        text (str): original text
        stopwords (list): list of stopwords
        stemmer (class): class of the stemmer
        
    Returns:
        (str): preprocessed text
    
    """
    
    new_text = text.lower()
    new_text = remove_newline(new_text)
    new_text = replace_dates(new_text)
    new_text = replace_numbers(new_text)
    new_text = replace_negation_words(new_text)
    new_text = remove_additional_whitespaces(new_text)
    new_text = remove_stopwords_punctuation(new_text, stopwords)
    new_text = remove_accent_marks(new_text)
    new_text = text_stemmer(new_text, stemmer)
    return new_text

- **Example:**

In [38]:
example_text = reviews_df['review_comment_message'].iloc[12]
print("> Text before preprocessing:")
display(example_text)
print("\n> Text after preprocessing:")
display(text_preprocessing(example_text, stopwords_ptbr, nltk_stemmer))

> Text before preprocessing:


'Sempre compro pela Internet e a entrega ocorre antes do prazo combinado, que acredito ser o prazo máximo. No stark o prazo máximo já se esgotou e ainda não recebi o produto.'


> Text after preprocessing:


'sempr compr internet entreg ocorr ant praz combin acredit ser praz max stark praz max esgot aind negaca receb produt'

## 10. Preprocessing our Data 
___

Ok. Now that we have created our preprocessing pipeline, let's preprocess all our data and create a new column called `preprocessed_review`:

In [44]:
reviews_df['preprocessed_review'] = [text_preprocessing(text, stopwords_ptbr, nltk_stemmer)
                                     for text in reviews_df['review_comment_message']]

reviews_df

Unnamed: 0,review_comment_message,review_score,preprocessed_review
0,Recebi bem antes do prazo estipulado.,5,receb bem ant praz estipul
1,Parabéns lojas lannister adorei comprar pela I...,5,parab loj lannist ador compr internet segur pr...
2,aparelho eficiente. no site a marca do aparelh...,4,aparelh efici sit marc aparelh impress numer d...
3,"Mas um pouco ,travando...pelo valor ta Boa.\r\n",4,pouc trav val ta boa
4,"Vendedor confiável, produto ok e entrega antes...",5,vend confia produt ok entreg ant praz
...,...,...,...
40972,para este produto recebi de acordo com a compr...,4,produt receb acord compr realiz
40973,Entregou dentro do prazo. O produto chegou em ...,5,entreg dentr praz produt cheg condico perfeit ...
40974,"O produto não foi enviado com NF, não existe v...",3,produt negaca envi nf negaca exist vend nf cer...
40975,"Excelente mochila, entrega super rápida. Super...",5,excel mochil entreg sup rap sup recom loj
