# Cleaning/Staging of Brazilian E-Commerce Public Dataset by Olist

In the previous notebook, we perfomed an EDA of the Brazilian E-Commerce Public Dataset, made available by Olist.

As a reminder, our final goal is to develop a classification model, which will identify if a customer review was positive or negative. 

While we already have the data we need to build this model, there is yet much to be done. As we saw from the last notebook, there still may be special characters, undesired HTML tags and other artifacts which are undesired in our data. We will deal with these artifacts now.

In [18]:
# Load our checkpoint .csv file
import pandas as pd

df = pd.read_csv('customer_reviews.csv', index_col=0)
df.head(5)

Unnamed: 0,review_score,review_comment_message
0,5,Recebi bem antes do prazo estipulado.
1,5,Parabéns lojas lannister adorei comprar pela I...
2,4,aparelho eficiente. no site a marca do aparelh...
3,4,"Mas um pouco ,travando...pelo valor ta Boa.\r\n"
4,5,"Vendedor confiável, produto ok e entrega antes..."


We can already see from line 4 that we have special characters - '/r' and '/n'.

The best way to deal with these in through Regex expressions

A very helpful cheatsheet can be found in [this link.](https://www.analyticsvidhya.com/blog/2021/06/regex-cheatsheet-for-natural-language-processing-tasks/)

In [19]:
import re

def remove_breakline(text):
    return [re.sub(r'[\n\r]', ' ', r) for r in text]


In [20]:
def remove_hyperlinks(text):
    pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return [re.sub(pattern, ' link ', r) for r in text]

In [21]:
def remove_dates(text):
    pattern = r'([0-2][0-9]|(3)[0-1])(\/|\.)(((0)[0-9])|((1)[0-2]))(\/|\.)\d{2,4}'
    return [re.sub(pattern, ' data ', r) for r in text]

In [22]:
def remove_emojis(text):
    return [re.sub(r'\W', ' ', r) for r in text]

In [23]:
def remove_trailing_whitespace(text):
    white_spaces = [re.sub(r'\s+', ' ', r) for r in text]
    white_spaces_end = [re.sub(r'[ \t]+$', '', r) for r in white_spaces]
    return white_spaces_end

In [24]:
review_messages = list(df['review_comment_message'].values)

review_messages = remove_breakline(review_messages)
review_messages = remove_hyperlinks(review_messages)
review_messages = remove_dates(review_messages)
review_messages = remove_emojis(review_messages)
review_messages = remove_trailing_whitespace(review_messages)

df['comments'] = review_messages
df = df.drop(columns=['review_comment_message'])

df.head(5)

Unnamed: 0,review_score,comments
0,5,Recebi bem antes do prazo estipulado
1,5,Parabéns lojas lannister adorei comprar pela I...
2,4,aparelho eficiente no site a marca do aparelho...
3,4,Mas um pouco travando pelo valor ta Boa
4,5,Vendedor confiável produto ok e entrega antes ...


With our text pre-processed for Regex patterns, we've done about half of the work.

Most modern NLP pipelines also perform text transformations such as removing stopwords and stemming the text [(ref)](https://medium.com/@asjad_ali/understanding-the-nlp-pipeline-a-comprehensive-guide-828b2b3cd4e2)

In [25]:
import nltk
from nltk.corpus import stopwords

try:
    portuguese_stopwords = stopwords.words('portuguese')
except:
    nltk.download('stopwords')
    portuguese_stopwords = stopwords.words('portuguese')

portuguese_stopwords[:4]

['a', 'à', 'ao', 'aos']

In [26]:
def remove_stopwords(text):
    portuguese_stopwords = stopwords.words('portuguese')
    return [c.lower() for c in text.split() if c.lower() not in portuguese_stopwords]

In [27]:
review_messages = [' '.join(remove_stopwords(review)) for review in review_messages]

df['comments'] = review_messages

In [28]:
df.head(5)

Unnamed: 0,review_score,comments
0,5,recebi bem antes prazo estipulado
1,5,parabéns lojas lannister adorei comprar intern...
2,4,aparelho eficiente site marca aparelho impress...
3,4,pouco travando valor ta boa
4,5,vendedor confiável produto ok entrega antes prazo


You can now see that our sentences were very reduced. Although a human reader may perceive them as worst, this process has shown to improve model accuracy. Now, we will perform text stemming

Text stemming is the process of reducing a word to its radical. As an example, verbs such as running and ran become run. 

In [29]:
from nltk.stem import RSLPStemmer

def stem(text):
    stemmer = RSLPStemmer()
    return [stemmer.stem(c) for c in text.split()]

In [31]:
try:
    review_messages = [' '.join(stem(review) for review in review_messages)]
except:
    nltk.download("rslp")
    review_messages = [' '.join(stem(review)) for review in review_messages]

df['comments'] = review_messages
df.head(5)

[nltk_data] Downloading package rslp to /home/heitor/nltk_data...
[nltk_data]   Package rslp is already up-to-date!


Now that the text was pre-processed, there are many ways to approach the next step: feature extraction.

Two of the most common Machine Learning-based* feature extractors for NLP are Bag of Words and TF-IDF. With this example, we will apply the TF-IDF method, since it comes stock with Scikit. If the result is poor, we may try other apporaches.


*some feature extractors which are not ML-based would be GPTs, for example.

The last step now is defining the target column, the one which our model will be trained on to predict later. 

There are a few options here. For starters, we could define review rated 1 and 2 as bad, 3 as neutral and 4 our 5 as positive. That is an approach I do not like. Personally, I see three star reviews as more negative than positive. I'm also not quite sure if a 4 star review can be defined as "positive", but for simplicity, we will map it as such.

In [None]:
def map_sentiment(score):
    if score >= 4:
        return int(1)
    else:
        return int(0)


df["sentiment"] = df["review_score"].apply(map_sentiment)
df.head(5)

In [36]:
df.to_csv('customer_reviews_preprocessed.csv')