# Analysis of Brazilian E-Commerce Public Dataset by Olist

Our goal with this dataset is to develop a classification model, which will identify if a customer review was positive or negative. 

Since the dataset does not have this target variable, we will create it based on available information -> customer reviews.

When first downloading the dataset, you may notice we get quite a few files. More specifically, they are:

olist_customers_dataset.csv - data related to customers zip code and state

olist_geolocation_dataset.csv - latlon codes for customers

olist_orders_dataset.csv - order status and details

olist_order_items_dataset.csv - customer orders, price and seller id

olist_order_payments_dataset.csv - payment type and installments

olist_order_reviews_dataset.csv - review scores and texts 

olist_products_dataset.csv - products and their description

olist_sellers_dataset.csv - seller zip code

product_category_name_translation.csv - product category names

Since our goal here is to design a Machine Learning algorithm that is able to identify the sentiment of customer reviews, we will work with the olist_order_reviews_dataset.csv

## Exploratory Data Analysis (EDA)

An EDA is the first step towards any successfull data-related project. It consists of exploring the dataset we will work with, understanding its attributes, distribution of values and feature relationships.

To build a great ML model, we must first understand and sanitize the data.

In [1]:
# We start by defining the csv path and loading the file

import os

abs_path = os.getcwd()
csv_path = 'data/datasets/olistbr/brazilian-ecommerce/versions/2/olist_order_reviews_dataset.csv'

full_csv_path = os.path.join(abs_path, csv_path)

print(full_csv_path)

/home/heitor/Bravium/Bravium-NLP/data/datasets/olistbr/brazilian-ecommerce/versions/2/olist_order_reviews_dataset.csv


In [2]:
# the .head() method is extremely useful to get a sneak-peek on the dataset

import pandas as pd

df = pd.read_csv(full_csv_path)
df.head(5)

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


Here, we can already identify a few columns which we may want to throw out. `review_id` and `order_id` are likely not useful at all, while review score, title and message will likely be the main features for our model.

In [3]:
df = df.drop(columns=['review_id', 'order_id',
                      'review_creation_date', 'review_answer_timestamp'])

In [4]:
df.head(5)

Unnamed: 0,review_score,review_comment_title,review_comment_message
0,4,,
1,5,,
2,5,,
3,5,,Recebi bem antes do prazo estipulado.
4,5,,Parabéns lojas lannister adorei comprar pela I...


The first thing I wish to do now is to compare the amount of entries in the dataset with the amount of missing data. Lets have a look:

In [5]:
len(df)

99224

In [6]:
null_counts = df.isna().sum()
null_counts

review_score                  0
review_comment_title      87656
review_comment_message    58247
dtype: int64

Here, we see that out of 99k entries, 87k titles and 58k messages are missing

Lets visualize a few values from these columns

In [7]:
# the .value_counts method is very useful to understand columns' data distributions

df['review_score'].value_counts()

review_score
5    57328
4    19142
1    11424
3     8179
2     3151
Name: count, dtype: int64

In [8]:
# Here, we can already see some positive review titles (recomendo, bom, excelente)

df['review_comment_title'].value_counts()

review_comment_title
Recomendo                    423
recomendo                    345
Bom                          293
super recomendo              270
Excelente                    248
                            ... 
medidas do produto             1
Muito, entregou antes do       1
Tudo dentro do combinado.      1
Não entrega do produto         1
Tudo como previsto!            1
Name: count, Length: 4527, dtype: int64

In [9]:
# Here, we can see both positive (bom, muito bom) and negative (não informado como réplica) reviews 

df['review_comment_message'].value_counts()

review_comment_message
Muito bom                                                    230
Bom                                                          189
muito bom                                                    122
bom                                                          107
Recomendo                                                    100
                                                            ... 
qualidade.                                                     1
chegou bem antes do prazo previsto                             1
Ja respondi esse questionario.                                 1
Produto não informado como paralelo/réplica                    1
Produto um pouço maior do que na imagem, mas ficou legal.      1
Name: count, Length: 36159, dtype: int64

In general, the `review_comment_title` seems to be redundant. It just follows the review score with a few words. Other than that, a big amount of its records are missing. Hence, we will drop this column.

In [10]:
df = df.drop(columns=['review_comment_title'])

We also drop missing values from the `review_comment_message` column, leaving us with the scores and their respective message:

In [11]:
df = df.dropna(subset=['review_comment_message'])
df = df.reset_index(drop=True)

In [12]:
df.head(5)

Unnamed: 0,review_score,review_comment_message
0,5,Recebi bem antes do prazo estipulado.
1,5,Parabéns lojas lannister adorei comprar pela I...
2,4,aparelho eficiente. no site a marca do aparelh...
3,4,"Mas um pouco ,travando...pelo valor ta Boa.\r\n"
4,5,"Vendedor confiável, produto ok e entrega antes..."


In [13]:
len(df)

40977

After this cleaning, we now have around 40k samples to work with. Still, before using this data on our model, we must clean it up extensively.

This data pre-processing step is shown is the `data_cleaning.ipynb` file

In [14]:
df.to_csv('customer_reviews.csv')