# Analysis of Reviews on Olist

🎯 Now that you are familiar with NLP, let's analyze the reviews of Olist.

👇 Run the following cell to load the reviews dataset and install `unidecode`

In [1]:
!pip install -q unidecode

import pandas as pd

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ml_olist_nlp_reviews.csv"
df = pd.read_csv(url, low_memory = False)

df.head()


Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,2018-01-18 00:00:00,2018-01-18 21:46:59,41dcb106f807e993532d446263290104,delivered,2018-01-11 15:30:49,2018-01-11 15:47:59,2018-01-12 21:57:22,2018-01-17 18:42:41,2018-02-02 00:00:00
1,1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,2018-03-10 00:00:00,2018-03-11 03:05:13,8a2e7ef9053dea531e4dc76bd6d853e6,delivered,2018-02-28 12:25:19,2018-02-28 12:48:39,2018-03-02 19:08:15,2018-03-09 23:17:20,2018-03-14 00:00:00
2,2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,2018-02-17 00:00:00,2018-02-18 14:36:24,e226dfed6544df5b7b87a48208690feb,delivered,2018-02-03 09:56:22,2018-02-03 10:33:41,2018-02-06 16:18:28,2018-02-16 17:28:48,2018-03-09 00:00:00
3,3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,de6dff97e5f1ba84a3cd9a3bc97df5f6,delivered,2017-04-09 17:41:13,2017-04-09 17:55:19,2017-04-10 14:24:47,2017-04-20 09:08:35,2017-05-10 00:00:00
4,4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,5986b333ca0d44534a156a52a8e33a83,delivered,2018-02-10 10:59:03,2018-02-10 15:48:21,2018-02-15 19:36:14,2018-02-28 16:33:35,2018-03-09 00:00:00


In [2]:
df.shape


(98657, 17)

❓ **Question: Analyse the reviews to understand what could be the causes of the bad review scores** ❓

This challenge is not as guided as the previous ones. But here are some questions to ask yourself:

- Are all the reviews relevant ? 
- What about combining the title and the body of a review ?
- What cleaning operations would you apply to the reviews ?

In [3]:
# Customers could review an order before receiving it...
# We should consider reviews written only after receiving the order

df = df[(df['review_creation_date'] >= df['order_delivered_customer_date'])]


In [4]:
# Keep only text columns and review score
df = df[['order_id','product_category_name','review_comment_title','review_comment_message','review_score']]
df.head()


Unnamed: 0,order_id,product_category_name,review_comment_title,review_comment_message,review_score
0,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,4
1,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,5
2,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,5
3,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,5
4,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,5


In [5]:
# combine review title and review message
df = df.dropna(subset=['review_comment_title','review_comment_message'])
df['title_comment'] = df["review_comment_title"].fillna('') + " " \
            + df['review_comment_message'].fillna('')
df.head()


Unnamed: 0,order_id,product_category_name,review_comment_title,review_comment_message,review_score,title_comment
9,b9bf720beb4ab3728760088589c62129,eletroportateis,recomendo,aparelho eficiente. no site a marca do aparelh...,4,recomendo aparelho eficiente. no site a marca ...
15,e51478e7e277a83743b6f9991dbfa3fb,informatica_acessorios,Super recomendo,"Vendedor confiável, produto ok e entrega antes...",5,"Super recomendo Vendedor confiável, produto ok..."
22,4fc44d78867142c627497b60a7e0228a,beleza_saude,Ótimo,Loja nota 10,5,Ótimo Loja nota 10
36,37e7875cdce5a9e5b3a692971f370151,esporte_lazer,Muito bom.,Recebi exatamente o que esperava. As demais en...,4,Muito bom. Recebi exatamente o que esperava. A...
38,e029f708df3cc108b3264558771605c6,pet_shop,Bom,"Recomendo ,",5,"Bom Recomendo ,"


In [8]:
# Cleaning text

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import unidecode
nltk.download('punkt')


def clean (text):

    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation

    lowercased = text.lower() # Lower Case

    unaccented_string = unidecode.unidecode(lowercased) # remove accents

    tokenized = word_tokenize(unaccented_string) # Tokenize

    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers

    stop_words = set(stopwords.words('portuguese')) # Make stopword list

    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words

    return " ".join(without_stopwords)

df['clean_text'] = df['title_comment'].apply(clean)

df.head()


[nltk_data] Downloading package punkt to /home/delphine/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/home/delphine/nltk_data'
    - '/home/delphine/.pyenv/versions/3.10.6/envs/lewagon/nltk_data'
    - '/home/delphine/.pyenv/versions/3.10.6/envs/lewagon/share/nltk_data'
    - '/home/delphine/.pyenv/versions/3.10.6/envs/lewagon/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
# Looking at the review scores...
# ... we have more than 25% of the orders with review scores <= 3
round(df["review_score"].value_counts(normalize = True),2)


5    0.59
4    0.16
1    0.14
3    0.07
2    0.04
Name: review_score, dtype: float64

In [None]:
# Let's focus on these bad scores
df = df[df["review_score"]<=3]


In [None]:
df.columns


Index(['order_id', 'product_category_name', 'review_comment_title',
       'review_comment_message', 'review_score', 'title_comment',
       'clean_text'],
      dtype='object')

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range = (2,2),
                             min_df=0.01,
                             max_df = 0.05).fit(df.clean_text)


In [None]:
vectors = pd.DataFrame(vectorizer.transform(df.clean_text).toarray(),
                       columns = vectorizer.get_feature_names_out())
vectors.head()


Unnamed: 0,ainda nao,antes prazo,ate agora,ate momento,bom produto,comprei dois,comprei duas,comprei produto,defeito produto,dentro prazo,...,produto errado,produto recebi,recebi apenas,recomendo produto,so chegou,so recebi,so veio,veio defeito,veio errado,veio faltando
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
sum_tfidf = vectors.sum(axis = 0)
sum_tfidf


ainda nao            59.953928
antes prazo          28.395321
ate agora            36.184216
ate momento          27.592462
bom produto          30.792640
comprei dois         27.888606
comprei duas         23.936694
comprei produto      26.983175
defeito produto      16.787267
dentro prazo         25.096925
entrar contato       19.739133
entregue nao         16.662659
errado comprei       15.477524
gostei produto       18.347935
nao chegou           33.166038
nao consigo          25.319662
nao entregue         56.795948
nao funciona         41.455319
nao gostei           41.958468
nao obtive           17.076249
nao veio             54.798881
nota fiscal          45.393544
outro produto        19.117803
pessima qualidade    30.066578
porem nao            20.680051
produto bom          36.363439
produto chegou       45.080361
produto comprei      27.615916
produto defeito      41.936776
produto diferente    36.387852
produto entregue     46.093263
produto errado       66.594133
produto 

In [None]:
tfidf_list = [(word, sum_tfidf[word])
              for word, idx in vectorizer.vocabulary_.items()]
tfidf_list


[('produto recebi', 18.36913055412219),
 ('ate agora', 36.184216388958035),
 ('produto defeito', 41.93677588570452),
 ('ainda nao', 59.95392767842011),
 ('nao obtive', 17.076249075368263),
 ('recebi apenas', 44.91822070048678),
 ('produto errado', 66.59413315707422),
 ('veio defeito', 26.85961632946528),
 ('defeito produto', 16.787266729501095),
 ('produto chegou', 45.08036102140102),
 ('antes prazo', 28.395320981582028),
 ('pessima qualidade', 30.066577634144036),
 ('nao gostei', 41.95846774793043),
 ('produto diferente', 36.38785195017216),
 ('nao consigo', 25.31966216476006),
 ('produto entregue', 46.09326281613154),
 ('produto bom', 36.36343877398457),
 ('comprei produto', 26.983175449717525),
 ('bom produto', 30.79264033815669),
 ('so recebi', 41.67828991932885),
 ('errado comprei', 15.47752353172635),
 ('entrar contato', 19.739133325547165),
 ('ate momento', 27.59246168657807),
 ('nao chegou', 33.16603770360955),
 ('recomendo produto', 36.332676681785465),
 ('porem nao', 20.68005

In [None]:
sorted_tfidf_list =sorted(tfidf_list, key = lambda x: x[1], reverse=True)
sorted_tfidf_list


[('produto errado', 66.59413315707422),
 ('ainda nao', 59.95392767842011),
 ('nao entregue', 56.795947662676284),
 ('nao veio', 54.79888075566172),
 ('produto entregue', 46.09326281613154),
 ('nota fiscal', 45.39354367145548),
 ('produto chegou', 45.08036102140102),
 ('recebi apenas', 44.91822070048678),
 ('nao gostei', 41.95846774793043),
 ('produto defeito', 41.93677588570452),
 ('so recebi', 41.67828991932885),
 ('nao funciona', 41.45531914089406),
 ('produto diferente', 36.38785195017216),
 ('produto bom', 36.36343877398457),
 ('recomendo produto', 36.332676681785465),
 ('ate agora', 36.184216388958035),
 ('nao chegou', 33.16603770360955),
 ('bom produto', 30.79264033815669),
 ('pessima qualidade', 30.066577634144036),
 ('antes prazo', 28.395320981582028),
 ('comprei dois', 27.888605969877165),
 ('produto comprei', 27.615915917635764),
 ('ate momento', 27.59246168657807),
 ('comprei produto', 26.983175449717525),
 ('veio defeito', 26.85961632946528),
 ('nao consigo', 25.31966216476

🇧🇷 Some Brazilian expressions and their translations:

- `producto errado` = wrong product
- `ainda nao` = not yet
- `nao entregue` = not delivered
- `nao veio` = did not come
- `nao gostei` = did not like it
- `produto defeito` = defective product
- `nao functiona` = not working
- `produto diferente` = different product
- `pessima qualidade` = poor quality
- `veio defeito` = came defect
- `veio faltando` = came missing
- `veio errado` = came wrong

🏁 Congratulations. Instead of reading 90K+ reviews, you were able to detect the main reasons of dissatisfactions on Olist.

💾 Don't forget to `git add/commit/push`