# Analysis of Reviews on Olist

🎯 Now that you are familiar with NLP, let's analyze the reviews of Olist.

👇 Run the following cell to load the reviews dataset.

In [1]:
import pandas as pd

url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ml_olist_nlp_reviews.csv"
df = pd.read_csv(url, low_memory = False)

df.head()

Unnamed: 0.1,Unnamed: 0,review_id,length_review,review_score,order_id,product_category_name,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,0,7bc2406110b926393aa56f80a40eba40,0,4,73fc7af87114b39712e6da79b0a377eb,esporte_lazer,,,2018-01-18 00:00:00,2018-01-18 21:46:59,41dcb106f807e993532d446263290104,delivered,2018-01-11 15:30:49,2018-01-11 15:47:59,2018-01-12 21:57:22,2018-01-17 18:42:41,2018-02-02 00:00:00
1,1,80e641a11e56f04c1ad469d5645fdfde,0,5,a548910a1c6147796b98fdf73dbeba33,informatica_acessorios,,,2018-03-10 00:00:00,2018-03-11 03:05:13,8a2e7ef9053dea531e4dc76bd6d853e6,delivered,2018-02-28 12:25:19,2018-02-28 12:48:39,2018-03-02 19:08:15,2018-03-09 23:17:20,2018-03-14 00:00:00
2,2,228ce5500dc1d8e020d8d1322874b6f0,0,5,f9e4b658b201a9f2ecdecbb34bed034b,informatica_acessorios,,,2018-02-17 00:00:00,2018-02-18 14:36:24,e226dfed6544df5b7b87a48208690feb,delivered,2018-02-03 09:56:22,2018-02-03 10:33:41,2018-02-06 16:18:28,2018-02-16 17:28:48,2018-03-09 00:00:00
3,3,e64fb393e7b32834bb789ff8bb30750e,37,5,658677c97b385a9be170737859d3511b,ferramentas_jardim,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,de6dff97e5f1ba84a3cd9a3bc97df5f6,delivered,2017-04-09 17:41:13,2017-04-09 17:55:19,2017-04-10 14:24:47,2017-04-20 09:08:35,2017-05-10 00:00:00
4,4,f7c4243c7fe1938f181bec41a392bdeb,100,5,8e6bfb81e283fa7e4f11123a3fb894f1,esporte_lazer,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,5986b333ca0d44534a156a52a8e33a83,delivered,2018-02-10 10:59:03,2018-02-10 15:48:21,2018-02-15 19:36:14,2018-02-28 16:33:35,2018-03-09 00:00:00


In [2]:
df.shape

(98657, 17)

In [12]:
df = df[~df['review_comment_message'].isna()]
df = df[~df['review_score'].isna()]
df.shape

(40439, 17)

In [10]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def prepocessing(sentence):
    
    # Basic cleaning
    sentence = sentence.strip() ## remove whitespaces
    sentence = sentence.lower() ## lowercase 
    sentence = ''.join(char for char in sentence if not char.isdigit()) ## remove numbers
    
    # Advanced cleaning
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') ## remove punctuation
    
    tokenized_sentence = word_tokenize(sentence) ## tokenize 
    # stop_words = set(stopwords.words('english')) ## define stopwords
    
    # tokenized_sentence_cleaned = [ ## remove stopwords
    #     w for w in tokenized_sentence if not w in stop_words
    # ]

    lemmatized = [
        WordNetLemmatizer().lemmatize(word) 
        for word in tokenized_sentence
    ]
    
    cleaned_sentence = ' '.join(word for word in lemmatized)
    
    return cleaned_sentence

In [30]:
X = df['review_comment_message'].map(prepocessing)
y = df['review_score'].map(lambda x:0 if x>=4 else 1)

X = X[y==1]

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(2,2),max_df=0.6,max_features=10)

# Training it on the texts
weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(X).toarray(),
                 columns = tf_idf_vectorizer.get_feature_names_out())

weighted_words

Unnamed: 0,ante do,bem ante,chegou ante,dentro do,do prazo,muito bom,no prazo,produto chegou,produto de,produto entregue
0,0.470656,0.750355,0.0,0.000000,0.464166,0.0,0.0,0.000000,0.0,0.0
1,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
2,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
3,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
4,0.711998,0.000000,0.0,0.000000,0.702181,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
26413,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
26414,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
26415,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
26416,0.000000,0.000000,0.0,0.656582,0.418240,0.0,0.0,0.627675,0.0,0.0


In [35]:
from sklearn.decomposition import LatentDirichletAllocation

from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(2,2),max_df=0.6,max_features=10)

# Training it on the texts
weighted_words = pd.DataFrame(tf_idf_vectorizer.fit_transform(X).toarray(),
                 columns = tf_idf_vectorizer.get_feature_names_out())


# vectorized_documents
# Instantiate the LDA 
n_components = 5
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fit the LDA on the vectorized documents
lda_model.fit_transform(weighted_words)

array([[0.07448302, 0.07850201, 0.074483  , 0.07639681, 0.69613516],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       ...,
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       [0.07400571, 0.39854797, 0.07400568, 0.07589687, 0.37754377],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ]])

In [33]:
def print_topics(lda_model, vectorizer, top_words):
    # 1. TOPIC MIXTURE OF WORDS FOR EACH TOPIC
    topic_mixture = pd.DataFrame(lda_model.components_,
                                 columns = vectorizer.get_feature_names_out())
    
    # 2. FINDING THE TOP WORDS FOR EACH TOPIC
    ## Number of topics
    n_components = topic_mixture.shape[0]
    ## Top words for each topic
    for topic in range(n_components):
        print("-"*10)
        print(f"For topic {topic}, here are the the top {top_words} words with weights:")
        topic_df = topic_mixture.iloc[topic]\
                             .sort_values(ascending = False).head(top_words)
        
        print(round(topic_df,3))

In [34]:
print_topics(lda_model, vectorizer, 5)

----------
For topic 0, here are the the top 5 words with weights:
do         6908.522
prazo      6861.191
ante       5235.192
produto    5028.258
chegou     3662.741
Name: 0, dtype: float64
----------
For topic 1, here are the the top 5 words with weights:
que    2301.311
não    1895.194
de     1703.642
um     1316.189
ma     1275.192
Name: 1, dtype: float64
----------
For topic 2, here are the the top 5 words with weights:
com      1925.561
de       1237.232
muito    1213.968
da       1102.427
loja     1067.190
Name: 2, dtype: float64
----------
For topic 3, here are the the top 5 words with weights:
produto      2796.627
entrega      2240.940
recomendo    2083.550
tudo         2055.190
ótimo        1733.190
Name: 3, dtype: float64
----------
For topic 4, here are the the top 5 words with weights:
de           2175.718
qualidade    2097.192
produto      1623.216
muito        1346.021
boa          1061.192
Name: 4, dtype: float64


❓ **Question: Analyse the reviews to understand what could be the causes of the bad review scores** ❓

This challenge is not as guided as the previous ones. But here are some questions to ask yourself:

- Are all the reviews relevant ? 
- What about combining the title and the body of a review ?
- What cleaning operations would you apply to the reviews ?

🇧🇷 Some Brazilian expressions and their translations:

- `producto errado` = wrong product
- `ainda nao` = not yet
- `nao entregue` = not delivered
- `nao veio` = did not come
- `nao gostei` = did not like it
- `produto defeito` = defective product
- `nao functiona` = not working
- `produto diferente` = different product
- `pessima qualidade` = poor quality
- `veio defeito` = came defect
- `veio faltando` = came missing
- `veio errado` = came wrong

🏁 Congratulations. Instead of reading 90K+ reviews, you were able to detect the main reasons of dissatisfactions on Olist.

💾 Don't forget to `git add/commit/push`