# Predicting the rating of a dress from online reviews

In this assignment, I will predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars). In this notebook you will find my documentation of the investigation.

In [260]:
# Import the modules

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import numpy as np

In [261]:
# import the csv file

df = pd.read_csv('Assignment text mining - data clothing reviews.csv')

In [262]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Bag of words
The bag of words model counts the appearance of the words in a document, and ignores order or grammar.

## Naïve Bayes

The Naïve Bayes model assigns labels to a certain value, but doesn't take correlations into account. When two words are describing a certain item, it doesn't look at the combination of those two words, but at the two words as independent values. 

# Pre-processing steps

In [263]:
# Filtering out only the reviews for dresses

df1 = df[df['Class Name'] == 'Dresses']

In [264]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [265]:
# Removing unnecessary columns

df1 = df1.drop(['Unnamed: 0', 'Clothing ID'], axis=1)

In [266]:
# Removing empty values

df1 = df1.dropna()

In [267]:
# Making a distinction between positive and negative ratings, positive as 1 and neutral/negative as 0

df1.loc[df1['Rating'] < 4, 'Positive/Negative'] = '0' 
df1.loc[df1['Rating'] > 3, 'Positive/Negative'] = '1' 

df1.head()

Unnamed: 0,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Positive/Negative
2,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,0
5,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,0
8,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses,1
9,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses,1
10,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses,0


In [268]:
# Printing out the head

df1.head()

Unnamed: 0,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Positive/Negative
2,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,0
5,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,0
8,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses,1
9,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses,1
10,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses,0


# Steps for a document feature matrix

In [269]:
text = df1['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi']


In [270]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat) #Let's print a little part of the matrix: the first 50 words & documents

  (0, 1353)	1
  (0, 1585)	1
  (0, 2074)	1
  (0, 2151)	1
  (0, 2292)	1
  (0, 2642)	1
  (0, 2782)	1
  (0, 2824)	1
  (0, 3244)	2
  (0, 3376)	1
  (0, 3443)	1
  (0, 3547)	1
  (0, 3619)	1
  (0, 3785)	1
  (0, 3921)	2
  (0, 3924)	1
  (0, 4179)	1
  (0, 4282)	1
  (0, 4554)	2
  (0, 4569)	1
  (0, 4681)	1
  (0, 4737)	1
  (0, 4776)	1
  (0, 4785)	1
  (0, 4977)	2
  :	:
  (5369, 4606)	1
  (5369, 4954)	1
  (5369, 4957)	1
  (5369, 6108)	1
  (5369, 6401)	1
  (5369, 6684)	1
  (5369, 6801)	1
  (5369, 6807)	1
  (5369, 7270)	1
  (5369, 7427)	1
  (5369, 7446)	1
  (5369, 7484)	1
  (5369, 7502)	1
  (5369, 7649)	1
  (5370, 1589)	1
  (5370, 2292)	1
  (5370, 2364)	1
  (5370, 2733)	1
  (5370, 2783)	1
  (5370, 3385)	1
  (5370, 4115)	1
  (5370, 4957)	1
  (5370, 5073)	1
  (5370, 5483)	1
  (5370, 7488)	1


In [271]:
# Final teps for the matrix (which I will not perform to spare my laptop)

# rev_words = pd.concat([df1, pd.DataFrame(docu_feat.toarray())], axis=1)
# rev_words.head(10)

### Splitting the file into a training and testing set

In [272]:
X = docu_feat
y = df1['Positive/Negative']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### Training a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars)


In [273]:
# Setting a variable for Naïve Bayes and fitting the data

clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB()

In [303]:
#Predicting the score of the first 6 reviews

print(clf.predict(X_test[0:6]))

['1' '0' '1' '0' '1' '0']


In [275]:
# Create a value for all predictions

prediction = clf.predict(X_test)
prediction

array(['1', '0', '1', ..., '1', '1', '1'], dtype='<U1')

## Evaluation

In [276]:
y_test_p = clf.predict(X_test)
clf.score(X_test, y_test)

0.8542183622828784

The accuracy is 85.4%, but there are only two categories. What if we guessed the same category all the time?

In [277]:
df1['Positive/Negative'].value_counts(normalize=True)

1    0.753119
0    0.246881
Name: Positive/Negative, dtype: float64

We would have guessed it right 75.3% of the time by only guessing 'Positive'.

### Confustion matrix

In [278]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Neutral/Negative', 'Positive'], columns=['Neutral/Negative predictions', 'Positive predictions'])
cm

Unnamed: 0,Neutral/Negative predictions,Positive predictions
Neutral/Negative,238,168
Positive,67,1139


In [279]:
# Check whether the labels are correct

clf.classes_

array(['0', '1'], dtype='<U1')

In [280]:
# Classification report

print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.78      0.59      0.67       406
           1       0.87      0.94      0.91      1206

    accuracy                           0.85      1612
   macro avg       0.83      0.77      0.79      1612
weighted avg       0.85      0.85      0.85      1612



How can we read this? The positive prediction is 87% of the time accurate. So 13% of the time the prediction was negative but the acutal rating was positive. 

The recall is 94% accurate. This means that positive ratings have been predicted as positive in 94% of the cases. That is very accurate, but as 75% of the ratings are positive it can also be a very good guess. 

In [327]:
# Creating a dataframe for comparing the prediction and the actual data

df2 = pd.DataFrame({'Pred': prediction, 'Actual': y_test})
df2.head(20)
df2["Comparison"] = np.where(df2["Pred"] == df2["Actual"], True, False)
df2 = df2.sort_values(by = "Comparison")
df2

Unnamed: 0,Pred,Actual,Comparison
15037,1,0,False
10669,1,0,False
5070,1,0,False
4403,0,1,False
18366,0,1,False
...,...,...,...
1448,1,1,True
5499,1,1,True
6247,1,1,True
10400,1,1,True


In [314]:
# Check what the actual rating was for the first three items of df2

df.iloc[15037, :]


Unnamed: 0                                                             15037
Clothing ID                                                             1087
Age                                                                       25
Title                                                               Runs big
Review Text                This dress looks so cute in the pictures-i lov...
Rating                                                                     2
Recommended IND                                                            0
Positive Feedback Count                                                    0
Division Name                                                        General
Department Name                                                      Dresses
Class Name                                                           Dresses
Name: 15037, dtype: object

In [317]:
# Compare in the df which ones are not matching

df.iloc[15037, 4]

'This dress looks so cute in the pictures-i love the style. ordered typical size and it was huge-felt like many sizes too big.'

Has been marked as positive, but is actually negative. I see the words cute and love so that is probably where the prediction went wrong. 

In [329]:
df.iloc[10669, :]

Unnamed: 0                                                             10669
Clothing ID                                                             1083
Age                                                                       37
Title                                                      Beautiful idea...
Review Text                I ordered my normal size in this dress. i am 6...
Rating                                                                     3
Recommended IND                                                            1
Positive Feedback Count                                                    0
Division Name                                                        General
Department Name                                                      Dresses
Class Name                                                           Dresses
Name: 10669, dtype: object

In [330]:
df.iloc[10669, 4]

"I ordered my normal size in this dress. i am 6 foot tall, but the regular sizes were too large and too long (mid-calf). i returned the dress for a size smaller in petite for a more flattering hemline. the dress is lovely, especially on the models in the pictures, but didn't quite work out for me. also, it feels like there are hundreds of closure hooks that make putting on/taking off the dress seem to take an unusually long time!"

Predicted as positive, but the review is actually negative. I think that the words flattering and lovely have been interpreted by the algorithm as positive. The overall review is not that positive but I can understand where the algorighm went wrong as you have to read between the lines. The rating is also 3, which is not that bad, so it is on the edge of becoming a positive review.

In [333]:
df.iloc[5070, :]

Unnamed: 0                                                              5070
Clothing ID                                                             1095
Age                                                                       52
Title                                                             Runs small
Review Text                This dress is very cute and is made well.  i b...
Rating                                                                     3
Recommended IND                                                            1
Positive Feedback Count                                                    0
Division Name                                                 General Petite
Department Name                                                      Dresses
Class Name                                                           Dresses
Name: 5070, dtype: object

In [332]:
df.iloc[5070, 4]

"This dress is very cute and is made well.  i bought up a size from what i usually where as one reviewer mentioned the dress is small in the bust. this is always a problem for me therefore i ordered a 14. i was surprised that the 14 was sung in the bust. and then understood why the 16 was sold out. i was 30/40 lbs heavier when i wore a 16..never would have thought i'd have to order up that high. i've been running a 10 or 12 depending on the level of my activity. i'm keeping the dress because i kn"

This review has been predicted as positive but is actually neutral/negative. It has also 3 stars so it is pretty close to a positive review. the words cute, well and surprised seem positive. As the review is cut off i don't fully understand why it is not a positive review, but as it has gotten 3 stars as well I don't think the customer was that unsatisfied. 