The bag of words model sees every word as its own and counts the amount of times it appears in a text. Pitfalls are that it does not see when words are supposed to be together like New York. It counts New as 1 and York as 1. 

The Naïve Bayes model works different. It looks at the probability of a text fitting together with another word or category. For example, it looks at how a text could fit in the category 'spam'. In practice it looks at the words in, for example, the title of an email, and looks at the probability of it being spam. 

In [29]:
import seaborn as sns 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix
import math
from sklearn.naive_bayes import MultinomialNB

df = pd.read_csv('dataset.csv')
df = df.dropna()
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


In [30]:
df = df.loc[df["Class Name"].isin(["Dresses"])]
df_subset.head(15)

Unnamed: 0,Review Text,Rating,Class Name
2,I had such high hopes for this dress and reall...,3,Dresses
5,"I love tracy reese dresses, but this one is no...",2,Dresses
8,I love this dress. i usually get an xs but it ...,5,Dresses
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,Dresses
10,Dress runs small esp where the zipper area run...,3,Dresses
12,More and more i find myself reliant on the rev...,5,Dresses
14,This is a nice choice for holiday gatherings. ...,3,Dresses
19,I love the look and feel of this tulle dress. ...,5,Dresses
21,"I'm upset because for the price of the dress, ...",4,Dresses
22,"First of all, this is not pullover styling. th...",2,Dresses


In [31]:
df_subset = df[["Review Text", "Rating", "Class Name"]]
df_subset.head(15)

Unnamed: 0,Review Text,Rating,Class Name
2,I had such high hopes for this dress and reall...,3,Dresses
5,"I love tracy reese dresses, but this one is no...",2,Dresses
8,I love this dress. i usually get an xs but it ...,5,Dresses
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,Dresses
10,Dress runs small esp where the zipper area run...,3,Dresses
12,More and more i find myself reliant on the rev...,5,Dresses
14,This is a nice choice for holiday gatherings. ...,3,Dresses
19,I love the look and feel of this tulle dress. ...,5,Dresses
21,"I'm upset because for the price of the dress, ...",4,Dresses
22,"First of all, this is not pullover styling. th...",2,Dresses


In [32]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df_subset['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text

feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi']


In [33]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:500,0:500]) #Let's print a little part of the matrix: the first 50 words & documents

  (1, 7)	1
  (3, 71)	1
  (3, 214)	1
  (12, 480)	1
  (13, 200)	1
  (14, 78)	1
  (14, 408)	1
  (14, 420)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (17, 236)	1
  (18, 54)	1
  (19, 39)	1
  (21, 59)	1
  (21, 216)	1
  (21, 225)	1
  (22, 229)	1
  (23, 102)	1
  (23, 178)	1
  (23, 222)	1
  (23, 243)	1
  (24, 403)	1
  (26, 11)	2
  :	:
  (475, 96)	1
  (475, 455)	2
  (476, 63)	1
  (476, 200)	1
  (476, 334)	1
  (477, 344)	1
  (478, 156)	1
  (479, 216)	1
  (479, 368)	1
  (482, 309)	2
  (484, 130)	1
  (489, 362)	1
  (490, 112)	1
  (492, 11)	2
  (492, 364)	1
  (493, 50)	1
  (493, 480)	1
  (497, 442)	1
  (498, 334)	1
  (499, 11)	1
  (499, 165)	1
  (499, 187)	1
  (499, 219)	1
  (499, 248)	1
  (499, 451)	1


In [34]:
from sklearn.model_selection import train_test_split

y = df_subset['Rating'] # defining the target variable (dependent variable) as y
X = docu_feat
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #test_size=0.3 indicates the percentage of the data that should be held over for testing

In [35]:
y_train.value_counts()

5    1971
4     868
3     517
2     281
1     122
Name: Rating, dtype: int64

In [36]:
clf = MultinomialNB()
clf.fit(X, y)
MultinomialNB()
print(clf.predict(X))

[3 2 5 ... 5 3 5]


In [37]:
from sklearn.metrics import confusion_matrix

y_test_pred = clf.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix" on the test set
cm

array([[ 16,   3,  24,  10,   6],
       [  0,  44,  37,  21,  25],
       [  0,   0, 149,  29,  42],
       [  0,   0,  11, 218, 116],
       [  1,   0,   6,  32, 822]])

In [40]:
conf_matrix = pd.DataFrame(cm, index=['1', '2', '3', '4', '5'], columns = ['1p', '2p', '3p', '4p', '5p']) 
conf_matrix

Unnamed: 0,1p,2p,3p,4p,5p
1,16,3,24,10,6
2,0,44,37,21,25
3,0,0,149,29,42
4,0,0,11,218,116
5,1,0,6,32,822


In [39]:
clf.classes_

array([1, 2, 3, 4, 5])

In [41]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           1       0.94      0.27      0.42        59
           2       0.94      0.35      0.51       127
           3       0.66      0.68      0.67       220
           4       0.70      0.63      0.67       345
           5       0.81      0.95      0.88       861

    accuracy                           0.77      1612
   macro avg       0.81      0.58      0.63      1612
weighted avg       0.78      0.77      0.76      1612



The accuracy is pretty high (81%) considering we had 5 categories (5 ratings). Also the precision of the first two ratings is really high (94). The precision of the 3rd and 4 star ratings is quite low. Guessing that these are overall similar ratings, it would be harder to guess. 