# Women's E-Commerce Clothing Reviews

1. The bag-of-words model assumes that we do not consider the contextual relationships between words in the text, but only the weights of all words, which are related to the frequency of the words in the text. The weights are related to the frequency of words in the text.
   It first splits the words and then by counting the number of occurrences of each word in the text, if these words are put together with the corresponding word frequencies for each text sample, we can obtain the word-based features in the text.


2.  Naïve Bayes works as follow: 

    First, get the "priori probability" of each class (0 or 1 in this assignment) from training set. 
    
    Second, count the frequency of each word in each class.
    
    Third, multiplying the "prior probability" with the word frequency in each class to obtain the "posterior probability" in each class.The class with higher probability will be the predicted result.
    
$$P(Y = Ci | X)$$  
Means the probability of a text belonging to a class, given that a combination of words X has been observed  

    
    
    
    
    

In this Notebook, I'll use text mining to predict the rating of a dress from online reviews.

In [40]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

### read the data file 

In [46]:
df = pd.read_csv('clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [49]:
df_new = df[['Review Text', 'Rating']]
df_new = df_new.dropna()

In [50]:
# if rating > 3 stars, position feedback, count as 0
# if rating < 4 stars, negative and neutral feedback, count as 1
df_new.loc[df_new['Rating'] < 4, 'Rating'] = 0
df_new.loc[df_new['Rating'] > 3, 'Rating'] = 1

In [51]:
df_new.head()

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


### Use the code to generate a document-feature matrix

In [53]:
text = df_new['Review Text'].values.astype('U') #Taking the text from the df, convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) # fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix


In [16]:
print(docu_feat)

  (0, 581)	1
  (0, 2788)	1
  (0, 10683)	1
  (0, 10928)	1
  (0, 13630)	1
  (1, 1446)	2
  (1, 1845)	1
  (1, 3537)	1
  (1, 3701)	1
  (1, 4035)	1
  (1, 5421)	1
  (1, 5725)	1
  (1, 5930)	1
  (1, 6667)	1
  (1, 6754)	1
  (1, 6986)	1
  (1, 7137)	1
  (1, 7257)	2
  (1, 7671)	1
  (1, 8363)	1
  (1, 8431)	1
  (1, 8888)	3
  (1, 9339)	1
  (1, 11292)	1
  (1, 11630)	1
  :	:
  (22639, 8205)	1
  (22639, 8838)	1
  (22639, 8841)	1
  (22639, 10863)	1
  (22639, 11384)	1
  (22639, 11863)	1
  (22639, 12081)	1
  (22639, 12090)	1
  (22639, 12938)	1
  (22639, 13280)	1
  (22639, 13315)	1
  (22639, 13379)	1
  (22639, 13413)	1
  (22639, 13684)	1
  (22640, 2796)	1
  (22640, 4035)	1
  (22640, 4163)	1
  (22640, 4796)	1
  (22640, 4884)	1
  (22640, 5893)	1
  (22640, 7264)	1
  (22640, 8841)	1
  (22640, 9057)	1
  (22640, 9773)	1
  (22640, 13389)	1


### Building the mode

Use the Naïve Bayes classifier from `sklearn`.

In [54]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()  
x = docu_feat #the document-feature matrix is the x matrix
y = df_new['Rating'] #creating the y vector

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)  

nb = nb.fit(x_train, y_train) #fit the model x=features, y=character


### Evaluating the model

In [55]:
y_test_p = nb.predict(x_test)
nb.score(x_test, y_test)

0.8717797732960401

The accuracy is  87%  

In [57]:
df_new['Rating'].value_counts(normalize=True)

1    0.770637
0    0.229363
Name: Rating, dtype: float64

Let's create a confusion matrix.

In [58]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['1', '0'], columns=['1-pred', '0-pred'])
cm

Unnamed: 0,1-pred,0-pred
1,1057,488
0,383,4865


Let's calculate precision and recall

In [59]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.73      0.68      0.71      1545
           1       0.91      0.93      0.92      5248

    accuracy                           0.87      6793
   macro avg       0.82      0.81      0.81      6793
weighted avg       0.87      0.87      0.87      6793



The precision for negative/neutral feedback is 0.91  

### Check out 3 cases where the model is off target. 

Inspect the associated texts. 

In [75]:
j = 0
k = 0
for i in y_test.index:
    if y_test[i] != y_test_p[k] and j < 3:
        print(df['Review Text'][i])
        print('prediction: '+ str(y_test_p[k]))
        print('actual: ' + str(y_test[i]))
        print('\n')
        j += 1
    k += 1

...but it does have it's problems. the neck, although it has pretty button detail at the back, is very outsized on me. i would like to see it fit a bit more snugly as by not laying flat, necklaces bunch up and are hard to wear with it. it is just sort of 'floppy' if i'm making sense. the camisole liner is nicely made and attaches with snaps at the shoulders which gives this a definite plus. the fabric is sort of like chiffon with appliquã©d flowers. the color combination is beautiful and with the
prediction: 0
actual: 1


I bought these pants in white and tan. they are comfortable and fit great.however, they tend to stretch. i just got them and they are already starting to become uncomfortably loose. i still really like them but i recommend you buy a little tight so they become comfortable.
prediction: 1
actual: 0


I got this sweater with high hopes to wear it with some leggings for christmas. it looked adorable (and long!) on the model. however, it doesn't even cover my butt, that's 

Case 1: Predicted to be positive, actual to be negative/neutral. I think the reason for the mistake is that the text includes many words of praise "pretty, nicely,beautiful" and no very clear negative words.

Case 2: Predicted to be negative/neutral, actual to be positive. I think the main reason is the word "uncomfortably".

Case 3: Predicted to be negative/neutral, actual to be positive. This is a difficult one, as the customer made it clear that the product did not match her expectations with a lot of words like "however, doesn't, even, short".The rating is positive, probably because the customer think the sweater is still adorable.