# Women's E-Commerce Clothing Reviews

1. The bag-of-words model assumes that we do not consider the contextual relationships between words in the text, but only the weights of all words, which are related to the frequency of the words in the text. The weights are related to the frequency of words in the text.
   It first splits the words and then by counting the number of occurrences of each word in the text, if these words are put together with the corresponding word frequencies for each text sample, we can obtain the word-based features in the text.


2.  Naïve Bayes works as follow: 

    First, get the "priori probability" of each class (0 or 1 in this assignment) from training set. 
    
    Second, count the frequency of each word in each class.
    
    Third, multiplying the "prior probability" with the word frequency in each class to obtain the "posterior probability" in each class.The class with higher probability will be the predicted result.
    
$$P(Y = Ci | X)$$  
Means the probability of a text belonging to a class, given that a combination of words X has been observed  

    
    
    
    
    

In this Notebook, I'll use text mining to predict the rating of a dress from online reviews.

In [5]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
 

### 1. read the data file 

In [16]:
df = pd.read_csv('clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


### 2. clean the data

In [17]:
def clean(input_df):
    ret_text = input_df['Review Text'].str.replace('"', '')
    ret_text = ret_text.str.replace(u'\u2019', '')
    ret_text = ret_text.str.replace('!', '')
    ret_text = ret_text.str.replace('-', '')
    ret_text = ret_text.str.replace(',', ' ')  
    ret_text = ret_text.str.replace('?', '')
    ret_text = ret_text.str.replace('.', '') 
 
    input_df['Review Text'] = ret_text
    return input_df

In [36]:
df_clean = clean(df)
df_clean.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful silky and sexy and comfo...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress it's sooo pretty i happened ...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,I love love love this jumpsuit it's fun fli...,5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [37]:
df_new = df_clean[['Review Text', 'Rating']]
df_new = df_new.dropna()

In [38]:
# if rating > 3 stars, position feedback, count as 0
# if rating < 4 stars, negative and neutral feedback, count as 1
df_new.loc[df_new['Rating'] < 4, 'Rating'] = 1
df_new.loc[df_new['Rating'] > 3, 'Rating'] = 0

In [39]:
df_new.head()

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful silky and sexy and comfo...,0
1,Love this dress it's sooo pretty i happened ...,0
2,I had such high hopes for this dress and reall...,1
3,I love love love this jumpsuit it's fun fli...,0
4,This shirt is very flattering to all due to th...,0


### 3. Use the code to generate a document-feature matrix

In [40]:
text = df_new['Review Text'].values.astype('U') #Taking the text from the df, convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) # fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix


In [41]:
#print(docu_feat)

### 4. Building the mode

Use the Naïve Bayes classifier from `sklearn`.

In [42]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()  
x = docu_feat #the document-feature matrix is the x matrix
y = df_new['Rating'] #creating the y vector

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)  

nb = nb.fit(x_train, y_train) #fit the model x=features, y=character


### 5.Evaluating the model

In [43]:
y_test_p = nb.predict(x_test)
nb.score(x_test, y_test)

0.872515825114088

The accuracy is  87.2%  

In [44]:
df_new['Rating'].value_counts(normalize=True)

0    0.770637
1    0.229363
Name: Rating, dtype: float64

Let's create a confusion matrix.

In [45]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['1', '0'], columns=['1-pred', '0-pred'])
cm

Unnamed: 0,1-pred,0-pred
1,4862,349
0,517,1065


Let's calculate precision and recall

In [46]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.90      0.93      0.92      5211
           1       0.75      0.67      0.71      1582

    accuracy                           0.87      6793
   macro avg       0.83      0.80      0.81      6793
weighted avg       0.87      0.87      0.87      6793



The precision for positive feedback is quite good, 0.9.  And the recall is also quite good, about 7% wrong prediction,
But for negative feedback, the precision is not good, and recall is even worse.
 

### Check out 3 cases where the model is off target. 

Inspect the associated texts. 

In [47]:
j = 0
k = 0
for i in y_test.index:
    if y_test[i] != y_test_p[k] and j < 3:
        print(df['Review Text'][i])
        print('prediction: '+ str(y_test_p[k]))
        print('actual: ' + str(y_test[i]))
        print('\n')
        j += 1
    k += 1

I like the way the skirt of the dress swings when you move in order to get the arm holes to not be too low  i ended up with the smaller of the two sizes i usually buy i also appreciate the dress has pockets and the soft fabric the tag says you can machine wash cold gentle which is also a plus
prediction: 1
actual: 0


In the photo of the dress  you might have a hard time telling that the top half of the dress is the lace part  and is a stretchy knit fabric  while the bottom half is polyester and not stretchy at all
fyi: 36c (34d)  short waist  broad shoulders/back  usually a 10/12 in a fitted top/dress
the navy lace of the top  as i said  is knit therefore it was too stretchy  and the straps actually were already too long it was so loose  that i needed the smaller size in the top half of the dress (10)
prediction: 0
actual: 1


I remember seeing this great jackets online and knowing that i would have to have it so i traveled to the store to try it on the ruffle on the bottom was beauti

Case 1: Predicted to be negative/neutral, actual to be positive. Maybe because of the "two low" and "smaller" parts make the prediction wrong.

Case 2: Predicted to be positive, actual to be negative/neutral. I'm not sure why there is a mistake in prediction. The nagtive information are quite clear, with the words "not stretchy at all","too long" and "so loose ".

Case 3: Predicted to be positive, actual to be negative/neutral. This one is quite difficult, with many positive words " beautiful and fluffy","the color was perfect", " lightweight enough".