# Week 6 - Text mining

### Objective

Predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars).

### Description of BoW & Naive Bayes

Naive Bayes models are based on a statistical classification technique called ‘Bayes Theorem’.  It classifies features as seperate and independent. It is called 'naive' because the presence of a certain feature in a dataset is completely unrelated to the presence of any other feature.

The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import numpy as np

In [3]:
df = pd.read_csv("womens-clothing.csv")
df = df.dropna()
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


### Pre-processing steps

In [4]:
df = df.loc[(df['Class Name'] == 'Dresses')]

In [5]:
df = df[['Review Text', 'Rating']]

In [6]:
# Separating ratings
df.loc[df['Rating'] < 4, 'Positive or Negative'] = '0' 
df.loc[df['Rating'] > 3, 'Positive or Negative'] = '1'

In [7]:
df.head()

Unnamed: 0,Review Text,Rating,Positive or Negative
2,I had such high hopes for this dress and reall...,3,0
5,"I love tracy reese dresses, but this one is no...",2,0
8,I love this dress. i usually get an xs but it ...,5,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1
10,Dress runs small esp where the zipper area run...,3,0


In [8]:
# Converting text to unicode
text = df['Review Text'].values.astype('U')

In [9]:
# Object with English stopwords
vect = CountVectorizer(stop_words = 'english')
vect = vect.fit(text)

In [10]:
# Getting text
feature_names = vect.get_feature_names()
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[700:720]}")

There are 7747 words in the vocabulary. A selection: ['attractive', 'attractively', 'attributed', 'atypical', 'auction', 'audrey', 'august', 'austin', 'authentic', 'authenticity', 'autobots', 'automatically', 'autumn', 'autumnal', 'avail', 'availability', 'available', 'average', 'avid', 'avoid']


### Document-feature matrix

In [11]:
doc_feat = vect.transform(text)
print(doc_feat[0 :50, 0 : 50])

  (1, 7)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (19, 39)	1
  (26, 11)	2
  (29, 30)	1
  (43, 0)	1


In [12]:
doc_feat = vect.transform(text)
print(doc_feat[0 :50, 0 : 50])

  (1, 7)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (19, 39)	1
  (26, 11)	2
  (29, 30)	1
  (43, 0)	1


### Model training

In [13]:
nb = MultinomialNB()
X = doc_feat
y = df['Positive or Negative']
# Train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Model fit
nb = nb.fit(X_train, y_train)

### Evaluation 

In [14]:
# Creating predictions
y_test_p = nb.predict(X_test)
# Accuracy of the predictions
nb.score(X_test, y_test)

0.8542183622828784

The accuracy is 85%.

In [17]:
df['Positive or Negative'].value_counts(normalize=True)

1    0.753119
0    0.246881
Name: Positive or Negative, dtype: float64

In case of guessing only 'Positive' we would be guessing it 75% of the time correctly.

### Confusion Matrix

In [18]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Neutral or Negative', 'Positive'], columns=['Neutral or Negative predictions', 'Positive predictions'])
cm

Unnamed: 0,Neutral or Negative predictions,Positive predictions
Neutral or Negative,238,168
Positive,67,1139


In [19]:
# Classification report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.78      0.59      0.67       406
           1       0.87      0.94      0.91      1206

    accuracy                           0.85      1612
   macro avg       0.83      0.77      0.79      1612
weighted avg       0.85      0.85      0.85      1612



The prediciton of 'Positive' is 87%. 
The recall of 'Positive' is 94%.
That means that real positice ratings are 94% of all 'Positive' predictions.

### Is model off-target?

In [43]:
# Looking for wrong predictions
test_data = pd.DataFrame({'y_test' : y_test, 'y_test_p' : y_test_p})
test_data[test_data['y_test'] != test_data['y_test_p']].sort_index().head(3)

Unnamed: 0,y_test,y_test_p
12,1,0
383,0,1
417,0,1


In [34]:
# Mistaken review (Positive one, predicted as negative)
df.iloc[12]['Review Text']


'I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets.'

Words like 'strange', 'weird' may impact the model.

In [41]:
# Mistaken review (Negative one, predicted as positive)
df.iloc[383]['Review Text']

'This dress is adorable and well made. this brand runs on the large side. i ordered an xs and the tag says \'p" on it. it\'s not petite. it ran large through the hips and will need alteration. i\'m keeping it because it is really cute and is well made. the material is knit and soft - very comfortable. i\'m surprised nobody has reviewed this item. i probably will not buy this brand again only because it runs large.'

'...will not buy', 'runs large' may impact the model.

In [38]:
# Mistaken review (Negative one, predicted as positive)
df.iloc[417]['Review Text']

'Tiny are experts at making busy bohemian shirtdresses that look casual but retain a feminine drape.  it skims over my trouble spots without adding bulk.  this is my fourth dress from this label and it does not disappoint.  for my frame and style i chose tts but those of slender & petite builds may want to size down.  for reference i am 5\'3" 140# 36dd.'

Well, actually I do not know. Here it seems that review is positive?