## Assignment-6: text mining
## prediction of a dress rate from online reviews
In this notebook I am going to predict about the rate of dresses in online review. Reviews will be classified into 2 categories, positive (>3 stars) or neutral/negative (<4 stars).

Included in your Jupyter Notebook

Explain briefly in your own words how the bag-of-words model and Naïve Bayes work, and how they work together.

- Pre-processing steps (don’t forget to filter out all non-dress reviews).

- The head() of the resulting dataframe.
- Text pre-processing steps resulting in a document-feature matrix
- Split the file into a training and a test set.
- Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars).
- Evaluate the performance of your model on the test set.
- Check out 3 cases where your model is off target. Inspect the associated texts. Do you understand why your model trips up? Explain.

In [110]:
import sklearn as sk
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [93]:
# Read data
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Pre-processing steps

In [94]:
# 1st_filtering of dresses category
df_dresses = df.loc[df['Class Name'] == 'Dresses']
# 2nd_filtering of necessary Columns
df_dresses = df_dresses[['Review Text', 'Rating']]
df_dresses.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
5,"I love tracy reese dresses, but this one is no...",2
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In [95]:
# 3th_Categorization of data in 2 group
df_dresses.loc[df_dresses['Rating'] < 4, 'Rating'] = 0 #Negative
df_dresses.loc[df_dresses['Rating'] > 3, 'Rating'] = 1 #Positive
df_dresses.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
5,"I love tracy reese dresses, but this one is no...",0
8,I love this dress. i usually get an xs but it ...,1
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",1


In [96]:
# Number of positive and negative reviews
df_dresses['Rating'].value_counts()

1    4792
0    1527
Name: Rating, dtype: int64

## Cleaning and Pre-Processing of text

In [97]:
df_dresses['Review Text']=df_dresses['Review Text'].str.lower()
text = df_dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


In [98]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


## It uses way too much memory!!

In [99]:
#Make a regular matrix out of docu_feat, make it into a DataFrame and concatenate it along the columns
rev_words = pd.concat([df_dresses, pd.DataFrame(docu_feat.toarray())], axis=1)
rev_words.head()

Unnamed: 0,Review Text,Rating,0,1,2,3,4,5,6,7,...,8070,8071,8072,8073,8074,8075,8076,8077,8078,8079
0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,love this dress! it's sooo pretty. i happene...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,i had such high hopes for this dress and reall...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Create a model and split the file into a training and a test set

In [100]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df_dresses['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data and store it

## Model evaluation

In [101]:
nb = nb.fit(X_train, y_train) #fit the model X=features, y=character
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8554852320675106

85% accuracy that seems an acceptable value

In [102]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Negative', 'Positive'], columns=['Negative pred', 'Positive pred'])
cm

Unnamed: 0,Negative pred,Positive pred
Negative,282,182
Positive,92,1340


We can see the quality of prediction by detail in confiusion matrix.

The way to read this is that of the reviews, 282 are correctly predicted as 'Negative', 182 are instead predicted as 'Positive'. The _recall_ and _precision_ for the category is:

$recall = \frac{282}{282 + 182} = .93$

$precision = \frac{282}{282 + 92} = .82$


Precision is important to predict the rate of a product, but if you want to build a filter to only let positive reviews show up on your webshop recall is probably more important.

In [103]:
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           0       0.75      0.61      0.67       464
           1       0.88      0.94      0.91      1432

    accuracy                           0.86      1896
   macro avg       0.82      0.77      0.79      1896
weighted avg       0.85      0.86      0.85      1896



In [104]:
df_dresses['Rating_pred'] = pd.Series(y_test_p)
df_dresses.head()

Unnamed: 0,Review Text,Rating,Rating_pred
1,love this dress! it's sooo pretty. i happene...,1,0.0
2,i had such high hopes for this dress and reall...,0,1.0
5,"i love tracy reese dresses, but this one is no...",0,0.0
8,i love this dress. i usually get an xs but it ...,1,1.0
9,"i'm 5""5' and 125 lbs. i ordered the s petite t...",1,1.0


## Check out 3 cases where your model is off target. 

In [105]:
df_wrong_pred=df_dresses.query('Rating != Rating_pred')
df_wrong_pred.head()

Unnamed: 0,Review Text,Rating,Rating_pred
1,love this dress! it's sooo pretty. i happene...,1,0.0
2,i had such high hopes for this dress and reall...,0,1.0
10,dress runs small esp where the zipper area run...,0,1.0
11,this dress is perfection! so pretty and flatte...,1,0.0
14,this is a nice choice for holiday gatherings. ...,0,1.0


In [106]:
df_dresses['Rating_pred'].value_counts()

1.0    330
0.0     85
Name: Rating_pred, dtype: int64

In [125]:
#I used random function to find my check out cases.
rand_1=0
rand_1 = random.randint(1, 415)
print(rand_1)
rand_1 = random.randint(1, 415)
print(rand_1)
rand_1 = random.randint(1, 415)
print(rand_1)

171
413
165


In [120]:
# The result was 180 when I tried.
print(f" Case-1: {df_dresses.iloc[180,0]} \n")
print(f" Main Review result: {df_dresses.iloc[180,1]} \n")

 Case-1: this dress is so beautiful, the color is even more vibrant in person, a stunning emerald green. it is well made and doesn't go too low under your arm, which sometimes happens with halters. the halter material is a thicker band that gives nice support. the skirt does a little poof near your hips at the top, look at the back view photo online, you can see it there. at first i was surprised by it because i didn't see that on the front view, but then i decided it was fun and flirty. i received numer 

 Main Review result: 1 



## Case-1: The words used by the user seem to be negative. (too low, stunning,hip)

In [123]:
# The result was 231 when I tried.
print(f" Case-2: {df_dresses.iloc[231,0]} \n")
print(f" Main Review result: {df_dresses.iloc[231,1]} \n")

 Case-2: was so thrilled to receive my dress in the mail. its just what i was looking for, for my staple summer dress. i love how i can wear it with or without the sweater top, and even can pair just the sweater top with my high-waisted shorts! 

 Main Review result: 1 



## Case-2: The words used by the user seem to be negative. (Short!)

In [126]:
# The result was 176 when I tried.
print(f" Case-3: {df_dresses.iloc[176,0]} \n")
print(f" Main Review result: {df_dresses.iloc[176,1]} \n")

 Case-3: i read the reviews prior to purchasing but loved the pattern so much i had to have it. i was not disappointed however, had i not read the reviews i would have been. it is true the top of the dress is flat and does not have the detail as shown on the model. it also appears to be disproportionate. i am 5'1 and curvy the bottom fits well but the top is large however rectified by wearing a padded bra. since i ordered the petite i was surprised the dress was above the ankle (usually their items are l 

 Main Review result: 1 



## Case-3: The words used by the user seem to be negative. (had to,disproportionate,the top is large) It seems that customer really is not completely satisfied but her rate is in our possitive group. Maybe we can change the quality of prediction by doing some changes in our rating category.