<h1> Week 6: Text Mining</h1>

In this notebook I will be looking at reviews from dresses to determine if a review is positive or negative. 

<b>'Bag-of-words' model</b>
This is a model that looks at a document or text and generates a vocabulary with a count for each individual word used in the document or text, it can then be used for an algorithm, it is often used for NLP.

<b> Naive Bayes</b>
This is an algorithm based on Bayes' theorem, it is used for classification. In the context of text classification, the frequency of a word determines the probability.

<h2> Pre-processing </h2>

In [86]:
import pandas as pd

In [87]:
# first read in dataset and look at the head
df = pd.read_csv("Assignment text mining - data clothing reviews.csv")

df.head(3)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses


In [88]:
'''we need only the reviews that are about dresses so we will need to select the right department, 
let's look at all the possible departments to make sure we don't miss anything''' 
df['Department Name'].value_counts()

Tops        10468
Dresses      6319
Bottoms      3799
Intimate     1735
Jackets      1032
Trend         119
Name: Department Name, dtype: int64

In [89]:
# now we know that selecting 'Dresses' will suffice
df = df[df['Department Name'] == 'Dresses']
df.head(3)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses


In [90]:
# we turn the ratings into the categories 'negative', 'neutral' and 'positive' because we are doing classification prediction'
df['Rating'].replace({1: 'Negative', 2: 'Negative', 3: 'Neutral', 4: 'Positive', 5: 'Positive'}, inplace = True)

# we will drop all reviews that have no text
df['Review Text'] = df['Review Text'].dropna()

df = df[['Rating', 'Review Text']] 

df.head(15)

Unnamed: 0,Rating,Review Text
1,Positive,Love this dress! it's sooo pretty. i happene...
2,Neutral,I had such high hopes for this dress and reall...
5,Negative,"I love tracy reese dresses, but this one is no..."
8,Positive,I love this dress. i usually get an xs but it ...
9,Positive,"I'm 5""5' and 125 lbs. i ordered the s petite t..."
10,Neutral,Dress runs small esp where the zipper area run...
11,Positive,This dress is perfection! so pretty and flatte...
12,Positive,More and more i find myself reliant on the rev...
14,Neutral,This is a nice choice for holiday gatherings. ...
19,Positive,I love the look and feel of this tulle dress. ...


<h2> Text pre-processing </h2>

In [91]:
from sklearn.feature_extraction.text import CountVectorizer 

# Taking the review from the df, we have to convert it to Unicode
review = df['Review Text'].values.astype('U') 

# Create the CV object, with English stop words and fit the model with the words from the review text
vect = CountVectorizer(stop_words='english') 
vect = vect.fit(review) 

# we will create the matrix with the CountVectorizer object transform 
docu_feat = vect.transform(review) 

#Let's print a little part of the matrix: the first 50 words & documents
print(docu_feat[0:50,0:50]) 

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


<h2>Setting up the model</h2>

In [92]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Setting up the data and model
nb = MultinomialNB()

#selecting the variables to go into my X matrix and creating the y vector
X = docu_feat 
y = df['Rating'] 

# splitting the data into a test and train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

nb = nb.fit(X_train, y_train)

<h2> Evaluation </h2>

In [93]:
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8022151898734177

In [94]:
df['Rating'].value_counts(normalize=True)

Positive    0.758348
Neutral     0.132616
Negative    0.109036
Name: Rating, dtype: float64

As you can see the accuracy is 80,2% which is not great because we would get only 5% lower if we guessed positive for all the reviews.

In [95]:
# making a confusion matrix to look at the predictions
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Negative', 'Neutral', 'Positive'], columns=['Negative pred', 'Neutral pred', 'Positive pred'])
cm

Unnamed: 0,Negative pred,Neutral pred,Positive pred
Negative,58,50,104
Neutral,17,58,154
Positive,12,38,1405


In [96]:
# to calculate the precision and recall of the model we are going to use the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

    Negative       0.67      0.27      0.39       212
     Neutral       0.40      0.25      0.31       229
    Positive       0.84      0.97      0.90      1455

   micro avg       0.80      0.80      0.80      1896
   macro avg       0.64      0.50      0.53      1896
weighted avg       0.77      0.80      0.77      1896



<b> Probabilities in individual cases</b>

I will check out what the predictions were for the individual reviews to understand a bit better why the model predicts certain things and why it makes mistakes in certain cases.

In [97]:
# writing a for loop to iterate over multiple predictions and show the actual probabilities
for i in range(15):
    prob = nb.predict_proba(X[i])
    print("")
    print(f"Review: {i}. {df.iloc[i,1]}")
    print("")
    print(f"Negative: {prob[0,0]}, Neutral: {prob[0,1]}, Positive: {prob[0,2]}")


Review: 0. Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.

Negative: 3.736055562303289e-08, Neutral: 5.802706763065579e-06, Positive: 0.9999941599326692

Review: 1. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c

Negative: 0.016830442705165087, Neutral: 0.9822788120063062, Positive: 0.00

Looking at the probabilities and the header table I printed earlier I spotted a few cases in which the model made an incorrect prediction.

In [98]:
print(df.iloc[2,1])
print(nb.predict_proba(X[2]))

I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of the package but its a lot of dress. the skirt is long and very full so it overwhelmed my small frame. not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment of the garment. i love the color and the idea of the style but it just did not work on me. i returned this dress.
[[1.05304602e-04 3.51672452e-01 6.48222243e-01]]


Review 0: Note how in this review, which should have been a negative. The model predicts it to be positive, this is probably because of the word 'love', which it associates with positivity. 

In [99]:
print(df.iloc[12,1])
print(nb.predict_proba(X[12]))

Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.
[[1.37971202e-04 5.74907867e-03 9.94112950e-01]]


Review 12: This review is neutral, however the words 'like' 'love' and 'good' quickly make it into a positive review for the model.

In [100]:
print(df.iloc[13,1])
print(nb.predict_proba(X[13]))

Love the color and style, but material snags easily
[[0.02029244 0.05284075 0.92686681]]


Review 13: This is the recurring theme, people tend to comment on different parts of the item but the model can't take this into account as it is naive as the name says and does not see the relation of the words in the complete text. In this case the reviewer likes the color and style but makes a negative remark about the quality. However, the word 'love' only refers to positive for the model, but this review was neutral. 

<b> Concluding </b>

Because I split the ratings into three categories it was very hard for the model to really see what a neutral review is. A lot of mistakes are probably made here because in the case of a neutral review people tend to have some positive remarks and a few negative. 

Also in some reviews people might refer to what they do find good or what they appreciate and how the item that they recieved does not meet their standards in comparison to what their earlier mentioned standards. In this case the model will predict a positive review even though the 'positive' words were illustrative of what they expected and it was actually a negative review. 