# Text mining
We're going to predict whether dresses are reviewed as positive (> 3 stars), neutral (3 stars) or negative (< 3 stars). 

## Method 
We're going to use the bag-of-words model and Naïve Bayes for this prediction. With the bag-of-words model, certain words which are related to a positive (> 3 stars), neutral (3 stars) or negative review (< 3 stars), are categorized into 3 groups. Therefore, each "bag-of-words" contains words categorized by one of the 3 catgories (positive/neutral/negative). For example, the bag labelled as 3 stars or more should mostly include positive reviews, while the bag with less than 3 stars should mostly include neutral or negative words. 

The Naïve Bayes classification will be used to predict whether a word belongs to the bag with less than 3 stars or belongs to the bag with more than 3 stars based on whether a word is positive, neutral or negative. 


In [1]:
# Import the important libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

## Pre-processing
First, we're going to take a look at the data and generate a document-feature matrix filtered on dresses. 

In [2]:
# Show the data
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
# Filter by dresses
df = df.loc[(df['Class Name'] == 'Dresses')]
# Remove NaN
df.dropna()
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


In [3]:
# Categorize the ratings by negative (1-3), neutral (3) and positive (4-5). Here we actually create the "bags"

def rating (score):
    if (score > 3):
        return "positive rating"
    elif (score < 3):
        return "negative rating"      
    else: 
        return "neutral"

In [4]:
df["Rating"] = df["Rating"].apply(rating)

In [5]:
# Generate the document-feature matrix
df = df[['Review Text', 'Rating']]
text = df['Review Text'].values.astype('U') #Taking the review text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
df.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,positive rating
2,I had such high hopes for this dress and reall...,neutral
5,"I love tracy reese dresses, but this one is no...",negative rating
8,I love this dress. i usually get an xs but it ...,positive rating
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",positive rating


In [24]:
# Show values
df['Rating'].value_counts()

positive rating    4792
neutral             838
negative rating     689
Name: Rating, dtype: int64

## Document feature matrix
Before we can start to build the model, we need to create a document feature matrix. 

In [25]:
# Converting the text to unicode and save as NumpyArray
text = df['Review Text'].values.astype('U')
# Create the object CountVectorizer with English stop words
vect = CountVectorizer(stop_words="english")
# Fit the text into the CountVectorizer Object
vect = vect.fit(text)
# Get the words from the dictionary of words 
feature_names = vect.get_feature_names()
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


In [26]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat)

  (0, 855)	2
  (0, 1080)	1
  (0, 2124)	1
  (0, 2216)	1
  (0, 2405)	1
  (0, 3258)	1
  (0, 3436)	1
  (0, 3569)	1
  (0, 3969)	1
  (0, 4015)	1
  (0, 4156)	1
  (0, 4248)	1
  (0, 4311)	2
  (0, 4544)	1
  (0, 4922)	1
  (0, 4969)	1
  (0, 5215)	3
  (0, 5489)	1
  (0, 6636)	1
  (0, 6828)	1
  (0, 7438)	1
  (0, 7440)	1
  (1, 1421)	1
  (1, 1666)	1
  (1, 2177)	1
  :	:
  (6317, 4834)	1
  (6317, 5190)	1
  (6317, 5193)	1
  (6317, 6388)	1
  (6317, 6685)	1
  (6317, 6975)	1
  (6317, 7096)	1
  (6317, 7102)	1
  (6317, 7589)	1
  (6317, 7752)	1
  (6317, 7771)	1
  (6317, 7811)	1
  (6317, 7829)	1
  (6317, 7982)	1
  (6318, 1670)	1
  (6318, 2405)	1
  (6318, 2482)	1
  (6318, 2868)	1
  (6318, 2919)	1
  (6318, 3552)	1
  (6318, 4316)	1
  (6318, 5193)	1
  (6318, 5316)	1
  (6318, 5741)	1
  (6318, 7815)	1


In [27]:
docu_feat

<6319x8080 sparse matrix of type '<class 'numpy.int64'>'
	with 154583 stored elements in Compressed Sparse Row format>

## Building the model 
Now, we're going to use the Naïve Bayes classifier from sklearn and split the dataset into a train and test set. We use rating as y-variable and docu_feat as x-variable.

In [30]:
X = docu_feat #the document-feature matrix is the X matrix
y = df['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

## Evaluating the model

Let's look if our model predicts well enough

In [32]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.7916666666666666

An accuracy of 79.16% seems quite good.

In [34]:
# Show values
df['Rating'].value_counts()

positive rating    4792
neutral             838
negative rating     689
Name: Rating, dtype: int64

In [37]:
# Show by class
nb.classes_

array(['negative rating', 'neutral', 'positive rating'], dtype='<U15')

In [38]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Negative rating (actual)', 'Neutral rating (actual)', 'Positive rating (actual)'], columns=['Negative rating (prediction)', 'Neutral rating (prediction)', 'Positive rating (prediction)'])
cm

Unnamed: 0,Negative rating (prediction),Neutral rating (prediction),Positive rating (prediction)
Negative rating (actual),62,35,107
Neutral rating (actual),28,49,183
Positive rating (actual),13,29,1390


In [39]:
# Use the built-in function classificiation_report to generate the precision, recall and accuracy
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

                 precision    recall  f1-score   support

negative rating       0.60      0.30      0.40       204
        neutral       0.43      0.19      0.26       260
positive rating       0.83      0.97      0.89      1432

       accuracy                           0.79      1896
      macro avg       0.62      0.49      0.52      1896
   weighted avg       0.75      0.79      0.75      1896



The accuracy of the whole prediction is 79%. According to the precision and recall, it seems that a positive rating can be better predicted than neutral or a negative rating. The neutral rating seems to be the hardest to predict. This seems logical because "neutral" is not a very clear category.  

##  Finding off targets

Let's take a look at 3 cases which are not predicted correctly. 

In [50]:
df["Rating_Prediction"] = nb.predict(X)
df_contradictions = df[df["Rating"] != df["Rating_Prediction"]]
df_contradictions.head(10)

Unnamed: 0,Review Text,Rating,Rating_Prediction
5,"I love tracy reese dresses, but this one is no...",negative rating,positive rating
10,Dress runs small esp where the zipper area run...,neutral,negative rating
23,Cute little dress fits tts. it is a little hig...,neutral,positive rating
52,"Love the color and style, but material snags e...",neutral,positive rating
69,"I really wanted this to work. alas, it had a s...",neutral,positive rating
194,Dress ran very large in every way. beautiful d...,neutral,positive rating
311,Looks beautiful online but has too much materi...,neutral,positive rating
381,Disappointed in the quality of the dress. love...,neutral,positive rating
383,This dress is not what i expected. the bottom ...,neutral,positive rating
406,I got this dress in hopes of having a really n...,neutral,positive rating


It seems that some words are categorized as positive/negative while the review tells actually something else. For example, the word "cute" in review nr. 23 is positive, so the predicted rating is positive, while the actual rating is neutral. Also "beautiful" in review 311 is categorized as positive, while also in this case, the actual rating is neutral. This also counts for "love" in the first review. This review is negative, but predicted as positive. 