# Text mining
The bag of words model: This model, used for text classification, takes word count into account. This is however all it does, it only counts the amount of occurences for each word but does not take into regard the placement of the words. It therefore loses context.

Naive Bayes assumes that all words are independent of one another, hence the 'naive'. The model also uses word counts but uses the amount of occurences to predict the category of the text. 



# Pre-processing

In [85]:
import pandas as pd #"as pd" means that we can use the abbreviation in commands

from sklearn.model_selection import train_test_split
df = pd.read_csv('./clothing_reviews.csv')
df.head(30)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits
7,7,858,39,"Shimmer, surprisingly goes with lots","I ordered this in carbon for store pick up, an...",4,1,4,General Petite,Tops,Knits
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


In [86]:
#rename column name so they are easier to adress in the next lines
df= df.rename(columns={'Department Name': 'department_name', 'Review Text': 'review_text'})
#renaming values of rating to positive (>3) or negative (<4) for classification to work later
df['Rating'] = df['Rating'].map({1: 'negative', 2: 'negative', 3: 'negative', 4: 'positive', 5:'positive'})
#filter out all reviews regarding dresses
df = df[(df.department_name=="Dresses")]
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,review_text,Rating,Recommended IND,Positive Feedback Count,Division Name,department_name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,positive,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,negative,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",negative,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,positive,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",positive,1,0,General,Dresses,Dresses


In [87]:
#make a subset of all relevant columns
df = df[['review_text', 'Rating']]

#drop NaN
df = df.dropna()
df.head()

Unnamed: 0,review_text,Rating
1,Love this dress! it's sooo pretty. i happene...,positive
2,I had such high hopes for this dress and reall...,negative
5,"I love tracy reese dresses, but this one is no...",negative
8,I love this dress. i usually get an xs but it ...,positive
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",positive


# Text pre-processing 

In [88]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df['review_text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8079 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


The document-feature matrix. The first number in between the brackets represents the document, the second represents the word. 

In [89]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (25, 40)	1
  (34, 12)	2
  (38, 31)	1


In [90]:
#Make a regular matrix out of docu_feat, make it into a DataFrame and concatenate it along the columns
rev_words = pd.concat([df, pd.DataFrame(docu_feat.toarray())], axis=1)
rev_words.head(6)

Unnamed: 0,review_text,Rating,0,1,2,3,4,5,6,7,...,8069,8070,8071,8072,8073,8074,8075,8076,8077,8078
0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Love this dress! it's sooo pretty. i happene...,positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,I had such high hopes for this dress and reall...,negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"I love tracy reese dresses, but this one is no...",negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [91]:
#get rid of the NaN
rev_words = rev_words.dropna()
rev_words.head(6)

Unnamed: 0,review_text,Rating,0,1,2,3,4,5,6,7,...,8069,8070,8071,8072,8073,8074,8075,8076,8077,8078
1,Love this dress! it's sooo pretty. i happene...,positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,I had such high hopes for this dress and reall...,negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"I love tracy reese dresses, but this one is no...",negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,I love this dress. i usually get an xs but it ...,positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",positive,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,Dress runs small esp where the zipper area run...,negative,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Training the data 

In [92]:
y = df['Rating'] #I need to take out the text as the Y-variable 
X = docu_feat #The X is the document-feature matrix
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

In [93]:
from sklearn.naive_bayes import MultinomialNB #using the Naive Bayes model
nb = MultinomialNB()
nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# Evaluating the model

In [94]:
from sklearn.metrics import confusion_matrix
y_test_pred = nb.predict(X_test)
cm = confusion_matrix(y_test, y_test_pred)
cm

array([[ 301,  168],
       [ 107, 1268]])

In [95]:
#get values of the confusion matrix
nb.classes_

array(['negative', 'positive'], dtype='<U8')

In [96]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['negative', 'positive' ], columns = ['predicted negative', 'predicted positive']) 
conf_matrix

Unnamed: 0,predicted negative,predicted positive
negative,301,168
positive,107,1268


In [121]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

    negative       0.74      0.64      0.69       469
    positive       0.88      0.92      0.90      1375

    accuracy                           0.85      1844
   macro avg       0.81      0.78      0.79      1844
weighted avg       0.85      0.85      0.85      1844



As can be read from the upper table, the accuracy is 85%. The recall for positive review is 92% and the precision is 88% which means that of the predicted positive reviews, 88% is really positive. Of all the positive reviews, 92% were also predicted to be one.

# Examples
I will be inspecting some cases in which the model is off and evaluate why.

In [120]:
#loop through an amount of sentences
for i in range(1, 500):
    #find sentence where the predicted rating is not equal to the actual rating
    if(nb.predict(X[i]) != df.Rating.iloc[i]):
        print('predicted to be:', nb.predict(X[i]))
        print('actually was:', df.Rating.iloc[i])
        print('text content:', df.review_text.iloc[i])
    

predicted to be: ['positive']
actually was: negative
text content: Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.
predicted to be: ['positive']
actually was: negative
text content: Love the color and style, but material snags easily
predicted to be: ['positive']
actually was: negative
text content: Looks beautiful online but has too much material and the zipper catches on the lace. also runs very large, i am normally a small but would need and xs in this dress
predicted to be: ['positive']
actually was: negative
text content: This dress is not what i expected. the bottom half is wool-like material-looks like someone has worn it. the top snags easily so you must be careful when wearing jewelry. when i received the dress i noticed there were two small holes under the arms. i wouldn't of paid full price but for the amount, i sewed up t

What can be recognized from the example above is that sentences with 'Love' are quickly marked as 'positive' even though there is a 'not' in front of it. Also a lot of negative reviews start with a positive note and quickly turn after the word 'but', however the algorithm does not take this into account.