# Text Mining

Marissa Berk

###### Bag of words: 
The bag of words model is a way of extracting features from text. It is very simple because it ignores semantics, syntax, morphology and pragmatics; however, it is still effective for many languages. This model treats the document as a collection of words and measures the presence of known words
###### Naive Bayes:
The naive bayes model is based on bayes theorem. Bayes theorem describes conditional probability; the probability of an event, based on prior knowledge of conditions that might be related to the event. Naive Bayes assumes that everything is independent. So the probabilities of two words being in a text are independent (even if this is not necessarily the case).

###### How they work together:
Bayes theorem can be used to calculate the probability that a text belongs to a certain category- bag of words plays a role because the frequency of each word determines this probability. 

In [4]:
import pandas as pd

In [21]:
reviews = pd.read_csv('Clothing_reviews.csv')
reviews.dropna()
reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Since we only need the information from the reviews of dresses we need to separate these reviews.

In [36]:
dresses = reviews['Department Name']=='Dresses'
dress_reviews = reviews[dresses]
dress_reviews.dropna()
dress_reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


To read the text and use it for our analysis, we need an object from sklearn called a _CountVectorizer_. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. 

I use a list of frequent English words _stop words_ that will not be counted: they are not informative enough.
We will need to convert the text to Unicode, which is a standard text format. We do so by using .values.astype(U).

In [37]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = dress_reviews['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']



Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [38]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents


  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1



As you can see, there are no 0's in the matrix. Because the matrix is mostly zeroes, they are left out to save memory. Instead, the positions of the cells that don't have a zero are spelled out, with their values. This is a so-called sparse matrix which saves a lot of memory. We can convert it to a regular matrix however, with .toarray(). Let's do that and add it to the reviews dataframe.


# Building the Model

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

In [50]:
nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = dress_reviews['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

# Evaluating the Model

In [51]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.5891350210970464


The accuracy is 58.9%, which does not seem great; but, considering there are five categories it's not too bad. 

In [71]:
nb.classes_

array([1, 2, 3, 4, 5])

In [72]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        59
           2       0.20      0.04      0.06       130
           3       0.39      0.28      0.33       253
           4       0.34      0.28      0.31       409
           5       0.69      0.89      0.77      1045

    accuracy                           0.59      1896
   macro avg       0.32      0.30      0.29      1896
weighted avg       0.52      0.59      0.54      1896



The precision for the negative ratings (<4 stars) are lower than the positive/ neutral ratings (>3 stars). As we can see, for a rating of one the accuracy is 0.00 this means none of these ratings were guessed accurately, however; for a rating of 5 the precision is 69% and the recall is 89% which is significantly better.

In [73]:
df= dress_reviews[['Rating','Review Text']]
df.head()

Unnamed: 0,Rating,Review Text
1,5,Love this dress! it's sooo pretty. i happene...
2,3,I had such high hopes for this dress and reall...
5,2,"I love tracy reese dresses, but this one is no..."
8,5,I love this dress. i usually get an xs but it ...
9,5,"I'm 5""5' and 125 lbs. i ordered the s petite t..."


In [74]:
print(df.iloc[0,1])
print(nb.predict_proba(X[0]))

Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
[[5.24593569e-13 7.05443377e-08 2.03491938e-05 9.88314584e-02
  9.01148122e-01]]


In [75]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"Review Text: {i}. {df.iloc[i,1]}")
    print(f"1 star: {prob[0,0]}, 2 star: {prob[0,1]}, 3 star: {prob[0,2]}, 4 star: {prob[0,3]}, 5 star: {prob[0,4]}")



Review Text: 0. Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
1 star: 5.245935688524858e-13, 2 star: 7.054433768073361e-08, 3 star: 2.034919381850184e-05, 4 star: 0.098831458399072, 5 star: 0.9011481218622508
Review Text: 1. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
1 star: 1.201685

For some reason it is showing probabilities greater than 1, this is not correct. I am not sure where exactly the mistake comes from since I am not recieving any errors.

For review text (0) the comments are quite positive, the customer said she loves the dress and that it is "sooooo pretty". Therefore the predictions for the higher ratings should be higher than those for the lower ratings but this is not the case. For this instance the probabilities are all way too high so I believe there was an error made here.

For review text (7) the customer mentions that she thought the dress was beautifully made, meaning the prediction should indicate a positive review but this is not the case. Perhaps because the user mentions 'doubt' so perhaps this is why there is a discrepency.

For review text (8) the review is confusing even for me so I can understand why the algorithm could not correctly predict this one. The user has both positive and negative comments so it is difficult to guage her rating.