# Text modeling

In this Notebook, we are applying text mining techniques to a corpus [of genuine and fake reviews](https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus). In this Notebook, we will create the document-feature matrix.

For natural languag processing the model Bag of words just counts but doesn't process meaning. Naïve Bayes is a ML algorithm for text mining. Naïve Bayes is a probabilistic classifier. Prior probability samples both classes evenly and search for balance between precision and recall. P(Y=Ci | X). You can check how sure the algorithm is about the guess in a confusion matrix. 

In [2]:
import pandas as pd

First, let's read in the data file.

In [3]:
reviews = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [4]:
subset = reviews[(reviews["Department Name"] == "Dresses")]   
subset.head(20)
#reviews.iloc[2,0]

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses
10,10,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses
11,11,1095,39,,This dress is perfection! so pretty and flatte...,5,1,2,General Petite,Dresses,Dresses
12,12,1095,53,Perfect!!!,More and more i find myself reliant on the rev...,5,1,2,General Petite,Dresses,Dresses
14,14,1077,50,Pretty party dress with some issues,This is a nice choice for holiday gatherings. ...,3,1,1,General,Dresses,Dresses
19,19,1077,47,Stylish and comfortable,I love the look and feel of this tulle dress. ...,5,1,0,General,Dresses,Dresses


In [5]:
df_subset = subset[["Rating", "Review Text", "Department Name"]].dropna()
df_subset.head()

Unnamed: 0,Rating,Review Text,Department Name
1,5,Love this dress! it's sooo pretty. i happene...,Dresses
2,3,I had such high hopes for this dress and reall...,Dresses
5,2,"I love tracy reese dresses, but this one is no...",Dresses
8,5,I love this dress. i usually get an xs but it ...,Dresses
9,5,"I'm 5""5' and 125 lbs. i ordered the s petite t...",Dresses


In [6]:
subset['Rating'].value_counts()

5    3397
4    1395
3     838
2     461
1     228
Name: Rating, dtype: int64

As we can see, there are 800 truthful and 800 deceptive reviews. 

To read the text and use it for our analysis, we need an object from `sklearn` called a `CountVectorizer`. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using `.values.astype('U')`.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df_subset['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 8079 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [8]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[600:950,600:950]) #Let's print a little part of the matrix: the first 50 words & documents

  (0, 33)	1
  (0, 74)	1
  (0, 217)	1
  (0, 324)	1
  (1, 208)	1
  (3, 343)	1
  (4, 156)	1
  (4, 221)	1
  (5, 191)	1
  (6, 140)	1
  (7, 343)	1
  (10, 191)	1
  (11, 240)	1
  (11, 296)	1
  (12, 50)	1
  (12, 64)	1
  (16, 276)	1
  (17, 343)	2
  (18, 77)	1
  (20, 64)	2
  (23, 303)	1
  (23, 341)	1
  (24, 276)	1
  (26, 158)	1
  (29, 324)	1
  :	:
  (315, 276)	1
  (315, 277)	1
  (316, 64)	1
  (316, 305)	1
  (323, 341)	1
  (325, 171)	1
  (325, 180)	1
  (327, 277)	1
  (327, 327)	1
  (329, 117)	1
  (330, 191)	1
  (334, 50)	1
  (336, 276)	1
  (336, 293)	1
  (338, 327)	1
  (340, 276)	1
  (342, 74)	1
  (344, 327)	1
  (346, 74)	1
  (347, 213)	1
  (347, 240)	1
  (348, 71)	1
  (348, 180)	1
  (348, 270)	1
  (349, 240)	1


As you can see, there are no 0's in the matrix. Because the matrix is mostly zeroes, they are left out to save memory. Instead, the positions of the cells that _don't_ have a zero are spelled out, with their values. This is a so-called _sparse matrix_ which saves a lot of memory. We can convert it to a regular matrix however, with `.toarray()`. Let's do that and add it to the reviews dataframe.

We are doing this now just for the example. In an application or Big Data analysis, you would have to be careful not to use up too much memory or computing power.

As we can see, the matrix is almost entirely filled with zeroes.

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = docu_feat 
y = df_subset[['Rating']]

X_train_k, X_test_k, y_train_k, y_test_k = train_test_split(X, y, test_size=0.3)

In [10]:
clf = MultinomialNB()
clf.fit(X_train_k, y_train_k)
MultinomialNB()
print(clf.predict(X[0:20]))
clf.score(X_test_k, y_test_k)

[5 3 5 5 5 3 5 5 3 5 3 2 5 5 4 5 3 5 4 4]


  y = column_or_1d(y, warn=True)


0.5845986984815619

In [18]:
df_subset["Review Text"].head(15)

1     Love this dress!  it's sooo pretty.  i happene...
2     I had such high hopes for this dress and reall...
5     I love tracy reese dresses, but this one is no...
8     I love this dress. i usually get an xs but it ...
9     I'm 5"5' and 125 lbs. i ordered the s petite t...
10    Dress runs small esp where the zipper area run...
11    This dress is perfection! so pretty and flatte...
12    More and more i find myself reliant on the rev...
14    This is a nice choice for holiday gatherings. ...
19    I love the look and feel of this tulle dress. ...
21    I'm upset because for the price of the dress, ...
22    First of all, this is not pullover styling. th...
23    Cute little dress fits tts. it is a little hig...
52    Love the color and style, but material snags e...
58    I got this in the petite length, size o, and i...
Name: Review Text, dtype: object

Number 3: Rating 2 and predicted rating 5
Number 13:  Rating 3 and predicted rating 5
Number 14: Rating 3 and predicted rating 5

The rating trips up because of the words love and cute. not in love with it is seen as a good thing, because all the words are seperate. 