In [87]:
import pandas as pd
import numpy as np

df = pd.read_csv('Reviews.csv')
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [88]:
df.shape

(50000, 10)

The Score column is scaled from 1 to 5, and we will remove all Scores equal to 3 because we assume these are neutral and did not provide us any useful information. We then add a new column called “Positivity”, where any score above 3 is encoded as a 1, indicating it was positively rated. Otherwise, it’ll be encoded as a 0, indicating it was negatively rated.

In [89]:
df.dropna(inplace=True)
df[df['Score'] != 3]
df['Positivity'] = np.where(df['Score'] > 3, 1, 0)
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Positivity
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,0
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,1
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,0
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,1


In [90]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Positivity'], random_state = 0)
print('X_train first entry: \n\n', X_train[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry: 

 I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


X_train shape:  (37496,)


In [91]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(X_train)

The default configuration tokenizes the string, by extracting words of at least 2 letters or numbers, separated by word boundaries, it then converts everything to lowercase and builds a vocabulary using these tokens. We can get some of the vocabularies by using the get_feature_names method like so:

In [92]:
vect.get_feature_names()[::2000]

['00',
 'allayed',
 'baby',
 'caffee',
 'comprise',
 'detox',
 'est',
 'galactooligosaccharides',
 'homogenized',
 'kiddin',
 'medicine',
 'oblige',
 'plantationbell',
 'redeemed',
 'seemingly',
 'startup',
 'tile',
 'vodkas']

In [93]:
len(vect.get_feature_names())

35424

Next, we transform the documents in X_train to a document term matrix, which gives us the bags-of-word representation of X_train. The result is stored in a SciPy sparse matrix, where each row corresponds to a document, and each column is a word from our training vocabulary.

## Vectorized Features
![alt](./images/bagofword.png "Bag of Words")


![alt](./images/bagofword2.png "Bag of Words")

In [94]:
X_train_vectorized = vect.transform(X_train)

In [95]:
X_train_vectorized.shape

(37496, 35424)

In [96]:
X_train_vectorized[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

The entries in this matrix are the number of times each word appears in each document. Because the number of words in the vocabulary is so much larger than the number of words that might appear in a single text, most entries of this matrix are zero.

## Logistic Regression

Now, we will train the Logistic Regression classifier based on this feature matrix X_ train_ vectorized because Logistics Regression works well for high dimensional sparse data.

In [97]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Next, we’ll make predictions using X_test, and compute the area under the curve score.

In [98]:
from sklearn.metrics import roc_auc_score
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.810651771925


### Area Under Curve (AUC)
![alt](./images/auc.png "AUC")

In order to better understand how our model makes these predictions, we can use the coefficients for each feature (a word) to determine its weight in terms of positivity and negativity.

In [99]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs: 
['worst' 'disappointing' 'flavorless' 'disappointment' 'expired' 'grounds'
 'trash' 'awful' 'unfortunately' 'terrible']

Largest Coefs: 
['hooked' 'proteins' 'addictive' 'excellent' 'carrying' 'wonderful'
 'amazing' 'pleasantly' 'awesome' 'skeptical']



Sorting the 10 smallest and 10 largest coefficients, we can see the model has predicted words like “worst”, “disappointing” and “horrible” to negative reviews, and words like “hooked”, “bright”, and “delicious” to positive reviews.

However, our model can be improved.

# Tf–idf term weighting
(Tf-idf : term frequency-inverse document frequency)

In a large text corpus, some words will be present very often but will carry very little meaningful information about the actual contents of the document (such as “the”, “a” and “is”). If we were to feed the count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. Tf-idf allows us to weight terms based on how important they are to a document.

So, we will instantiate the tf–idf vectorizer and fit it to our training data. We specify min_df = 5, which will remove any words from our vocabulary that appear in fewer than five documents.

![alt](./images/tf-idf.png "Tf-idf")

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df = 5).fit(X_train)
len(vect.get_feature_names())

11551

In [101]:
X_train_vectorized = vect.transform(X_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.783420849973


Using the following code, we are able to obtain a list of features with the smallest tf-idf that either commonly appeared across all reviews or only appeared rarely in very long reviews and a list of features with the largest tf–idf contains words which appeared frequently in a review, but did not appear commonly across all reviews.

In [102]:
feature_names = np.array(vect.get_feature_names())
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()
print('Smallest Tfidf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest Tfidf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest Tfidf: 
['4thd' 'nations' 'biochemical' '350mgs' '300mgs' 'nas' 'fnb' 'committee'
 'faecium' 'annette']

Largest Tfidf: 
['mmm' 'nom' 'mustard' 'quinoa' 'br' 'jell' 'rhubarb' 'jello' 'thanks'
 'agar']



In [103]:
print(model.predict(vect.transform(['The candy is not good, I will never buy them again',
                                    'The candy is not bad, I will buy them again'])))

[1 0]


Our current model misclassified the document “The candy is not good, I will never buy them again” as a positive review, and it also misclassified the document “The candy is not bad, I will buy them again” as a negative review.

# n-grams

One way to fix this misclassification is to add n-grams. For example, bigrams count pairs of adjacent words and could give us features such as bad versus not bad. Thus, we are refitting our training set specifying a minimum document frequency of 5 and extracting 1-grams and 2-grams.

![alt](./images/ngram.png "N-Grams")

In [104]:
vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

82497

In [105]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.84120818323


In [106]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coef: \n{}\n'.format(feature_names[sorted_coef_index][:10]))
print('Largest Coef: \n{}\n'.format(feature_names[sorted_coef_index][:-11:-1]))

Smallest Coef: 
['disappointing' 'worst' 'not recommend' 'unfortunately'
 'very disappointed' 'awful' 'terrible' 'won buy' 'not good' 'disappointed']

Largest Coef: 
['be disappointed' 'delicious' 'excellent' 'not too' 'amazing' 'wonderful'
 'awesome' 'love this' 'even better' 'great']



In [107]:
print(model.predict(vect.transform(['The candy is not good, I will never buy them again',
                                    'The candy is not bad, I will buy them again'])))

[0 1]


In [108]:
print(model.predict(vect.transform(['I fed this to my Golden Retriever and he hated it.  He wouldnt eat it, and when he did, it gave him terrible diarrhea.  We will not be buying this again.  It\'s also super expensive.'])))

[0]


In [109]:
print(model.predict(vect.transform(['These are excellent bars at an outstanding price.  The whole family, including our finicky 1 1/2 year-old, loves them.  Greens Plus bars are nutritious, filling, and convenient.'])))

[1]
