# Sentiment Analysis
Using Amazon's reviews of mobile devices we are going to train and build a model that would classify reviews as positive or negative.

## Data Prep
We are going to use a subset of data to speed things up.

In [51]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Mobile_reviews.csv')
print(df.shape)
# Sample the data to speed up computation
# Comment out this line to match with lecture
df = df.sample(frac=0.4, random_state=10)
print(df.shape)

df.head()

(206920, 7)
(82768, 7)


Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
71610,264323,Nokia C6 Unlocked GSM Phone with Easy E-mail S...,Nokia,99.95,2,"The phone's fine, im just disappointed when i ...",1.0
148450,331118,Samsung Galaxy Note II N7100 16GB Gray-Unlocke...,,116.99,5,great,0.0
51115,20512,Apple iPhone 5 32GB Factory Unlocked GSM Cell ...,Apple,179.99,5,Exceptional ... works perfectly.,0.0
31935,172253,CNPGD [U.S. Warranty] All-in-1 Smartwatch and ...,CNPGD,49.99,2,Bluetooth kept disconnecting. I was not happy ...,0.0
107601,6654,"Apple Iphone 4 - 8gb Sprint (CDMA) White, Smar...",,,1,Same day I got it stopped working,1.0


In [52]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df.Rating >= 4, 1, 0)
df.head(15)

Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
71610,264323,Nokia C6 Unlocked GSM Phone with Easy E-mail S...,Nokia,99.95,2,"The phone's fine, im just disappointed when i ...",1.0,0
51115,20512,Apple iPhone 5 32GB Factory Unlocked GSM Cell ...,Apple,179.99,5,Exceptional ... works perfectly.,0.0,1
31935,172253,CNPGD [U.S. Warranty] All-in-1 Smartwatch and ...,CNPGD,49.99,2,Bluetooth kept disconnecting. I was not happy ...,0.0,0
161957,270262,Nokia Lumia 820 8GB GSM 4G LTE Windows 8 Smart...,Nokia,99.99,1,This phone was defective. Was very unhappy tha...,0.0,0
40717,226895,"LG Google Nexus 5 Unlocked Phone D821, 16 GB, ...",LG,373.75,4,Nice phone. Worked very well while traveling o...,0.0,1
27583,314599,Samsung Galaxy Grand Prime G531H/DS Internatio...,Samsung,189.9,5,i love it,0.0,1
18837,249668,Motorola KRZR K1 Unlocked Cell Phone with 2 MP...,Motorola,83.99,1,This is not a new phone as advertised. It had ...,1.0,0
130302,226943,"LG Google Nexus 5 Unlocked Phone D821, 16 GB, ...",LG,373.75,5,Reached on time. Yet to open the package... It...,0.0,1
110148,135100,BLU Studio 5.0 C HD - Unlocked Cell Phones - R...,BLU,2000.0,5,Guys.......this phone is awesome for the price...,0.0,1
17546,150957,BLU Studio C Super Camera -Unlocked Smartphone...,BLU,99.0,2,So far the phone is great. But it is a mislead...,0.0,0


In [53]:
# Most ratings are positive
df['Positively Rated'].mean()

0.7473840095038162

In [54]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [55]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train first entry\'s review is:', y_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Came as advertised. No trouble connecting to the t-mobile network.


X_train first entry's review is: 1


X_train shape:  (46086,)


# CountVectorizer
CountVectorizer allows us to use the bag-of-wrods approach - which ignores sentence structure and only focuses on the number of times a word is present in the sentence. It converts a collection of text documents (reviews) into a matrix of token counts.

We first instantiate CountVectorizer and fit to our training data. It finds all tookens (at least two characters long and separated by word boundaries), converts text to lowercase, and builds a vocabulary using these tokens.

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [57]:
vect.get_feature_names()[::2000]

['00',
 'admitir',
 'blacklisted',
 'comprehensible',
 'dispel',
 'feasible',
 'headphonesblue',
 'kook',
 'msrp',
 'phonefeb',
 'reeboting',
 'sii',
 'tend',
 'virgin']

In [58]:
# we are working with over 27,000 features
len(vect.get_feature_names())

27279

In [59]:
# using transform method we transform X_train into a matrix of bag-of-words representation of X_train
X_train_vectorized = vect.transform(X_train)

X_train_vectorized
# each row is a document and each column is a word

<46086x27279 sparse matrix of type '<class 'numpy.int64'>'
	with 1229266 stored elements in Compressed Sparse Row format>

In [61]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [64]:
from sklearn.metrics import *

preds = log_reg.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, preds))
print('Accuracy: ', accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

AUC:  0.9098377775180254
Accuracy:  0.9379027533684827
[[ 3330   575]
 [  379 11079]]


Let us now evaluate the coefficients of the model

In [69]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = log_reg.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'junk' 'horrible' 'garbage' 'crashed' 'unable' 'empty' 'ui'
 'stopped' 'poor']

Largest Coefs: 
['excelente' 'excelent' 'loves' 'excellent' 'love' 'perfect' 'exelente'
 'perfectly' 'awesome' 'amazing']


# TfIdf
Now let us use a different approach, Term Frequency Inverse Document Frequency. It allows us to weight documents tokens not only how often it appears in a document but also how often it appears in a corpus. High weight is given to token that are used a lot in particular document but not in a corpus.

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

# min_df = allows to spec min # of docs a token needs to appear to be a par of a dictionary

vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

7871

We have also shrunk the feature space by a factor of 4 - let's see how well this model is going to perform

In [73]:
# using transform method we transform X_train into a matrix of bag-of-words representation of X_train
X_train_vectorized = vect.transform(X_train)

log_reg_tfidf = LogisticRegression()
log_reg_tfidf.fit(X_train_vectorized, y_train)

preds = log_reg_tfidf.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, preds))
print('Accuracy: ', accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


AUC:  0.9103901483768924
Accuracy:  0.9399856798802317
[[ 3320   585]
 [  337 11121]]


Great! By reducing the number of features we not only decreased the run time to train our model, but also have improved the model's generalization.

In [76]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = log_reg_tfidf.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'disappointed' 'worst' 'return' 'horrible' 'slow' 'doesn' 'stopped'
 'poor' 'waste']

Largest Coefs: 
['great' 'love' 'excellent' 'perfect' 'amazing' 'good' 'best' 'far' 'easy'
 'perfectly']


Let us see what features appear most often in our tfidf matrix:

In [86]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:20]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-21:-1]]))

Smallest tfidf:
['spoken' 'reactions' 'fidelity' 'bokeh' 'adreno' 'controlled'
 'synchronization' 'candybar' 'comparisons' 'nav' 'rooms' 'damp' '480p'
 'ultimo' 'warmth' '1920x1080' 'ingress' 'dragging' 'conversion'
 'breakers']

Largest tfidf: 
['real' 'don' 'excelent' 'windows' 'aaa' 'pin' 'genial' 'very' 'god'
 'verygood' 'loveit' 'a1' 'it' 'done' 'recommended' 'excelente' 'tiempo'
 'thx' 'fire' 'goo']


List of features with the largest TfIdf contains words which appreared frequently in a review, but did not appear commonly across all reviews.

List of features with the smallest TfIdf either commonly appeared across all reviews or only appeared rarely in very long reviews.

# Word Order with n-grams
One problem is our bag-of-words models do not take structure of the sentence and word order into account. One way we can fix that is to consider n-grams. An n-gram is a sequence of length n of word features. For example, including bi-grams we will have 'is working' and 'not working'.

In [89]:
vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

52065

Just by adding bigrams we have increased doubled our feature space. Thus these n-grams can b very computationally expensive.

In [90]:
log_reg_2gram = LogisticRegression()
log_reg_2gram.fit(X_train_vectorized, y_train)

preds = log_reg_2gram.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, preds))
print('Accuracy: ', accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


AUC:  0.9322673644814028
Accuracy:  0.9522228731367571
[[ 3482   423]
 [  311 11147]]


Great! By adding bigrams we have improved our model even further.

In [91]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = log_reg_2gram.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'junk' 'not good' 'horrible' 'worst' 'broken' 'not very' 'poor'
 'terrible' 'garbage']

Largest Coefs: 
['excellent' 'excelente' 'excelent' 'perfect' 'not bad' 'great' 'love'
 'no problems' 'awesome' 'amazing']
