# Sentiment Analysis with Machine Learning Models

In [3]:
import numpy as np
import pandas as pd

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import Pipeline, make_pipeline

import functions as f

In [2]:
M_bow = f.load_pickle('data/M_bow_100k.pickle')
M_tfidf = f.load_pickle('data/M_tfidf_100k.pickle')
M_svd = f.load_pickle('data/M_svd_100k.pickle')
M_nmf = f.load_pickle('data/M_nmf_100k.pickle')
M_word2vec = f.load_pickle('data/M_word2vec_100k.pickle')
y = f.load_pickle('data/sentiment_100k.pickle')

## Machine Learning Models Comparison

There is no one answer which combination of text representation and machine learning model will perform best. In order to find the optimal solution, each configuration needs to be tested. There are 5 different text representations: bag of words, TF-IDF, co-occurrence matrix with SVD and NMF decomposition and word2vec. They will be used in the following predictive models:
- Multinomial Naive Bayes
- SGD classifier
- Logistic Regression

In order to evaluate model performance, cross validation is implemented, measuring: test accuracy, f1 score, precision and recall. In the first place, model will be assessed by accuracy and f1 score. At the same time, it's important to note that there is high class imbalance in the dataset - the majority of reviews is positive. Due to this fact, it's enough to label all reviews as positive to reach 91,5% accuracy. Because of that, accuracy is not a sufficient metric and it's also valuable to look at precision and recall.

Since Naive Bayes takes only positive values as input, SVD and Word2Vec embeddings were transformed with MinMaxScaler.

In [4]:
baseline = sum(y)/len(y)
print('Baseline accuracy: ', baseline)

Baseline accuracy:  0.9148


In [5]:
scaler = MinMaxScaler()
M_svd_positive = scaler.fit_transform(M_svd)
M_word2vec_positive = scaler.fit_transform(M_word2vec)

In [6]:
embeddings_names = ['BOW', 'TFIDF', 'SVD', 'NMF', 'Word2Vec']
embeddings = [M_bow, M_tfidf, M_svd, M_nmf, M_word2vec]
embeddings_positive = [M_bow, M_tfidf, M_svd_positive, M_nmf, M_word2vec_positive]
results_names = ['test_acc', 'f1', 'precision', 'recall']

### Multinomial Naive Bayes

Naive Bayes Classifier is a simple model that’s usually used in a wide variety of classification tasks (both binary and multiclass). The name "naive" comes from the fact that it assumes the features that go into the model are independent of each other. It provides a way to calculate the probability of a piece of data belonging to a given class, given our prior knowledge.

It performs best with BOW word vectors (the highest test accuracy and f1 score). Naive Bayes used with SVD and Word2Vec model return the baseline accuracy and recall equal to 1 which means that it assigned all reviews to the "positive" class.

In [7]:
naive = MultinomialNB()

naive_cv = f.model_cv(naive, embeddings_positive, y)
f.df_model_cv(naive_cv, embeddings_names, results_names)

Unnamed: 0,test_acc,f1,precision,recall
BOW,0.94656,0.970859,0.96857,0.973164
TFIDF,0.93515,0.96568,0.935993,0.997311
SVD,0.9148,0.955504,0.9148,1.0
NMF,0.72811,0.841202,0.903112,0.787243
Word2Vec,0.9148,0.955504,0.9148,1.0


### SGD Classifier

SGD Classifier implements regularized linear models with stochastic gradient descent learning. In the first example it fits SVM model, and in the second logistic regression.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

Both SGD Classifiers return similar results. The best solution is achieved with BOW word vectors and SVD model.

In [13]:
sgd = SGDClassifier(random_state=9, n_jobs=-1)

sgd_cv = f.model_cv(sgd, embeddings, y)
f.df_model_cv(sgd_cv, embeddings_names, results_names)

Unnamed: 0,test_acc,f1,precision,recall
BOW,0.95621,0.976397,0.963097,0.990074
TFIDF,0.9496,0.9731,0.950783,0.996491
SVD,0.8577,0.921848,0.925756,0.918135
NMF,0.9148,0.955504,0.9148,1.0
Word2Vec,0.94956,0.972803,0.960062,0.985931


In [14]:
sgd_log = SGDClassifier(loss='log', penalty='elasticnet', random_state=9, n_jobs=-1)

sgd_log_cv = f.model_cv(sgd_log, embeddings, y)
f.df_model_cv(sgd_log_cv, embeddings_names, results_names)

Unnamed: 0,test_acc,f1,precision,recall
BOW,0.95324,0.974855,0.95937,0.99085
TFIDF,0.94032,0.968344,0.940585,0.997792
SVD,0.84823,0.915755,0.928278,0.904143
NMF,0.9148,0.955504,0.9148,1.0
Word2Vec,0.9495,0.972703,0.962119,0.983548


### Logistic Regression

Logistic regression, despite its name, is a linear model for classification rather than regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

It returns the best results for TF-IDF word vectors. It is also the best result achieved so far. Logistic Regression with SVD and NMF word vectors assigned all observations to the "positive" class (recall = 1).

In [15]:
logreg = LogisticRegression(max_iter=500, random_state=9, n_jobs=-1)

logreg_cv = f.model_cv(logreg, embeddings, y)
f.df_model_cv(logreg_cv, embeddings_names, results_names)

Unnamed: 0,test_acc,f1,precision,recall
BOW,0.95678,0.9766,0.967468,0.985909
TFIDF,0.95808,0.97744,0.962655,0.992687
SVD,0.9148,0.955504,0.9148,1.0
NMF,0.9148,0.955504,0.9148,1.0
Word2Vec,0.95048,0.973322,0.959593,0.987451


## Verification of model performance

Logistic Regression with TF-IDF word vectors resulted in 95,8% accuracy and 0.9774 f1-score. It's the best result achieved so far.

Let's analyze results of this particular model:
- try to predict sentiment of two example reviews
- analyze errors (true positiives, false positives etc)
- browse reviews that were assigned to incorrect class

In [17]:
vectorizer = load_pickle('data/tfidf_vectorizer_100k.pickle')

In [18]:
logreg.fit(M_tfidf, y)

SGDClassifier(n_jobs=-1, random_state=9)

In [19]:
review_test_pos = 'This game is amazing ^^, my son plays with it all the time!'
review_test_neg = 'I\'m really disappointed with this game. My son doesn\'t like playing with it.'

In [20]:
review_tokens_test_pos = f.normalize_single_text(review_test_pos)
tfidf_vector_test_pos = vectorizer.transform([' '.join(review_tokens_test_pos)])
logreg.predict(tfidf_vector_test_pos)

array([1])

In [21]:
review_tokens_test_neg = f.normalize_single_text(review_test_neg)
tfidf_vector_test_neg = vectorizer.transform([' '.join(review_tokens_test_neg)])
logreg.predict(tfidf_vector_test_neg)

array([0])

### Train test split & check reviews with incorrect labels

In [31]:
df = pd.read_csv('data/reviews_toys_games_100k.csv')

In [23]:
X_train, X_test, y_train, y_test = train_test_split(M_tfidf, np.array(y), test_size=0.33, random_state=9)

In [24]:
logreg.fit(X_train, y_train)

SGDClassifier(n_jobs=-1, random_state=9)

In [25]:
y_pred = logreg.predict(X_test)

In [32]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

In [21]:
y_diff = y_test - y_pred
np.where(y_diff != 0)

(array([   35,    66,    69, ..., 32961, 32978, 32983], dtype=int64),)

In [24]:
i = 69
print(df['review'][i])
print('\nreal label:', y_test[i], '\npredicted label:', y_pred[i])

Son really loves it.

real label: 0 
predicted label: 1


In [25]:
i = 32978
print(df['review'][i])
print('\nreal label:', y_test[i], '\npredicted label:', y_pred[i])

Arrived quickly,  just as described.

real label: 0 
predicted label: 1


In [26]:
i = 32983
print(df['review'][i])
print('\nreal label:', y_test[i], '\npredicted label:', y_pred[i])

Great product and quality. Fast shipping

real label: 0 
predicted label: 1


## Over and under sampling

In [28]:
under_sampler_pipeline = make_pipeline(RandomUnderSampler(random_state=9),
                              LogisticRegression(max_iter=500, random_state=9, n_jobs=-1))

under_sampler_cv = f.model_cv(under_sampler_pipeline, embeddings, y)
f.df_model_cv(under_sampler_cv, embeddings_names, results_names)

Unnamed: 0,train_acc,test_acc,precision,recall,f1,roc_auc
BOW,0.89673,0.89106,0.988827,0.890982,0.937357,0.943244
TFIDF,0.81419,0.80703,0.994465,0.793474,0.882665,0.956293
SVD,0.596615,0.59652,0.979323,0.570999,0.721385,0.769999
NMF,0.46547,0.46552,0.963136,0.432291,0.596725,0.580237
Word2Vec,0.77846,0.77864,0.987666,0.76761,0.86384,0.909421


In [32]:
smote_pipeline = make_pipeline(SMOTE(random_state=9),
                              LogisticRegression(max_iter=500, random_state=9, n_jobs=-1))

smote_cv = f.model_cv(smote_pipeline, embeddings_positive, y)
f.df_model_cv(smote_cv, embeddings_names, results_names)

Unnamed: 0,train_acc,test_acc,precision,recall,f1,roc_auc
BOW,0.921895,0.9152,0.977209,0.928968,0.952478,0.900934
TFIDF,0.92335,0.90324,0.987509,0.905684,0.944825,0.959009
SVD,0.59761,0.59743,0.979358,0.571994,0.722188,0.770017
NMF,0.466335,0.46619,0.962883,0.433177,0.597522,0.580255
Word2Vec,0.772965,0.77274,0.98813,0.760713,0.859634,0.909639
