## Sentiment Analysis on Movie Reviews

This exercise was developed as part of the Supervised Learning course (DSCI 571) taken at the Master of Data Science & Computational Linguistics program offered at the University of British Columbia.

The goal of this exercise was to test different classifiers to carry out sentiment analysis in the [IMDB Movie Reviews dataset](https://www.kaggle.com/utathya/imdb-review-dataset).

In [1]:
# Import libraries
import re
import sys
from hashlib import sha1

# python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# train test split and cross validation
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

# sklearn objects
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

### Data load

In [2]:
imdb_df = pd.read_csv("data/imdb_master.csv", encoding="ISO-8859-1", index_col="Unnamed: 0")
imdb_df = imdb_df.query('label == "neg" | label == "pos"')
train_df = imdb_df.query('type == "train"')
test_df = imdb_df.query('type == "test"')

In [3]:
train_df.shape

(25000, 4)

In [4]:
test_df.shape

(25000, 4)

In [5]:
# Create X_train, X_test, y_train, y_test
X_train, y_train = train_df.drop(columns = ["label"]), train_df["label"]
X_test, y_test = test_df.drop(columns = ["label"]), test_df["label"]

### Initial EDA

In [6]:
# class balance
y_train.value_counts("neg")

neg    0.5
pos    0.5
Name: label, dtype: float64

> Classes are equally distributed (50% / 50%)

In [7]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 25000 to 49999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    25000 non-null  object
 1   review  25000 non-null  object
 2   file    25000 non-null  object
dtypes: object(3)
memory usage: 781.2+ KB


> There are no missing values in the dataset.

### Model building and hyperparameter optimization

In [8]:
# Credit: This function was adapted from Varada Kolhatkar (MDS instructor)
def store_cross_val_results(model_name, scores, results_dict):
    """
    Stores mean scores from cross_validate in results_dict for
    the given model model_name.

    Parameters
    ----------
    model_name :
        scikit-learn classification model
    scores : dict
        object return by `cross_validate`
    results_dict: dict
        dictionary to store results

    Returns
    ----------
        None

    """
    results_dict[model_name] = {
        "mean_train_accuracy": "{:0.4f}".format(np.mean(scores["train_score"])),
        "mean_valid_accuracy": "{:0.4f}".format(np.mean(scores["test_score"])),
        "mean_fit_time (s)": "{:0.4f}".format(np.mean(scores["fit_time"])),
        "mean_score_time (s)": "{:0.4f}".format(np.mean(scores["score_time"])),
        "std_train_score": "{:0.4f}".format(scores["train_score"].std()),
        "std_valid_score": "{:0.4f}".format(scores["test_score"].std()),
    }
    
results_dict = {}

In [9]:
# Set baseline model
pipe_dclf = make_pipeline(DummyClassifier())
scores = cross_validate(pipe_dclf, X_train, y_train, return_train_score = True)
store_cross_val_results('Dummy Classifier', scores, results_dict)
pd.DataFrame(results_dict)



Unnamed: 0,Dummy Classifier
mean_fit_time (s),0.0114
mean_score_time (s),0.0067
mean_train_accuracy,0.5003
mean_valid_accuracy,0.4984
std_train_score,0.0017
std_valid_score,0.0076


In [10]:
# Set other classifiers along with a pipeline consisting of a CountVectorizer object
models = {
    "Decision Tree": DecisionTreeClassifier(),
    "RBF SVM": SVC(),
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=2000),
}

for key, model in models.items():
    print(f"Training model {key}...")
    pipe = make_pipeline(CountVectorizer(), model)
    pipe_scores = cross_validate(pipe, X_train["review"], y_train, cv=5, return_train_score=True)
    store_cross_val_results("vectorizer + {}".format(key), pipe_scores, results_dict)
    print(f"Training model {key}...finished")
    
pd.DataFrame(results_dict).T

Training model Decision Tree...
Training model Decision Tree...finished
Training model RBF SVM...
Training model RBF SVM...finished
Training model Naive Bayes...
Training model Naive Bayes...finished
Training model Logistic Regression...
Training model Logistic Regression...finished


Unnamed: 0,mean_train_accuracy,mean_valid_accuracy,mean_fit_time (s),mean_score_time (s),std_train_score,std_valid_score
Dummy Classifier,0.5003,0.4984,0.0114,0.0067,0.0017,0.0076
vectorizer + Decision Tree,1.0,0.709,24.7664,0.7601,0.0,0.0034
vectorizer + RBF SVM,0.9166,0.8509,409.159,77.0441,0.0013,0.0036
vectorizer + Naive Bayes,0.9091,0.7788,3.2498,0.9361,0.0045,0.0113
vectorizer + Logistic Regression,0.9991,0.8406,19.2095,0.8026,0.0001,0.0119


> From the results above we can see that `RBF SVM`, `Logistic Regression` and `Naive Bayes` are models that seem to be performing well at the task at hand, but only `RBF SVM` and `Logistic Regression` reached a mean validation score above 80. Among these three models, the one that took the longest to train by far was `RBF SVM`, while `Naive Bayes` took the shortest time. When we consider the gap between train scores and validation scores, `Decision Tree` seems to be the most overfit of all. However, even the best performing models seem to be prone to overfitting to some degree. 
>
> Taking the time / score relation into account, I would choose `Logistic Regression` as the most appropriate model for our classification task.

In [11]:
# Hyperparameter optimization using RandomSearchCV with Logistic Regression
pipe_lr = make_pipeline(CountVectorizer(), LogisticRegression(max_iter = 1000))

param_grid = {
    "logisticregression__C": 10.0 ** np.arange(-3, 3),
    "countvectorizer__max_features": [100, 2500, 5000, 10000, 20000, X_train.shape[0]]
}
random_search = RandomizedSearchCV(pipe_lr, param_distributions = param_grid, n_jobs = -2, n_iter = 20, cv = 5);
random_search.fit(X_train['review'], y_train)

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('countvectorizer',
                                              CountVectorizer()),
                                             ('logisticregression',
                                              LogisticRegression(max_iter=1000))]),
                   n_iter=20, n_jobs=-2,
                   param_distributions={'countvectorizer__max_features': [100,
                                                                          2500,
                                                                          5000,
                                                                          10000,
                                                                          20000,
                                                                          25000],
                                        'logisticregression__C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])})

In [12]:
print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_}")

Best parameters: {'logisticregression__C': 0.01, 'countvectorizer__max_features': 5000}
Best score: 0.8662000000000001


> The best scores and hyperparameter values were found at `C = 0.01` and `max_features = 5000`. With this pipeline configuration, the validation score for the `Logistic Regression` model improved a couple of points, and even exceeded that of `RBF SVM`, which was the best performing model of all.

### Model interpretation

In [13]:
# Set pipeline and run CV on best estimator
pipe_lr_be = make_pipeline(CountVectorizer(max_features=20000), LogisticRegression(max_iter=1000, C=0.01))
lr_scores = cross_validate(pipe_lr_be, X_train["review"], y_train, return_train_score=True)
pd.DataFrame(lr_scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,5.657449,0.86083,0.8696,0.9217
1,5.002771,0.847228,0.8592,0.92535
2,5.290339,0.820774,0.8572,0.9247
3,5.414927,1.077567,0.8622,0.923
4,5.100554,0.85357,0.8762,0.921


In [14]:
pipe_lr_be.fit(X_train["review"], y_train)

Pipeline(steps=[('countvectorizer', CountVectorizer(max_features=20000)),
                ('logisticregression',
                 LogisticRegression(C=0.01, max_iter=1000))])

In [15]:
# Get top 20 words indicative of negative and positive reviews

### Credit: Code adadapted from lecture notes ###
vocab = pipe_lr_be['countvectorizer'].get_feature_names()
weights = pipe_lr_be['logisticregression'].coef_.flatten()
coef = np.argsort(pipe_lr_be['logisticregression'].coef_.flatten())

# 20 words indicative of negative reviews
neg_words = [vocab[idx] for idx in coef[:20]]

# 20 words indicative of positive reviews
pos_words = [vocab[idx] for idx in coef[-20:]]

neg_words_weights = [(weights[idx]) for idx in coef[:20]]
pos_words_weights = [(weights[idx]) for idx in coef[-20:]]

pd.DataFrame(
    {
        "neg words": neg_words,
        "neg weights": neg_words_weights,
        "pos words": pos_words,
        "pos weights": pos_words_weights,
    }
)

Unnamed: 0,neg words,neg weights,pos words,pos weights
0,worst,-0.862892,job,0.257614
1,waste,-0.682865,simple,0.258293
2,awful,-0.616819,surprised,0.262982
3,boring,-0.549721,fantastic,0.267735
4,poor,-0.466777,definitely,0.275765
5,bad,-0.458941,enjoyable,0.280504
6,terrible,-0.438812,brilliant,0.303954
7,worse,-0.437115,enjoyed,0.314259
8,poorly,-0.430238,highly,0.315074
9,horrible,-0.412052,fun,0.322172


> From these results, we can see that in both cases all words mapped are coherent with the type of review given to the movie. As a side note, it is intetesting to note how the negative coefficients seem to be more pronounced than the positive coefficients.
>
> The fact that Logistic Regression models allow access to this information is important because it provides us with tools to understand the results produced by the model. By looking at how coefficients are mapped to words we can see the impact that each feature is having on the model's predictions.

### Test score and model evaluation

In [16]:
# Test set score
score = random_search.best_estimator_.score(X_test['review'], y_test)
print(f"Model score on test set: {score}")
print(f"Mean cross-validation score: {lr_scores['test_score'].mean()}")

Model score on test set: 0.87944
Mean cross-validation score: 0.8648800000000001


> Even though the test score seems to be in line with the mean cross-validation score obtained while training the model, it should not be blindly trusted for the following reasons:
> * The training and test sets were equally split (50%, 50%), which implies that there's room for optimization on the training process provided that we choose a different train/test split (i.e. 70-30, 80-20).
> * The `CountVectorizer` approach of converting a collection of text documents to a vector of term/token counts might not be the most optimal approach to word encoding, as it cannot make inferences on the releationships between words. This could limit the classifier's capacity to make more accurate predictions. 

In [17]:
# Most representative example from negative and positive reviews
neg_probs = pipe_lr_be.predict_proba(X_test['review'])[:, 0]
pos_probs = pipe_lr_be.predict_proba(X_test['review'])[:, 1]
most_neg = np.argmax(neg_probs)
most_pos = np.argmax(pos_probs)
print("Most positive review:\n")
print(f"Probability score: {pos_probs[most_pos]}\n")
print(X_test.iloc[[most_pos]]["review"][most_pos])
print("\nMost negative review:\n")
print(f"Probability score: {neg_probs[most_neg]}\n")
print(X_test.iloc[[most_neg]]["review"][most_neg])

Most positive review:

Probability score: 0.999999933105387

Universal Studios version of "Flipper" (1996) is a great heartwarming film for the entire family with good values and sentimentality. It is the story of Sandy Ricks, a teenager from Chicago who reluctantly spends his vacation with his Uncle Porter Ricks in the Bahamas. This ultimately changes the teenagers life and he grows up in the process. He learns to appreciate nature and to have a respect for the environment. I grew up in the 1960's and the NBC television show "Flipper" was my favorite childhood show. Elijah Wood is perfectly cast as a 1990's Sandy Ricks and gives an excellent performance. As much as I liked the NBC television show and MGM theatrical feature films with Luke Halpin as Sandy in the 1960's I liked this feature the best! I feel Elijah Wood is the best Sandy Ricks. With respect to Luke Halpin I feel Elijah Wood has more of a range of acting talent and emotes more as an actor which makes his performance excel