## Sentiment Analysis: Movie Reviews

In this notebook, supervised machine learning methods are applied on a text classification problem. This problem uses a popular dataset that contains 50,000 unique instances of movie reviews extracted from Amazon. 

> Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher. Learning Word Vectors for Sentiment Analysis. In Association for Computational Linguistics, pp. 142--150, Portland, Oregon, USA, June, 2011. [[Web Link]](http://www.aclweb.org/anthology/P11-1015)

### Data set Information (From the attached README):

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Due the size of the dataset, kindly head over [[here]](https://ai.stanford.edu/~amaas/data/sentiment/) to download the dataset.

### Objective
To classify a movie review as positive or negative sentimentally.
 
### Approach
In this practice, we will be using the **TFIDFVectorizer** module by Scikit-Learn. TFIDFVectorizer is a form of Feature Extraction whereby it transform text data into feature vectors that can be used as input for estimators. The value of a word is proportionate to the count in the corpus, adjusted inversely to the number of documents it appears in the corpus (a collection of texts). This is needed as a word that frequently appears across multiple documents / texts (e.g. we, the, I) will not be useful in drawing differences between them. 

### Models
In this practice, the following classifiers are used: **Multinomial Naive-Bayes** (MNB) and **Logistic Regressor** (Log).

As far as possible, default parameters will be used (ie. C, max_iter, alpha) with the exception of *random_state = 0*.

#### Measures 
At the end of each scenario for each approach, the following scores will be calculated for each model: 

 - Precision Scores
 - Recall Scores
 - Cross-validated accuracy scores for training set, and accuracy scores for test set
 
### Conclusion 
For each model, it's classification ability is very high. For MultinomialNB, it has accuracy of 0.85424, precision score of 0.851206, and a recall score of 0.85856. In Logistic Regression, it generated a higher accuracy of 0.88012, precision score of 0.873047, and a recall score of 0.88960. In general, the Logisitic Regression performed better than the MultinomialNB model. Both models do not overfit the training data (with similar accuracies to the test set), and produced good generalised models. However, this may be attributed to the extensive set of training records for only two classes (12500 instances each).

In [1]:
# Libraries related to importing of data, cleaning, storing, and manipulation.
import os
import numpy as np
import pandas as pd

# Libraries related to text processing
import unidecode
from bs4 import BeautifulSoup
import string

# Libraries related to Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import precision_score, recall_score

# Other Libraries
import time ## Used to measure the time taken for some processes (e.g. Importing, Training, Evaluating)

In [2]:
mypath = "C:/Users/XXX/XXX/XXX/XXXX/XXXX/aclImdb/{}/{}" # Masked for privacy

d = {}
for path in ['test', 'train']:
    start = time.time()
    for typ in ['neg', 'pos']:
        d["{}_{}".format(path, typ)] = [open(mypath.format(path, typ) + "/" + f, 'r', encoding = 'utf8').read() for f in os.listdir(mypath.format(path, typ)) if os.path.isfile(os.path.join(mypath.format(path, typ), f)) and f.endswith('.txt')]
    print("Import completed for {}ing dataset in {:.5f} seconds.".format(path, time.time() - start))

Import completed for testing dataset in 68.84991 seconds.
Import completed for training dataset in 54.38756 seconds.


In [3]:
# Dataframe used for training purposes
train_pos = pd.DataFrame(d['train_pos'])
train_pos['sentiment'] = 1
train_neg = pd.DataFrame(d['train_neg'])
train_neg['sentiment'] = 0

train_df = train_pos.append(train_neg, ignore_index = True)

# Dataframe used for testing purposes
test_pos = pd.DataFrame(d['test_pos'])
test_pos['sentiment'] = 1
test_neg = pd.DataFrame(d['test_neg'])
test_neg['sentiment'] = 0

test_df = test_pos.append(test_neg, ignore_index = True)

In [4]:
def pre_processing(series):
    lst = []
    PUNCT_TO_REMOVE = string.punctuation
    for text in series:
        text = unidecode.unidecode(BeautifulSoup(text, "html.parser").get_text(separator = " ")) # Remove any HTML expr
        #text = text.lower() # Not necessary to lower() as the Tfidf Vectorizer has an in-built function for doing so. 
        text = text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)) # Remove all punctuations
        lst.append(text)
    return lst

train_df[0], test_df[0] = pre_processing(train_df[0]), pre_processing(test_df[0])

In [5]:
X = train_df[0]
y = train_df['sentiment']

"""
Dataset Count: 25 000
Minimum Document Frequency = 100  : Ignore terms that happened less than 100 times across the dataset
> To remove words that are used very infrequently (e.g. mispellings, short languages, uniquely created words)
"""
vect = TfidfVectorizer(min_df = 100, stop_words = 'english', ngram_range = (1, 3), lowercase = True).fit(X)

In [6]:
models = [MultinomialNB(), LogisticRegression(random_state = 0)]

In [7]:
cv_entries = []
for model in models:
    start = time.time()
    m_name = model.__class__.__name__
    accuracies = cross_val_score(model, vect.transform(X), y, scoring = 'accuracy', cv = 5)
    for idx, accuracy in enumerate(accuracies):
        cv_entries.append([m_name, idx, accuracy])
    print("Cross-validation completed for {} in {:.5f} seconds".format(m_name, time.time() - start))
cv_dF = pd.DataFrame(cv_entries, columns = ['Model','Job','Accuracy'])
cv_dF.groupby('Model').mean().iloc[:, 1]

Cross-validation completed for MultinomialNB in 10.65864 seconds
Cross-validation completed for LogisticRegression in 12.13979 seconds


Model
LogisticRegression    0.86632
MultinomialNB         0.84424
Name: Accuracy, dtype: float64

In [8]:
eval_entries = []
for model in models:
    start = time.time()
    m_name = model.__class__.__name__
    model.fit(vect.transform(X), y)
    score = model.score(vect.transform(test_df[0]), test_df['sentiment'])
    precision = precision_score(test_df['sentiment'], model.predict(vect.transform(test_df[0])))
    recall = recall_score(test_df['sentiment'], model.predict(vect.transform(test_df[0])))
    eval_entries.append([m_name, score, precision, recall])
    print("Evaluation completed for {} in {:.5f} seconds".format(m_name, time.time() - start))
score_dF = pd.DataFrame(eval_entries, columns = ['Model', 'Score', 'Precision', 'Recall']).set_index('Model')
score_dF

Evaluation completed for MultinomialNB in 43.08900 seconds
Evaluation completed for LogisticRegression in 43.59243 seconds


Unnamed: 0_level_0,Score,Precision,Recall
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MultinomialNB,0.85424,0.851206,0.85856
LogisticRegression,0.88012,0.873047,0.8896
