# Final Project NLP: Binary Sentiment Analysis

Members: Jonatan Piñol and Peter Weber

In the attached final_project_lib.py file, there is the class that implements all the logic behind the analysis presented in this notebook.

In terms of data preprocessing we implement the following steps.
- Reading of the documents, and saving the reviews and the corresponding sentiments separately.
- Cleaning the reviews, and converting the sentiments to boolean 0/1.
The cleaning step removes punctuation, not alphanumeric tokens, stopwords, and tokens of length one.

In terms of data preparation for the algorithm, we use three methods.
1. ngram tf-idf, where we use uni-gram, bi-gram, and tri-gram features. We limit the features to the 50000 most occuring ones.
2. ngram counts, where we use uni-gram, bi-gram, and tri-gram features. We limit the features to the 50000 most occuring ones
3. gensim document2vector embeddings, where each word is characterized by a 2000 elements vector. We build a feature matrix by characterizing every review by its min, mean, and max for every vector element of the embedding, so that we obtain a matrix with 6000 features.

For the predictions we use a random forest with 1000 estimators, and 4-fold cross validation to validate the model.
We first evaluate every model separately, and then perform a majority vote between the three models to get the final predictions.

In a next step (which is not implemented), one would ideally feed the embeddings word by word into a Convolutional Neural Network or an LSTM, to take advantage of the embedding representation of the reviews and the word order.

### Load Sentiment Analysis Class in lib

In [1]:
import final_project_lib as lib
import autoreload
%load_ext autoreload
%autoreload 2

import numpy as np
from sklearn.metrics import accuracy_score

### Class instantiation and data manipulation

In [2]:
instance = lib.SentimentAnalysis()
instance.read_documents()
instance.clean_documents()

Documents loaded!
Documents cleaned!


## Compute cross validation score for different algorithms
### tf-idf

In [3]:
X_tfidf, y_tfidf = instance.create_tfidf_matrix(max_feat = 50000, n = 3)
score, y_hat_tfidf = instance.cross_validate_random_forest(X_tfidf, y_tfidf)
print "Cross validation (cv) accuracy using 4-fold cv, a random forest classifier and tri-gram tfidf", \
        round(np.mean(score)*100), "%"

Cross validation (cv) accuracy using 4-fold cv, a random forest classifier and tri-gram tfidf 81.0 %


### tri-gram counts

In [4]:
X_ngram, y_ngram = instance.create_ngram_matrix(max_feat = 50000, n = 3)
score, y_hat_ngram = instance.cross_validate_random_forest(X_ngram, y_ngram)
print "Cross validation (cv) accuracy using 4-fold cv, a random forest classifier and tri-gram counts", \
        round(np.mean(score)*100), "%"

Cross validation (cv) accuracy using 4-fold cv, a random forest classifier and tri-gram counts 84.0 %


### Document 2 Vector embeddings
https://stackoverflow.com/questions/45170589/how-word2vec-deal-with-the-end-of-a-sentence

In [5]:
model = instance.train_doc2vec()

In [6]:
X_d2v, y_d2v = instance.create_embedding_matrix(model)
score, y_hat_d2v = instance.cross_validate_random_forest(X_d2v, y_d2v)
print "Cross validation (cv) accuracy using 4-fold cv, a random forest classifier and doc2vec embeddings", \
        round(np.mean(score)*100), "%"

Cross validation (cv) accuracy using 4-fold cv, a random forest classifier and doc2vec embeddings 75.0 %


### Do majority vote between the three models

In [7]:
y_hat = instance.vote_majority(y_hat_tfidf, y_hat_ngram, y_hat_d2v)

In [8]:
print "Verify that all target vectors are equal, n_gram == tfidf:", y_ngram == y_tfidf, \
      "! tfidf == doc2vec:", y_tfidf == y_d2v, "!\n", "So that any of the three can be chosen for comparison with predictions!"
accuracy = accuracy_score(y_hat, y_ngram)
print "\nMajority vote accuracy:", round(accuracy*100), "%"

Verify that all target vectors are equal, n_gram == tfidf: True ! tfidf == doc2vec: True !
So that any of the three can be chosen for comparison with predictions!

Majority vote accuracy: 83.0 %
