### Sentiment analysis of movie (IMDB) reviews using dataset provided by the ACL 2011 paper, see http://ai.stanford.edu/~amaas/data/sentiment/.

#### Dataset can be downloaded separately from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz, but wont be necessary as the download process has been embedded in the notebook and source file.

In [2]:
# !pip install nltk
# !pip install --upgrade gensim

import numpy as np
import os
import os.path

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import nltk
#nltk.download('punkt')>>> import nltk


import glob
from gensim.models import Word2Vec

import time



In [63]:
# MacOSX: See https://www.mkyong.com/mac/wget-on-mac-os-x/ for wget
print('On the MacOSX, you will need to install wget, see https://www.mkyong.com/mac/wget-on-mac-os-x/')

if not os.path.isfile('aclImdb_v1.tar.gz'):
  !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 

if not os.path.isfile('aclImdb'):  
  !tar -xf aclImdb_v1.tar.gz 


On the MacOSX, you will need to install wget, see https://www.mkyong.com/mac/wget-on-mac-os-x/


In [3]:
time_beginning_of_notebook = time.time()
SAMPLE_SIZE=1000
positive_sample_file_list = glob.glob(os.path.join('aclImdb/train/pos', "*.txt"))
positive_sample_file_list = positive_sample_file_list[:SAMPLE_SIZE]

negative_sample_file_list = glob.glob(os.path.join('aclImdb/train/neg', "*.txt"))
negative_sample_file_list = negative_sample_file_list[:SAMPLE_SIZE]

import re

# load doc into memory
# regex to clean markup elements 
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r', encoding='utf8')
    # read all text
    text = re.sub('<[^>]*>', ' ', file.read())
    #text = file.read()
    # close the file
    file.close()
    return text


# Data exploration

In [4]:
positive_strings = [load_doc(x) for x in positive_sample_file_list]
#print('\n Positive reviews \n ',positive_strings[:5])

negative_strings = [load_doc(x) for x in negative_sample_file_list]
#print('\n Negative reviews \n ', negative_strings[:5])
    

In [5]:
positive_tokenized = [word_tokenize(s) for s in positive_strings]
#print('\n Positive tokenized 1 \n {} \n\n Positive tokenized 2 \n {}'. format(positive_tokenized[1], positive_tokenized[2]))


In [6]:
negative_tokenized = [word_tokenize(s) for s in negative_strings]
#print('\n Negative tokenized 1 \n {} \n\n  Negative tokenized 2 \n {}'. format(negative_tokenized[1], negative_tokenized[2]))

In [7]:
# load doc into memory
with open('aclImdb/imdb.vocab', encoding='utf8') as f:
    #content = f.readlines()
    universe_vocabulary = [x.strip() for x in f.readlines()]

print("Word count across all reviews (before stripping tokens):", sum([len(token) for token in positive_tokenized]))

#Checking the not alphanumeric characters in vocabulary
non_alphanumeric_set = set()
for word in universe_vocabulary:
    non_alphanumeric_set |= set(re.findall('\W', word))
print('Non alphanumeric characters found in universe vocabulary', non_alphanumeric_set)


stripped_positive_tokenized = []
for tokens in positive_tokenized:
  stripped_positive_tokenized.append([token.lower() for token in tokens if token.lower() in universe_vocabulary])

print("Word count across all reviews (after stripping tokens):", sum([len(token) for token in stripped_positive_tokenized]))

Word count across all reviews (before stripping tokens): 255821
Non alphanumeric characters found in universe vocabulary {'?', '=', '}', '-', '(', '[', ';', ')', ']', "'", '!', ':'}
Word count across all reviews (after stripping tokens): 221986


In [8]:
print("Word count across all reviews (before stripping tokens):", sum([len(token) for token in positive_tokenized]))
stripped_negative_tokenized = []
for tokens in negative_tokenized:
  stripped_negative_tokenized.append([token.lower() for token in tokens if token.lower() in universe_vocabulary])

print("Word count across all reviews (after stripping tokens):", sum([len(token) for token in stripped_negative_tokenized]))

Word count across all reviews (before stripping tokens): 255821
Word count across all reviews (after stripping tokens): 217639


## Modelling 

We have decided to do the use the below models and vectorisation techniques to test our their accuracy / score, the idea is to use a one model and one vectorization technique and plot a score.

**Simple models**

- Logistic Regression
- Random Forst
- LSTM
- GRU
- CNN

**Vectorisation techniques**
- Bag of Words
- Word2Vec
- TFIDF (probability scores)
- FastText
- Glove

## Logistic Regression.
## Introducing Pipeline: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

## Introducing TfdfVectorizer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Introducing cross_val_score http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html


<br>


In [10]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

df_positives = pd.DataFrame({'reviews':[load_doc(x) for x in positive_sample_file_list], 'sentiment': np.ones(SAMPLE_SIZE)})
df_negatives = pd.DataFrame({'reviews':[load_doc(x) for x in negative_sample_file_list], 'sentiment': np.zeros(SAMPLE_SIZE)})

df = pd.concat([df_positives, df_negatives], ignore_index=True)

df = shuffle(df)

X_train, X_test, y_train, y_test = train_test_split(df['reviews'], df['sentiment'], test_size=0.25)



## Logistic Regress model using Bag of Words vectorisation technique

In [11]:

CountVec = CountVectorizer()
lr_CV = Pipeline([('vect', CountVec), ('clf', LogisticRegression(random_state=0))])
lr_CV.fit(X_train, y_train)
print('Train accuracy {}'.format(lr_CV.score(X_train, y_train)))
print('Test accuracy {}'.format(lr_CV.score(X_test, y_test)))

# # Trying with cross_val_score
lr = LogisticRegression()
k_folds = 10
X_train_CV = CountVec.fit_transform(X_train)
type(X_train_CV)
print('Train accuracy list {} '.format(cross_val_score(lr, X_train_CV, y_train, cv= k_folds))) 
print('Train accuracy mean {} '.format(cross_val_score(lr, X_train_CV, y_train, cv= k_folds).mean()))

Train accuracy 1.0
Test accuracy 0.858
Train accuracy list [0.86092715 0.86754967 0.87417219 0.83333333 0.83333333 0.82666667
 0.87333333 0.83892617 0.8590604  0.91946309] 
Train accuracy mean 0.8586765337718714 


## Logistic Regress model using TfidfVectorizer vectorisation technique

In [12]:

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
print('Train accuracy {}'.format(lr_tfidf.score(X_train, y_train)))
print('Test accuracy {}'.format(lr_tfidf.score(X_test, y_test)))

# Trying with cross_val_score
lr = LogisticRegression()
k_folds = 10
X_train_tfidf = tfidf.fit_transform(X_train)
print('Train accuracy list {} '.format(cross_val_score(lr, X_train_tfidf, y_train, cv= k_folds))) 
print('Train accuracy mean {} '.format(cross_val_score(lr, X_train_tfidf, y_train, cv= k_folds).mean()))


Train accuracy 0.982
Test accuracy 0.88
Train accuracy list [0.89403974 0.89403974 0.86754967 0.88666667 0.85333333 0.83333333
 0.89333333 0.86577181 0.89261745 0.89932886] 
Train accuracy mean 0.8780013926544884 


## Logistic Regress model using TfidfVectorizer and different values for C hyperparameter

In [22]:
C_values = np.arange(1,2,0.1)
results = []

for value in C_values:   
    lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0, C=value))])
    lr_tfidf.fit(X_train, y_train)
    train_score = lr_tfidf.score(X_train, y_train)
    score = lr_tfidf.score(X_test, y_test)
    print('C_value {} Test Score {} Train_score {}'.format(value, score, train_score))
    results.append(score)

time_end_of_notebook = time.time()

C_value 1.0 Test Score 0.88 Train_score 0.982
C_value 1.1 Test Score 0.886 Train_score 0.9853333333333333
C_value 1.2000000000000002 Test Score 0.886 Train_score 0.9866666666666667
C_value 1.3000000000000003 Test Score 0.886 Train_score 0.9873333333333333
C_value 1.4000000000000004 Test Score 0.886 Train_score 0.9886666666666667
C_value 1.5000000000000004 Test Score 0.886 Train_score 0.9886666666666667
C_value 1.6000000000000005 Test Score 0.888 Train_score 0.9893333333333333
C_value 1.7000000000000006 Test Score 0.888 Train_score 0.99
C_value 1.8000000000000007 Test Score 0.888 Train_score 0.99
C_value 1.9000000000000008 Test Score 0.89 Train_score 0.99


In [23]:

table_models_vectorization = pd.DataFrame(
     {'Models':                   ["Logistic Regression", "Logistic Regression", "Logistic Regression"], 
      'Vectorisation techniques': ["Bag of Words",        "Word2Vec", "TFIDF"], 
      'Score':                    [score,                 "Pending", lr_tfidf.score(X_train, y_train) ]},
    columns=['Models','Vectorisation techniques','Score']
)
print("Sample size:", SAMPLE_SIZE)

duration = time_end_of_notebook - time_beginning_of_notebook

print("Full notebook execution duration:", duration, "seconds")
print("Full notebook execution duration:", duration / 60, "minutes")

table_models_vectorization

Sample size: 1000
Full notebook execution duration: 1743.947651386261 seconds
Full notebook execution duration: 29.065794189771015 minutes


Unnamed: 0,Models,Vectorisation techniques,Score
0,Logistic Regression,Bag of Words,0.89
1,Logistic Regression,Word2Vec,Pending
2,Logistic Regression,TFIDF,0.99


In [None]:
#