### Sentiment analysis of movie (IMDB) reviews using dataset provided by the ACL 2011 paper, see http://ai.stanford.edu/~amaas/data/sentiment/.

#### Dataset can be downloaded separately from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz, but wont be necessary as the download process has been embedded in the notebook and source file.

In [62]:
!pip install nltk
!pip install --upgrade gensim

import numpy as np
import os
import os.path

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

import glob
from gensim.models import Word2Vec

import time

Requirement already up-to-date: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [63]:
# MacOSX: See https://www.mkyong.com/mac/wget-on-mac-os-x/ for wget
print('On the MacOSX, you will need to install wget, see https://www.mkyong.com/mac/wget-on-mac-os-x/')

if not os.path.isfile('aclImdb_v1.tar.gz'):
  !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 

if not os.path.isfile('aclImdb'):  
  !tar -xf aclImdb_v1.tar.gz 


On the MacOSX, you will need to install wget, see https://www.mkyong.com/mac/wget-on-mac-os-x/


In [0]:
time_beginning_of_notebook = time.time()
SAMPLE_SIZE=600
positive_sample_file_list = glob.glob(os.path.join('aclImdb/train/pos', "*.txt"))
positive_sample_file_list = positive_sample_file_list[:SAMPLE_SIZE]

negative_sample_file_list = glob.glob(os.path.join('aclImdb/train/neg', "*.txt"))
negative_sample_file_list = negative_sample_file_list[:SAMPLE_SIZE]

import re

# load doc into memory
# regex to clean markup elements 
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r', encoding='utf8')
	# read all text
	text = re.sub('<[^>]*>', ' ', file.read())
	# close the file
	file.close()
	return text


# New Section

In [80]:
positive_strings = [load_doc(x) for x in positive_sample_file_list]
print(positive_strings[:10])

negative_strings = [load_doc(x) for x in negative_sample_file_list]
print(negative_strings[:10])
    

['Great entertainment from start to the end. Wonderful performances by Belushi, Beach, Dalton & Railsback. Some twists and many action scenes. The movie was made for me! Funny lines in the screenplay, good music. Dalton as the tough sheriff and Railsback as "redneck-villain". I must recommend this film to every action-adventure fan! 10/10', '"Kalifornia"is a great film that makes us look at ourselves.The film has a great cast,Brad Pitt(Johnny Suede,A River Runs Through It,and The Legends Of The Fall)as Early Grayce,David Duchovny(The X Files)as Brian Kessler,Michelle Forbes(Star Trek:The Next Generation,Homicide:Life On The Street,and Escape From L.A.)as Carrie Loughlin,Brian\'s girlfriend,and Juliette Lewis(Natural Born Killers,Cape Fear,and What\'s Eating Gilbert Grape)as Adele Corners,Early\'s girlfriend.  Brian Kessler is a writer who is a Liberal,is getting ready to write a book about serial killers.Brian and his girlfriend,Carrie decide they want to move to California,so Brian pl

In [81]:
positive_labels = np.array(SAMPLE_SIZE * [1])
print(positive_labels)

[1 1 1 ... 1 1 1]


In [82]:
negative_labels = np.array(SAMPLE_SIZE * [0])
print(negative_labels)

[0 0 0 ... 0 0 0]


In [83]:
positive_tokenized = [word_tokenize(s) for s in positive_strings]
print(positive_tokenized[1])
print(positive_tokenized[2])

['``', 'Kalifornia', "''", 'is', 'a', 'great', 'film', 'that', 'makes', 'us', 'look', 'at', 'ourselves.The', 'film', 'has', 'a', 'great', 'cast', ',', 'Brad', 'Pitt', '(', 'Johnny', 'Suede', ',', 'A', 'River', 'Runs', 'Through', 'It', ',', 'and', 'The', 'Legends', 'Of', 'The', 'Fall', ')', 'as', 'Early', 'Grayce', ',', 'David', 'Duchovny', '(', 'The', 'X', 'Files', ')', 'as', 'Brian', 'Kessler', ',', 'Michelle', 'Forbes', '(', 'Star', 'Trek', ':', 'The', 'Next', 'Generation', ',', 'Homicide', ':', 'Life', 'On', 'The', 'Street', ',', 'and', 'Escape', 'From', 'L.A.', ')', 'as', 'Carrie', 'Loughlin', ',', 'Brian', "'s", 'girlfriend', ',', 'and', 'Juliette', 'Lewis', '(', 'Natural', 'Born', 'Killers', ',', 'Cape', 'Fear', ',', 'and', 'What', "'s", 'Eating', 'Gilbert', 'Grape', ')', 'as', 'Adele', 'Corners', ',', 'Early', "'s", 'girlfriend', '.', 'Brian', 'Kessler', 'is', 'a', 'writer', 'who', 'is', 'a', 'Liberal', ',', 'is', 'getting', 'ready', 'to', 'write', 'a', 'book', 'about', 'serial'

In [84]:
negative_tokenized = [word_tokenize(s) for s in negative_strings]
print(negative_tokenized[1])
print(negative_tokenized[2])

['**May', 'Contain', 'Spoilers**', 'A', 'dude', 'in', 'a', 'dopey-looking', 'Kong', 'suit', '(', 'the', 'same', 'one', 'used', 'in', 'KING', 'KONG', 'VS.', 'GODZILLA', 'in', '1962', ')', 'provides', 'much', 'of', 'the', 'laffs', 'in', 'this', 'much-mocked', 'monster', 'flick', '.', 'Kong', 'is', 'resurrected', 'on', 'Mondo', 'Island', 'and', 'helps', 'out', 'the', 'lunkhead', 'hero', 'and', 'other', 'good', 'guys', 'this', 'time', 'around', '.', 'The', 'vampire-like', 'villain', 'is', 'named', 'Dr.', 'Who\x96-funny', ',', 'he', 'does', "n't", 'look', 'like', 'Peter', 'Cushing', '!', 'Kong', 'finally', 'dukes', 'it', 'out', 'with', 'Who', "'s", 'pride', 'and', 'joy', ',', 'a', 'giant', 'robot', 'ape', 'that', 'looks', 'like', 'a', 'bad', 'metal', 'sculpture', 'of', 'Magilla', 'Gorilla', '.', 'Like', 'many', 'of', 'Honda', "'s", 'flicks', 'this', 'may', 'have', 'had', 'some', 'merit', 'before', 'American', 'audiences', 'diddled', 'around', 'with', 'it', 'and', 'added', 'new', 'footage', 

In [85]:
# load doc into memory
with open('aclImdb/imdb.vocab') as f:
  content = f.readlines()
universe_vocabulary = [x.strip() for x in content]


print("Word count across all reviews (before stripping tokens):", sum([len(token) for token in positive_tokenized]))
stripped_positive_tokenized = []
for tokens in positive_tokenized:
  stripped_positive_tokenized.append([token.lower() for token in tokens if token.lower() in universe_vocabulary])

print("Word count across all reviews (after stripping tokens):", sum([len(token) for token in stripped_positive_tokenized]))

Word count across all reviews (before stripping tokens):  2696212
Word count across all reviews (after stripping tokens):  2335068


In [86]:
print(positive_tokenized[0:5])
print(stripped_positive_tokenized[0:5])

[['Great', 'entertainment', 'from', 'start', 'to', 'the', 'end', '.', 'Wonderful', 'performances', 'by', 'Belushi', ',', 'Beach', ',', 'Dalton', '&', 'Railsback', '.', 'Some', 'twists', 'and', 'many', 'action', 'scenes', '.', 'The', 'movie', 'was', 'made', 'for', 'me', '!', 'Funny', 'lines', 'in', 'the', 'screenplay', ',', 'good', 'music', '.', 'Dalton', 'as', 'the', 'tough', 'sheriff', 'and', 'Railsback', 'as', '``', 'redneck-villain', "''", '.', 'I', 'must', 'recommend', 'this', 'film', 'to', 'every', 'action-adventure', 'fan', '!', '10/10'], ['``', 'Kalifornia', "''", 'is', 'a', 'great', 'film', 'that', 'makes', 'us', 'look', 'at', 'ourselves.The', 'film', 'has', 'a', 'great', 'cast', ',', 'Brad', 'Pitt', '(', 'Johnny', 'Suede', ',', 'A', 'River', 'Runs', 'Through', 'It', ',', 'and', 'The', 'Legends', 'Of', 'The', 'Fall', ')', 'as', 'Early', 'Grayce', ',', 'David', 'Duchovny', '(', 'The', 'X', 'Files', ')', 'as', 'Brian', 'Kessler', ',', 'Michelle', 'Forbes', '(', 'Star', 'Trek', ':

In [87]:
print("Word count across all reviews (before stripping tokens):", sum([len(token) for token in positive_tokenized]))
stripped_negative_tokenized = []
for tokens in negative_tokenized:
  stripped_negative_tokenized.append([token.lower() for token in tokens if token.lower() in universe_vocabulary])

print("Word count across all reviews (after stripping tokens):", sum([len(token) for token in stripped_negative_tokenized]))

2696212
2282105


In [88]:
print(negative_tokenized[0:5])
print(stripped_negative_tokenized[0:5])

[['``', 'Metamorphosis', "''", 'hold', 'a', 'tiny', 'bit', 'of', 'cult-value', ',', 'simply', 'because', 'it', 'was', 'written', 'and', 'directed', 'by', 'George', 'Eastman', '.', 'This', 'Italian', 'bloke', 'is', 'more', 'or', 'less', 'the', 'personification', 'of', 'male', 'sleaze', 'and', 'starred', 'in', 'pretty', 'much', 'every', 'rancid', 'Joe', "D'Amato", 'production', 'during', 'the', 'late', "70's/early", '80', "'s", '.', 'Would', "n't", 'it', 'be', 'interesting', 'for', 'avid', 'Euro-cult', 'purchasers', 'to', 'own', 'the', 'only', 'movie', 'directed', 'by', 'the', 'guy', 'who', 'walked', 'around', 'bare-butted', 'in', '``', 'Erotic', 'Nights', 'of', 'the', 'Living', 'Dead', "''", 'all', 'the', 'time', '?', 'I', 'thought', 'so', '!', 'Now', ',', 'unlike', 'the', 'movies', 'he', 'starred', 'in', ',', 'Eastman', "'s", 'own', '``', 'Metamorphosis', "''", 'is', 'kind', 'of', 'disappointing', 'in', 'the', 'gore', '&', 'sleaze', 'departments', '.', 'There', 'are', 'a', 'handful', '

In [89]:
#### Commenting out this bit as it is adding to the time to load the notebook, we can uncomment it when we need to reuse it again

# model_ted = Word2Vec(sentences=positive_tokenized, size=100, window=5, min_count=5, workers=1, sg=0, seed=42)
# model_ted.wv.most_similar("brother")

# print(np.linalg.norm(model_ted.wv['man'] - model_ted.wv['woman']))
# print(np.linalg.norm(model_ted.wv['father'] - model_ted.wv['mother']))
# print(np.linalg.norm(model_ted.wv['brother'] - model_ted.wv['sister']))
# print(np.linalg.norm(model_ted.wv['house'] - model_ted.wv['road']))  ### boat or ship does not exist in the corpus so we get an error if we use them

# print(np.linalg.norm(model_ted.wv['father'] - model_ted.wv['mother']))
# print(np.linalg.norm(model_ted.wv['sister'] - model_ted.wv['mother']))

5.961282
3.7747576
3.1723378
5.432142
3.7747576
3.964881


  if np.issubdtype(vec.dtype, np.int):


In [90]:
features = np.array(stripped_positive_tokenized + stripped_negative_tokenized)
labels = np.concatenate([positive_labels, negative_labels])
# print(features.shape)
# print(features)
# print(labels.shape)
# print(labels)

from keras.preprocessing import text


# GitHub reference: https://github.com/tensorflow/workshops/blob/master/extras/keras-bag-of-words/keras-bow-model.ipynb
# Blog: https://cloud.google.com/blog/products/gcp/intro-to-text-classification-with-keras-automatically-tagging-stack-overflow-posts

vocab_size = 1000
tokenize = text.Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(features)
tokenized_features = tokenize.texts_to_matrix(features)


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(tokenized_features, labels, test_size=0.25)

print(x_train[1])
# print(x_train.shape)
# print(x_test.shape)
# print(y_train.shape)
# print(y_test.shape)

[0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0.
 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0.
 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

We have decided to do the use the below models and vectorisation techniques to test our their accuracy / score, the idea is to use a one model and one vectorization technique and plot a score.

**Simple models**

- Logistic Regression
- Random Forst
- LSTM
- GRU
- CNN

**Vectorisation techniques**
- Bag of Words
- Word2Vec
- TFIDF (probability scores)
- FastText
- Glove

## Logistic Regress model using Bag of Words vectorisation technique

In [91]:
from sklearn.linear_model import LogisticRegression

# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()

logisticRegr.fit(x_train, y_train)

score = logisticRegr.score(x_test, y_test)
print("Score: ", score)
y_test = logisticRegr.predict(x_test)
time_end_of_notebook = time.time()

Score:  0.8494


In [92]:
import pandas as pd
table_models_vectorization = pd.DataFrame(
     {'Models':                   ["Logistic Regression", "Logistic Regression"], 
      'Vectorisation techniques': ["Bag of Words",        "Word2Vec"], 
      'Score':                    [score,                 "Pending"]},
    columns=['Models','Vectorisation techniques','Score']
)
print("Sample size:", SAMPLE_SIZE)

duration = time_end_of_notebook - time_beginning_of_notebook

print("Full notebook execution duration:", duration, "seconds")
print("Full notebook execution duration:", duration / 60, "minutes")

table_models_vectorization

Sample size: 10000
Full notebook execution duration: 889.98082447052 seconds
Full notebook execution duration: 14.833013741175334 minutes


Unnamed: 0,Models,Vectorisation techniques,Score
0,Logistic Regression,Bag of Words,0.8494
1,Logistic Regression,Word2Vec,Pending
