<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB-Movie-Review-Sentiment-Classification" data-toc-modified-id="IMDB-Movie-Review-Sentiment-Classification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB Movie Review Sentiment Classification</a></span></li><li><span><a href="#Purpose" data-toc-modified-id="Purpose-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Process" data-toc-modified-id="Process-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Process</a></span></li><li><span><a href="#Configure-notebook,-import-libraries,-and-import-dataset" data-toc-modified-id="Configure-notebook,-import-libraries,-and-import-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Configure notebook, import libraries, and import dataset</a></span></li><li><span><a href="#Examine-the-data" data-toc-modified-id="Examine-the-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Examine the data</a></span></li><li><span><a href="#Cleaning-and-preprocessing" data-toc-modified-id="Cleaning-and-preprocessing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Cleaning and preprocessing</a></span></li><li><span><a href="#Bag-of-words-feature-creation" data-toc-modified-id="Bag-of-words-feature-creation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bag-of-words feature creation</a></span></li><li><span><a href="#Baseline-Model-development" data-toc-modified-id="Baseline-Model-development-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Baseline Model development</a></span></li></ul></div>

<h1>IMDB Movie Review Sentiment Classification</h1>

<img style="float: left; margin-right: 15px; width: 30%; height: 30%;" src="images/imdb.jpg" />

# Purpose

The overall goal of this set of write-ups is to explore a number of machine learning algorithms utilizing natural language processing (NLP) to classify the sentiment in a set of IMDB movie reviews.

The specific goals of this write-up include:
1. Create a sparser feature set by removing words not directly related to sentiment
2. Run the models from the [last write-up](./Model-06.ipynb) against the new feature set
3. Determine if the new feature set improves our ability to correctly classify movie review sentiment

This series of write-ups is inspired by the Kaggle [
Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial) competition.    

Dataset source:  [IMDB Movie Reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

# Process

Previously covered [here](./Model-06.ipynb#Process).

# Configure notebook, import libraries, and import dataset

##### Import libraries

In [105]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
from pandas import set_option

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# http://www.nltk.org/index.html
# pip install nltk
import nltk
from nltk.corpus import stopwords

# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
# pip install BeautifulSoup4
from bs4 import BeautifulSoup

# https://pypi.org/project/gensim/
# pip install gensim
from gensim.models import word2vec

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


##### Define global variables

In [3]:
seed = 10
np.random.seed(seed)

# Opens a GUI that allows us to download the NLTK data
# nltk.download()

dataPath = os.path.join('.', 'datasets', 'imdb_movie_reviews')
labeledTrainData = os.path.join(dataPath, 'labeledTrainData.tsv')

##### Import dataset

In [4]:
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

# Examine the data

Previously covered [here](./Model-06.ipynb#Examine-the-data).

# Cleaning and preprocessing

Process justification and methodology previously covered [here](./Model-06.ipynb#Cleaning-and-preprocessing).

Define a 'cleaning' function, and clean the training set:

In [5]:
# Convert the stop words to a set
stopWords = set(stopwords.words("english"))

# Clean IMDB review text
def cleanReview(review, stopWords):
    # Remove HTML
    clean = BeautifulSoup(review)
    
    # Remove non-alpha chars
    clean = re.sub("[^a-zA-Z]", ' ', clean.get_text())
    
    # Convert to lower case and "tokenize"
    clean = clean.lower().split()
    
    # Remove stop words
    clean = [x for x in clean if not x in stopWords]

    # Prepare final, cleaned review
    clean = " ".join(clean)
    
    # Return results
    return clean
    

In [6]:
cleanReviews = [cleanReview(x, stopWords) for x in df['review']]
assert(len(df) == (len(cleanReviews)))

# Bag-of-words feature creation

Initial discussion of the `bag-of-words` algorithm was previously covered [here](./Model-06.ipynb#Bag-of-words-feature-creation).

Next, in the [first write-up](http://localhost:8888/notebooks/Machine-Learning/Python/04-Classic-Datasets/Model-06.ipynb) of this series we examined a sample review--index 108--during the analysis, cleaning, and preprocessing.  We'll post it here again for reference:

In [7]:
cleanReviews[108]

'question one sees movie bad necessarily movie bad get made even see awful first place learned experience learned rules horror movies catalogued satirized countless times last ten years mean someone go ahead make movie uses without shred humor irony movie described loosely based video game script problems black character may always die first asian character always know kung fu may proud figured matrix effect budget necessarily mean use ad nausea ron howard brother guarantee choice roles whenever scene edit together use footage video game one notice cousin rap metal band offers write movie theme free politely decline zombie movies people killing zombies zombies killing people preferably gruesome way possible makes scary white people pay get rave deserve die find old book tell everything need know anything else figure two lines someone asks bare breasts horror movie panacea helicopter boom shot licensing deal sega magically transforms movie student film major studio release try name drop

Since the bag-of-words creation is doing a word count analysis I wanted to explore what would happen if we removed the 'noise' from the reviews.  (And by 'noise' I mean words that likely wouldn't help or hinder sentiment.)  From the review text above we have this string sample for instance:

```
whenever scene edit together use footage video game one notice cousin rap metal band offers
```

It is doubtful this series of words will give the model any 'insights' into if this is a positive or negative review.  However, this next string sample does seem like it would give an indication to review sentiment:

```
question one sees movie bad necessarily movie bad get made even see awful
```

In order to explore this idea let's load a sentiment lexicon into the notebook, and then remove any 'noise' words not found in the sentiment lexicon from the review texts.  We'll then run the 'de-noised' review texts through the same models as we did in the [previous write-up](./Model-06.ipynb), and see if we gain any improvements in speed and/or accuracy.

##### Download the sentiment lexicon

The sentiment lexicon we'll utilize can be found here:  https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

Using a few commands we can download and extract it:

```
wget https://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
unrar e opinion-lexicon-English.rar
```

##### Applying the sentiment lexicon  - Single observation

Next we'll want to combine all the positive and negative sentiment words into a single list, and then remove any words from the reviews not found in the sentiment lexicon:

In [8]:
# Combine the positive and negative lists of sentiment lexicon words

with open(os.path.join('.', 'datasets', 'positive-words.txt')) as f:
    _positive = f.read().splitlines()
    
with open(os.path.join('.', 'datasets', 'negative-words.txt')) as f:
    _negative = f.read().splitlines()
    
allWords = _positive[35:] + _negative[35:]

assert( len(allWords) == (len(_positive[35:]) + len(_negative[35:])) )

In [9]:
# Preview our sample review before sentiment lexicon parsing
cleanReviews[108]

'question one sees movie bad necessarily movie bad get made even see awful first place learned experience learned rules horror movies catalogued satirized countless times last ten years mean someone go ahead make movie uses without shred humor irony movie described loosely based video game script problems black character may always die first asian character always know kung fu may proud figured matrix effect budget necessarily mean use ad nausea ron howard brother guarantee choice roles whenever scene edit together use footage video game one notice cousin rap metal band offers write movie theme free politely decline zombie movies people killing zombies zombies killing people preferably gruesome way possible makes scary white people pay get rave deserve die find old book tell everything need know anything else figure two lines someone asks bare breasts horror movie panacea helicopter boom shot licensing deal sega magically transforms movie student film major studio release try name drop

In [10]:
# Apply the sentiment lexicon parsing
_tmp = [x for x in cleanReviews[108].split() if x in allWords]

In [11]:
# Example the 'de-noised' list of remaining words
_tmp

['bad',
 'bad',
 'awful',
 'humor',
 'irony',
 'problems',
 'die',
 'proud',
 'guarantee',
 'free',
 'decline',
 'zombie',
 'killing',
 'killing',
 'preferably',
 'gruesome',
 'scary',
 'die',
 'boom',
 'dead',
 'worse',
 'annihilation']

##### Applying the sentiment lexicon  - All observations

Everything looks good so far, so let's 'de-noise' the entire dataset:

In [12]:
sparseCleanReviews = []

for review in cleanReviews:
    _tmp = [x for x in review.split() if x in allWords]
    sparseCleanReviews.append(" ".join(_tmp))

In [13]:
# Sanity check examination

sparseCleanReviews[108]

'bad bad awful humor irony problems die proud guarantee free decline zombie killing killing preferably gruesome scary die boom dead worse annihilation'

##### CountVectorizer application

We'll now simply repeat the CountVectorizer steps as we did in the [first write-up](./Model-06.ipynb) to create the 'bags-of-words' numeric representation of the 'de-noised' reviews suitable for the machine learning model.

In [14]:
# Utilize the defaults for the object instantiation other than max_features
vec = CountVectorizer(max_features = 5000)

# Similar to how almost every other Scikit-Learn objects works we'll call the fit() and transform() methods
features = vec.fit_transform(sparseCleanReviews)

# And finally we'll convert to a np.array
features = features.toarray()

print("Features shape: ", features.shape)

Features shape:  (25000, 5000)


##### Examine vocabulary

We'll examine what the 'de-noising' did to the top ten top and bottom vocabulary listings:

In [15]:
# Take a look at the first 10 words in the vocabulary
vocab = vec.get_feature_names()
print(vocab[:10])

['abnormal', 'abolish', 'abominable', 'abominably', 'abomination', 'abort', 'aborted', 'aborts', 'abound', 'abounds']


In [16]:
_df = pd.DataFrame(data = features, columns = vocab).sum()
_df.sort_values(ascending = False, inplace = True)

In [17]:
print("Top 10:\n")
print(_df.head(10))

Top 10:

like      20274
good      15140
well      10662
bad        9301
great      9058
plot       6585
love       6454
best       6416
better     5737
work       4372
dtype: int64


Original 'Top 10' before 'de-noising':

```
Top 10:

movie     44031
film      40147
one       26788
like      20274
good      15140
time      12724
even      12646
would     12436
story     11983
really    11736
```

In [18]:
print("Bottom 10:\n")
print(_df.tail(10))

Bottom 10:

hothead          1
pillory          1
immorally        1
immodest         1
beckoned         1
beckoning        1
immoderate       1
horrify          1
hotbeds          1
overbearingly    1
dtype: int64


Original 'Bottom 10' before 'de-noising':

```
Bottom 10:

skull       78
sopranos    78
premiere    78
bunny       78
flair       78
fishing     78
awhile      78
stumbled    78
amused      78
cream       78
```

# Baseline Model development

We are finally ready to develop the baseline model on the data we've explored, cleaned, and processed.  Because the IMDB data set doesn't include a validation set we'll create one from a portion of the training data.  The processes is similar to our work in previous write-ups such as the [Iris classifier](.//Model-01.ipynb).

In [25]:
# Pull in the labeled data
labeledTrainData = os.path.join(dataPath, 'labeledTrainData.tsv')
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

# Pull in the unlabeled data since it can be utilized by Word2Vec
unlabeledTrainData = os.path.join(dataPath, 'unlabeledTrainData.tsv')
dfUn = pd.read_csv(unlabeledTrainData, sep = '\t', header = 0, quoting = 3)

In [27]:
# Validation
print('df.shape :', df.shape)
print('dfUn.shape :', dfUn.shape)

df.shape : (25000, 3)
dfUn.shape : (50000, 2)


Word2Vec expects single sentences, each one as a list of words. In other words, the input format is a list of lists.

In [54]:
# Update stop word helper function to output a list of words

# Clean IMDB review text
def cleanReview(review, removeStopWords = False):
    # Convert the stop words to a set
    stopWords = set(stopwords.words("english"))
    
    # Remove HTML
    clean = BeautifulSoup(review)
    
    # Remove non-alpha chars
    clean = re.sub("[^a-zA-Z]", ' ', clean.get_text())
    
    # Convert to lower case and "tokenize"
    clean = clean.lower().split()
    
    # Remove stop words
    if removeStopWords:
        clean = [x for x in clean if not x in stopWords]
    
    # Return results
    return clean

In [62]:
# Examine
cleanReview(df.iloc[25,2])[:12]

['looking',
 'for',
 'quo',
 'vadis',
 'at',
 'my',
 'local',
 'video',
 'store',
 'i',
 'found',
 'this']

In [64]:
# Examine
cleanReview(dfUn.iloc[0,1])[:12]

['watching',
 'time',
 'chasers',
 'it',
 'obvious',
 'that',
 'it',
 'was',
 'made',
 'by',
 'a',
 'bunch']

Create function to break review into a list of sentences which are list of words (i.e. list of lists)

In [65]:
# Creating function implementing punkt tokenizer for sentence splitting
import nltk.data

# Only need this the first time...
# nltk.download('punkt')

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Define a function to split a review into parsed sentences
def createSentences(review, tokenizer, remove_stopwords = False):
    # Init container to hold results
    sentences = []
    
    # Split review string into sentences
    tokenSentences = tokenizer.tokenize(review.strip())

    # Clean the sentences via cleanReview() function
    for s in tokenSentences:
        # If a sentence is empty, skip it
        if len(s) > 0:
            # Clean sentence
            sentences.append( cleanReview( s, remove_stopwords ))
    
    # Return list of clean sentences
    return sentences

In [74]:
# Examine
_ = createSentences(df.iloc[25,2], tokenizer)
print(_[0][:12])
print(len(_))

['looking', 'for', 'quo', 'vadis', 'at', 'my', 'local', 'video', 'store', 'i', 'found', 'this']
8


In [75]:
# Examine
_ = createSentences(dfUn.iloc[0,1], tokenizer)
print(_[0][:12])
print(len(_))

['watching', 'time', 'chasers', 'it', 'obvious', 'that', 'it', 'was', 'made', 'by', 'a', 'bunch']
5


Now combine the labeled and unlabeled list of lists:

In [87]:
combined = []

for s in df.iloc[:,2]:
    combined += createSentences(s, tokenizer)

In [94]:
for s in dfUn.iloc[:,1]:
    combined += createSentences(s, tokenizer)

Quick examination:

In [103]:
print('len(combined): ', len(combined))
print("\nSample sentence:")
print(combined[0])

len(combined):  795538

Sample sentence:
['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again']


Train the Word2Vec model:

In [95]:
# Set Word2Vec params
features = 300       # Word vector dimensionality                      
minWordCount = 40    # Minimum word count                        
workers = 4          # Number of threads to run in parallel
context = 10         # Context window size                                                                                    
downSampling = 1e-3  # Downsample setting for frequent words

model = word2vec.Word2Vec(combined, 
                          workers=workers,
                          size=features, 
                          min_count = minWordCount,
                          window = context, 
                          sample = downSampling)

# https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.init_sims.html
# If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!
# Note that you cannot continue training after doing a replace. 
# The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.
model.init_sims(replace = True)

# Save model to disk
model.save("300features_40minwords_10context")

2018-10-17 15:39:10,579 : INFO : collecting all words and their counts
2018-10-17 15:39:10,580 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-10-17 15:39:10,626 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 17776 word types
2018-10-17 15:39:10,672 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 24948 word types
2018-10-17 15:39:10,719 : INFO : PROGRESS: at sentence #30000, processed 671315 words, keeping 30034 word types
2018-10-17 15:39:10,769 : INFO : PROGRESS: at sentence #40000, processed 897815 words, keeping 34348 word types


Training model...


2018-10-17 15:39:10,811 : INFO : PROGRESS: at sentence #50000, processed 1116963 words, keeping 37761 word types
2018-10-17 15:39:10,861 : INFO : PROGRESS: at sentence #60000, processed 1338404 words, keeping 40723 word types
2018-10-17 15:39:10,909 : INFO : PROGRESS: at sentence #70000, processed 1561580 words, keeping 43333 word types
2018-10-17 15:39:10,959 : INFO : PROGRESS: at sentence #80000, processed 1780887 words, keeping 45714 word types
2018-10-17 15:39:11,008 : INFO : PROGRESS: at sentence #90000, processed 2004996 words, keeping 48135 word types
2018-10-17 15:39:11,055 : INFO : PROGRESS: at sentence #100000, processed 2226966 words, keeping 50207 word types
2018-10-17 15:39:11,102 : INFO : PROGRESS: at sentence #110000, processed 2446580 words, keeping 52081 word types
2018-10-17 15:39:11,151 : INFO : PROGRESS: at sentence #120000, processed 2668775 words, keeping 54119 word types
2018-10-17 15:39:11,205 : INFO : PROGRESS: at sentence #130000, processed 2894303 words, keep

2018-10-17 15:39:14,509 : INFO : PROGRESS: at sentence #770000, processed 17217759 words, keeping 121703 word types
2018-10-17 15:39:14,567 : INFO : PROGRESS: at sentence #780000, processed 17447905 words, keeping 122402 word types
2018-10-17 15:39:14,624 : INFO : PROGRESS: at sentence #790000, processed 17674981 words, keeping 123066 word types
2018-10-17 15:39:14,652 : INFO : collected 123504 word types from a corpus of 17798082 raw words and 795538 sentences
2018-10-17 15:39:14,653 : INFO : Loading a fresh vocabulary
2018-10-17 15:39:14,747 : INFO : effective_min_count=40 retains 16490 unique words (13% of original 123504, drops 107014)
2018-10-17 15:39:14,748 : INFO : effective_min_count=40 leaves 17238940 word corpus (96% of original 17798082, drops 559142)
2018-10-17 15:39:14,797 : INFO : deleting the raw counts dictionary of 123504 items
2018-10-17 15:39:14,801 : INFO : sample=0.001 downsamples 48 most-common words
2018-10-17 15:39:14,802 : INFO : downsampling leaves estimated 1

2018-10-17 15:40:06,733 : INFO : EPOCH 4 - PROGRESS: at 47.56% examples, 856704 words/s, in_qsize 7, out_qsize 0
2018-10-17 15:40:07,733 : INFO : EPOCH 4 - PROGRESS: at 54.15% examples, 855010 words/s, in_qsize 7, out_qsize 0
2018-10-17 15:40:08,734 : INFO : EPOCH 4 - PROGRESS: at 60.18% examples, 846510 words/s, in_qsize 6, out_qsize 1
2018-10-17 15:40:09,740 : INFO : EPOCH 4 - PROGRESS: at 65.85% examples, 833630 words/s, in_qsize 8, out_qsize 3
2018-10-17 15:40:10,742 : INFO : EPOCH 4 - PROGRESS: at 72.10% examples, 830640 words/s, in_qsize 6, out_qsize 0
2018-10-17 15:40:11,749 : INFO : EPOCH 4 - PROGRESS: at 78.61% examples, 830056 words/s, in_qsize 8, out_qsize 0
2018-10-17 15:40:12,754 : INFO : EPOCH 4 - PROGRESS: at 84.94% examples, 827971 words/s, in_qsize 6, out_qsize 0
2018-10-17 15:40:13,767 : INFO : EPOCH 4 - PROGRESS: at 91.26% examples, 825788 words/s, in_qsize 7, out_qsize 2
2018-10-17 15:40:14,780 : INFO : EPOCH 4 - PROGRESS: at 98.11% examples, 828199 words/s, in_qsiz

Explore the results:

In [109]:
model.most_similar("great")

[('fantastic', 0.744966983795166),
 ('terrific', 0.7254248857498169),
 ('wonderful', 0.7222756743431091),
 ('superb', 0.6488494873046875),
 ('fine', 0.6268933415412903),
 ('brilliant', 0.6259881854057312),
 ('excellent', 0.6108428835868835),
 ('good', 0.6105794310569763),
 ('marvelous', 0.5988600254058838),
 ('fabulous', 0.5846982598304749)]

In [101]:
model.most_similar("awful")

[('terrible', 0.7702630758285522),
 ('atrocious', 0.7353411912918091),
 ('horrible', 0.7253411412239075),
 ('abysmal', 0.6938037276268005),
 ('dreadful', 0.6896001100540161),
 ('horrid', 0.6868517994880676),
 ('horrendous', 0.6746305823326111),
 ('appalling', 0.6509989500045776),
 ('lousy', 0.6151514053344727),
 ('laughable', 0.610934853553772)]

In [110]:
# One of the reviews we referred to often in previous write-ups was of a zombie movie,
# so let's see what words are similar/associated with the word 'zombie'
model.most_similar("zombie")

[('cannibal', 0.6876082420349121),
 ('horror', 0.6336225867271423),
 ('slasher', 0.6316038370132446),
 ('splatter', 0.6017206907272339),
 ('fulci', 0.578565239906311),
 ('mummy', 0.5776978731155396),
 ('vampire', 0.5766961574554443),
 ('zombies', 0.5759209394454956),
 ('monster', 0.5732545256614685),
 ('werewolf', 0.5728244185447693)]