<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB-Movie-Review-Sentiment-Classification" data-toc-modified-id="IMDB-Movie-Review-Sentiment-Classification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB Movie Review Sentiment Classification</a></span></li><li><span><a href="#Purpose" data-toc-modified-id="Purpose-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Process" data-toc-modified-id="Process-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Process</a></span></li><li><span><a href="#Configure-notebook,-import-libraries,-and-import-dataset" data-toc-modified-id="Configure-notebook,-import-libraries,-and-import-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Configure notebook, import libraries, and import dataset</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Define-global-variables" data-toc-modified-id="Define-global-variables-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Define global variables</a></span></li></ul></li><li><span><a href="#Helper-Functions" data-toc-modified-id="Helper-Functions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Helper Functions</a></span></li><li><span><a href="#Examine-the-data" data-toc-modified-id="Examine-the-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Examine the data</a></span></li><li><span><a href="#Cleaning-and-preprocessing" data-toc-modified-id="Cleaning-and-preprocessing-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Cleaning and preprocessing</a></span><ul class="toc-item"><li><span><a href="#Load-labeled-training-data" data-toc-modified-id="Load-labeled-training-data-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Load labeled training data</a></span></li><li><span><a href="#Clean/Process-reviews" data-toc-modified-id="Clean/Process-reviews-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Clean/Process reviews</a></span><ul class="toc-item"><li><span><a href="#Option-A:--Read-previous-text-cleaning-work-from-disk" data-toc-modified-id="Option-A:--Read-previous-text-cleaning-work-from-disk-7.2.1"><span class="toc-item-num">7.2.1&nbsp;&nbsp;</span>Option A:  Read previous text cleaning work from disk</a></span></li><li><span><a href="#Option-B:--Perform-text-cleaning-work-and-write-to-disk" data-toc-modified-id="Option-B:--Perform-text-cleaning-work-and-write-to-disk-7.2.2"><span class="toc-item-num">7.2.2&nbsp;&nbsp;</span>Option B:  Perform text cleaning work and write to disk</a></span></li></ul></li></ul></li><li><span><a href="#Create-Models" data-toc-modified-id="Create-Models-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Create Models</a></span><ul class="toc-item"><li><span><a href="#Tokenize-the-cleaned-data" data-toc-modified-id="Tokenize-the-cleaned-data-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Tokenize the cleaned data</a></span></li><li><span><a href="#Algorithm-v1:--LSTM" data-toc-modified-id="Algorithm-v1:--LSTM-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Algorithm v1:  LSTM</a></span><ul class="toc-item"><li><span><a href="#Comments" data-toc-modified-id="Comments-8.2.1"><span class="toc-item-num">8.2.1&nbsp;&nbsp;</span>Comments</a></span></li></ul></li><li><span><a href="#Algorithm-v2:--CNN" data-toc-modified-id="Algorithm-v2:--CNN-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>Algorithm v2:  CNN</a></span><ul class="toc-item"><li><span><a href="#Comments" data-toc-modified-id="Comments-8.3.1"><span class="toc-item-num">8.3.1&nbsp;&nbsp;</span>Comments</a></span></li></ul></li><li><span><a href="#Algorithm-v3:--CNN-with-tuning" data-toc-modified-id="Algorithm-v3:--CNN-with-tuning-8.4"><span class="toc-item-num">8.4&nbsp;&nbsp;</span>Algorithm v3:  CNN with tuning</a></span><ul class="toc-item"><li><span><a href="#Comments" data-toc-modified-id="Comments-8.4.1"><span class="toc-item-num">8.4.1&nbsp;&nbsp;</span>Comments</a></span></li></ul></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

<h1>IMDB Movie Review Sentiment Classification</h1>

<img style="float: left; margin-right: 15px; width: 30%; height: 30%;" src="images/imdb.jpg" />

# Purpose

The overall goal of this set of write-ups is to explore a number of machine learning algorithms utilizing natural language processing (NLP) to classify sentiment IMDB movie reviews.

The specific goals of this write-up include:
1. Create a feature set of document vectors from the IMDb movie review text utilizing [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)
2. Utilize LSTM and CNN algorithms to create predictive models from the feature set(s) created above 
2. Obtain Kaggle scores on the outputs of the models
3. Determine if utilizing LSTM and CNN networks on feature sets created with Doc2Vec improves our ability to correctly classify movie review sentiment

This series of write-ups is inspired by the Kaggle [
Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial) competition.

References:
* [Gensim Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)

Dataset source:  [IMDB Movie Reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

# Process

Previously covered [here](./Model-06.ipynb#Process).

# Configure notebook, import libraries, and import dataset

## Import libraries

In [3]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import os
import re
import numpy as np
import matplotlib.pyplot as plt
import pickle
from collections import Counter

from random import shuffle

import pandas as pd
from pandas import set_option

from tqdm import tqdm
from keras_tqdm import TQDMNotebookCallback
tqdm.pandas(desc="progress-bar")

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.grid_search import GridSearchCV

# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization

# http://www.nltk.org/index.html
# pip install nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


# Creating function implementing punkt tokenizer for sentence splitting
import nltk.data

# Only need this the first time...
# nltk.download('punkt')


# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
# pip install BeautifulSoup4
from bs4 import BeautifulSoup


# https://pypi.org/project/gensim/
# pip install gensim
import gensim.models.doc2vec
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

#import multiprocessing

#cores = multiprocessing.cpu_count()
#assert(gensim.models.doc2vec.FAST_VERSION > -1, "Going to be slow!")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Define global variables

In [4]:
seed = 10
np.random.seed(seed)

# Opens a GUI that allows us to download the NLTK data
# nltk.download()

dataPath = os.path.join('.', 'datasets', 'imdb_movie_reviews')
labeledTrainData = os.path.join(dataPath, 'labeledTrainData.tsv')

In [5]:
contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

contractionsObj = re.compile('(%s)' % '|'.join(contractions.keys()))

# Helper Functions

In [6]:
def expandContractions(txt, contractions = contractions):
    def replace(match):
        return contractions[match.group(0)]
    return contractionsObj.sub(replace, txt)

---
Function to train a given algorithm, create predictions on the test set from the resulting model, and then prepare a Kaggle submission file in the correct format.

In [7]:
def createKaggleSubmission(model, writeData = False, readData = False):
    
    # Pull in the labeled data
    print("** Loading test data.")
    dataPath = os.path.join('.', 'datasets', 'imdb_movie_reviews')
    testData = os.path.join(dataPath, 'testData.tsv')

    testDF = pd.read_csv(testData, sep = '\t', header = 0, quoting = 3)
    testDF['id'] = testDF['id'].str.replace('"', '')
        
    if readData:
        print("\n** Reading processed test data from disk.")
        with open('Model-06.p5.finalTest.pkl','rb') as f:
            finalTest =  pickle.load(f) 
    
    else:
        # Sanity check
        print("** Test data loaded into dataframe.")
        print('testDF.shape :', testDF.shape)
        print("")
        print(testDF.head())

        # Clean the test data
        cleanTest = []

        # Clean the reviews
        print("\n** Cleaning the test data.")
        for i, s in tqdm(enumerate(testDF.iloc[:,1])):
            cleanTest.append(cleanReview(s, True))

        # Examine a portion of the first clean review
        print("Examine a portion of the first clean review:\n")
        print(cleanTest[0][:10])


        print("\n** Processing the clean test data.")
        finalTest = []
        for doc in tqdm(cleanTest):
            finalTest.append(processCleanReview(doc, vocab))

        print("** Processing complete.")
        print('Records:', len(finalTest))
        print("First record:")
        print(finalTest[0])

        if writeData:
            print("\n** Writing processed test data to disk.")
            with open('Model-06.p5.finalTest.pkl','wb') as f:
                pickle.dump(finalTest, f)


    # Create test data sequences for use by the Keras CNN model
    print("\n** Creating Keras sequences.")
    sequences = tokenizer.texts_to_sequences(finalTest)
    data = pad_sequences(sequences, maxlen = seqLength, padding = "post")

    print("\n** Predicting classes.")
    yHat = model.predict(data)
    yHat = np.round(yHat).astype(np.int)
    print("** First 10 predictions:")
    print(yHat[:10])


    # Add the predictions to the test data frame
    print("\n** Adding predictions to test data frame.")
    testDF['sentiment'] = yHat
    testDF.head()


    # Create the Kaggle submission file
    print("\n** Writing submission CSV file.")
    import csv
    header = ['id', 'sentiment']
    testDF.to_csv('submission.csv', columns = header, index = False, quoting = csv.QUOTE_NONE)
    
    print("\n** Finished!\n")
    
    return testDF

# Examine the data

Previously covered [here](./Model-06.ipynb#Examine-the-data).

# Cleaning and preprocessing

## Load labeled training data

(Previous process justification and methodology also previously covered [here](./Model-06.ipynb#Cleaning-and-preprocessing).)

We need to load the labeled training data exactly as we've done in previous write-ups:

In [8]:
# Pull in the labeled data
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

# Sanity check
print('df.shape :', df.shape)

df.shape : (25000, 3)


## Clean/Process reviews

Take a given sentence and process/clean it (i.e. remove HTML and other cruft, lower case the text, etc.).

In [9]:
# Update stop word helper function to output a list of words

# Clean IMDB review text
def cleanReview(review, removeStopWords = False, applyStemming = False):
    
    # Convert the stop words to a set
    stopWords = set(stopwords.words("english"))
    
    # Remove HTML
    clean = BeautifulSoup(review).get_text()
    
    # Expand contractions (i.e. wasn't => was not)
    clean = expandContractions(clean)
    
    # Remove non-alpha chars
    clean = re.sub("[^a-zA-Z]", ' ', clean)
    
    # Convert to lower case and "tokenize"
    clean = clean.lower().split()
       
    # Remove stop words and add to global vocab
    if removeStopWords:
        clean = [x for x in clean if not x in stopWords]
        
    # Stemming
    if applyStemming:
        clean = [PorterStemmer().stem(x) for x in clean]
    
    # Return results
    return clean

---
1. Remove any words from the cleaned review not found in the vocab tokens list
2. Remove any words with length less than 2

In [10]:

def processCleanReview(review, vocab):
    results = []
    
    for x in review:
        if (x in vocab and len(x) > 1):
            results.append(x)
            
    return results

### Option A:  Read previous text cleaning work from disk

__Run the block below if you've already processed the data and saved the vocab to disk__

In [12]:
with open('Model-06.p5.finalDocs.pkl','rb') as f:
    finalDocs =  pickle.load(f)
    
with open('Model-06.p5.vocab.pkl','rb') as f:
    vocab =  pickle.load(f)

with open('Model-06.p5.counts.pkl','wb') as f:
    counts = pickle.load(f)

### Option B:  Perform text cleaning work and write to disk

A quick examination of the output:

In [11]:
# Examine
cleanReview(df.iloc[25,2], True, True)[:12]

['look',
 'quo',
 'vadi',
 'local',
 'video',
 'store',
 'found',
 'version',
 'look',
 'interest',
 'wow',
 'amaz']

Clean the reviews and create the vocab:

In [11]:
cleanDocs = []
vocab = Counter()

# Clean the reviews
for i, s in tqdm(enumerate(df.iloc[:,2])):
    _ = cleanReview(s, True)
    cleanDocs.append(_)
    vocab.update(_)

25000it [00:51, 488.16it/s]


Examine some vocab metrics:

In [117]:
print(len(vocab))
print(vocab[:5])

46199
['stuff', 'going', 'moment', 'mj', 'started']


How big is the vocab if we only consider words appearing at least N times?

In [13]:
ocurring = 2
vocab = [k for k,c in vocab.items() if c >= ocurring]
print(len(vocab))

46199


Quick sanity check test:

In [None]:
_ = cleanReview(df.iloc[25,2], True, True)
processCleanReview(_, vocab)

Now we want to do final processing of the cleaned reviews:

In [None]:
finalDocs = []

for doc in tqdm(cleanDocs):
    finalDocs.append(processCleanReview(doc, vocab))
    
print(len(finalDocs))
print(finalDocs[0])

Next create a collection of review sizes in words:

In [14]:
counts = []

for i, d in tqdm(enumerate(finalDocs)):
    counts.append(len(d))

25000it [00:00, 1666655.01it/s]


What are the max and min review sizes in words?

In [15]:
print("Max review size in words:", max(counts))
print("Min review size in words:", min(counts))

Max review size in words: 1418
Min review size in words: 4


Visually inspect the max review:

In [16]:
" ".join(finalDocs[counts.index(max(counts))][:100])

'match tag team table match bubba ray spike dudley vs eddie guerrero chris benoit bubba ray spike dudley started things tag team table match eddie guerrero chris benoit according rules match opponents go tables order get win benoit guerrero heated early taking turns hammering first spike bubba ray german suplex benoit bubba took wind dudley brother spike tried help brother referee restrained benoit guerrero ganged corner benoit stomping away bubba guerrero set table outside spike dashed ring somersaulted top rope onto guerrero outside recovering taking care spike guerrero slipped table ring helped wolverine set tandem set double superplex middle rope'

Visually inspect the min review:

In [17]:
" ".join(finalDocs[counts.index(min(counts))])

'movie terrible good effects'

View the 10 largest and 10 smallest reviews in words:

In [18]:
counts.sort()
print("10 largest reviews in words:", counts[-10:])
print("10 smallest reviews in words:", counts[:10])

10 largest reviews in words: [700, 707, 735, 796, 797, 812, 874, 913, 924, 1418]
10 smallest reviews in words: [4, 6, 6, 7, 7, 8, 8, 8, 9, 9]


Pickle the cleanDocs vocab, and counts to save time if/when we run this again:

In [20]:
with open('Model-06.p5.finalDocs.pkl','wb') as f:
    pickle.dump(finalDocs, f)
    
with open('Model-06.p5.vocab.pkl','wb') as f:
    pickle.dump(vocab, f)
    
with open('Model-06.p5.counts.pkl','wb') as f:
    pickle.dump(counts, f)

# Create Models

## Tokenize the cleaned data

The largest review in words frankly looks like junk.  We are going to ignore it and make the max seq. length the 2nd largest review size in words.

vocabSize = len(vocab)
seqLength = counts[-2]

print("Vocab size:", vocabSize)
print("seqLength:", seqLength)

Tokenize the vocab:

In [184]:
tokenizer = Tokenizer(num_words = vocabSize)
tokenizer.fit_on_texts(vocab)

sequences = tokenizer.texts_to_sequences(finalDocs)
data = pad_sequences(sequences, maxlen = seqLength, padding = "post")

In [225]:
print(len(tokenizer.word_index))
print(len(sequences))
print(len(data))

46199
25000
25000


Let's confirm Keras was smart enough to understand we already tokenized the review text:

In [195]:
assert(len(sequences) == len(df))

Let's also examine what the final product looks like:

In [196]:
data

array([[    1,     2,     3, ...,     0,     0,     0],
       [  167,   168,   169, ...,     0,     0,     0],
       [   32,    76,   225, ...,     0,     0,     0],
       ...,
       [   19,  2353,    16, ...,     0,     0,     0],
       [ 1235,    10, 19681, ...,     0,     0,     0],
       [ 1063,    66,  3099, ...,     0,     0,     0]])

## Algorithm v1:  LSTM

We'll utilize a LSTM algorithm to create the fist model and assess its predictive powers:

In [200]:
model = Sequential()
model.add(Embedding(vocabSize, 100, input_length = seqLength))
model.add(LSTM(100, dropout = 0.2, recurrent_dropout = 0.2))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [201]:
model.fit(
    data, 
    df.iloc[:, 1], 
    validation_split = 0.4, 
    epochs = 3, 
    verbose = 0, 
    callbacks = [TQDMNotebookCallback(leave_inner = True, leave_outer = True)]
)

HBox(children=(IntProgress(value=0, description='Training', max=3, style=ProgressStyle(description_width='init…

HBox(children=(IntProgress(value=0, description='Epoch 0', max=15000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 1', max=15000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 2', max=15000, style=ProgressStyle(description_width='i…

<keras.callbacks.History at 0x1b082cbe0>

### Comments

Final epoch reported loss: 0.693, acc: 0.495, val_loss: 0.693, val_acc: 0.501.

Kaggle score model v1.0:  0.84780

Pretty poor results...   Let's try another, deeper model architecture.

## Algorithm v2:  CNN

We'll switch over to a CNN for our second attempt:

In [205]:
# Create, compile, and return the CNN model
def createModelv2(vocabSize, seqLength):
    model = Sequential()
    model.add(Embedding(vocabSize, 100, input_length = seqLength))
    model.add(Conv1D(filters = 32, kernel_size = 8, activation = 'relu'))
    model.add(MaxPooling1D(pool_size = 2))
    model.add(Flatten())
    model.add(Dense(10, activation = 'relu'))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    model.summary()

    return model

In [206]:
m2 = createModelv2(vocabSize, seqLength)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_66 (Embedding)     (None, 924, 100)          4619900   
_________________________________________________________________
conv1d_64 (Conv1D)           (None, 917, 32)           25632     
_________________________________________________________________
max_pooling1d_64 (MaxPooling (None, 458, 32)           0         
_________________________________________________________________
flatten_64 (Flatten)         (None, 14656)             0         
_________________________________________________________________
dense_129 (Dense)            (None, 10)                146570    
_________________________________________________________________
dense_130 (Dense)            (None, 1)                 11        
Total params: 4,792,113
Trainable params: 4,792,113
Non-trainable params: 0
_________________________________________________________________


In [207]:
m2.fit(
    data, 
    df.iloc[:, 1], 
    epochs = 3, 
    verbose = 0, 
    callbacks = [TQDMNotebookCallback(leave_inner = True, leave_outer = True)]
)

HBox(children=(IntProgress(value=0, description='Training', max=3, style=ProgressStyle(description_width='init…

HBox(children=(IntProgress(value=0, description='Epoch 0', max=25000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 1', max=25000, style=ProgressStyle(description_width='i…

HBox(children=(IntProgress(value=0, description='Epoch 2', max=25000, style=ProgressStyle(description_width='i…

<keras.callbacks.History at 0x1b483aef0>

In [208]:
m2.save('Model-06.p5.CNN-v2.h5')

###### Prepare Kaggle Submission

In [210]:
m2DF = createKaggleSubmission(m2, writeData = False, readData = True)

** Loading test data.

** Reading processed test data from disk.

** Creating Keras sequences.

** Predicting classes.
** First 10 predictions:
[[1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]]

** Adding predictions to test data frame.

** Writing submission CSV file.

** Finished!



### Comments

Kaggle score model v2.0:  0.85424

The CNN have provided some improvement to the test set predictive Kaggle score.  Next we'll examine hyperparameter fine tuning, and see if we can further increase accuracy gains.

## Algorithm v3:  CNN with tuning

Let's do some gridsearch on the algorithm's hyperparameters, and see if we can reduce the variance issues (i.e. over fitting):

In [89]:
# Create, compile, and return the CNN model
def createModelv3(vocabSize, seqLength, filters, kernelSize, debug = False):    
    if debug:
        print("\n*****")
        print("filters:", filters)
        print("kernel_size:", kernel_size)
        print("*****\n")
    
    model = Sequential()
    model.add(Embedding(vocabSize, 100, input_length = seqLength))
    model.add(Conv1D(filters = filters, kernel_size = kernelSize, activation = 'relu'))
    model.add(MaxPooling1D(pool_size = 2))
    model.add(Flatten())
    model.add(Dense(10, activation = 'relu'))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    
    if debug:
        model.summary()
    
    return model

In [93]:
m3 = KerasClassifier(
    build_fn = define_model2, 
    vocab_size = vocabSize, 
    max_length = seqLength, 
    verbose = 0,
    epochs = 3
)

In [94]:
param_grid = dict(
    filters=[32, 100, 200], 
    kernel_size = [6, 8, 10]
)
grid = GridSearchCV(
    estimator = m3, 
    param_grid = param_grid, 
    n_jobs = 1
)
grid_result = grid.fit(data, df.iloc[:, 1])


*****
filters: 32
kernel_size: 6
*****

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_32 (Embedding)     (None, 924, 100)          4619900   
_________________________________________________________________
conv1d_32 (Conv1D)           (None, 919, 32)           19232     
_________________________________________________________________
max_pooling1d_32 (MaxPooling (None, 459, 32)           0         
_________________________________________________________________
flatten_32 (Flatten)         (None, 14688)             0         
_________________________________________________________________
dense_63 (Dense)             (None, 10)                146890    
_________________________________________________________________
dense_64 (Dense)             (None, 1)                 11        
Total params: 4,786,033
Trainable params: 4,786,033
Non-trainable params: 0
_________________________

Trainable params: 5,158,021
Non-trainable params: 0
_________________________________________________________________

*****
filters: 100
kernel_size: 8
*****

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_46 (Embedding)     (None, 924, 100)          4619900   
_________________________________________________________________
conv1d_46 (Conv1D)           (None, 917, 100)          80100     
_________________________________________________________________
max_pooling1d_46 (MaxPooling (None, 458, 100)          0         
_________________________________________________________________
flatten_46 (Flatten)         (None, 45800)             0         
_________________________________________________________________
dense_91 (Dense)             (None, 10)                458010    
_________________________________________________________________
dense_92 (Dense)             (None, 1)          

dense_118 (Dense)            (None, 1)                 11        
Total params: 5,139,021
Trainable params: 5,139,021
Non-trainable params: 0
_________________________________________________________________


In [211]:
# Examine the best score and model params
print(grid_result.best_score_)
print(grid_result.best_params_)

0.8776800000119209
{'filters': 100, 'kernel_size': 6}


In [224]:
# Attach to the best model
bestModel = grid_result.best_estimator_ 

# Save it
bestModel.model.save('Model-06.p5.CNN-v3.h5')

##### Prepare Kaggle submission

In [173]:
testDF = createKaggleSubmission(bestModel, writeData = False, readData = True)

** Loading test data.

** Reading processed test data from disk.

** Creating Keras sequences.

** Predicting classes.
** First 10 predictions:
[[1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]]

** Adding predictions to test data frame.

** Writing submission CSV file.

** Finished!



### Comments

Kaggle score model v3.0:  0.86752

Again we have improvements which is a result of the hyperparameter fine tuning results.

# Summary

In [None]:
%%html
<style>
table {float:left}
</style>

As I continued to work on CNN hyperparameter fine tuning I came across a number of papers detailing transfer learning on NLP tasks.  I felt this was a more promising line of exploration, because further non-trivial improvements on the CNN were slow in materializing.  As such I decided to stop work on this write-up, and instead spend some time on utilizing pre-trained networks in a separate write-up.

In this write-up we accomplished the following:


* Createe a feature set of document vectors from the IMDb movie review text utilizing Doc2Vec
* Utilized LSTM and CNN algorithms to create predictive models from the feature set(s) created above
* Obtained Kaggle scores on the outputs of the models

The best Kaggle score we achieved was 0.86752 on the tuned version of the CNN algorithm.  This score isn't bad for our first serious Kaggle submission; however, there is clearly room for improvement which hopefully we'll gain utilizing NLP transfer learning techniques in the next write-up.

Performance metrics so far:

|Model     |Kaggle Score|
|----------|------------|
|LST       | 0.84780    |
|CNN       | 0.85424    |
|Tuned CNN | 0.86752    |
<div style="clear:both"></div>
