<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IMDB-Movie-Review-Sentiment-Classification" data-toc-modified-id="IMDB-Movie-Review-Sentiment-Classification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>IMDB Movie Review Sentiment Classification</a></span></li><li><span><a href="#Purpose" data-toc-modified-id="Purpose-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Process" data-toc-modified-id="Process-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Process</a></span><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Import libraries</a></span></li></ul></li><li><span><a href="#Examine-data" data-toc-modified-id="Examine-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Examine data</a></span><ul class="toc-item"><li><span><a href="#Descriptive-statistics" data-toc-modified-id="Descriptive-statistics-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Descriptive statistics</a></span></li></ul></li><li><span><a href="#Baseline-model-development" data-toc-modified-id="Baseline-model-development-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Baseline model development</a></span></li><li><span><a href="#Init-vars" data-toc-modified-id="Init-vars-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Init vars</a></span></li><li><span><a href="#Build-the-computational-graph" data-toc-modified-id="Build-the-computational-graph-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Build the computational graph</a></span><ul class="toc-item"><li><span><a href="#Static-vs-Dynamic-TensorFlow-RNNs" data-toc-modified-id="Static-vs-Dynamic-TensorFlow-RNNs-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Static vs Dynamic TensorFlow RNNs</a></span></li><li><span><a href="#LSTM-v1" data-toc-modified-id="LSTM-v1-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>LSTM v1</a></span></li><li><span><a href="#LSTM-v2" data-toc-modified-id="LSTM-v2-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>LSTM v2</a></span></li><li><span><a href="#LSTM-v3" data-toc-modified-id="LSTM-v3-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>LSTM v3</a></span></li></ul></li></ul></div>

<h1>IMDB Movie Review Sentiment Classification</h1>

<img style="float: left; margin-right: 15px; width: 30%; height: 30%;" src="images/imdb.jpg" />

# Purpose

The purpose of this write-up is create a predictive classification model utilizing natural language processing (NLP) to process IMDB movie review sentiments.  The write-up is inspired by the Kaggle (
Bag of Words Meets Bags of Popcorn)[https://www.kaggle.com/c/word2vec-nlp-tutorial] competition.    

Goals include:
* TODO
* TODO
* TODO

Dataset source:  [IMDB Movie Reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

# Process

We'll utilize the following process to guide us through this and the following write-ups on the IDMB movie review dataset:

1. Problem definition
2. Evaluation Strategy
3. Baseline model(s)
4. Data validation
5. Model development

# Configure notebook and import libraries

In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import os
import re
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from pandas import set_option

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# http://www.nltk.org/index.html
# pip install nltk
import nltk
from nltk.corpus import stopwords

# https://www.crummy.com/software/BeautifulSoup/bs4/doc/
# pip install BeautifulSoup4
from bs4 import BeautifulSoup

In [3]:
seed = 10
np.random.seed(seed)

dataPath = os.path.join('.', 'datasets', 'imdb_movie_reviews')
labeledTrainData = os.path.join(dataPath, 'labeledTrainData.tsv')

# Examine data

If we open the training data file in a text editor we can see that:
* A header row exists with the values 'id	sentiment	review'
* The values appear to be separated by tabs
* There are double quotes around the review text as well as within the contents of the review text

Based on the last point we'll tell Pandas to avoid quoting with the parameter `quoting = 3`.

Let's go ahead and read the test data file into a Pandas DataFrame and then explore the raw data.

In [4]:
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

## Descriptive statistics

##### Shape and data types

In [23]:
df.shape

(25000, 3)

In [25]:
df.dtypes

id           object
sentiment     int64
review       object
dtype: object

In [None]:
##### Inspect a few rows of raw data

In [46]:
# Don't truncate
pd.set_option('display.max_colwidth', -1)
df[8:11].head()

Unnamed: 0,id,sentiment,review
8,"""319_1""",0,"""A friend of mine bought this film for £1, and even then it was grossly overpriced. Despite featuring big names such as Adam Sandler, Billy Bob Thornton and the incredibly talented Burt Young, this film was about as funny as taking a chisel and hammering it straight through your earhole. It uses tired, bottom of the barrel comedic techniques - consistently breaking the fourth wall as Sandler talks to the audience, and seemingly pointless montages of 'hot girls'.<br /><br />Adam Sandler plays a waiter on a cruise ship who wants to make it as a successful comedian in order to become successful with women. When the ship's resident comedian - the shamelessly named 'Dickie' due to his unfathomable success with the opposite gender - is presumed lost at sea, Sandler's character Shecker gets his big break. Dickie is not dead, he's rather locked in the bathroom, presumably sea sick.<br /><br />Perhaps from his mouth he just vomited the worst film of all time."""
9,"""8713_10""",1,"""<br /><br />This movie is full of references. Like \""Mad Max II\"", \""The wild one\"" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."""
10,"""2486_3""",0,"""What happens when an army of wetbacks, towelheads, and Godless Eastern European commies gather their forces south of the border? Gary Busey kicks their butts, of course. Another laughable example of Reagan-era cultural fallout, Bulletproof wastes a decent supporting cast headed by L Q Jones and Thalmus Rasulala."""


It appears there is a lot of noise in the `review` column we are going to have to deal with:  punctuation, html, escaped double quotes, currency symbols, and so forth.  

Two of the reviews seem to have a clear sentiment, which will hopefully allow the model to train and learn well against:
* Row 8 :: "This movie is a masterpiece." --> Clearly positive
* Row 10 :: "... the worst film of all time." --> Clearly negative

And then we have Row[10] which even as a human I wouldn't be 100% sure if they were being negative and/or sarcastic but in a positive or snarky way.  I would assume this type of review is going to give our learning algorithm some issues.

##### Label distribution

In [31]:
df.groupby('sentiment').size()

sentiment
0    12500
1    12500
dtype: int64

We have an even split of likes and dislikes; no one classification has a skewed representation in the data set.

##### ID distribution

Kaggle's site has this to say about the ID column:
* id - Unique ID of each review

It isn't clear; however, if each review is from a unique author, or we have potentially multiple reviews written by the same person.

It appears that perhaps the first part of the ID before the underscore might identify the author, and the second part of the ID after the underscore might be the Nth review from that author.

We can explore this theory using Pandas:

In [5]:
# Check for dupes against the raw ID values
df['id'].value_counts().shape

(25000,)

In [None]:
# Split the ID on the underscore
split = df['id'].str.replace('"', '').str.split('_')
split.head(1)

In [29]:
# Pull out the first part of the ID values using a list comprehension, and place results into a Pandas Series object
ids = pd.Series([x[0] for x in split])

# Let's see if the number of records has changed
print("Shape:\n", ids.value_counts(ascending = False).shape, "\n")
print("First five:\n", ids.value_counts(ascending = False).head(5), "\n")
print("Last five:\n", ids.value_counts(ascending = False).tail(5), "\n")

Shape:
 (12500,) 

First five:
 3981     2
7046     2
8771     2
10085    2
10021    2
dtype: int64 

Last five:
 6437     2
2014     2
12019    2
10907    2
9770     2
dtype: int64 



If our theory is correct--which it may not be--then each review author has exactly two entries present in the training observations.  This still provides us with a wide range (12500 in fact) of writing styles, word compositions, and so forth.  It also mitigates the possibility that we might have a few authors with a large number of reviews that would skew the algorithm's ability to generalize to unseen observations.

If our theory is incorrect then we simply have 25,000 unique reviews each written by a different author, and this can only help the model to generalize.

Just for fun; however, let's pick out two reviews by the same author, and see if the writing styles are similar:

In [28]:
samples = df[ df['id'].str.contains('12486_') ]
pd.set_option('display.max_colwidth', -1)
samples

Unnamed: 0,id,sentiment,review
10936,"""12486_2""",0,"""Rich ditzy Joan Winfield (a woefully miscast Bette Davis) is engaged to be married to stupid egotistical Allen Brice (Jack Carson looking lost). Her father (Eugene Palette) is determined to stop the marriage and has her kidnapped by pilot Steve Collins (James Cagney. Seriously). They crash land in the desert and hate each other but (sigh) start falling in love.<br /><br />This seems to be getting a high rating from reviewers here only because Cagney and Davis are in it. They were both brilliant actors but they were known for dramas NOT comedy and this movie shows why! The script is just horrible--there's not one genuine laugh in the entire movie. The running joke in this has Cagney and Davis falling rump first in a cactus (this is done THREE TIMES!). Only their considerable talents save them from being completely humiliated. As it is they both do their best with the lousy material. Cagney tries his best with his lines and Davis screeches every line full force but it doesn't work. Carson has this \""what the hell\"" look on his face throughout the entire movie (probably because his characters emotions change in seconds). Only Palette with his distinctive voice and over the top readings manges to elicit a few smiles. But, all in all, this was dull and laughless--a real chore to sit through. This gets two stars only for Cagney and Davis' acting and some beautiful cinematography but really--it's not worth seeing. Cagney and Davis hated this film in later years and you can see why."""
11427,"""12486_7""",1,"""Well, this film is a difficult one really. To be straight with you, this film doesn't contain much of a riveting story, nore does it make u 'want' to know how it'll end...but I'll tell you something now...never have I been as tense and jumped up before in my life! This film sure does deliver the jumps and thrills! To be fair, I did watch it at almost midnight so I was kinda sleepy anyway, so maybe that explains why I was jumpy...or maybe it's because this film does deliver in that aspect! It's basically about a couple who lose their child in a tragic event. They decide to move away and rent a cabin looking thing in the mountains...all looks peaceful and calm until they have their first visitors (i think it's it's the sister of the main character, and she brings along her husband)...during the night, the husband hears noises...checks it out, and thats when things start to go really really wrong...they don't stay for another day and tell the couple they should leave asap as something isn't right...to cut a long story short...eventually they find out what has happened in that house in the past few years and decide it needs to be taken care of.<br /><br />It's not a Hollywood blockbuster, nore does it have a huge budget, but please don't let that put you off. It's creepy, tense and very very jumpy! Just give it a try :)"""


Based on the writing styles present in the two samples above the theory that these were written by the same author seems to be weakened.  For example, the second sample utilizes '...' a number of times, but we don't see that present in the first sample.  Likewise the first sample uses all uppercase characters for emphasis, but there are none present in the second sample.  And finally, the use (or misuse) of grammar does not match between the two entries either.

In [20]:
df.dtypes

id           object
sentiment     int64
review       object
dtype: object

In [17]:
df.stats()

AttributeError: 'DataFrame' object has no attribute 'stats'

In [18]:
df.info

<bound method DataFrame.info of               id  sentiment                                             review
0       "5814_8"          1  "With all this stuff going down at the moment ...
1       "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2       "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3       "3630_4"          0  "It must be assumed that those who praised thi...
4       "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
5       "8196_8"          1  "I dont know why people think this is such a b...
6       "7166_2"          0  "This movie could have been very good, but com...
7      "10633_1"          0  "I watched this video at a friend's house. I'm...
8        "319_1"          0  "A friend of mine bought this film for £1, and...
9      "8713_10"          1  "<br /><br />This movie is full of references....
10      "2486_3"          0  "What happens when an army of wetbacks, towelh...
11     "6811_10"    

In [10]:
df = pd.read_csv(labeledTrainData, sep = '\t', header = 0, quoting = 3)

print(df.columns)
print(len(df))

Index(['id', 'sentiment', 'review'], dtype='object')
25000


# Baseline model development

Next well create a number of baseline models in order to obtain a benchmark to compare further efforts against.

# Init vars

The LSTM wants inputs of shape `[samples, timeSteps, features]`, and we have several thousand MNIST images of size 28 x 28 pixels.  

One way to think of this is a complete image is comprised of 28 rows of 28 pixels each.  If we were to step through the rows one by one and stack them up then the image would be more and more complete as time went by.  So our units of "time" will be the rows stacking together to create a complete image, and the number of features will be the number of pixels in the image row at that step in time (i.e. 28).  This gives us:

* samples     = number of observations (i.e. number of images in the mini batch)
* timeSteps   = number of rows we need to step through/stack up to make a complete image
* features    = the number of features in each row we are stepping through (i.e. also 28)

Additionally, we only care about the final output of the LSTM network which should give us the prediction of which numeral the image represents.  Other LSTM networks do care about the outputs of each LSTM cell (translating each word in a sentence for example), but that doesn't apply in our case.

Having said this we can continue with initializing the various variables we'll need:

In [9]:
# Setup vars for the MINST data set
timeSteps = 28
features = 28

lstmUnits = 128
lr = 0.001
epochs = 10
samples = 50

classes = 10

# Allow results to be reproduced
seed = 10

# Notice we are pulling in the labels as one hot encodings!
mninst = input_data.read_data_sets("./datasets/mnist", one_hot = True)

# For use when we create the LSTM network below
testShape = mninst.test.images.shape

# Note the one hot encoding on the label:
print("\n", "Example label: ", mninst.test.labels[0])

Extracting ./datasets/mnist\train-images-idx3-ubyte.gz
Extracting ./datasets/mnist\train-labels-idx1-ubyte.gz
Extracting ./datasets/mnist\t10k-images-idx3-ubyte.gz
Extracting ./datasets/mnist\t10k-labels-idx1-ubyte.gz

 Example label:  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


# Build the computational graph

## Static vs Dynamic TensorFlow RNNs

> Dynamic RNN's allow for variable sequence lengths. You might have an input shape (batch_size, max_sequence_length), but this will allow you to run the RNN for the correct number of time steps on those sequences that are shorter than max_sequence_length.
 
> In contrast, there are static RNNs, which expect to run the entire fixed RNN length. There are cases where you might prefer to do this, such as if you are padding your inputs to max_sequence_length anyway.

>In short, dynamic_rnn is usually what you want for variable length sequential data. It has a sequence_length parameter, and it is your friend.  [Source](https://stackoverflow.com/questions/43100981/what-is-a-dynamic-rnn-in-tensorflow)


## LSTM v1

* Utilize f.contrib.rnn.BasicLSTMCell
* Utilize tf.contrib.rnn.static_rnn
* Manual weight and bias definitions with tf.random_normal for initialization
* Track training and validiation loss and accuracy in TensorBoard

In [84]:
# Reset the TF CG
resetGraph()

# Clean away any old log files
cleanLogs()

# Set the seed
tf.set_random_seed(seed)

# Set the TB logdir - We want two log dirs since we are going to be plotting two values on the same plot
logDirTrain = './logs/mnistLSTM/runOne/train'
logDirValidation = './logs/mnistLSTM/runOne/validation'


# Create place holders
x = tf.placeholder(tf.float32, shape = [None, timeSteps, features], name = 'x')
# Give 2nd dimension arg to shape since we are using one hot encodings
y = tf.placeholder(tf.int64, shape = [None, classes], name = 'y')

# Create weights and bias tensors
with tf.name_scope("weightBias"):
    w = tf.Variable(tf.random_normal([lstmUnits, classes]))
    b = tf.Variable(tf.random_normal([classes]))


# Add the LSTM cells
with tf.name_scope("LSTM"):
    
    # Later in the code we'll make a call to tf.contrib.rnn.static_rnn
    # tf.contrib.rnn.static_rnn expects a length T list of inputs, each a Tensor of shape [batch_size, input_size]
    # So we need to convert our inputs of shape [batchSize, timeSteps, numberOfInputs] to [batch_size, input_size]
    #
    # https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/contrib/rnn/static_rnn
    
    # https://www.tensorflow.org/api_docs/python/tf/unstack
    inputs = tf.unstack(x, num = timeSteps, axis = 1)
    
    # Create the basic LSTM cell
    # It does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.
    # https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell
    cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
    
    # Add the cell to the RNN
    # https://www.tensorflow.org/versions/r1.1/api_docs/python/tf/contrib/rnn/static_rnn
    output, state = tf.contrib.rnn.static_rnn(cell, inputs, dtype = tf.float32)
    
    # We only care about the final output which should be the model's prediction
    yH = tf.matmul(output[-1], w) + b
    
# Add loss function
with tf.name_scope("loss"):
    # We don't use "tf.nn.sparse_softmax_cross_entropy_with_logits" here since we have one hot encodings
    entropy = tf.nn.softmax_cross_entropy_with_logits(logits = yH, labels = y)
    loss = tf.reduce_mean(entropy, name = "loss")
    # Capture loss
    tf.summary.scalar("loss", loss)
    
with tf.name_scope("optimizer"):
    opt = tf.train.AdamOptimizer(learning_rate = lr).minimize(loss)
    
# Eval the model's accuracy
with tf.name_scope("eval"):
    # We don't use "tf.nn.in_top_k(yH, y, 1)" here since are aren't using "tf.nn.sparse_softmax_cross_entropy_with_logits"
    correct = tf.equal(tf.argmax(yH, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    # Capture accuracy
    tf.summary.scalar("accuracy", accuracy)

init = tf.global_variables_initializer()

In [85]:
# Execute the TF CG
counter = 0

with tf.Session() as sess:
    init.run()
    
    # Create the TB writer and init
    trainWriter = tf.summary.FileWriter(logDirTrain, sess.graph)
    validationWriter = tf.summary.FileWriter(logDirValidation)
    merge = tf.summary.merge_all()
    
    for e in range(epochs + 1):
        for i in range(mninst.train.num_examples // samples):      
            counter += 1     
            
            # Grab the next minibatch
            xBatch, yBatch = mninst.train.next_batch(samples)
            
            # Reshape x to [samples, timeSteps, features] for the LSTM:
            #   The image is given to us as a single vector of dimensionality 784
            #   So to use them we need to gather the number of rows together to be the timeSteps
            xBatch = xBatch.reshape(samples, timeSteps, features)
            
            # Train the model
            summary, _ = sess.run([merge, opt], feed_dict = {x: xBatch, y: yBatch})
            
            # Capture summary data every N steps
            if counter % 10 == 0:
                # Manually add to the train accuracy summary value
                summary, accTrain = sess.run([merge, accuracy], feed_dict = {x: xBatch, y: yBatch})
                trainWriter.add_summary(summary, counter) 
                
                # Manually add to the test accuracy summary value
                
                # If test accuracy calcs are causing speed issues you can reduce the number tested via the following:
                #summary, accValidation = sess.run([merge, accuracy], feed_dict = {
                #    x: mninst.validation.images[:450].reshape(-1, timeSteps, features), 
                #    y: mninst.validation.labels[:450]})
                summary, accValidation = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.validation.images.reshape(-1, timeSteps, features), 
                    y: mninst.validation.labels})
                validationWriter.add_summary(summary, counter)
                
        if e % 10 == 0:
            print(e, "Train Acc: ", accTrain, "Validation Acc: ", accValidation)
        
    print(" ")
    # Compute test set accuracy rating
    summary, accTest = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.test.images.reshape(-1, timeSteps, features), 
                    y: mninst.test.labels})
    print("FINAL :: ", "Train Acc: ", accTrain, "Validation Acc: ", accValidation, "Test Acc: ", accTest)

0 Train Acc:  0.96 Validation Acc:  0.9686
10 Train Acc:  1.0 Validation Acc:  0.9898
 
FINAL ::  Train Acc:  1.0 Validation Acc:  0.9898 Test Acc:  0.9882


<img style="float: left; margin-right: 15px;" src="images/mnist-run-one.png" />

Although a little slower than some other models we've looked at the LSTM has exellent accuracy on this problem.

## LSTM v2

* Enclose the CG architecture in a graph object; pass to the training session
* Utilize f.contrib.rnn.BasicLSTMCell
* Utilize tf.contrib.rnn.dynamic_rnn, so we don't need to unstack the 'x' tensor
* Remove manual weight and bias definitions and relace with a dense layer
* Utilize He initialization
* Track training and validiation loss and accuracy in TensorBoard

In [88]:
# Reset the TF CG
resetGraph()

# Clean away any old log files
cleanLogs()

# Set the seed
tf.set_random_seed(seed)

# Set the TB logdir - We want two log dirs since we are going to be plotting two values on the same plot
logDirTrain = './logs/mnistLSTM/runTwo/train'
logDirValidation = './logs/mnistLSTM/runTwo/validation'

# Create the graph object and populate it
graph = tf.Graph()

with graph.as_default():
    # Create place holders
    x = tf.placeholder(tf.float32, shape = [None, timeSteps, features], name = 'x')
    # Give 2nd dimension arg to shape since we are using one hot encodings
    y = tf.placeholder(tf.int64, shape = [None, classes], name = 'y')

    # Add the LSTM cells with He initialization (we'll let TF worry about the "w" and "b" values)
    # Notice to do this we switch from "tf.name_scope" to "tf.variable_scope" and add the "initializer" param
    with tf.variable_scope("LSTM", initializer = tf.variance_scaling_initializer()):

        # Create the basic LSTM cell
        # It does not allow cell clipping, a projection layer, and does not use peep-hole connections: it is the basic baseline.
        # https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell
        cell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)

        # Add the cell to the RNN
        # https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn
        # Notice we don't have to unstack x as in the previous model
        output, state = tf.nn.dynamic_rnn(cell, x, dtype = tf.float32)

        # We only care about the final output which should be the model's prediction
        yH = tf.layers.dense(state[-1], classes)

    # Add loss function
    with tf.name_scope("loss"):
        # We don't use "tf.nn.sparse_softmax_cross_entropy_with_logits" here since we have one hot encodings
        entropy = tf.nn.softmax_cross_entropy_with_logits(logits = yH, labels = y)
        loss = tf.reduce_mean(entropy, name = "loss")
        # Capture loss
        tf.summary.scalar("loss", loss)

    with tf.name_scope("optimizer"):
        opt = tf.train.AdamOptimizer(learning_rate = lr).minimize(loss)

    # Eval the model's accuracy
    with tf.name_scope("eval"):
        # We don't use "tf.nn.in_top_k(yH, y, 1)" here since are aren't using "tf.nn.sparse_softmax_cross_entropy_with_logits"
        correct = tf.equal(tf.argmax(yH, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
        # Capture accuracy
        tf.summary.scalar("accuracy", accuracy)

    init = tf.global_variables_initializer()

In [89]:
# Execute the TF CG
counter = 0

with tf.Session(graph = graph) as sess:
    init.run()
    
    # Create the TB writer and init
    trainWriter = tf.summary.FileWriter(logDirTrain, sess.graph)
    validationWriter = tf.summary.FileWriter(logDirValidation)
    merge = tf.summary.merge_all()
    
    for e in range(epochs + 1):
        for i in range(mninst.train.num_examples // samples):      
            counter += 1     
            
            # Grab the next minibatch
            xBatch, yBatch = mninst.train.next_batch(samples)
            
            # Reshape x to [samples, timeSteps, features] for the LSTM:
            #   The image is given to us as a single vector of dimensionality 784
            #   So to use them we need to gather the number of rows together to be the timeSteps
            xBatch = xBatch.reshape(samples, timeSteps, features)
            
            # Train the model
            summary, _ = sess.run([merge, opt], feed_dict = {x: xBatch, y: yBatch})
            
            # Capture summary data every N steps
            if counter % 10 == 0:
                # Manually add to the train accuracy summary value
                summary, accTrain = sess.run([merge, accuracy], feed_dict = {x: xBatch, y: yBatch})
                trainWriter.add_summary(summary, counter) 
                
                # Manually add to the test accuracy summary value
                
                # If test accuracy calcs are causing speed issues you can reduce the number tested via the following:
                #summary, accValidation = sess.run([merge, accuracy], feed_dict = {
                #    x: mninst.validation.images[:450].reshape(-1, timeSteps, features), 
                #    y: mninst.validation.labels[:450]})
                summary, accValidation = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.validation.images.reshape(-1, timeSteps, features), 
                    y: mninst.validation.labels})
                validationWriter.add_summary(summary, counter)
                
        if e % 10 == 0:
            print(e, "Train Acc: ", accTrain, "Validation Acc: ", accValidation)
        
    print(" ")
    # Compute test set accuracy rating
    summary, accTest = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.test.images.reshape(-1, timeSteps, features), 
                    y: mninst.test.labels})
    print("FINAL :: ", "Train Acc: ", accTrain, "Validation Acc: ", accValidation, "Test Acc: ", accTest)

0 Train Acc:  0.9 Validation Acc:  0.9606
10 Train Acc:  1.0 Validation Acc:  0.9884
 
FINAL ::  Train Acc:  1.0 Validation Acc:  0.9884 Test Acc:  0.9877


<img style="float: left; margin-right: 15px;" src="images/mnist-run-two.png" />

## LSTM v3

* Enclose the CG architecture in a graph object; pass to the training session
* Utilize tf.contrib.rnn.LSTMBlockCell
* Utilize tf.contrib.rnn.dynamic_rnn, so we don't need to unstack the 'x' tensor
* Apply [Batch normalization](https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization)
* Dense layer with He initialization for weights and biases
* Using [tf.nn.softmax_cross_entropy_with_logits_v2](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits_v2) in the 'loss' calculations
* Perform gradient clipping via [tf.clip_by_global_norm](https://www.tensorflow.org/versions/r1.2/api_docs/python/tf/clip_by_global_norm) during optimization
* Track training and validiation loss and accuracy in TensorBoard

LSTM types and benchmarks:  https://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html

In [None]:
# Reset the TF CG
resetGraph()

# Clean away any old log files
cleanLogs()

# Set the seed
tf.set_random_seed(seed)

# Set the TB logdir - We want two log dirs since we are going to be plotting two values on the same plot
logDirTrain = './logs/mnistLSTM/runThree/train'
logDirValidation = './logs/mnistLSTM/runThree/validation'

# Create the graph object and populate it
graph = tf.Graph()

with graph.as_default():
    # We need a way to track if we are training or not for the gradient clipping
    isTraining = tf.placeholder_with_default(False, shape = (), name = 'isTraining')
    
    # Create place holders
    x = tf.placeholder(tf.float32, shape = [None, timeSteps, features], name = 'x')
    # Give 2nd dimension arg to shape since we are using one hot encodings
    y = tf.placeholder(tf.int64, shape = [None, classes], name = 'y')

    # Add the LSTM cells with He initialization (we'll let TF worry about the "w" and "b" values)
    # Notice to do this we switch from "tf.name_scope" to "tf.variable_scope" and add the "initializer" param
    with tf.variable_scope("LSTM", initializer = tf.variance_scaling_initializer()):

        # Create LSTMBlockCell which should be faster
        # https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LSTMBlockCell
        # LSTM types and benchmarks:  https://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html
        cell = tf.contrib.rnn.LSTMBlockCell(lstmUnits)

        # Add the LSTMBlockCell to the RNN
        # https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn
        # Notice we don't have to unstack x as in the previous model
        output, state = tf.nn.dynamic_rnn(cell, x, dtype = tf.float32)
        
        # Return the last output for each sample and apply batch normalization
        # Ex: 
        #   x = np.arange(24)
        #   x = x.reshape((2,3,4))
        #   x
        #   >>>
        #   array([[[ 0,  1,  2,  3],
        #           [ 4,  5,  6,  7],
        #           [ 8,  9, 10, 11]],
        #   
        #          [[12, 13, 14, 15],
        #           [16, 17, 18, 19],
        #           [20, 21, 22, 23]]])
        #
        #   x[:,-1,:]
        #   >>>
        #   array([[ 8,  9, 10, 11],
        #          [20, 21, 22, 23]])
        #
        # Don't forget to enable/disable training!
        bnormOutput = tf.layers.batch_normalization(output[:, -1, :], training = isTraining)
        
        # Apply the dense layer to output prediction probabilities
        yH = tf.layers.dense(bnormOutput, classes)

    # Add loss function
    with tf.name_scope("loss"):
        # We don't use "tf.nn.sparse_softmax_cross_entropy_with_logits" here since we have one hot encodings
        entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits = yH, labels = y)
        loss = tf.reduce_mean(entropy, name = "loss")
        # Capture loss
        tf.summary.scalar("loss", loss)

    with tf.name_scope("optimizer"):
        # Since we want to apply gradient clipping we need to compute the gradients,
        # process them, and then update the model's parameters by hand
        # https://stackoverflow.com/questions/36498127/how-to-apply-gradient-clipping-in-tensorflow
        # https://www.tensorflow.org/api_docs/python/tf/clip_by_global_norm
        _opt = tf.train.AdamOptimizer(learning_rate = lr)
        gvs = _opt.compute_gradients(loss)
        cappedGvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
        opt = _opt.apply_gradients(cappedGvs)       

    # Eval the model's accuracy
    with tf.name_scope("eval"):
        # We don't use "tf.nn.in_top_k(yH, y, 1)" here since are aren't using "tf.nn.sparse_softmax_cross_entropy_with_logits"
        correct = tf.equal(tf.argmax(yH, 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
        # Capture accuracy
        tf.summary.scalar("accuracy", accuracy)

    init = tf.global_variables_initializer()

In [None]:
# Execute the TF CG
counter = 0

with tf.Session(graph = graph) as sess:
    init.run()
    
    # Create the TB writer and init
    trainWriter = tf.summary.FileWriter(logDirTrain, sess.graph)
    validationWriter = tf.summary.FileWriter(logDirValidation)
    merge = tf.summary.merge_all()
    
    for e in range(epochs + 1):
        for i in range(mninst.train.num_examples // samples):      
            counter += 1     
            
            # Grab the next minibatch
            xBatch, yBatch = mninst.train.next_batch(samples)
            
            # Reshape x to [samples, timeSteps, features] for the LSTM:
            #   The image is given to us as a single vector of dimensionality 784
            #   So to use them we need to gather the number of rows together to be the timeSteps
            xBatch = xBatch.reshape(samples, timeSteps, features)
            
            # Train the model
            summary, _ = sess.run([merge, opt], feed_dict = {x: xBatch, y: yBatch})
            
            # Capture summary data every N steps
            if counter % 10 == 0:
                # Manually add to the train accuracy summary value
                summary, accTrain = sess.run([merge, accuracy], feed_dict = {x: xBatch, y: yBatch})
                trainWriter.add_summary(summary, counter) 
                
                # Manually add to the test accuracy summary value
                
                # If test accuracy calcs are causing speed issues you can reduce the number tested via the following:
                #summary, accValidation = sess.run([merge, accuracy], feed_dict = {
                #    x: mninst.validation.images[:450].reshape(-1, timeSteps, features), 
                #    y: mninst.validation.labels[:450]})
                summary, accValidation = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.validation.images.reshape(-1, timeSteps, features), 
                    y: mninst.validation.labels})
                validationWriter.add_summary(summary, counter)
                
        if e % 10 == 0:
            print(e, "Train Acc: ", accTrain, "Validation Acc: ", accValidation)
        
    print(" ")
    # Compute test set accuracy rating
    summary, accTest = sess.run([merge, accuracy], feed_dict = {
                    x: mninst.test.images.reshape(-1, timeSteps, features), 
                    y: mninst.test.labels})
    print("FINAL :: ", "Train Acc: ", accTrain, "Validation Acc: ", accValidation, "Test Acc: ", accTest)