## Practical Machine Learning
### Assignment 8 - Language Modeling With an Recurrent Neural Networks (RNN)

Description:
This assignment involves working with language models developed with pretrained word vectors. We use sentences (sequences of words) to train language models for predicting movie review sentiment (thumbs up versus thumbs down). We study effects of word vector size, vocabulary size, and neural network structure (hyperparameters) on classification performance. We build on resources for recurrent neural networks (RNNs) as implemented in TensorFlow. RNNs are well suited to the analysis of sequences, as needed for natural language processing (NLP). 

Initial background reading for this assignment comes from Chapter 14 (pp. 379–411) of the Géron textbook:
Géron, A. (2017). Hands-on machine learning with Scikit-Learn & TensorFlow: Concepts, tools, and techniques to build intelligent systems. Sebastopol, CA: O’Reilly. [ISBN-13 978-1-491-96229-9]. 

Specialized RNN models have been developed to accommodate the needs of many language processing tasks. Larger relevant vocabularies are usually associated with more accurate models, but training with larger vocabularies requires more memory and longer processing times. We can speed up the training process by using pretrained word vectors and subsets of pretrained word vectors.

Technologies such as word2vec, GloVe (global vectors), and fastText provide ways of representing words as numeric vectors. These numeric vectors or neural network embeddings capture the meaning of words as well as their common usage as parts of speech. Word embeddings have extensive applications in natural language processing.

This assignment requires the use of two pretrained word embeddings selected from a list of supported vectors. That is, we replace each word in a sentence or sequence with a vector of numbers. Methods for downloading word embeddings are provided in the Python package chakin { https://github.com/chakki-works/chakin } 

Early work on word2vec embeddings is cited in these references:
- Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space
https://arxiv.org/pdf/1301.3781.pdf 

- Mikolov, T., et al. (2013). Linguistic Regularities in Continuous Space Word Representations (Links to an external site.)Links to an external site.
https://www.aclweb.org/anthology/N13-1090 

GloVe, a method for estimating pretrained word vectors, was developed at Stanford 
https://nlp.stanford.edu/projects/glove/

A tutorial and code for using GloVe embeddings is available here. 
https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer

A third set of pretrained vectors is fastText, described here 
https://arxiv.org/pdf/1607.04606.pdf

Word embeddings are an active area of research, as shown in recent developments in probabilistic fastText 
http://aclweb.org/anthology/P18-1001

We can also test vocabulary sizes associated with pretrained word vectors, defined as part of the language model. For example, we might compare a vocabulary of the top 10,000 words in English versus a vocabulary of the top 30,000 words.

(Optional) We can test alternative RNN structures and settings of hyperparameters.

Assignment Requirements
- Install the Python chakin package, obtain GloVe (and perhaps non-GloVe) embeddings.
- Load and run jump-start code for the assignment, which uses pretrained word vectors from GloVe.6B.50d, a vocabulary of 10,000 words, and movie review data. 
- Revise the jump-start code to accommodate two pretrained word vectors and two vocabulary sizes. These represent the cells of a completely crossed 2-by-2 experimental design, defining four distinct language models.
- (Optional) Test two or more alternative RNN structures or hyperparameter settings.
- Build and evaluate at least four language models of the experimental design. For each cell in the design, compute classification accuracy in the test set.
- Evaluate the four language models and make recommendations to management.

Management Problem
- Suppose management is thinking about using a language model to classify written customer reviews and call and complaint logs. If the most critical customer messages can be identified, then customer support personnel can be assigned to contact those customers.
- How would you advise senior management? What kinds of systems and methods would be most relevant to the customer services function? Considering the results of this assignment in particular, what is needed to make an automated customer support system that is capable of identifying negative customer feelings? What can data scientists do to make language models more useful in a customer service function?

Grading Guidelines (50 points)
(1) Data preparation, exploration, visualization (10 points)
(2) Review research design and modeling methods (10 points)
(3) Review results, evaluate models (10 points)
(4) Implementation and programming (10 points)
(5) Exposition, problem description, and management recommendations (10 points)

Deliverables and File Formats
Create a folder or directory with all supplementary files with your last name at the beginning of the folder name, compress that folder with zip compression, and post the zip-archived folder under the assignment link in Canvas. The following files should be included in an archive folder/directory that is uploaded as a single zip-compressed file. (Use zip, not StuffIt or any 7z or other compression method.)

Provide a double-spaced paper with a two-page maximum for the text. The paper should include (1) a summary and problem definition for management; (2) discussion of the research design, measurement and statistical methods, traditional and machine learning methods employed; (3) overview of programming work; and (4) review of results with recommendations for management. (The paper must be provided as an Adobe Acrobat pdf file. MS Word files are NOT acceptable.)

Files or links to files should be provided in the format as used by the Python program.

Because only minor revisions to the jump-start code are needed to implement the language models being tested in this assignment, it is not necessary to provide Python code for the language models.

Output from the program, such as console listing/logs, text files, and graphics output for visualizations.
List file names and descriptions of files in the zip-compressed folder/directory.

Appendix: chakin-supported vectors
Here are a few of the chakin-supported pretrained word vectors as of August 12, 2018:
[index number] Name	Dimension	Corpus	Vocab Size	Method	Language
[0] fastText(ar)	300	Wikipedia	610K	fastText	Arabic
[1] fastText(de)	300	Wikipedia	2.3M	fastText	German
[2] fastText(en)	300	Wikipedia	2.5M	fastText	English
[3] fastText(es)	300	Wikipedia	985K	fastText	Spanish
[4] fastText(fr)	300	Wikipedia	1.2M	fastText	French
[5] fastText(it)	300	Wikipedia	871K	fastText	Italian
[6] fastText(ja)	300	Wikipedia	580K	fastText	Japanese
[7] fastText(ko)	300	Wikipedia	880K	fastText	Korean
[8] fastText(pt)	300	Wikipedia	592K	fastText	Portuguese
[9] fastText(ru)	300	Wikipedia	1.9M	fastText	Russian
[10] fastText(zh)	300	Wikipedia	330K	fastText	Chinese
[11] GloVe.6B.50d	50	Wikipedia+Gigaword 5 (6B)	400K	GloVe	English
[12] GloVe.6B.100d	100	Wikipedia+Gigaword 5 (6B)	400K	GloVe	English
[13] GloVe.6B.200d	200	Wikipedia+Gigaword 5 (6B)	400K	GloVe	English
[14] GloVe.6B.300d	300	Wikipedia+Gigaword 5 (6B)	400K	GloVe	English
[15] GloVe.42B.300d	300	Common Crawl(42B)	1.9M	GloVe	English
[16] GloVe.840B.300d	300	Common Crawl(840B)	2.2M	GloVe	English
[17] GloVe.Twitter.25d	25	Twitter(27B)	1.2M	GloVe	English
[18] GloVe.Twitter.50d	50	Twitter(27B)	1.2M	GloVe	English
[19] GloVe.Twitter.100d	100	Twitter(27B)	1.2M	GloVe	English
[20] GloVe.Twitter.200d	200	Twitter(27B)	1.2M	GloVe	English
[21] word2vec.GoogleNews	300	Google News(100B)	3.0M	word2vec	English
[22] word2vec.Wiki-NEologd.50d	50	Wikipedia	335K	word2vec + NEologd	Japanese

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

#ignore tensorflow related warnings
import os
import tensorflow as tf

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}

In [2]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

##### Data Collection
##### RNN - Movie Review Sentiment  PART 1 : run chakin to get embeddings

In [3]:
# Common imports for our work
import os 
import numpy as np
import tensorflow as tf
import time
import pandas as pd  # data frame operations  
import numpy as np  # arrays and math functions
import matplotlib.pyplot as plt  # static plotting
import re # regular expressions
import scipy
import os # Operation System
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from matplotlib.backends.backend_pdf import PdfPages
import sklearn
import tensorflow as tf

from datetime import datetime
#from __future__ import division, print_function, unicode_literals
import cv2
import seaborn as sns
import numpy as np

# Python chakin package previously installed by 
#    pip install chakin
import chakin  

import json
import os
from collections import defaultdict

In [4]:
chakin.search(lang='English')  # lists available indices in English

                   Name  Dimension                     Corpus VocabularySize  \
2          fastText(en)        300                  Wikipedia           2.5M   
11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   
12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   
13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   
14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   
15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   
16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   
17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   
18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   
19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   
20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   
21  word2vec.GoogleNews        300      

In [5]:
# Specify English embeddings file to download and install by index number, number of dimensions, and subfoder name
# Note that GloVe 50-, 100-, 200-, and 300-dimensional folders are downloaded with a single zip download
CHAKIN_INDEX = 11
NUMBER_OF_DIMENSIONS = 50
SUBFOLDER_NAME = "gloVe.6B"

In [6]:
DATA_FOLDER = "embeddings"
ZIP_FILE = os.path.join(DATA_FOLDER, "{}.zip".format(SUBFOLDER_NAME))
ZIP_FILE_ALT = "glove" + ZIP_FILE[5:]  # sometimes it's lowercase only...
UNZIP_FOLDER = os.path.join(DATA_FOLDER, SUBFOLDER_NAME)
if SUBFOLDER_NAME[-1] == "d":
    GLOVE_FILENAME = os.path.join(
        UNZIP_FOLDER, "{}.txt".format(SUBFOLDER_NAME))
else:
    GLOVE_FILENAME = os.path.join(UNZIP_FOLDER, "{}.{}d.txt".format(
        SUBFOLDER_NAME, NUMBER_OF_DIMENSIONS))


if not os.path.exists(ZIP_FILE) and not os.path.exists(UNZIP_FOLDER):
    # GloVe by Stanford is licensed Apache 2.0:
    #     https://github.com/stanfordnlp/GloVe/blob/master/LICENSE
    #     http://nlp.stanford.edu/data/glove.twitter.27B.zip
    #     Copyright 2014 The Board of Trustees of The Leland Stanford Junior University
    print("Downloading embeddings to '{}'".format(ZIP_FILE))
    chakin.download(number=CHAKIN_INDEX, save_dir='./{}'.format(DATA_FOLDER))
else:
    print("Embeddings already downloaded.")

if not os.path.exists(UNZIP_FOLDER):
    import zipfile
    if not os.path.exists(ZIP_FILE) and os.path.exists(ZIP_FILE_ALT):
        ZIP_FILE = ZIP_FILE_ALT
    with zipfile.ZipFile(ZIP_FILE, "r") as zip_ref:
        print("Extracting embeddings to '{}'".format(UNZIP_FOLDER))
        zip_ref.extractall(UNZIP_FOLDER)
else:
    print("Embeddings already extracted.")

print('\nRun complete')

# After this step there should be
# embeddings folder in the current working directory A
# Directory called glove.6b within embeddings directory
# 4 files within it

Embeddings already downloaded.
Embeddings already extracted.

Run complete


##### Data Preparation & Modelling

In [7]:
# Common imports for our work
from __future__ import absolute_import # from __future__ import absolute_import
from __future__ import division # Changing the Division Operator
from __future__ import print_function #Make print a function
import numpy as np # import numpy
import os  # operating system functions
import os.path  # for manipulation of file path names
import re  # regular expressions
from collections import defaultdict # dict subclass that calls a factory function to supply missing values
import nltk # Natural Language Toolkit
from nltk.tokenize import TreebankWordTokenizer #tokenize text
import tensorflow as tf #TensorFlow
import time # Record processing time

In [8]:
# seed value for random number generators to obtain reproducible results
RANDOM_SEED = 9999

In [9]:
# Check current working directory
os.getcwd() 

'C:\\Users\\myl94\\OneDrive\\Desktop\\ds projects\\422 exercises\\Exercises 8\\Assignments'

In [10]:

# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
#     tf.compat.v1.get_default_graph()
#     tf.random.set_seed(seed)
#     np.random.seed(seed)
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [11]:
REMOVE_STOPWORDS = False  # no stopword removal 

In [12]:
EVOCABSIZE = 10000  # specify desired size of pre-defined embedding vocabulary 

In [13]:
# ------------------------------------------------------------- 
# Select the pre-defined embeddings source        
# Define vocabulary size for the language model    
# Create a word_to_embedding_dict for GloVe.6B.50d
embeddings_directory = 'embeddings/gloVe.6B'
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
# ------------------------------------------------------------- 

In [14]:
# Utility function for loading embeddings follows methods described in
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# Creates the Python defaultdict dictionary word_to_embedding_dict
# for the requested pre-trained word embeddings

# Note the use of defaultdict data structure from the Python Standard Library
# collections_defaultdict.py lets the caller specify a default value up front
# The default value will be retuned if the key is not a known dictionary key
# That is, unknown words are represented by a vector of zeros
# For word embeddings, this default value is a vector of zeros
# Documentation for the Python standard library:
#   Hellmann, D. 2017. The Python 3 Standard Library by Example. Boston: 
#     Addison-Wesley. [ISBN-13: 978-0-13-429105-5]


# Load a embedding text file
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, we return a tuple of two dictionaries 
    `(word_to_index_dict, index_to_embedding_array)`, otherwise we return only a direct 
    `word_to_embedding_dict` dictionary mapping from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict


In [15]:
# Check if the loaded embedding files glove.6B.50d. successfully.   "glove.6B.50d."was ontained though "run-chakin-to-get-embeddings-v001.py"
print('\nLoading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = \
    load_embedding_from_disks(embeddings_filename, with_indexes=True)
print("Embedding loaded from disks.")

# Note: unknown words have representations with values [0, 0, ..., 0]


Loading embeddings from embeddings/gloVe.6B\glove.6B.50d.txt
Embedding loaded from disks.


In [16]:
# Additional background code from
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# shows the general structure of the data structures for word embeddings
# This code is modified for our purposes in language modeling 

# Check vocabrary size and embedding dimention
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
print("This means (number of words, number of dimensions per word)\n")
print("The first words are words that tend occur more often.")

#Check embedding data
print("Note: for unknown words, the representation is an empty vector,\n"
      "and the index is the last one. The dictionnary has a limit:")
print("    {} --> {} --> {}".format("A word", "Index in embedding", 
      "Representation"))
word = "worsdfkljsdf"  # a word obviously not in the vocabulary
idx = word_to_index[word] # index for word obviously not in the vocabulary
complete_vocabulary_size = idx 
embd = list(np.array(index_to_embedding[idx], dtype=int)) # "int" compact print
print("    {} --> {} --> {}".format(word, idx, embd))
word = "the"
idx = word_to_index[word]
embd = list(index_to_embedding[idx])  # "int" for compact print only.
print("    {} --> {} --> {}".format(word, idx, embd))

Embedding is of shape: (400001, 50)
This means (number of words, number of dimensions per word)

The first words are words that tend occur more often.
Note: for unknown words, the representation is an empty vector,
and the index is the last one. The dictionnary has a limit:
    A word --> Index in embedding --> Representation
    worsdfkljsdf --> 400000 --> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    the --> 0 --> [0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862, -0.00066023, -0.6566, 0.27843, -0.14767, -0.55677, 0.14658, -0.0095095, 0.011658, 0.10204, -0.12792, -0.8443, -0.12181, -0.016801, -0.33279, -0.1552, -0.23131, -0.19181, -1.8823, -0.76746, 0.099051, -0.42125, -0.19526, 4.0071, -0.18594, -0.52287, -0.31681, 0.00059213, 0.0074449, 0.17778, -0.15897, 0.012041, -0.054223, -0.29871, -0.15749, -0.34758, -0.045637, -0.44251, 0.18785, 0.0027849, -0.18

In [17]:
# Show how to use embeddings dictionaries with a test sentence
# This is a famous typing exercise with all letters of the alphabet
# https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog
a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'
print('\nTest sentence: ', a_typing_test_sentence, '\n')
words_in_test_sentence = a_typing_test_sentence.split()

print('Test sentence embeddings from complete vocabulary of', 
      complete_vocabulary_size, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding = index_to_embedding[word_to_index[word_]]
    print(word_ + ": ", embedding)


Test sentence:  The quick brown fox jumps over the lazy dog 

Test sentence embeddings from complete vocabulary of 400000 words:

the:  [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick:  [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.1287
 -0.23636   -1.0629     0.046277   0.29143   -0.25

In [18]:
# ------------------------------------------------------------- 
# Define vocabulary size for the language model    
# To reduce the size of the vocabulary to the n most frequently used words

def default_factory():
    return EVOCABSIZE  # last/unknown-word row in limited_index_to_embedding
# dictionary has the items() function, returns list of (key, value) tuples
limited_word_to_index = defaultdict(default_factory, \
    {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})


In [19]:
# Select the first EVOCABSIZE rows to the index_to_embedding
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]

# Set the unknown-word row to be all zeros as previously
limited_index_to_embedding = np.append(limited_index_to_embedding, 
    index_to_embedding[index_to_embedding.shape[0] - 1, :].\
        reshape(1,embedding_dim), 
    axis = 0)

In [20]:
# Verify the new vocabulary: should get same embeddings for test sentence
# Note that a small EVOCABSIZE may yield some zero vectors for embeddings
print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding = limited_index_to_embedding[limited_word_to_index[word_]]
    print(word_ + ": ", embedding)


Test sentence embeddings from vocabulary of 10000 words:

the:  [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick:  [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.1287
 -0.23636   -1.0629     0.046277   0.29143   -0.25819   -0.094902
  0.79478   -1.2095    -0.01039   -0.092086   0.84322   

In [21]:
# ------------------------------------------------------------
# code for working with movie reviews data 
# Source: Miller, T. W. (2016). Web and Network Data Science.
#    Upper Saddle River, N.J.: Pearson Education.
#    ISBN-13: 978-0-13-388644-3
# This original study used a simple bag-of-words approach
# to sentiment analysis, along with pre-defined lists of
# negative and positive words.        
# Code available at:  https://github.com/mtpa/wnds       
# ------------------------------------------------------------

# Utility function to get file names within a directory
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

In [22]:
# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs
codelist = ['\r', '\n', '\t']   

In [23]:
# We will not remove stopwords in this exercise because they are
# important to keeping sentences intact
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

# previous analysis of a list of top terms showed a number of words, along 
# with contractions and other word strings to drop from further analysis, add
# these to the usual English stopwords to be dropped from a document collection
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

In [24]:
# text parsing function for creating text documents 
# there is more we could do for data preparation 
# stemming... looking for contractions... possessives... 
# but we will work with what we have in this parsing function
# if we want to do stemming at a later time, we can use
#     porter = nltk.PorterStemmer()  
# in a construction like this
#     words_stemmed =  [porter.stem(word) for word in initial_words]  
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)   

In [25]:
os.getcwd()

'C:\\Users\\myl94\\OneDrive\\Desktop\\ds projects\\422 exercises\\Exercises 8\\Assignments'

In [26]:
# -----------------------------------------------
# Download and install both negtive and positive movies reviews from the Technology Resources Section 
# within the Jump Start Program for Assignment 8 Module. This entails saving the 'movie-reviews-negative'
# and 'movie-reviews-positive' directories from the run-jump-start-rnn-sentiment-v002.zip 
# or  run-jump-start-rnn-sentiment-big-v002.zip file to your working directory
# -----------------------------------------------

# -----------------------------------------------
# gather data for 500 negative movie reviews
# -----------------------------------------------

# Set path to the negative word dictionary, "moive-reviews-negative"
dir_name = os.getcwd() + './train/movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: C:\Users\myl94\OneDrive\Desktop\ds projects\422 exercises\Exercises 8\Assignments./train/movie-reviews-negative
500 files found


In [27]:
# Read data for negative movie reviews
# Data will be stored in a list of lists where the each list represents 
# a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

    with open(filename, encoding='utf-8') as f:
        data = tf.compat.as_str(f.read())
        data = data.lower()
        data = text_parse(data)
        data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

    return data

negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    negative_documents.append(words)
    # print('Data size (Characters) (Document %d) %d' %(i,len(words)))
    # print('Sample string (Document %d) %s'%(i,words[:50]))



Processing document files under C:\Users\myl94\OneDrive\Desktop\ds projects\422 exercises\Exercises 8\Assignments./train/movie-reviews-negative


In [28]:
# -----------------------------------------------
# gather data for 500 positive movie reviews
# -----------------------------------------------

# Set path to the positive word dictionary, "moive-reviews-positive"
dir_name = './train/movie-reviews-positive'  
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: ./train/movie-reviews-positive
500 files found


In [29]:
# Read data for positive movie reviews
# Data will be stored in a list of lists where the each list 
# represents a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

    with open(filename, encoding='utf-8') as f:
        data = tf.compat.as_str(f.read())
        data = data.lower()
        data = text_parse(data)
        data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

    return data

positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    positive_documents.append(words)
    # print('Data size (Characters) (Document %d) %d' %(i,len(words)))
    # print('Sample string (Document %d) %s'%(i,words[:50]))



Processing document files under ./train/movie-reviews-positive


In [30]:
# -----------------------------------------------------
# convert positive/negative documents into numpy array
# note that reviews vary from 22 to 1052 words   
# so we use the first 20 and last 20 words of each review 
# as our word sequences for analysis
# -----------------------------------------------------

max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

max_review_length: 1052
min_review_length: 22


In [31]:
# construct list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    

In [32]:
# create list of lists of lists for embeddings
embeddings = []    
for doc in documents:
    embedding = []
    for word in doc:
        embedding.append(limited_index_to_embedding[limited_word_to_index[word]]) 
    embeddings.append(embedding)

In [33]:
# -----------------------------------------------------    
# Check on the embeddings list of list of lists 
# -----------------------------------------------------

# Show the first word in the first document
test_word = documents[0][0]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[0][0][:])

First word in first document: story
Embedding for this word:
 [ 0.48251    0.87746   -0.23455    0.0262     0.79691    0.43102
 -0.60902   -0.60764   -0.42812   -0.012523  -1.2894     0.52656
 -0.82763    0.30689    1.1972    -0.47674   -0.46885   -0.19524
 -0.28403    0.35237    0.45536    0.76853    0.0062157  0.55421
  1.0006    -1.3973    -1.6894     0.30003    0.60678   -0.46044
  2.5961    -1.2178     0.28747   -0.46175   -0.25943    0.38209
 -0.28312   -0.47642   -0.059444  -0.59202    0.25613    0.21306
 -0.016129  -0.29873   -0.19468    0.53611    0.75459   -0.4112
  0.23625    0.26451  ]
Corresponding embedding from embeddings list of list of lists
 [ 0.48251    0.87746   -0.23455    0.0262     0.79691    0.43102
 -0.60902   -0.60764   -0.42812   -0.012523  -1.2894     0.52656
 -0.82763    0.30689    1.1972    -0.47674   -0.46885   -0.19524
 -0.28403    0.35237    0.45536    0.76853    0.0062157  0.55421
  1.0006    -1.3973    -1.6894     0.30003    0.60678   -0.46044
  2.596

In [34]:
# Show the seventh word in the tenth document
test_word = documents[6][9]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[6][9][:])

First word in first document: but
Embedding for this word:
 [ 0.35934   -0.2657    -0.046477  -0.2496     0.54676    0.25924
 -0.64458    0.1736    -0.53056    0.13942    0.062324   0.18459
 -0.75495   -0.19569    0.70799    0.44759    0.27031   -0.32885
 -0.38891   -0.61606   -0.484      0.41703    0.34794   -0.19706
  0.40734   -2.1488    -0.24284    0.33809    0.43993   -0.21616
  3.7635     0.19002   -0.12503   -0.38228    0.12944   -0.18272
  0.076803   0.51579    0.0072516 -0.29192   -0.27523    0.40593
 -0.040394   0.28353   -0.024724   0.10563   -0.32879    0.10673
 -0.11503    0.074678 ]
Corresponding embedding from embeddings list of list of lists
 [ 0.35934   -0.2657    -0.046477  -0.2496     0.54676    0.25924
 -0.64458    0.1736    -0.53056    0.13942    0.062324   0.18459
 -0.75495   -0.19569    0.70799    0.44759    0.27031   -0.32885
 -0.38891   -0.61606   -0.484      0.41703    0.34794   -0.19706
  0.40734   -2.1488    -0.24284    0.33809    0.43993   -0.21616
  3.7635

In [35]:
# Show the last word in the last document
test_word = documents[999][39]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[999][39][:])        

First word in first document: from
Embedding for this word:
 [ 0.41037   0.11342   0.051524 -0.53833  -0.12913   0.22247  -0.9494
 -0.18963  -0.36623  -0.067011  0.19356  -0.33044   0.11615  -0.58585
  0.36106   0.12555  -0.3581   -0.023201 -1.2319    0.23383   0.71256
  0.14824   0.50874  -0.12313  -0.20353  -1.82      0.22291   0.020291
 -0.081743 -0.27481   3.7343   -0.01874  -0.084522 -0.30364   0.27959
  0.043328 -0.24621   0.015373  0.49751   0.15108  -0.01619   0.40132
  0.23067  -0.10743  -0.36625  -0.051135  0.041474 -0.36064  -0.19616
 -0.81066 ]
Corresponding embedding from embeddings list of list of lists
 [ 0.41037   0.11342   0.051524 -0.53833  -0.12913   0.22247  -0.9494
 -0.18963  -0.36623  -0.067011  0.19356  -0.33044   0.11615  -0.58585
  0.36106   0.12555  -0.3581   -0.023201 -1.2319    0.23383   0.71256
  0.14824   0.50874  -0.12313  -0.20353  -1.82      0.22291   0.020291
 -0.081743 -0.27481   3.7343   -0.01874  -0.084522 -0.30364   0.27959
  0.043328 -0.24621   0.

In [36]:
# -----------------------------------------------------    
# Make embeddings a numpy array for use in an RNN 
# Create training and test sets with Scikit Learn
# -----------------------------------------------------

# Apply embeddings to numpy
embeddings_array = np.array(embeddings)

In [37]:
# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                      np.ones((500), dtype = np.int32)), axis = 0)

In [38]:
# Scikit Learn for random splitting of the data  
from sklearn.model_selection import train_test_split

In [39]:
# Set training and test data
# Random splitting of the data in to training (80%) and test (20%)  
X_train, X_test, y_train, y_test = \
    train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                     random_state = RANDOM_SEED)

print("Shape of Training data: ", X_train.shape)
print("Shape of Test data: ", X_test.shape)

print("\nShape of Training data: ", y_train.shape)
print("Shape of Test data: ", y_test.shape)

Shape of Training data:  (800, 40, 50)
Shape of Test data:  (200, 40, 50)

Shape of Training data:  (800,)
Shape of Test data:  (200,)


##### Base Model: Simple RNN with BPTT
##### GloVe.6B.50d - Embedding Vocabulary Size 10,000

In [40]:
# --------------------------------------------------------------------------      
# We use a very simple Recurrent Neural Network for this assignment
# Géron, A. 2017. Hands-On Machine Learning with Scikit-Learn & TensorFlow: 
#    Concepts, Tools, and Techniques to Build Intelligent Systems. 
#    Sebastopol, Calif.: O'Reilly. [ISBN-13 978-1-491-96229-9] 
#    Chapter 14 Recurrent Neural Networks, pages 390-391
#    Source code available at https://github.com/ageron/handson-ml
#    Jupyter notebook file 14_recurrent_neural_networks.ipynb
#    See section on Training an sequence Classifier, # In [34]:
#    which uses the MNIST case data...  we revise to accommodate
#    the movie review data in this assignment    
# --------------------------------------------------------------------------  

#Training base model by using RNN backpropagation through time (BPTT).
reset_graph() # Refresh graph to make output stable across runs

In [41]:
### Consutruction Phase ###


tf.compat.v1.disable_eager_execution()

n_steps = embeddings_array.shape[1]  # number of words per document 
n_inputs = embeddings_array.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001 # Learning rate = 0.001

#Set placeholder for X and Y
X = tf.compat.v1.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.compat.v1.placeholder(tf.int32, [None])


In [42]:
#Set basic cell and dynamic unrolling through time
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

# Compute logits and cross entropy for cost function
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
# Set Cost function
loss = tf.reduce_mean(xentropy)

# Training with AdamOptimizaer
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

# Evaluation       
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

# Initialize all variables
init = tf.global_variables_initializer()

# Saver
saver = tf.train.Saver()

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use keras.layers.dense instead.


In [43]:
### Execution Phase###

# Set number of epochs and batch size for training model.
n_epochs = 50
batch_size = 100

# Record start time for neural network training
start_time_base = time.clock()

In [44]:

#Training model by RNN
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        print('\n  ---- Epoch ', epoch, ' ----\n')
        for iteration in range(y_train.shape[0] // batch_size):          
            X_batch = X_train[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train[iteration*batch_size:(iteration + 1)*batch_size]
            print('  Batch ', iteration, ' training observations from ',  
                  iteration*batch_size, ' to ', (iteration + 1)*batch_size-1,)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train_base = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test_base = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print('\n  Train accuracy:', acc_train_base, 'Test accuracy:', acc_test_base)
        
        save_path = saver.save(sess, "./my_catdog_model")


  ---- Epoch  0  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.48 Test accuracy: 0.515

  ---- Epoch  1  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.5 Test accuracy: 0.505

  ---- Epoch  2  ---


  ---- Epoch  17  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.59 Test accuracy: 0.58

  ---- Epoch  18  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.59 Test accuracy: 0.59

  ---- Epoch  19  -


  ---- Epoch  34  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.78 Test accuracy: 0.67

  ---- Epoch  35  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.78 Test accuracy: 0.665

  ---- Epoch  36  

In [45]:
# Make prediction by using test data
with tf.Session() as sess:
    saver.restore(sess, "./my_catdog_model") # or better, use save_path
    X_new_scaled = X_test[:50]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred_base = np.argmax(Z, axis=1)
    accuracy_base = accuracy.eval(feed_dict={X: X_test, y: y_test})
    accuracy_base_y = accuracy.eval(feed_dict={X: X_train, y: y_train})

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./my_catdog_model


In [46]:
# Print prediction classes and actual classes.
print("-------- Base Model --------")
print("RNN backpropagation through time (BPTT)")
print("\nPredicted classes:", y_pred_base)
print("Actual classes:", y_test[:25])
print("Test Set Accuracy:", accuracy_base)
print("Train Set Accuracy:", accuracy_base_y)
# Record end time for neural network training
stop_time_base = time.clock()

#Total processing time
runtime_base = stop_time_base - start_time_base 

print("\nStart time:", start_time_base)
print("Stop time:", stop_time_base)
print("processing time:", runtime_base)



-------- Base Model --------
RNN backpropagation through time (BPTT)

Predicted classes: [0 1 1 1 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 1 0
 0 0 0 1 0 1 0 0 0 1 0 1 0]
Actual classes: [0 1 1 1 0 1 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 0 1 0]
Test Set Accuracy: 0.68
Train Set Accuracy: 0.77625

Start time: 21.982894
Stop time: 40.0879267
processing time: 18.105032699999995


#### Model 1: RNN with LSTM cells and 3 Layers
#### GloVe.6B.50d - Embedding Vocabulary Size 10,000

In [47]:
from tensorflow import keras
from keras.layers import Dense, Dropout, Flatten, Activation, Input, Add
from keras.layers import Input, LSTM, GRU
from keras.models import Sequential, Model
def generateModel1():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model
start_time_base_1 = time.clock()

Using TensorFlow backend.


In [48]:
model = generateModel1()
print(model.summary())

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 40, 32)            10624     
_________________________________________________________________
lstm_2 (LSTM)                (None, 40, 32)            8320      
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 27,297
Trainable params: 27,297
Non-trainable params: 0
_________________________________________________________________
None


In [49]:
from keras.optimizers import adam
model.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1
count = 0
hist = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)




Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [50]:
accuracy_1 = model.evaluate(X_test, y_test, verbose = 0)
accuracy_base_y_1 = model.evaluate(X_train, y_train,verbose = 0)
stop_time_base_1 = time.clock()
time_1 = stop_time_base_1 - start_time_base_1
print("processing time")
print("Test Set Accuracy:", accuracy_base_y_1[1])
print("Train Set Accuracy:", accuracy_1[1])
print("processing time", time_1)

processing time
Test Set Accuracy: 0.9975000023841858
Train Set Accuracy: 0.6850000023841858
processing time 31.783718400000005


#### Model 2: RNN with LSTM cells and 3 Layers
#### GloVe.6B.50d - Embedding Vocabulary Size 30,000

In [51]:
EVOCABSIZE_1 = 30000

In [52]:
def default_factory():
    return EVOCABSIZE_1  # last/unknown-word row in limited_index_to_embedding
# dictionary has the items() function, returns list of (key, value) tuples
limited_word_to_index_1 = defaultdict(default_factory, \
    {k: v for k, v in word_to_index.items() if v < EVOCABSIZE_1})

In [53]:
limited_index_to_embedding_1 = index_to_embedding[0:EVOCABSIZE_1,:]
print(len(limited_index_to_embedding_1))
# Set the unknown-word row to be all zeros as previously
limited_index_to_embedding_1 = np.append(limited_index_to_embedding_1, 
    index_to_embedding[index_to_embedding.shape[0] - 1, :].\
        reshape(1,embedding_dim), 
    axis = 0)


30000


In [54]:
# Verify the new vocabulary: should get same embeddings for test sentence
# Note that a small EVOCABSIZE may yield some zero vectors for embeddings
print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE_1, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding_1 = limited_index_to_embedding_1[limited_word_to_index_1[word_]]
    print(word_ + ": ", embedding_1)


Test sentence embeddings from vocabulary of 30000 words:

the:  [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick:  [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.1287
 -0.23636   -1.0629     0.046277   0.29143   -0.25819   -0.094902
  0.79478   -1.2095    -0.01039   -0.092086   0.84322   

In [55]:
embeddings_1 = []    
for doc in documents:
    embedding_1 = []
    for word in doc:
        embedding_1.append(limited_index_to_embedding_1[limited_word_to_index_1[word]]) 
    embeddings_1.append(embedding_1)

In [56]:
embeddings_array_1 = np.array(embeddings_1)

In [57]:
# Set training and test data
# Random splitting of the data in to training (80%) and test (20%)  
X1_train, X1_test, y1_train, y1_test = \
    train_test_split(embeddings_array_1, thumbs_down_up, test_size=0.20, 
                     random_state = RANDOM_SEED)

print("Shape of Training data: ", X1_train.shape)
print("Shape of Test data: ", X1_test.shape)

print("\nShape of Training data: ", y1_train.shape)
print("Shape of Test data: ", y1_test.shape)

Shape of Training data:  (800, 40, 50)
Shape of Test data:  (200, 40, 50)

Shape of Training data:  (800,)
Shape of Test data:  (200,)


In [58]:
start_time_base_2 = time.clock()
def generateModel2():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [59]:
model_2 = generateModel2()
print(model_2.summary())

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 40, 32)            10624     
_________________________________________________________________
lstm_5 (LSTM)                (None, 40, 32)            8320      
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 27,297
Trainable params: 27,297
Non-trainable params: 0
_________________________________________________________________
None


In [60]:

model_2.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_1 = model_2.fit(X1_train, y1_train, validation_data=(X1_test, y1_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [61]:
accuracy_2 = model_2.evaluate(X1_test, y1_test, verbose = 0)
accuracy_base_y_2 = model_2.evaluate(X1_train, y1_train, verbose = 0)
stop_time_base_2 = time.clock()
time_2 = stop_time_base_2 - start_time_base_2
print("Test Set Accuracy:", accuracy_base_y_2[1] )
print("Train Set Accuracy:", accuracy_2[1] )
print("processing time", time_2)

Test Set Accuracy: 1.0
Train Set Accuracy: 0.7250000238418579
processing time 29.868344699999994


#### Model 3: RNN with LSTM cells and 5 Layers
#### GloVe.6B.50d - Embedding Vocabulary Size 10,000

In [62]:
start_time_base_3 = time.clock()
def generateModel3():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [63]:
model_3 = generateModel3()
print(model_3.summary())

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 40, 32)            10624     
_________________________________________________________________
lstm_8 (LSTM)                (None, 40, 32)            8320      
_________________________________________________________________
lstm_9 (LSTM)                (None, 40, 32)            8320      
_________________________________________________________________
lstm_10 (LSTM)               (None, 40, 32)            8320      
_________________________________________________________________
lstm_11 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33  

In [64]:
model_3.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_3 = model_3.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [65]:
accuracy_3 = model_3.evaluate(X_test, y_test, verbose = 0)
accuracy_base_y_3 = model_3.evaluate(X_train, y_train, verbose = 0)
stop_time_base_3 = time.clock()
time_3 = stop_time_base_3 - start_time_base_3
print("Test Set Accuracy:", accuracy_3[1])
print("Train Set Accuracy:", accuracy_base_y_3[1])
print("processing time", time_3)

Test Set Accuracy: 0.7300000190734863
Train Set Accuracy: 0.9737499952316284
processing time 53.52331629999999


#### Model 4: RNN with LSTM cells and 5 Layers
#### GloVe.6B.50d - Embedding Vocabulary Size 30,000

In [66]:
start_time_base_4 = time.clock()
def generateModel4():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [67]:
model_4 = generateModel4()
print(model_4.summary())

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 40, 32)            10624     
_________________________________________________________________
lstm_13 (LSTM)               (None, 40, 32)            8320      
_________________________________________________________________
lstm_14 (LSTM)               (None, 40, 32)            8320      
_________________________________________________________________
lstm_15 (LSTM)               (None, 40, 32)            8320      
_________________________________________________________________
lstm_16 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33  

In [68]:
model_4.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_4 = model_4.fit(X1_train, y1_train, validation_data=(X1_test, y1_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [69]:
accuracy_4 = model_4.evaluate(X1_test, y1_test, verbose = 0)
accuracy_base_y_4 = model_4.evaluate(X1_train, y1_train, verbose = 0)
stop_time_base_4 = time.clock()
time_4 = stop_time_base_4 - start_time_base_4
print("Test Set Accuracy:", accuracy_base_y_4[1])
print("Train Set Accuracy:", accuracy_4[1])
print("processing time", time_4)

Test Set Accuracy: 0.9987499713897705
Train Set Accuracy: 0.7099999785423279
processing time 57.5640368


#### Model 5: RNN with GRU cells and 3 Layers
#### GloVe.6B.100d - Embedding Vocabulary Size 10,000

In [70]:
from keras.layers import GRU
start_time_base_5 = time.clock()
def generateModel5():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [71]:
model_5 = generateModel5()
print(model_5.summary())

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 40, 32)            7968      
_________________________________________________________________
gru_2 (GRU)                  (None, 40, 32)            6240      
_________________________________________________________________
gru_3 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 33        
Total params: 20,481
Trainable params: 20,481
Non-trainable params: 0
_________________________________________________________________
None


In [72]:
model_5.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_5 = model_5.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [73]:
accuracy_5 = model_5.evaluate(X_test, y_test, verbose = 0)
accuracy_base_y_5 = model_5.evaluate(X_train, y_train)
stop_time_base_5 = time.clock()
time_5 = stop_time_base_5 - start_time_base_5

print("Test Set Accuracy:", accuracy_base_y_5[1] )
print("Train Set Accuracy:", accuracy_5[1] )
print("processing time", time_5)

Test Set Accuracy: 0.9912499785423279
Train Set Accuracy: 0.7099999785423279
processing time 41.97401099999999


#### Model 6: RNN with GRU cells and 3 Layers
#### GloVe.6B.100d - Embedding Vocabulary Size 30,000

In [74]:
start_time_base_6 = time.clock()
def generateModel6():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [75]:
model_6 = generateModel6()
print(model_6.summary())

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
gru_4 (GRU)                  (None, 40, 32)            7968      
_________________________________________________________________
gru_5 (GRU)                  (None, 40, 32)            6240      
_________________________________________________________________
gru_6 (GRU)                  (None, 32)                6240      
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 33        
Total params: 20,481
Trainable params: 20,481
Non-trainable params: 0
_________________________________________________________________
None


In [76]:
model_6.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_6 = model_6.fit(X1_train, y1_train, validation_data=(X1_test, y1_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [77]:
accuracy_6 = model_6.evaluate(X1_test, y1_test, verbose = 0)
accuracy_base_y_6 = model_6.evaluate(X1_train, y1_train)
stop_time_base_6 = time.clock()
time_6 = stop_time_base_6 - start_time_base_6

print("Test Set Accuracy:", accuracy_base_y_6[1] )
print("Train Set Accuracy:",accuracy_6[1] )
print("processing time", time_6)

Test Set Accuracy: 0.9975000023841858
Train Set Accuracy: 0.6850000023841858
processing time 49.41685900000002


#### Model 7: RNN with GRU cells and 5 Layers
#### GloVe.6B.100d - Embedding Vocabulary Size 10,000

In [78]:
start_time_base_7 = time.clock()
def generateModel7():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [79]:
model_7 = generateModel7()
print(model_7.summary())

Model: "model_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
gru_7 (GRU)                  (None, 40, 32)            7968      
_________________________________________________________________
gru_8 (GRU)                  (None, 40, 32)            6240      
_________________________________________________________________
gru_9 (GRU)                  (None, 40, 32)            6240      
_________________________________________________________________
gru_10 (GRU)                 (None, 40, 32)            6240      
_________________________________________________________________
gru_11 (GRU)                 (None, 32)                6240      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33  

In [80]:
model_7.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_7 = model_7.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [81]:
accuracy_7 = model_7.evaluate(X_test, y_test, verbose = 0)
accuracy_base_y_7 = model_7.evaluate(X_train, y_train)
stop_time_base_7 = time.clock()
time_7 = stop_time_base_7 - start_time_base_7

print("Test Set Accuracy:", accuracy_base_y_7[1])
print("Train Set Accuracy:",accuracy_7[1] )
print("processing time", time_7)

Test Set Accuracy: 0.9987499713897705
Train Set Accuracy: 0.6850000023841858
processing time 75.27545650000002


#### Model 8: RNN with GRU cells and 5 Layers
#### GloVe.6B.100d - Embedding Vocabulary Size 30,000

In [82]:
start_time_base_8 = time.clock()
def generateModel8():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [83]:
model_8 = generateModel8()
print(model_8.summary())

Model: "model_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 40, 50)            0         
_________________________________________________________________
gru_12 (GRU)                 (None, 40, 32)            7968      
_________________________________________________________________
gru_13 (GRU)                 (None, 40, 32)            6240      
_________________________________________________________________
gru_14 (GRU)                 (None, 40, 32)            6240      
_________________________________________________________________
gru_15 (GRU)                 (None, 40, 32)            6240      
_________________________________________________________________
gru_16 (GRU)                 (None, 32)                6240      
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 33  

In [84]:
model_8.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_8 = model_8.fit(X1_train, y1_train, validation_data=(X1_test, y1_test), epochs=50, batch_size=64, verbose=1)

Train on 800 samples, validate on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [85]:
accuracy_8 = model_8.evaluate(X1_test, y1_test, verbose = 0)
accuracy_base_y_8 = model_8.evaluate(X1_train, y1_train)
stop_time_base_8 = time.clock()
time_8 = stop_time_base_8 - start_time_base_8

print("Test Set Accuracy:", accuracy_base_y_8[1])
print("Train Set Accuracy:",accuracy_8[1] )
print("processing time", time_8)

Test Set Accuracy: 0.9987499713897705
Train Set Accuracy: 0.7450000047683716
processing time 73.20194680000003


In [87]:
#### Summary Table
from tabulate import tabulate

col_labels = ['# - Model', 'Number of Layers', 'Embedding Vocabulary Size','Processing Time', 'Test Set Accuracy', 'Training Set Accuracy']

table_vals = [['1 - Simple RNN with BPTT','3','10000',runtime_base, accuracy_base,accuracy_base_y],
              ['2 - RNN with LSTM cells','3','10000',time_1,accuracy_1[1],accuracy_base_y_1[1]],
              ['3 - RNN with LSTM cells','3','30000',time_2,accuracy_2[1],accuracy_base_y_2[1]],
              ['4 - RNN with LSTM cells','5','10000',time_3,accuracy_3[1],accuracy_base_y_3[1]],
              ['5 - RNN with LSTM cells','5','30000',time_4,accuracy_4[1],accuracy_base_y_4[1]],
              ['6 - RNN with GRU','3','10000',time_5,accuracy_5[1],accuracy_base_y_5[1]],
              ['7 - RNN with GRU','3','30000',time_6,accuracy_6[1],accuracy_base_y_6[1]],
              ['8 - RNN with GRU','5','10000',time_7,accuracy_7[1],accuracy_base_y_7[1]],
              ['9 - RNN with GRU','5','30000',time_8,accuracy_8[1],accuracy_base_y_8[1]],]

print('------------------------------- Summary Table -------------------------------')

table = tabulate(table_vals, headers=col_labels, tablefmt="simple",numalign="left")
print(table)

------------------------------- Summary Table -------------------------------
# - Model                 Number of Layers    Embedding Vocabulary Size    Processing Time    Test Set Accuracy    Training Set Accuracy
------------------------  ------------------  ---------------------------  -----------------  -------------------  -----------------------
1 - Simple RNN with BPTT  3                   10000                        18.105             0.68                 0.77625
2 - RNN with LSTM cells   3                   10000                        31.7837            0.685                0.9975
3 - RNN with LSTM cells   3                   30000                        29.8683            0.725                1
4 - RNN with LSTM cells   5                   10000                        53.5233            0.73                 0.97375
5 - RNN with LSTM cells   5                   30000                        57.564             0.71                 0.99875
6 - RNN with GRU          3           

Management Problem
- Suppose management is thinking about using a language model to classify written customer reviews and call and complaint logs. If the most critical customer messages can be identified, then customer support personnel can be assigned to contact those customers.
- How would you advise senior management? What kinds of systems and methods would be most relevant to the customer services function? Considering the results of this assignment in particular, what is needed to make an automated customer support system that is capable of identifying negative customer feelings? What can data scientists do to make language models more useful in a customer service function?


#### REPORT/FINDINGS: 
(1) A summary and problem definition for management; 

    For this week assigment, the project mainly focus on building RNN model to analysis the movie reviews and classify written customer reviews. The purpose of this project is to define the most critical customer messages, then customer support can contact those customers.

(2) Discussion of the research design, measurement and statistical methods, traditional and machine learning methods employed 
    
    I have created 9 models including the simple RNN with BPTT, moded # 2-5 are LSTM cells, and 6-9 are GRU. Overall compare with LSTM and GRU, GRU takes more processing time. but under the same layers, Number of layers and embedding vocabulary size, GRU has more accurate accuracy for both training set and test set. Within LSTM, the layers and more embedding vocabulary size will take more processing time. but when the embedding vocabulary size is 30000, the more number of layers has less accuracy for both training set and test. With in GRU model, when the embedding vocabulary size is 10000, the more number of layers have less training and test accuracy. 
    
(3) Overview of programming work; 

    first, to install chakin package to get embeddings, then set embedding size, the most important part is to convert text to numpy array and set negative review to 0, and positive reviw to 1. split training data to 80% and 20%. and create LSTM and GRU with different layers and embedding vocabulary size.
    
(4) Review of results with recommendations for management.

    after comparing with all the models I create, I think the GRU with 5 layers and 30000 embedding vocabulary size has the accuracy. I think the system can ask customer rate the moviews and write reviews. For example, if customer leave a 1 star feedback, the system can identify the negative custmer. 
    Also the system need to create target customer personas, then understand and create emotional connections, anticipate customer needs and be proactive, the collect feedback and perform analytics.
    
    resource: https://www.proprofs.com/c/customer-support/emotions-impact-customer-engagement/
