# Assignment 8: Language Modeling w/ an RNN - Oscar Hernandez 

## Purpose

Our management team is considering using a language model to classify written customer reviews, call and complaint logs. If certain customer messages are very critical, management would want a customer support personnel to contact the customer who sent the message. This report will outline the work that went into building several classification models while using a 2x2 completely crossed experimental design. Specifically, we will be training/testing recurrent neural networks (RNNs) using TensorFlow. Our goal is build multiple RNNs that can classify the sentiment of movie reviews which will allow us to provide the management team some takeaways and answer their questions regarding the usefulness of language models. We will be utilizing pretrained word vectors and data from movie reviews (negative/positive) to develop our RNNs. We will be looking to see the impact that using word vectors with different dimensions and vocabulary sizes has on the accuracy of the RNNs.   

#### This report will be broken up into Sections that cover the specific methodology that went into arriving at the final recommendation. The report will include and/or cover, at minimum, the following specific items and/or tasks:
* Loading the pretrained word vectors from our working directory and their embeddings 
* Define the vocabulary size for the model 
* Define text preprocessing functions (e.g. add stopwords) 
* Gather negative/positive movie review text and convert into NumPy array used for embedding (list of 1000 lists with 40 words in each list)
* Create y labels and do train/test data split
* Create 4 RNNs with the exact same hyperparameters (only difference is the pretrained word vector and vocabulary size of each vector that was used as part of the model development process)
* Make a final recommendation to management about what systems are relevant to the described customer service function, what is needed to create such a system and what data scientists can do to make language models useful for the described customer service function. 


## Final Report Explanation: Please note that to carry out the experiment, we will make changes to the "filename" variable and "EVOCABSIZE" to train 4 different RNNs but the output from the doc will only show the final variable changes made and final RNN accuracy scores. We will capture the outputs from each iteration in a table output at the bottom of the document. 

### Section 1 -  Vector Loading & Vocabulary Size Definition 

#### Section 1 covers loading the data sets that will be used to train our CNN along with defining the size of the vocabulary. We will utilize two pretrained word vectors that have different dimensions (50 and 100). Each vector initially has a vocabulary size of 400K but will decrease it to two different sizes (10000 and 15000). These two changes compose the 2x2 cross design experiment. 

In [9]:
# CODE SOURCE: Program by Thomas W. Miller, August 16, 2018
# previous work used also cited as well where necessary 

# Previous work involved gathering embeddings via chakin
# Following methods described in
#    https://github.com/chakki-works/chakin
# The previous program, run-chakin-to-get-embeddings-v001.py
# downloaded pre-trained GloVe embeddings, saved them in a zip archive,
# and unzipped that archive to create the two word-to-embeddings

#Load all the necessary packages for this exercise 
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import os  
import os.path 
import re  
from collections import defaultdict
import nltk
from nltk.tokenize import TreebankWordTokenizer
import nbconvert

In [23]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [96]:
RANDOM_SEED = 9999

# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

REMOVE_STOPWORDS = False  # no stopword removal 

In [25]:
import tensorflow as tf

In [97]:
# Select the pre-defined embeddings source        
# Define vocabulary size for the language model    
# Create a word_to_embedding_dict for GloVe.6B.50d
embeddings_directory = 'embeddings/gloVe.6B'
filename = 'glove.6B.100d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)

In [108]:
#Utility function for loading embeddings follows methods described in
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# Creates the Python defaultdict dictionary word_to_embedding_dict
# for the requested pre-trained word embeddings 

def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, 
    we return a tuple of two dictionnaries
    `(word_to_index_dict, index_to_embedding_array)`, 
    otherwise we return only a direct 
    `word_to_embedding_dict` dictionary mapping 
    from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[51:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

print('\nLoading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = \
    load_embedding_from_disks(embeddings_filename, with_indexes=True)
print("Embedding loaded from disks.")




Loading embeddings from embeddings/gloVe.6B\glove.6B.50d.txt
Embedding loaded from disks.


In [99]:
#This variable will specify the size of pre-defined embedding vocabulary
EVOCABSIZE = 15000

In [100]:
# Define vocabulary size for the language model    
# To reduce the size of the vocabulary to the n most frequently used words

def default_factory():
    return EVOCABSIZE  # last/unknown-word row in limited_index_to_embedding
# dictionary has the items() function, returns list of (key, value) tuples
limited_word_to_index = defaultdict(default_factory, \
    {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})

# Select the first EVOCABSIZE rows to the index_to_embedding
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
# Set the unknown-word row to be all zeros as previously
limited_index_to_embedding = np.append(limited_index_to_embedding, 
    index_to_embedding[index_to_embedding.shape[0] - 1, :].\
        reshape(1,embedding_dim), 
    axis = 0)

# Delete large numpy array to clear some CPU RAM
del index_to_embedding

# Verify the new vocabulary: should get same embeddings for test sentence
# Note that a small EVOCABSIZE may yield some zero vectors for embeddings
print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE, 'words:\n')


Test sentence embeddings from vocabulary of 15000 words:



### Section 2 - Text Preprocessing and Movie Sentiment Embedding

#### Section 2 covers the many ways that text preprocessing can be completed. Preprocessing includes adding/dropping stopwords for example. Also, this section covers how the positive and negative movie reviews were created along with their final conversion to a NumPy array for embedding. 

In [101]:
# code for working with movie reviews data 
# Source: Miller, T. W. (2016). Web and Network Data Science.
#    Upper Saddle River, N.J.: Pearson Education.
#    ISBN-13: 978-0-13-388644-3
# This original study used a simple bag-of-words approach
# to sentiment analysis, along with pre-defined lists of
# negative and positive words.        
# Code available at:  https://github.com/mtpa/wnds 
# Utility function to get file names within a directory
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs
codelist = ['\r', '\n', '\t']   

# We will not remove stopwords in this exercise because they are
# important to keeping sentences intact
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

# previous analysis of a list of top terms showed a number of words, along 
# with contractions and other word strings to drop from further analysis, add
# these to the usual English stopwords to be dropped from a document collection
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

# text parsing function for creating text documents 
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)

In [102]:
# gather data for 500 negative movie reviews
dir_name = 'movie-reviews-negative'
    
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))

# Read data for negative movie reviews
# Data will be stored in a list of lists where the each list represents 
# a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    negative_documents.append(words)
    # print('Data size (Characters) (Document %d) %d' %(i,len(words)))
    # print('Sample string (Document %d) %s'%(i,words[:50]))



Directory: movie-reviews-negative
500 files found

Processing document files under movie-reviews-negative


In [103]:
# gather data for 500 positive movie reviews

dir_name = 'movie-reviews-positive'  
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))

# Read data for positive movie reviews
# Data will be stored in a list of lists where the each list 
# represents a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    positive_documents.append(words)
    # print('Data size (Characters) (Document %d) %d' %(i,len(words)))
    # print('Sample string (Document %d) %s'%(i,words[:50]))


Directory: movie-reviews-positive
500 files found

Processing document files under movie-reviews-positive


In [104]:
# convert positive/negative documents into numpy array
# note that reviews vary from 22 to 1052 words   
# so we use the first 20 and last 20 words of each review 
# as our word sequences for analysis
max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

# construct list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    

# create list of lists of lists for embeddings
embeddings = []    
for doc in documents:
    embedding = []
    for word in doc:
       embedding.append(limited_index_to_embedding[limited_word_to_index[word]]) 
    embeddings.append(embedding)

max_review_length: 1052
min_review_length: 22


### Section 3 - Train/Test Split and Model Creation/Validation

#### Section 3 covers how the y labels were created along with splitting the data into training/test sets. Also, this section includes training/validation each RNN using the same hyperparameters.  

In [105]:
# Make embeddings a numpy array for use in an RNN 
# Create training and test sets with Scikit Learn

embeddings_array = np.array(embeddings)

# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                      np.ones((500), dtype = np.int32)), axis = 0)

# Scikit Learn for random splitting of the data  
from sklearn.model_selection import train_test_split

# Random splitting of the data in to training (80%) and test (20%)  
X_train, X_test, y_train, y_test = \
    train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                     random_state = RANDOM_SEED)

In [107]:
reset_graph()


n_steps = embeddings_array.shape[1]  # number of words per document 
n_inputs = embeddings_array.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        print('\n  ---- Epoch ', epoch, ' ----\n')
        for iteration in range(y_train.shape[0] // batch_size):          
            X_batch = X_train[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train[iteration*batch_size:(iteration + 1)*batch_size]
            print('  Batch ', iteration, ' training observations from ',  
                  iteration*batch_size, ' to ', (iteration + 1)*batch_size-1,)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)

### Key Takeaways
* In terms of training each RNN, the computational time was relatively fast compared to CNNs. Although no CNNs were trained in this exercise, this is just a simple reflection from a prior exercise. It may have to do with the type of data that was used. 
* The pretrained word vectors with 50 dimensions performed better on training accuracy but not on test accuracy compared to the pretrained word vectors with 100 dimensions. 
* With regards to the impact of vocabulary size, it there was no difference in the training accuracy using vectors with 50 dimensions. There was a minimal decrease in training accuracy when increasing the vocabulary size within the 100 dimension vectors. 
* It would seem that increasing the dimension size lead to lower training accuracy but better test accuracy. 
* The difference in test accuracy and training accuracy was more pronounced (difference of 0.15) using the vectors with 50 dimensions compared to the vectors with 100 dimensions (difference of 0.09). 
* Overall, test accuracy was lower compared to training accuracy across every RNN which may signal potential underfitting. 

![title](Capture.png)

### Section 4 - Benchmark Experiment Results and Final Recommendation

We would advice management to consider using RNNs as ways to classify customer messaging sentiment. The simple RNNs developed here showed promise when referring to their test accuracy. This accuracy can certainly be improved by perhaps using a different pretrained word vector with higher dimension and higher vocabulary size. Also, there was no hyperparameter optimization that was completed during this exercise which can only serve to improve accuracy. 

In terms of methods that are relevant to the customer services function, we would want a model that can accurately predict when a customer is upset based on the combination/type of words used in their message. A binary response variable can be used at the output layer which would tell management that a customer is upset. More sophisticated models would perhaps output the degree to which a customer is upset. If we can accurately predict if a customer is upset, we would reduce costs by not reaching out to non-upset customers as well. 

In order to achieve this accurate prediction, the word vector used to train the model would need to be changed. It would be ideal to have a pretrained word vector that relates to customer satisfaction related to the company's product or industry. For example, using word vector that was pulled from Twitter may not be appropriate to train a model that is expected to predict upset bank customers. Data scientists could develop proprietary word vectors that are more recent and contain updated words that belong to certain classifications (perhaps certain phrases have changed meaning over the years). This can be done through web scraping new sources and/or going through the arduous task of classifying a long list of words by hand. 