Management is considering using a language model to classify written customer reviews and call and complaint logs. The goal is to improve customer service by identifying critical customer messages to assign to support staff. Our project is to develop a sentiment identifying system using RNNs. Movie reviews are used to train the system in order to select the best method of training the model to identify positive and negative language. Predefined embeddings are used for the training. The following embeddings were used in the study glove.6B.50d and glove.6B.100d. Different types of RNN models were used in the study. The BasicRNNCell using tensorflow was first experimented with but the best results obtained were a train accuracy of .83 and a test accuracy of .66 with 50 epochs, one RNN layer and an adam optimizer. So the study progressed to using a BasicLSTMCell under Keras the results and methods are discussed in this document.

In [32]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import os  # operating system functions
import os.path  # for manipulation of file path names

import re  # regular expressions

from collections import defaultdict

import nltk
from nltk.tokenize import TreebankWordTokenizer

import tensorflow as tf

RANDOM_SEED = 42

# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

REMOVE_STOPWORDS = False  # no stopword removal 

EVOCABSIZE = 10000  # specify desired size of pre-defined embedding vocabulary 

Above libraries are loaded and constants are initialized as well as the reset graph method is defined.

In [33]:
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

#Load embeddings glove.6B.50d.txt and glove.6B.100d.txt
embeddings_directory = 'embeddings/gloVe.6B'
glove_6B_50dfilename = 'glove.6B.50d.txt'
embeddings1_filename = os.path.join(embeddings_directory, glove_6B_50dfilename)

glove_6B_100dfilename = 'glove.6B.100d.txt'
embeddings2_filename = os.path.join(embeddings_directory, glove_6B_100dfilename)

word1_to_index, index_to_embedding1 = \
    load_embedding_from_disks(embeddings1_filename, with_indexes=True)
word2_to_index, index_to_embedding2 = \
    load_embedding_from_disks(embeddings2_filename, with_indexes=True)

The previous cell defines the load_embedding_from_disks method and loads the two embeddings used in this study glove.6B.50d and glove.6B.100d.

In [35]:
def define_dim_tests(word_to_index, index_to_embedding):

    vocab_size, embedding_dim = index_to_embedding.shape
    word = "worsdfkljsdf"  # a word obviously not in the vocabulary
    idx = word_to_index[word] # index for word obviously not in the vocabulary
    complete_vocabulary_size = idx 
    embd = list(np.array(index_to_embedding[idx], dtype=int)) 
    word = "the"
    idx = word_to_index[word]
    embd = list(index_to_embedding[idx])  # "int" for compact print only.

    # Show how to use embeddings dictionaries with a test sentence
    # This is a famous typing exercise with all letters of the alphabet
    # https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog
    a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'
    words_in_test_sentence = a_typing_test_sentence.split()

    for word in words_in_test_sentence:
        word_ = word.lower()
        embedding = index_to_embedding[word_to_index[word_]]
    return words_in_test_sentence, embedding, word_to_index, \
           index_to_embedding, embedding_dim

In [36]:
# ------------------------------------------------------------- 
# Helper function to define vocabulary size

def default_factory():
    return EVOCABSIZE  # last/unknown-word row in limited_index_to_embedding# 

# Define vocabulary size for the language model    
# To reduce the size of the vocabulary to the n most frequently used words

# dictionary has the items() function, returns list of (key, value) tuples
def define_vocabulary_size(index_to_embedding):
    limited_word_to_index = defaultdict(default_factory, \
        {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})

    # Select the first EVOCABSIZE rows to the index_to_embedding
    limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
    # Set the unknown-word row to be all zeros as previously
    limited_index_to_embedding = np.append(limited_index_to_embedding, 
        index_to_embedding[index_to_embedding.shape[0] - 1, :].\
            reshape(1,embedding_dim), 
        axis = 0)

    # Delete large numpy array to clear some CPU RAM
    del index_to_embedding

    # Verify the new vocabulary: should get same embeddings for test sentence
    # Note that a small EVOCABSIZE may yield some zero vectors for embeddings
    for word in words_in_test_sentence:
        word_ = word.lower()
        embedding = limited_index_to_embedding[limited_word_to_index[word_]]
    return embedding, limited_index_to_embedding, limited_word_to_index


The previous two cells are used to prepare the embeddings to convert the negative and positive reviews for use in the model.

In [37]:
# Utility function to get file names within a directory
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)
# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs
codelist = ['\r', '\n', '\t']   

# We will not remove stopwords in this exercise because they are
# important to keeping sentences intact
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

# previous analysis of a list of top terms showed a number of words, along 
# with contractions and other word strings to drop from further analysis, add
# these to the usual English stopwords to be dropped from a document collection
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

    # text parsing function for creating text documents 
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)    

These are helper methods used to retrieve then convert negative and positive reviews to documents. One of the methods is used to retrieve the files. The other method is used to parse the files including flagging words not to be added to the collection.

In [38]:
# -----------------------------------------------
# gather data for 500 negative movie reviews
# -----------------------------------------------
dir_name = 'movie-reviews-negative'
    
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists

# Read data for negative movie reviews

def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

negative_documents = []

for i in range(num_files):
    words = read_data(os.path.join(dir_name, filenames[i]))
    negative_documents.append(words)


In [39]:
# -----------------------------------------------
# gather data for 500 positive movie reviews
# -----------------------------------------------
dir_name = 'movie-reviews-positive'  
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists

# Read data for positive movie reviews

def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

positive_documents = []

for i in range(num_files):
    words = read_data(os.path.join(dir_name, filenames[i]))
    positive_documents.append(words)

The previous two cells retrieve negative and positive reviews. All the files in the negative and positive folders are read. Each word in the file is tokenized then appended to it's respective neg pos document.

In [3]:
# -----------------------------------------------------
# convert positive/negative documents into numpy array
# -----------------------------------------------------
def cvt_pos_neg_npa(embedding, limited_index_to_embedding, 
                    index_to_embedding, limited_word_to_index):
    max_review_length = 0  # initialize
    for doc in negative_documents:
        max_review_length = max(max_review_length, len(doc))    
    for doc in positive_documents:
        max_review_length = max(max_review_length, len(doc)) 

    min_review_length = max_review_length  # initialize
    for doc in negative_documents:
        min_review_length = min(min_review_length, len(doc))    
    for doc in positive_documents:
        min_review_length = min(min_review_length, len(doc)) 

    # construct list of 1000 lists with 40 words in each list
    from itertools import chain
    documents = []
    for doc in negative_documents:
        doc_begin = doc[0:20]
        doc_end = doc[len(doc) - 20: len(doc)]
        documents.append(list(chain(*[doc_begin, doc_end])))    
    for doc in positive_documents:
        doc_begin = doc[0:20]
        doc_end = doc[len(doc) - 20: len(doc)]
        documents.append(list(chain(*[doc_begin, doc_end])))    

    # create list of lists of lists for embeddings
    embeddings = []    
    for doc in documents:
        embedding = []
        for word in doc:
            embedding.append(limited_index_to_embedding[limited_word_to_index\
                                                        [word]]) 
        embeddings.append(embedding)
    return embeddings, documents

The above cell converts the negative positive results to numpy arrays and an embeddings list is created for training the model.

In [41]:
# -----------------------------------------------------    
# Make embeddings a numpy array for use in an RNN 
# Create training and test sets with Scikit Learn
# -----------------------------------------------------
def create_train_test(embeddings):
    embeddings_array = np.array(embeddings)

    # Define the labels to be used 500 negative (0) and 500 positive (1)
    thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                          np.ones((500), dtype = np.int32)), axis = 0)

    # Scikit Learn for random splitting of the data  
    from sklearn.model_selection import train_test_split

    # Random splitting of the data in to training (80%) and test (20%)  
    X_train, X_test, y_train, y_test = \
        train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                         random_state = RANDOM_SEED)
    return X_train, X_test, y_train, y_test, embeddings_array

In [11]:
The embedding list is split into train and test data for the model.

In [43]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up
EmbeddingSet = []
Optimizer = []
Training_Set_Accuracy = []
Test_Set_Accuracy = []

learning_rate = 0.001

adam = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
rmsprop = tf.keras.optimizers.RMSprop(lr=0.001, decay=1e-6)
sgd = tf.keras.optimizers.SGD(lr=0.001, decay=1e-6)

for i in range(0,2):
    if(i==0):
        words_in_test_sentence, embedding, \
          word_to_index, index_to_embedding, embedding_dim = \
          define_dim_tests(word1_to_index, index_to_embedding1)
        es='glove.6B.50d.txt'
    else:
        words_in_test_sentence, embedding, \
          word_to_index, index_to_embedding, embedding_dim = \
          define_dim_tests(word2_to_index, index_to_embedding2)
        es='glove.6B.100d.txt'

    embedding, limited_index_to_embedding, limited_word_to_index = \
      define_vocabulary_size(index_to_embedding)
    embeddings, documents = cvt_pos_neg_npa(embedding, 
                                        limited_index_to_embedding, 
                                        index_to_embedding, 
                                        limited_word_to_index)   
    X_train, X_test, y_train, y_test, embeddings_array = \
    create_train_test(embeddings)
    n_steps = embeddings_array.shape[1]  # number of words per document 
    n_inputs = embeddings_array.shape[2]  # dimension of  pre-trained embeddings

    for opt in [adam, rmsprop, sgd]:
        #Referenced 
        #https://pythonprogramming.net/recurrent-neural-network-deep-learning-python-tensorflow-keras/
        model = Sequential()
        model.add(LSTM(128, 
            input_shape=(X_train.shape[1:]), activation='relu', 
                       return_sequences=True))
        model.add(Dropout(0.2))

        model.add(LSTM(128, 
            input_shape=(X_train.shape[1:]), activation='relu', 
                       return_sequences=True))
        model.add(Dropout(0.1))

        model.add(LSTM(128, activation='relu'))
        model.add(Dropout(0.1))

        model.add(Dense(32, activation='relu'))
        model.add(Dropout(0.2))

        model.add(Dense(2, activation='softmax'))

        model.compile(
               loss='sparse_categorical_crossentropy',
               optimizer=opt,
               metrics=['accuracy'],
        )

        model.fit(X_train,
                  y_train,
                  epochs=10,
                  validation_data=(X_test, y_test),
                  verbose=0)

        score, acc_train = model.evaluate(X_train, y_train)
        score, acc_test = model.evaluate(X_test, y_test)
        Training_Set_Accuracy.append(acc_train)
        Test_Set_Accuracy.append(acc_test)
        EmbeddingSet.append(es)
        if(opt==adam):
            Optimizer.append('adam')
        elif(opt==rmsprop):
            Optimizer.append('rmsprop')
        else:
            Optimizer.append('sgd')




The preceding code cell configures, trains and evaluates the BasicLSTMCell model. The model contains a three-layer LSTM with two dense layers for training. There are two for loops to train the model in a 2 x 3 matrix of embeddings and optimizers. The first for loop is used to select the embedding to use as training data. The training and testing sets are built in the outer loop. Within the inner loop an optimizer is selected from the following list adam, rmsprop and sgd for the model. In the inner loop the model is configured, trained and evaluated. Statistical data is compiled here too. It is important to note that the training runs only required 10 epochs to reach levels that the BasicRNNCell couldn't reach using 50 epochs. For the learning rate .001 was used with a decay rate of 1e-6 so the steps would get smaller as gradiaent decent approached the minimum.

In [31]:

from prettytable import PrettyTable

table = PrettyTable(['Embedding', 'Optimizer', 
                     'Train_Set_Accuracy', 'Test_Set_Accuracy'])
for x in range(0, 6):
    table.add_row([EmbeddingSet[x], Optimizer[x], 
                   round(Training_Set_Accuracy[x], 3), 
                   round(Test_Set_Accuracy[x], 3)])
print(table)

+-------------------+-----------+--------------------+-------------------+
|     Embedding     | Optimizer | Train_Set_Accuracy | Test_Set_Accuracy |
+-------------------+-----------+--------------------+-------------------+
|  glove.6B.50d.txt |    adam   |       0.786        |        0.68       |
|  glove.6B.50d.txt |  rmsprop  |       0.744        |       0.675       |
|  glove.6B.50d.txt |    sgd    |       0.558        |        0.54       |
| glove.6B.100d.txt |    adam   |       0.924        |       0.715       |
| glove.6B.100d.txt |  rmsprop  |       0.842        |        0.68       |
| glove.6B.100d.txt |    sgd    |       0.496        |       0.475       |
+-------------------+-----------+--------------------+-------------------+


In [44]:
from prettytable import PrettyTable

table = PrettyTable(['Embedding', 'Optimizer', 
                     'Train_Set_Accuracy', 'Test_Set_Accuracy'])
for x in range(0, 6):
    table.add_row([EmbeddingSet[x], Optimizer[x], 
                   round(Training_Set_Accuracy[x], 3), 
                   round(Test_Set_Accuracy[x], 3)])
print(table)

+-------------------+-----------+--------------------+-------------------+
|     Embedding     | Optimizer | Train_Set_Accuracy | Test_Set_Accuracy |
+-------------------+-----------+--------------------+-------------------+
|  glove.6B.50d.txt |    adam   |       0.671        |       0.615       |
|  glove.6B.50d.txt |  rmsprop  |       0.672        |       0.615       |
|  glove.6B.50d.txt |    sgd    |       0.495        |       0.515       |
| glove.6B.100d.txt |    adam   |       0.808        |        0.65       |
| glove.6B.100d.txt |  rmsprop  |       0.795        |       0.685       |
| glove.6B.100d.txt |    sgd    |       0.495        |        0.52       |
+-------------------+-----------+--------------------+-------------------+


Here are the results of the study. There are two tables displayed above because data accuracy was inconsistent for each run during test and development. The first table represents the most accurate run and the second table is the last run. The two results agree the best combination is an embedding using glove.6B.100d and optimizer of adam. This combination produced the best results of train accuracy .924 and test accuracy .715. Senior management should choose per this study to use an RNN model with three BasicLSTMCell layers and two dense layers. The learning rate should be set to .001 along with a decay rate 1e-6, epochs 10 and neurons 20. To improve on the model, use a computer with Nvidia GPUs.  When constructing the model use a CudaLSTMCell instead of BasicLSTMCell. This would improve performance and allow the model to learn faster. After the model is trained using the suggested CudaLSTM model a pipeline should be created to route logged reviews entered to the model for prediction. If the customer review and call and compliant log is predicted to be a negative log a customer service rep should be notified. The notification triggers action to taken by the customer service rep. The model should be retrained on a regular basis or as prediction accuracy falls below a minimum threshold. The incorrect predictions should be included in the training data to improve predictions. It is important the customer service model be updated regularly to prevent the model from becoming out of date.  Word choice in vogue changes over time and new terms are coined which may have negative connotations. The new terms will be unfamiliar to the model. 