# Ramia_Assignment8

**Introduction:** This assignment involves working with language models developed with pretrained word vectors. We use sentences (sequences of words) to train language models for predicting movie review sentiment (thumbs up versus thumbs down). We study effects of word vector size, vocabulary size, and source on classification performance. We build on resources for recurrent neural networks (RNNs) as implemented in TensorFlow. RNNs are well suited to the analysis of sequences, as needed for natural language processing (NLP).

Specialized RNN models have been developed to accommodate the needs of many language processing tasks. Larger relevant vocabularies are usually associated with more accurate models, but training with larger vocabularies requires more memory and longer processing times. We can speed up the training process by using pretrained word vectors and subsets of pretrained word vectors.

Technologies such as word2vec, GloVe (global vectors), and fastText provide ways of representing words as numeric vectors. These numeric vectors or neural network embeddings capture the meaning of words as well as their common usage as parts of speech. Word embeddings have extensive applications in natural language processing.

Previous work involved gathering embeddings via chakin. Following methods described in https://github.com/chakki-works/chakin, the previous program, run-chakin-to-get-embeddings-v001.py downloaded pre-trained GloVe embeddings, saved them in a zip archive, and unzipped that archive to create the four word-to-embeddings text files for use in language models. 

In a 2x2 experimental design, we test classification performance of an RNN trained on four different GloVe embeddings. Two embeddings were sourced from Wikipedia, while the other two were sourced from Twitter. The Wikipedia embeddings have a vocabulary of 400K and dimensions of either 50 or 100. The Twitter embeddings have a vocabulary of 1.2M and dimensions of either 50 or 100. In the end we found that the model trained on Twitter embeddings with 50 dimensions performed the best on the test set, although results were a bit strange given the under-optimized nature of the RNN.

In [0]:
import numpy as np
import os  # operating system functions
import os.path  # for manipulation of file path names
import re  # regular expressions
from collections import defaultdict
import nltk
from nltk.tokenize import TreebankWordTokenizer
import tensorflow as tf

In [0]:
RANDOM_SEED = 9999
REMOVE_STOPWORDS = False  # no stopword removal

In [0]:
# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Define the embedding files' source directory.

In [0]:
glove_6B_50d = '/content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.6B.50d.txt'
glove_6B_100d = '/content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.6B.100d.txt'
glove_twitter_27B_50d = '/content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.twitter.27B.50d.txt'
glove_twitter_27B_100d = '/content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.twitter.27B.100d.txt'

Utility function for loading embeddings follows methods described in https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer creates the Python defaultdict dictionary word_to_embedding_dict for the requested pre-trained word embeddings

If with_indexes=True, we return a tuple of two dictionnaries ("word_to_index_dict" and "index_to_embedding_array"). Otherwise we return only a direct "word_to_embedding_dict" dictionnary mapping from a string to a numpy array.

Note the use of defaultdict data structure from the Python Standard Library collections_defaultdict.py lets the caller specify a default value up front. The default value will be retuned if the key is not a known dictionary key. For word embeddings, this default value is a vector of zeros. That is, unknown words are represented by a vector of zeros. 

In [0]:
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

In [7]:
print('\nLoading embeddings from', glove_6B_50d)
word_to_index1, index_to_embedding1 = load_embedding_from_disks(glove_6B_50d, with_indexes=True)
print("Embedding loaded from disk")
print("Embedding is of shape: {}".format(index_to_embedding1.shape))

print('\nLoading embeddings from', glove_6B_100d)
word_to_index2, index_to_embedding2 = load_embedding_from_disks(glove_6B_100d, with_indexes=True)
print("Embedding loaded from disk")
print("Embedding is of shape: {}".format(index_to_embedding2.shape))

print('\nLoading embeddings from', glove_twitter_27B_50d)
word_to_index3, index_to_embedding3 = load_embedding_from_disks(glove_twitter_27B_50d, with_indexes=True)
print("Embedding loaded from disk")
print("Embedding is of shape: {}".format(index_to_embedding3.shape))

print('\nLoading embeddings from', glove_twitter_27B_100d)
word_to_index4, index_to_embedding4 = load_embedding_from_disks(glove_twitter_27B_100d, with_indexes=True)
print("Embedding loaded from disk")
print("Embedding is of shape: {}".format(index_to_embedding4.shape))


Loading embeddings from /content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.6B.50d.txt
Embedding loaded from disk
Embedding is of shape: (400001, 50)

Loading embeddings from /content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.6B.100d.txt
Embedding loaded from disk
Embedding is of shape: (400001, 100)

Loading embeddings from /content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.twitter.27B.50d.txt
Embedding loaded from disk
Embedding is of shape: (1193515, 50)

Loading embeddings from /content/gdrive/My Drive/MSDS 422/Week 8/embeddings/glove.twitter.27B.100d.txt
Embedding loaded from disk
Embedding is of shape: (1193515, 100)


**Note:** The shape of an embedding is the number of words followed by the number of dimensions per word.

The first words in an embedding are words that tend to occur more often. For unknown words, the representation is an empty vector and their index becomes the last one.

To demonstrate how to use embeddings dictionaries we use the following test sentence.

In [8]:
a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'
print('\nTest sentence: ', a_typing_test_sentence, '\n')

print('Test sentence embeddings from complete vocabulary of', 
      len(word_to_index1), 'words:\n')

words_in_test_sentence = a_typing_test_sentence.split()
for word in words_in_test_sentence:
    word_ = word.lower()
    idx = word_to_index1[word_]
    embedding = index_to_embedding1[idx]
    print(word_ + ':', '\n', 'index =', idx, '\n', 'embedding =', embedding)


Test sentence:  The quick brown fox jumps over the lazy dog 

Test sentence embeddings from complete vocabulary of 400000 words:

the: 
 index = 0 
 embedding = [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick: 
 index = 2582 
 embedding = [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.12

The following code is for working with movie reviews data. 

Utility function to get file names within a directory:

In [0]:
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

Define list of codes to be dropped from the document (carriage-returns, line-feeds, tabs).

In [0]:
codelist = ['\r', '\n', '\t']

Previous analysis of a list of top terms showed a number of words, along with contractions and other word strings to drop from further analysis. Add these to the usual English stopwords to be dropped from a document collection.

We will not remove stopwords in this exercise because they are important to keeping sentences intact.

In [0]:
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))
    
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

Text parsing function for creating text documents.

There is more we could do for data preparation (stemming, looking for contractions, possessives, etc.) but we will work with what we have in this parsing function. If we want to do stemming at a later time, we can use porter = nltk.PorterStemmer() in a construction like this:

words_stemmed =  [porter.stem(word) for word in initial_words]  

In [0]:
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)

Gather data for 500 negative movie reviews:

In [13]:
dir_name = '/content/gdrive/My Drive/MSDS 422/Week 8/movie-reviews-negative'
    
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: /content/gdrive/My Drive/MSDS 422/Week 8/movie-reviews-negative
500 files found


Read data for negative movie reviews.

Data will be stored in a list of lists where the each list represents a document and document is a list of words. We then break the text into words.

In [14]:
def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    words = read_data(os.path.join(dir_name, filenames[i]))
    negative_documents.append(words)


Processing document files under /content/gdrive/My Drive/MSDS 422/Week 8/movie-reviews-negative


Gather data for 500 positive movie reviews:

In [15]:
dir_name = '/content/gdrive/My Drive/MSDS 422/Week 8/movie-reviews-positive'  

filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: /content/gdrive/My Drive/MSDS 422/Week 8/movie-reviews-positive
500 files found


Read data for positive movie reviews.

Data will be stored in a list of lists where the each list represents a document and document is a list of words. We then break the text into words.


In [16]:
def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    words = read_data(os.path.join(dir_name, filenames[i]))
    positive_documents.append(words)


Processing document files under /content/gdrive/My Drive/MSDS 422/Week 8/movie-reviews-positive


Convert positive/negative documents into numpy array.

Note that reviews vary from 22 to 1052 words, so we use the first 20 and last 20 words of each review as our word sequences for analysis.

In [17]:
max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

# construct list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))

max_review_length: 1052
min_review_length: 22


Create list of lists of lists for embeddings.

In [0]:
embeddings1 = []    
for doc in documents:
    embedding1 = []
    for word in doc:
       embedding1.append(index_to_embedding1[word_to_index1[word]]) 
    embeddings1.append(embedding1)
    
embeddings2 = []    
for doc in documents:
    embedding2 = []
    for word in doc:
       embedding2.append(index_to_embedding2[word_to_index2[word]]) 
    embeddings2.append(embedding2)
    
embeddings3 = []    
for doc in documents:
    embedding3 = []
    for word in doc:
       embedding3.append(index_to_embedding3[word_to_index3[word]]) 
    embeddings3.append(embedding3)
    
embeddings4 = []    
for doc in documents:
    embedding4 = []
    for word in doc:
       embedding4.append(index_to_embedding4[word_to_index4[word]]) 
    embeddings4.append(embedding4)

Check on the embeddings list of list of lists.

In [19]:
# Show the first word in the first document
test_word = documents[0][0]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      index_to_embedding1[word_to_index1[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings1[0][0][:])

First word in first document: this
Embedding for this word:
 [ 5.3074e-01  4.0117e-01 -4.0785e-01  1.5444e-01  4.7782e-01  2.0754e-01
 -2.6951e-01 -3.4023e-01 -1.0879e-01  1.0563e-01 -1.0289e-01  1.0849e-01
 -4.9681e-01 -2.5128e-01  8.4025e-01  3.8949e-01  3.2284e-01 -2.2797e-01
 -4.4342e-01 -3.1649e-01 -1.2406e-01 -2.8170e-01  1.9467e-01  5.5513e-02
  5.6705e-01 -1.7419e+00 -9.1145e-01  2.7036e-01  4.1927e-01  2.0279e-02
  4.0405e+00 -2.4943e-01 -2.0416e-01 -6.2762e-01 -5.4783e-02 -2.6883e-01
  1.8444e-01  1.8204e-01 -2.3536e-01 -1.6155e-01 -2.7655e-01  3.5506e-02
 -3.8211e-01 -7.5134e-04 -2.4822e-01  2.8164e-01  1.2819e-01  2.8762e-01
  1.4440e-01  2.3611e-01]
Corresponding embedding from embeddings list of list of lists
 [ 5.3074e-01  4.0117e-01 -4.0785e-01  1.5444e-01  4.7782e-01  2.0754e-01
 -2.6951e-01 -3.4023e-01 -1.0879e-01  1.0563e-01 -1.0289e-01  1.0849e-01
 -4.9681e-01 -2.5128e-01  8.4025e-01  3.8949e-01  3.2284e-01 -2.2797e-01
 -4.4342e-01 -3.1649e-01 -1.2406e-01 -2.8170e-0

To demonstrate how the embeddings can vary by file, we do the same as above using a the second embedding file.

In [20]:
test_word = documents[0][0]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      index_to_embedding2[word_to_index2[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings2[0][0][:])

First word in first document: this
Embedding for this word:
 [-0.57058   0.44183   0.70102  -0.41713  -0.34058   0.02339  -0.071537
  0.48177  -0.013121  0.16834  -0.13389   0.040626  0.15827  -0.44342
 -0.019403 -0.009661 -0.046284  0.093228 -0.27331   0.2285    0.33089
 -0.36474   0.078741  0.3585    0.44757  -0.2299    0.18077  -0.6265
  0.053852 -0.29154  -0.4256    0.62903   0.14393  -0.046004 -0.21007
  0.48879  -0.057698  0.37431  -0.030075 -0.34494  -0.29702   0.15095
  0.28248  -0.16578   0.076131 -0.093016  0.79365  -0.60489  -0.18874
 -1.0173    0.31962  -0.16344   0.54177   1.1725   -0.47875  -3.3842
 -0.081301 -0.3528    1.8372    0.44516  -0.52666   0.99786  -0.32178
  0.033462  1.1783   -0.072905  0.39737   0.26166   0.33111  -0.35629
 -0.16558  -0.44382  -0.14183  -0.37976   0.28994  -0.029114 -0.35169
 -0.27694  -1.344     0.19555   0.16887   0.040237 -0.80212   0.23366
 -1.3837   -0.023132  0.085395 -0.74051  -0.073934 -0.58838  -0.085735
 -0.10525  -0.51571   0.15038

Make embeddings a numpy array for use in an RNN.

In [0]:
embeddings_array1 = np.array(embeddings1)
embeddings_array2 = np.array(embeddings2)
embeddings_array3 = np.array(embeddings3)
embeddings_array4 = np.array(embeddings4)

Define the labels to be used: 500 negative (0) and 500 positive (1)

In [0]:
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                      np.ones((500), dtype = np.int32)), axis = 0)

Create training and test sets with Scikit Learn.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(embeddings_array1,
                                                        thumbs_down_up,
                                                        test_size=0.20,
                                                        random_state = RANDOM_SEED)

X_train2, X_test2, y_train2, y_test2 = train_test_split(embeddings_array2,
                                                        thumbs_down_up,
                                                        test_size=0.20,
                                                        random_state = RANDOM_SEED)

X_train3, X_test3, y_train3, y_test3 = train_test_split(embeddings_array3,
                                                        thumbs_down_up,
                                                        test_size=0.20,
                                                        random_state = RANDOM_SEED)

X_train4, X_test4, y_train4, y_test4 = train_test_split(embeddings_array4,
                                                        thumbs_down_up,
                                                        test_size=0.20,
                                                        random_state = RANDOM_SEED)

We use a very simple Recurrent Neural Network for this assignment. Source code available at https://github.com/ageron/handson-ml.

In [26]:
reset_graph()

n_steps = embeddings_array1.shape[1]  # number of words per document 
n_inputs = embeddings_array1.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(y_train1.shape[0] // batch_size):          
            X_batch = X_train1[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train1[iteration*batch_size:(iteration + 1)*batch_size]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test1, y: y_test1})
    print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)


  Train accuracy: 0.83 Test accuracy: 0.595


In [27]:
reset_graph()

n_steps = embeddings_array2.shape[1]  # number of words per document 
n_inputs = embeddings_array2.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(y_train2.shape[0] // batch_size):          
            X_batch = X_train2[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train2[iteration*batch_size:(iteration + 1)*batch_size]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test2, y: y_test2})
    print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)


  Train accuracy: 0.92 Test accuracy: 0.605


In [28]:
reset_graph()

n_steps = embeddings_array3.shape[1]  # number of words per document 
n_inputs = embeddings_array3.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(y_train3.shape[0] // batch_size):          
            X_batch = X_train3[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train3[iteration*batch_size:(iteration + 1)*batch_size]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test3, y: y_test3})
    print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)


  Train accuracy: 0.78 Test accuracy: 0.715


In [29]:
reset_graph()

n_steps = embeddings_array4.shape[1]  # number of words per document 
n_inputs = embeddings_array4.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(y_train4.shape[0] // batch_size):          
            X_batch = X_train4[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train4[iteration*batch_size:(iteration + 1)*batch_size]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test4, y: y_test4})
    print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)


  Train accuracy: 0.94 Test accuracy: 0.635


**Conclusion:** As mentioned earlier, the model trained on Twitter embeddings with 50 dimensions performed the best on the test set. The other three models appear to overfit the training set and underperform on the test set. Oddly, it doesn't seem as though vocabulary size improves this, as the model using Twitter embeddings with 100 dimensions performs just as poorly as the models using Wikipedia embeddings.

Dimensionality also has a bit of a confusing effect. While greater number of dimensions appears to improve performance between models using Wikipedia embeddings, the same cannot be said between models using Twitter embeddings.

These confusing effects would likely be remedied by optimizing the RNN. We used a simple model here for demonstration purposes, but there are a number of way to improve the results of a neural network. These include hyperparameter tuning and regularization techniques that would help ameliorate the overfitting.