#   Homework 3
## Sentiment analysis using Neural Networks

Total: 50 Points


In this homework we will perform sentiment analysis using a few simple Neural Network based architectures.
For this problem we use the IMDB Large Movie Review Dataset. The dataset contains 25,000 highly polar movie reviews for both train and test dataset, each with 12,500 positive (greater than equal to 7/10 rating) and 12,500 negative reviews(less than equal to 4/10 rating). 

Use "https://keras.io/" for keras documentation. Please use Python 3. GPU is not required but it will help improve the training speed for each problem.

Please save the notebook with your cell outputs. You will not be graded if your outputs are not present below the homework cell. Also note your outputs will be unique since you will be using your the last numbers of your uni as your random seed (In the third cell). Make sure you submit this iPython file, with the saved outputs. The submission format must be 'hw3/hw3.ipynb'. You will not submit any other files. If you do save your model weights, you will not submit them. You will however, make sure your model weights do get saved in the 'weights' folder and can be retrieved from there as well.

Please fill your details below.



Name: Hugh Krogh-Freeman

Uni: hk2903

Email: hk2903@columbia.edu


In [1]:
from os import listdir
import random
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Dense, Dropout, Reshape, Merge, BatchNormalization, TimeDistributed, Lambda, Activation, LSTM, Flatten, Convolution1D, GRU, MaxPooling1D
from keras.regularizers import l2
from keras.callbacks import Callback, ModelCheckpoint, EarlyStopping
#from keras import initializers
from keras import backend as K
from keras.optimizers import SGD
from keras.optimizers import Adadelta
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras import optimizers
import numpy as np

Using TensorFlow backend.


In [2]:
#we retrieve train and test file names

train_dir = "./aclImdb/train/"
test_dir = "./aclImdb/test/"
tr_review = [re_filename for re_filename in listdir(train_dir)]
te_review = [re_filename for re_filename in listdir(test_dir)]

#we initialize the train and test arrays

tr_X = []
tr_Y = []
te_X = []
te_Y = []

#we arrange the reviews into the train and test arrays 

for review_file in tr_review:
    f_review = open(train_dir+review_file, "r", encoding="utf8")
    str_review = f_review.readline()
    str_review = " ".join(str_review.split(' '))
    tr_X.append(str_review)
    y_truth = int (review_file.split('.')[0].split('_')[1])
    if y_truth>=7:
        tr_Y.append(1)
    else:
        tr_Y.append(0)
        
for review_file in te_review:
    f_review = open(test_dir+review_file, "r", encoding="utf8")
    str_review = f_review.readline()
    str_review = " ".join(str_review.split(' '))
    te_X.append(str_review)
    y_truth = int (review_file.split('.')[0].split('_')[1])
    if y_truth>=7:
        te_Y.append(1)
    else:
        te_Y.append(0)
        

We will now create the validation set from the train set

use the last 4 numbers of your uni for the seed value seed to ensure all answers remain unique.

In [3]:
#replace 2 (SEED) with the last 4 numbers of your Uni
#Uni: 
SEED = 2903
seed_counter = 0
while(1):

    shuffle_combine = list(zip(tr_X, tr_Y))
    random.seed(SEED+seed_counter)
    seed_counter+=1
    random.shuffle(shuffle_combine)

    tr_X, tr_Y = zip(*shuffle_combine)

    val_X = tr_X[:5000]
    val_Y = tr_Y[:5000]

    counter = 0
    for label in val_Y:
        counter+=label

    print (counter)
    print (seed_counter)
    if(counter>2400 and counter <2600):
        tr_X = tr_X[5000:]
        tr_Y = tr_Y[5000:]
        break

2523
1


In [4]:


print("Length of Train review set : " + str(len(tr_X)))
print("Length of Train label set : " + str(len(tr_Y)))
print("Length of Validation review set : " + str(len(val_X)))
print("Length of Validation label set : " + str(len(val_Y)))
print("Length of Test review set : " + str(len(te_X)))
print("Length of Test label set : " + str(len(te_Y)))
print("*****************************************")
print("Some sample Reviews Train sets and their labels")
print(tr_X[0][:150])
print(tr_Y[0])
print(tr_X[1][:150])
print(tr_Y[1])
print(tr_X[2][:150])
print(tr_Y[2])
print(tr_X[3][:150])
print(tr_Y[3])
print(tr_X[4][:150])
print(tr_Y[4])

Length of Train review set : 20000
Length of Train label set : 20000
Length of Validation review set : 5000
Length of Validation label set : 5000
Length of Test review set : 25000
Length of Test label set : 25000
*****************************************
Some sample Reviews Train sets and their labels
I'm watching this on the Sci-Fi channel right now. It's so horrible I can't stop watching it! I'm a Videographer and this movie makes me sad. I feel b
0
Hello, this little film is interesting especially for an artist, film-maker or music creator or a visual artist, for:<br /><br />One can feel and exam
1
(NOTE: I thought I'd be the only one writing what I did below, but I see the others here agree. I guess it was pretty obvious - this was overdoing the
0
This was the first televised episode of the Columbo series (although it was filmed after "Death Lends a Hand")and it heralded one of the most successf
1
Hawked as THE MOST OFFENSIVE MOVIE EVER, GUARANTEED TO OFFEND EVERYONE- Guess what? I

In [5]:
#we collect all the reviews from train validation and test set to generate 
texts = []
texts += tr_X 
texts += te_X 
texts += val_X
len(texts)



#we clip the sentence length to first 250 words. 
MAX_SEQUENCE_LENGTH = 250

#length of vocab, Tokenizer will only use vocab_len most common words
vocab_len = 25000

#we tokenize the texts and convert all the words to tokens
tokenizer = Tokenizer(num_words=vocab_len)
tokenizer.fit_on_texts(texts)

token_tr_X = tokenizer.texts_to_sequences(tr_X)
token_te_X = tokenizer.texts_to_sequences(te_X)
token_val_X = tokenizer.texts_to_sequences(val_X)

#to ensure all reviews have the same length, we pad the smaller reviews with 0, 
#and cut the larger reviews to a max length 
#(we clip from the top, as the end of the reviews generally have a conclusion which provides better features)
x_train = sequence.pad_sequences(token_tr_X, maxlen=MAX_SEQUENCE_LENGTH)
x_test = sequence.pad_sequences(token_te_X, maxlen=MAX_SEQUENCE_LENGTH)
x_val = sequence.pad_sequences(token_val_X, maxlen=MAX_SEQUENCE_LENGTH)


#changes the labels to one-hot encoding
y_train = np_utils.to_categorical(tr_Y)
y_test = np_utils.to_categorical(te_Y)
y_val = np_utils.to_categorical(val_Y)


In [6]:
print('X_train shape:', x_train.shape)
print('X_test shape:', x_test.shape)
print('X_val shape:', x_val.shape)

print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
print('y_val shape:', y_val.shape)


print("*****************************************")
print("Tokenized Reviews Train sets and their labels")
print(x_train[0][:20])
print(y_train[0])
print()
print(x_train[1][:20])
print(y_train[1])
print()
print(x_train[2][:20])
print(y_train[2])
print()
print(x_train[3][:20])
print(y_train[3])
print()
print(x_train[4][:20])
print(y_train[4])
print()

X_train shape: (20000, 250)
X_test shape: (25000, 250)
X_val shape: (5000, 250)
y_train shape: (20000, 2)
y_test shape: (25000, 2)
y_val shape: (5000, 2)
*****************************************
Tokenized Reviews Train sets and their labels
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 1.  0.]

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 5103   11  120   19]
[ 0.  1.]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 1.  0.]

[ 7769  4941     3  2970     5    87     8     1   952   100    28  8848
 11649     3  1576   654     5 10285   325    10]
[ 0.  1.]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 1.  0.]



********************************************

As you can see the reviews have now been transformed into indices to tokenized vocabulary and the labels have been converted to one-hot encoding. We can now go ahead and feed these sequences to Neural Network Models.

********************************************

# Part A

Building your first model (5 Points)

Construct this sequential model using Keras :

![title](img/model1.jpg)

In [7]:
print('Build model...')

## implement model here

model = Sequential()
model.add(Embedding(vocab_len, 128, input_length=MAX_SEQUENCE_LENGTH)) 
model.add(Flatten())
model.add(Dense(200, activation='relu'))
# model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

## compille it here according to instructions
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

print("Model Built")

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 250, 128)          3200000   
_________________________________________________________________
flatten_1 (Flatten)          (None, 32000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 200)               6400200   
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 402       
Total params: 9,600,602
Trainable params: 9,600,602
Non-trainable params: 0
_________________________________________________________________
Model Built


In [8]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=32,
          epochs=4,
          validation_data=(x_val, y_val),
          verbose = 1,
         shuffle = True)

Train...
Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fed90ded940>

# Part B

Stacking Fully Connected Layers (5 points)

Construct this sequential model using Keras :

![title](img/model2.jpg)

In [9]:
print('Build model...')

## implement model here

model = Sequential()
model.add(Embedding(vocab_len, 128, input_length=MAX_SEQUENCE_LENGTH)) 
model.add(Flatten())
model.add(Dense(200, activation='relu'))
# model.add(Dropout(0.5))
model.add(Dense(200, activation='relu'))
# model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

## compille it here according to instructions
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

print("Model Built")

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 250, 128)          3200000   
_________________________________________________________________
flatten_2 (Flatten)          (None, 32000)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 200)               6400200   
_________________________________________________________________
dense_4 (Dense)              (None, 200)               40200     
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 402       
Total params: 9,640,802
Trainable params: 9,640,802
Non-trainable params: 0
_________________________________________________________________
Model Built


In [10]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=32,
          epochs=4,
          validation_data=(x_val, y_val),
          verbose = 1,
         shuffle = True)

Train...
Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fed88b2a8d0>

# Part C

Using LSTMS based networks(5 Points) 

Construct this sequential model using Keras :

![title](img/model3.jpg)

In [11]:
print('Build model...')

## implement model here

model = Sequential()
model.add(Embedding(vocab_len, 128, input_length=MAX_SEQUENCE_LENGTH)) 
model.add(LSTM(128))
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

## compille it here according to instructions
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

print("Model Built")

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 250, 128)          3200000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_6 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 258       
Total params: 3,348,354
Trainable params: 3,348,354
Non-trainable params: 0
_________________________________________________________________
Model Built


In [12]:

print('Train...')
model.fit(x_train, y_train,
          batch_size=32,
          epochs=5,
          validation_data=(x_val, y_val),
          verbose = 1,
         shuffle = True)

Train...
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fed858f32e8>

# Part D

Adding Pretrained Word Embeddings(10 Points)

Construct this sequential model using Keras :

Correction: The Embedding Layer Dimension (1st box) is 300, not 128.

![title](img/model4.jpg)

In [7]:
import codecs

#dimension of Glove Embeddings.
EMBEDDING_DIM = 300

word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))

#load glove embeddings
gembeddings_index = {}
with codecs.open('glove.42B.300d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        gembedding = np.asarray(values[1:], dtype='float32')
        gembeddings_index[word] = gembedding
#
f.close()
print('G Word embeddings:', len(gembeddings_index))

# nb_words contains the total length of vocab
nb_words = len(word_index) +1

#get glove embeddings for each word in tokenizer.
#g_word_embedding_matrix holds the embeddings dictionary
g_word_embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))

for word, i in word_index.items():
    gembedding_vector = gembeddings_index.get(word)
    if gembedding_vector is not None:
        g_word_embedding_matrix[i] = gembedding_vector
        
#total words in the tokenizer not in Embedding matrix
print('G Null word embeddings: %d' % np.sum(np.sum(g_word_embedding_matrix, axis=1) == 0))



Found 124252 unique tokens
G Word embeddings: 1917494
G Null word embeddings: 35772


In [8]:
print('Build model...')

## implement model here

model = Sequential()

## to use the glove embeddings, your embedding layer would take the vocab size as input dimension, 
## Glove embedding dimension as the output dimsion
## and you will provide the  embedding dictionary as the 'weights' parameter (!important) to the embedding layer.

model.add(Embedding(nb_words, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, trainable=False, 
    weights=[g_word_embedding_matrix])) 
model.add(LSTM(EMBEDDING_DIM, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(2, activation='softmax'))

## compille it here according to instructions
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

print("Model Built")

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 250, 300)          37275900  
_________________________________________________________________
lstm_1 (LSTM)                (None, 300)               721200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               90300     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 602       
Total params: 38,088,002
Trainable params: 812,102
Non-trainable params: 37,275,900
_________________________________________________________________
Model Built


In [37]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=32,
          epochs=5,
          validation_data=(x_val, y_val),
          verbose = 1,
         shuffle = True)

Train...
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fed43098ef0>

# Dont attempt this

Stacking LSTM layers

Unfortunately it takes very long to train, be aware we can stack LTMSs over each other like this.
This requires bottom LSTM to return a sequences instead instead of single vector, which becomes input for the top LSTM.


![title](img/model5.jpg)

# Part E

Using Convolutional Networks (10 points)

Construct the model, shown below. Use the same loss functions and optimizers as before

Correction: The Embedding Layer Dimension (1st box) is 300, not 128.

![title](img/model6.jpg)

In [46]:
print('Build model...')

## implement model here

model = Sequential()
model.add(Embedding(nb_words, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, trainable=False, 
                    weights=[g_word_embedding_matrix])) 
model.add(Convolution1D(filters=EMBEDDING_DIM, kernel_size=3))
model.add(Dropout(0.2))
model.add(Dense(EMBEDDING_DIM, activation='relu'))

model.add(Convolution1D(filters=int(EMBEDDING_DIM/2), kernel_size=3))
model.add(Dropout(0.2))
model.add(Dense(EMBEDDING_DIM, activation='relu'))

model.add(Convolution1D(filters=int(EMBEDDING_DIM/4), kernel_size=3))
model.add(Dropout(0.2))
model.add(Dense(EMBEDDING_DIM, activation='relu'))

model.add(Flatten())
model.add(Dense(2*EMBEDDING_DIM))
model.add(Dropout(0.2))
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(2, activation='softmax'))

## compille it here according to instructions
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

print("Model Built")

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_25 (Embedding)     (None, 250, 300)          37275900  
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 248, 300)          270300    
_________________________________________________________________
dropout_7 (Dropout)          (None, 248, 300)          0         
_________________________________________________________________
dense_17 (Dense)             (None, 248, 300)          90300     
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 246, 150)          135150    
_________________________________________________________________
dropout_8 (Dropout)          (None, 246, 150)          0         
_________________________________________________________________
dense_18 (Dense)             (None, 246, 300)          45300 

In [47]:

print('Train...')
model.fit(x_train, y_train,
          batch_size=32,
          epochs=5,
          validation_data=(x_val, y_val),
          verbose = 1,
         shuffle = True)

Train...
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fec195c4550>

# Part F

Model constructed : (5 points)

Test Accuracy Over 87.5%: (5 Points)

Bonus: Min(10, Square of (test_score - 88%))

Create your best model, use Validation score to judge your best model and check accuracy on test set


In [9]:
print('Build model...')

## implement model here

model = Sequential()

## to use the glove embeddings, your embedding layer would take the vocab size as input dimension, 
## Glove embedding dimension as the output dimsion
## and you will provide the  embedding dictionary as the 'weights' parameter (!important) to the embedding layer.

model.add(Embedding(nb_words, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, trainable=False, 
    weights=[g_word_embedding_matrix])) 
model.add(LSTM(EMBEDDING_DIM, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(2, activation='softmax'))

## compille it here according to instructions
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

print("Model Built")

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 250, 300)          37275900  
_________________________________________________________________
lstm_2 (LSTM)                (None, 300)               721200    
_________________________________________________________________
dropout_2 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 300)               90300     
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 602       
Total params: 38,088,002
Trainable params: 812,102
Non-trainable params: 37,275,900
_________________________________________________________________
Model Built


You can keep saving models with different names in model_name, 

so you can retrieve their weights again for testing, you dont have to retrain 
(You would have to initialize the model definition again).

In [10]:
wt_dir = "./weights/"
model_name = 'model_best'
early_stopping =EarlyStopping(monitor='val_acc', patience=2)
bst_model_path = wt_dir + model_name + '.h5'
model_checkpoint = ModelCheckpoint(bst_model_path, monitor='val_acc', save_best_only=True, save_weights_only=True)

print('Train...')
model.fit(x_train, y_train,
          batch_size=32,
          epochs=7,
          validation_data=(x_val, y_val),
          verbose = 1,
         shuffle = True,
         callbacks=[early_stopping, model_checkpoint])



Train...
Train on 20000 samples, validate on 5000 samples
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7


<keras.callbacks.History at 0x7f6ec4aea438>

If you plan on using Ensemble averaging, feel free to edit the code below or add multiple models.

Make sure they get saved and can be retrieved when executing serially.

In [11]:
model.load_weights(bst_model_path)
scores = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 89.19%


# Part G

Explain how Dense, LSTM and Convolution Layers work.

Explain Relu, Dropout, and Softmax work.

Analyze the architectures you constructed, with the accuracies you achieved and the training time it took. 

What are some insights you gained with these experiments? 

(5 Points)


* In a dense layer every node is connected to every other node in the next layer [1]. A convolutional layer applies a convolution operation to the input [2]. LSTM stands for "Long Short Term Memory"; an LSTM layer is part of a recurrent neural network and is able to learn "long-term dependencies" [3].
* Relu is the rectifier function, `g(x) = max(0, x)`, and serves as the activation function in a neural network (Goldberg, 44). Neural networks are prone to overfitting, which necessitates a form of regularization called "dropout training." Dropout training prevents the network from relying on specific weights and works by randomly setting half the neurons to zero in the network for each training example (Goldberg, 47).
* In Part A, the embedding layer always comes first because it provides the features, before which nothing can happen. The next step is the flatten the 2D layer into something 1D [5]. This is necessary to prepare the layer for the next dense layer which is fully connected and where the insights happen. Every layer needs an activation function; hence, the rectilinear units. Then the dense layer with dimension = 2 functions to map the input to a binary output, which is transformed by the softmax activation function. Part A took about 10 minutes to train. The big addition in Part B is an additional fully connected, dense layer, which took the same amount of time to train. Part C involves a recurrent network, which naturally took 40 minutes to train. This takes longer because the network has loops. Part D adds pretrained word embeddings and dropout training to the previous network, making it the most accurate so far, an average of 87.6%. But it took 2 hours and 18 minutes. I chose this architecture to use in part F due to the high accuracy. Part E includes three 1-dimensional convolutional layers instead of the LTSM layer in part D. This network took 2.5 hours to train, but achieved only about 50% accuracy, a clue that perhaps I implemented it incorrectly.
* I understand how basic neural networks work, and I can construct them in Python. I learned that neural networks, although they require parameter tuning, can learn features thereby eliminating the need for feature engineering. And I learned to install Cuda, Keras, and the accompanying stack onto my laptop, from which I did some of the training.

[1] `https://www.quora.com/Is-there-a-difference-between-hidden-layer-and-dense-layer-in-neural-networks` <br>
[2] `https://en.wikipedia.org/wiki/Convolutional_neural_network#Convolutional` <br>
[3] `http://colah.github.io/posts/2015-08-Understanding-LSTMs/` <br>
[4] Yoav Goldberg and Graeme Hirst. 2017. Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers. <br>
[5] `https://keras.io/layers/embeddings/`