Dylan Hastings

# 1. Sentiment analysis

Using the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), we want to do a regression model that predict the ratings are on a 1-10 scale. You have an example train and test set in the `dataset` folder.

### 1.1 Regression Model

Use a feedforward neural network and NLP techniques we've seen up to now to train the best model you can on this dataset



In [112]:
import glob
import numpy as np
from numpy.random import seed
import pandas as pd
import zipfile
import os
from os import listdir
import matplotlib.pyplot as plt
from datetime import datetime
import nltk
from nltk.corpus import stopwords
from sklearn.decomposition import PCA
import sklearn.feature_extraction.text as text
from sklearn.metrics import r2_score
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout, Input, Embedding, GRU, LSTM
from keras.callbacks import EarlyStopping
from tensorflow.random import set_seed
import random

In [2]:
files_train_neg = listdir('data/aclImdb/train/neg')
files_train_pos = listdir('data/aclImdb/train/pos')

files_test_neg = listdir('data/aclImdb/test/neg')
files_test_pos = listdir('data/aclImdb/test/pos')

In [3]:
def get_reviews(target, reviews, directory):
    x = []
    x_line = []

    for file in directory:
        with open(f'data/aclImdb/{target}/{reviews}/{file}', encoding='utf8') as opened_file:
            rating = file.split("_")[1].split(".")[0]
            for line in opened_file:
                x_line = []
                x_line.append(line)
                x_line.append(rating)
                x.append(x_line)
                
    return x

In [4]:
train_neg = pd.DataFrame(columns = ['review', 'rating'], data=get_reviews("train", "neg", files_train_neg))
train_pos = pd.DataFrame(columns = ['review', 'rating'], data=get_reviews("train", "pos", files_train_pos))

test_neg = pd.DataFrame(columns = ['review', 'rating'], data=get_reviews("test", "neg", files_test_neg))
test_pos = pd.DataFrame(columns = ['review', 'rating'], data=get_reviews("test", "pos", files_test_pos))

In [5]:
train_df = pd.concat([train_pos, train_neg], ignore_index=True)
test_df = pd.concat([test_pos, test_neg], ignore_index=True)

In [6]:
train_df

Unnamed: 0,review,rating
0,Bromwell High is a cartoon comedy. It ran at t...,9
1,Homelessness (or Houselessness as George Carli...,8
2,Brilliant over-acting by Lesley Ann Warren. Be...,10
3,This is easily the most underrated film inn th...,7
4,This is not the typical Mel Brooks film. It wa...,8
...,...,...
24995,"Towards the end of the movie, I felt it was to...",4
24996,This is the kind of movie that my enemies cont...,3
24997,I saw 'Descent' last night at the Stockholm Fi...,3
24998,Some films that you pick up for a pound turn o...,1


In [11]:
stop = stopwords.words('english')
pca = PCA(n_components = 1000)

In [12]:
df = train_df.sample(n=1000, random_state = 42).reset_index(drop=True)
df.rating = df.rating.astype('float')

In [15]:
df.review = df.review.apply(lambda t: " ".join([t for t in t.replace("<br /> ", "").lower().split(" ") if not t in stop]))

In [20]:
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['review'])
X = X.toarray()

In [21]:
X = pca.fit_transform(X)

In [23]:
df['rev_tfidf'] = [x for x in X]
df.head(3)

Unnamed: 0,review,rating,rev_tfidf
0,panic streets richard widmark plays u.s. navy ...,8.0,"[0.09635181945360419, -0.06352997017281335, 0...."
1,ask first one really better one. look sarah m....,1.0,"[-0.0896921085977124, 0.05355772542352921, 0.0..."
2,big fan faerie tale theatre i've seen one best...,10.0,"[-0.11287535096311527, 0.040346814693369364, -..."


In [35]:
loss_stopper = EarlyStopping(monitor = 'loss', patience = 1)

In [29]:
model = Sequential()

model.add(Input(shape = X.shape[-1]))
model.add(Dropout(0.2))

model.add(Dense(50))
model.add(Dropout(0.2))

model.add(Dense(50))
model.add(Dropout(0.2))

model.add(Dense(1))

model.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics = ['accuracy'])

In [36]:
seed(42)
set_seed(42)
model.fit(x = X, y= df.rating, batch_size = 1, epochs = 25, callbacks = [loss_stopper])

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25


<tensorflow.python.keras.callbacks.History at 0x1fc85b59d30>

In [39]:
df_test = test_df.sample(n = 1000, random_state = 42).reset_index(drop = True)
df_test.rating = df_test.rating.astype('float')

In [40]:
Xt = tf.fit_transform(df_test['review'])
Xt = Xt.toarray()

In [41]:
Xt = pca.fit_transform(Xt)

In [42]:
preds = model.predict(Xt)
preds = preds.flatten()

In [46]:
r2_score(preds, df_test.rating.values)

-1.636244051395198

### 1.2 RNN model

Train a RNN to do the sentiment analysis regression. The RNN should consist simply of an embedding layer (to make word IDs into word vectors) a recurrent blocks (GRU or LSTM) feeding into an output layer.

In [47]:
def get_tag(token):
    tags = []
    
    for tag in nltk.pos_tag(token):
        tags.append(tag[1])
        
    return tags

In [83]:
df = train_df.sample(n = 1000, random_state = 42).reset_index(drop = True)
df.rating = df.rating.astype('float')

In [84]:
df['rev_token'] = df['review'].apply(lambda x: nltk.word_tokenize(x))

In [85]:
df

Unnamed: 0,review,rating,rev_token
0,In Panic In The Streets Richard Widmark plays ...,8.0,"[In, Panic, In, The, Streets, Richard, Widmark..."
1,If you ask me the first one was really better ...,1.0,"[If, you, ask, me, the, first, one, was, reall..."
2,I am a big fan a Faerie Tale Theatre and I've ...,10.0,"[I, am, a, big, fan, a, Faerie, Tale, Theatre,..."
3,I just finished reading a book about Dillinger...,1.0,"[I, just, finished, reading, a, book, about, D..."
4,Greg Davis and Bryan Daly take some crazed sta...,2.0,"[Greg, Davis, and, Bryan, Daly, take, some, cr..."
...,...,...,...
995,"According to IMDb, as well as to every other w...",4.0,"[According, to, IMDb, ,, as, well, as, to, eve..."
996,In Cold Blood was one of several 60s films tha...,4.0,"[In, Cold, Blood, was, one, of, several, 60s, ..."
997,I work in a library and expected to like this ...,7.0,"[I, work, in, a, library, and, expected, to, l..."
998,"This is one of the first films I can remember,...",7.0,"[This, is, one, of, the, first, films, I, can,..."


In [86]:
def make_lexicon(token_seqs, min_freq=1):
    '''Create a lexicon for the words in the sentences as well as the tags'''
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

print("WORDS:")
words_lexicon = make_lexicon(df['rev_token'])

WORDS:
LEXICON SAMPLE (22518 total items):
{'In': 2, 'Panic': 3, 'The': 4, 'Streets': 5, 'Richard': 6, 'Widmark': 7, 'plays': 8, 'U.S.': 9, 'Navy': 10, 'doctor': 11, 'who': 12, 'has': 13, 'his': 14, 'week': 15, 'rudely': 16, 'interrupted': 17, 'with': 18, 'a': 19, 'corpse': 20, 'that': 21}


In [87]:
'''Make a dictionary where the string representation of a lexicon item can be retrieved from its numerical index'''

def get_lexicon_lookup(lexicon):
    '''Make a dictionary where the string representation 
        of a lexicon item can be retrieved 
        from its numerical index
    '''
    lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
    print("LEXICON LOOKUP SAMPLE:")
    print(dict(list(lexicon_lookup.items())[:20]))
    return lexicon_lookup

def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

df['Sentence_Idxs'] = tokens_to_idxs(df['rev_token'], words_lexicon)
df[['rev_token', 'Sentence_Idxs']][:10]

Unnamed: 0,rev_token,Sentence_Idxs
0,"[In, Panic, In, The, Streets, Richard, Widmark...","[2, 3, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
1,"[If, you, ask, me, the, first, one, was, reall...","[215, 259, 260, 261, 32, 155, 262, 65, 178, 26..."
2,"[I, am, a, big, fan, a, Faerie, Tale, Theatre,...","[95, 315, 19, 314, 316, 19, 317, 318, 319, 56,..."
3,"[I, just, finished, reading, a, book, about, D...","[95, 276, 362, 363, 19, 364, 238, 365, 24, 165..."
4,"[Greg, Davis, and, Bryan, Daly, take, some, cr...","[420, 421, 56, 422, 423, 424, 72, 425, 426, 67..."
5,"[This, really, is, an, incredible, film, ., No...","[165, 178, 148, 51, 495, 156, 24, 496, 398, 11..."
6,"[If, you, lived, through, the, 60s, ,, this, f...","[215, 259, 610, 611, 32, 612, 45, 162, 156, 96..."
7,"[As, a, writer, I, find, films, this, bad, mak...","[25, 19, 657, 95, 54, 658, 162, 659, 660, 115,..."
8,"[I, 'm, 14, years, old, and, I, love, this, ca...","[95, 203, 746, 747, 345, 56, 95, 350, 162, 748..."
9,"[This, film, would, usually, classify, as, the...","[165, 156, 175, 791, 792, 106, 32, 311, 290, 4..."


In [88]:
from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

max_seq_len = max([len(idx_seq) for idx_seq in df['Sentence_Idxs']]) # Get length of longest sequence
train_padded_words = pad_idx_seqs(df['Sentence_Idxs'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1

print("WORDS:\n", train_padded_words)
print("SHAPE:", train_padded_words.shape, "\n")

WORDS:
 [[   0    0    0 ...   80  258   24]
 [   0    0    0 ...  314  280   24]
 [   0    0    0 ...  213  361   24]
 ...
 [   0    0    0 ...  417  263   24]
 [   0    0    0 ...  152 7135 1172]
 [   0    0    0 ...   56 6176   24]]
SHAPE: (1000, 1458) 



In [89]:
'''Create the model'''

def create_model(seq_input_len, n_input_nodes, n_embedding_nodes,
                 n_hidden_nodes, stateful=False, batch_size=20):
    
    #Layer 1
    input_layer = Input(shape=(None,))
    
    # Layer 2
    embedding_layer = Embedding(input_dim=n_input_nodes,
                                output_dim=n_embedding_nodes,
                                mask_zero=True)(input_layer) #mask_zero tells the model to ignore 0 values (padding)
    #Output shape = (batch_size, input_matrix_length, n_embedding_nodes)
    
    # Layer 3
    gru_layer = GRU(units=n_hidden_nodes)(embedding_layer)
    #Output shape = (batch_size, n_hidden_nodes)
    #Layer 4
    
    output_layer = Dense(units=1)(gru_layer)
    #Output shape = (batch_size, 1)
    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')
    
    return model

In [90]:
model = create_model(seq_input_len=train_padded_words.shape[-1] - 1, #substract 1 from matrix length because of offset
                     n_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding
                     n_embedding_nodes=300,
                     n_hidden_nodes=500)

In [91]:
model.fit(x = train_padded_words[:,1:],
          y = df.rating,
          batch_size=20,
          epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1fceac251c0>

In [95]:
test_df = pd.concat([test_pos, test_neg], ignore_index = True)

In [97]:
test_df = test_df.sample(n=1000, random_state = 42).reset_index(drop = True)
test_df['rev_token'] = test_df['review'].apply(lambda x: nltk.word_tokenize(x))

In [98]:
test_rev_lexicon = make_lexicon(test_df['rev_token'])

test_df['Sentence_Idxs'] = tokens_to_idxs(test_df['rev_token'], test_rev_lexicon)

LEXICON SAMPLE (22283 total items):
{'Wow': 2, '!': 3, 'What': 4, 'a': 5, 'movie': 6, 'if': 7, 'you': 8, 'want': 9, 'to': 10, 'blow': 11, 'your': 12, 'budget': 13, 'on': 14, 'the': 15, 'title': 16, 'and': 17, 'have': 18, 'it': 19, 'look': 20, 'real': 21}


In [99]:
max_seq_len = max([len(idx_seq) for idx_seq in test_df['Sentence_Idxs']])

test_padded_words = pad_idx_seqs(test_df['Sentence_Idxs'], max_seq_len + 1)

In [100]:
preds = model.predict(test_padded_words)

In [101]:
r2_score(preds, test_df.rating)

-6.174605884589652

# 2. (evil) XOR Problem

Train an LSTM to solve the XOR problem: that is, given a sequence of bits, determine its parity. The LSTM should consume the sequence, one bit at a time, and then output the correct answer at the sequence’s end. Test the two approaches below:

### 2.1 

Generate a dataset of random <=100,000 binary strings of equal length <= 50. Train the LSTM; what is the maximum length you can train up to with precisison?


In [131]:
SEQ_LEN = 50
COUNT = 100_000

In [132]:
bin_pair = lambda x: [x, not(x)]
training = np.array([[bin_pair(random.choice([0, 1])) for _ in range(SEQ_LEN)] for _ in range(COUNT)])
target = np.array([[bin_pair(x) for x in np.cumsum(example[:,0]) % 2] for example in training])
print('shape check:', training.shape, '=', target.shape)

shape check: (100000, 50, 2) = (100000, 50, 2)


In [133]:
model = Sequential()

model.add(Input(shape = (SEQ_LEN, 2), dtype = 'float32'))

model.add(LSTM(1, return_sequences = True))

model.add(Dense(2, activation = 'softmax'))

In [134]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.fit(training, target, epochs = 10, batch_size = 128)
model.summary()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 50, 1)             16        
_________________________________________________________________
dense_9 (Dense)              (None, 50, 2)             4         
Total params: 20
Trainable params: 20
Non-trainable params: 0
_________________________________________________________________


In [135]:
predictions = model.predict(training)
i = random.randint(0, COUNT)
chance = predictions[i, -1, 0]
print('randomly selected sequence:', training[i, :, 0])
print('prediction:', int(chance > 0.5))
print('confidence: {:0.2f}%'.format((chance if chance > 0.5 else 1 - chance) * 100))
print('actual:', np.sum(training[i, :, 0]) %2)

randomly selected sequence: [0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0
 0 1 1 1 0 0 1 1 0 0 0 1 0]
prediction: 1
confidence: 99.98%
actual: 1


### 2.2

Generate a dataset of random <=200,000 binary strings, where the length of each string is independently and randomly chosen between 1 and 50. Train the LSTM. Does it succeed? What explains the difference?

In [126]:
COUNT = 200_000

In [127]:
bin_pair = lambda x: [x, not(x)]
training = np.array([[bin_pair(random.choice([0, 1])) for _ in range(SEQ_LEN)] for _ in range(COUNT)])
target = np.array([[bin_pair(x) for x in np.cumsum(example[:,0]) % 2] for example in training])
print('shape check:', training.shape, '=', target.shape)

shape check: (200000, 50, 2) = (200000, 50, 2)


In [128]:
model = Sequential()

model.add(Input(shape = (SEQ_LEN, 2), dtype = 'float32'))

model.add(LSTM(1, return_sequences = True))

model.add(Dense(2, activation = 'softmax'))

In [129]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.fit(training, target, epochs = 10, batch_size = 128)
model.summary()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 50, 1)             16        
_________________________________________________________________
dense_8 (Dense)              (None, 50, 2)             4         
Total params: 20
Trainable params: 20
Non-trainable params: 0
_________________________________________________________________


In [130]:
predictions = model.predict(training)
i = random.randint(0, COUNT)
chance = predictions[i, -1, 0]
print('randomly selected sequence:', training[i, :, 0])
print('prediction:', int(chance > 0.5))
print('confidence: {:0.2f}%'.format((chance if chance > 0.5 else 1 - chance) * 100))
print('actual:', np.sum(training[i, :, 0]) %2)

randomly selected sequence: [1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1
 0 1 1 1 0 1 0 0 0 0 1 0 0]
prediction: 1
confidence: 100.00%
actual: 1


With a dataset of 200,000 binary strings, the model now predicts with 100% confidence.