## Movie review sentiment classifier implemented using a bidirectional LSTM

Run this code as a Jupyter notebook at notebooks.csc.fi.

Solve Tasks 1 – 4 below by adding necessary Python code and answering follow-up questions. The tasks combined are worth 5 points.

You run the code in the cells by selecting a cell and pressing Ctrl-Enter. If a program is split into multiple cells (as here below), the cells coming later still "remember" all values of variables etc that have been set in previous cells. You need to run the cells, one by one, in the right order. 

The split into multiple cells is practical, because you can modify the code in some cell(s) and _only rerun the affected cells_, which can save a lot of time. Remember, however, that when you modify a cell that comes before other cells, you typically need to rerun all the cells that follow the modified cell. Otherwise your changes won't be reflected in the later cells.

In [1]:
'''Trains a Bidirectional LSTM on the IMDB sentiment classification task.
Output after 4 epochs on CPU: ~0.8146
Time per epoch on CPU (Core i7): ~150s.
'''

from __future__ import print_function
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from tensorflow.keras.datasets import imdb


max_features = 20000 # (Use the "max_features" most common words as features)
maxlen = 100
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences


**Task 1** (1 point):

In the cell below, write code that prints the first review in the training set and the first review in the test set as plain numeric vectors (= lists of numbers).

Add a comment where you explain how information is encoded in these vectors.

Hint: https://keras.io/api/datasets/imdb/

In [3]:
# Answer to Task 1 goes here:
print(x_train[0])
print(x_test[0])

# Every number in the vector is one word. Each number represents an index. The smaller the number 
# is the more frequent the word is.

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
[1, 591, 202, 14, 31, 6, 717, 10, 10, 18142, 106

**Task 2** (1.5 points):

In the cell below, write code that prints the first review in the training set and the first review in the test set as strings of words rather than numeric values. You should be able to read English text if you do this correctly. Additionally, print for each of the two reviews what sentiment they have been tagged with in the data (positive or negative).

Hint: https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset

In [19]:
# Answer to Task 2 goes here:
word_to_id = imdb.get_word_index()
INDEX_FROM = 3

word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items() if v + INDEX_FROM < max_features}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}

labels_to_sentiment = ["negative", "positive"] # negative = 0, positive = 1 -> we can use list thanks to indexing

print('First sentence in training set:', ' '.join(id_to_word[id] for id in x_train[0] ))
print('First training sentence is tagged as:', labels_to_sentiment[y_train[0]])
print()
print('First sentence in testing set:', ' '.join(id_to_word[id] for id in x_test[0] ))
print('First testing sentence is tagged as:', labels_to_sentiment[y_test[0]])

First sentence in training set: cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
First training sentence is tagged as: positive

First sentence in testing set: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madis

In [20]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 100)
x_test shape: (25000, 100)


**Task 3** (0.5 points): In the cell below, write code that again prints the first review in the training set and the first review in the test set as strings of words. 

Add a comment where you explain what has happened to the reviews since Task 2 and why this is necessary.

In [21]:
# Answer to Task 3 goes here:

print('First sentence in training set:', ' '.join(id_to_word[id] for id in x_train[0] ))
print()
print('First sentence in testing set:', ' '.join(id_to_word[id] for id in x_test[0] ))

# Both are now 100 words long. The ending has more information, so we keep only last 100 words from the first
# review and we add padding to the second one. The program needs to have sentences of the same size to run
# correctly, so here we just decided to make it 100 words.

First sentence in training set: cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

First sentence in testing set: <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he 

In [22]:
y_train = np.array(y_train)
y_test = np.array(y_test)

model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# it is possible to use different optimizers and different optimizer configs
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

The actual model training takes place in the cell below. This is rather slow, so avoid running this too often.

Make sure that your code in the cells above is in order, and then you can run the training. If you modify the code above, you probably need to rerun the training as well: 

In [23]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=4, # how many times we retrain
          validation_data=[x_test, y_test])

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f918a9e7b90>

After you have a properly trained model, you can use it to predict the sentiment on some movie review data that you invent yourself:

**Task 4** (2 points): Apply the trained model on some test reviews that you invent yourself. Add code to the cell below to make the neural network predict the sentiment of your reviews (or the ones already suggested below).

Add a comment in which you discuss how well the prediction works.

Hint: You can use the predict_on_batch method (https://www.tensorflow.org/api_docs/python/tf/keras/Sequential?hl=en#predict_on_batch)

In [31]:
test_reviews = [ "<START> this was an awesome movie with all of my favorite actors",
                 "<START> i fell asleep during the first minute of this film",
                 "<START> i was not convinced by this movie but i still liked parts of it",
                 "<START> the story was a bit too sentimental for my taste",
                 "<START> aki karismäki is phenomenal",
                 "<START> i was surprised how weird it was",
                 "<START> best movie ever",
                 "<START> such a good movie but i was so sad about the ending",
                 "<START> the movie was horribly awesome",
                 "<START> the worst movie i have ever seen waste of time do not watch it",]

# Answer to Task 4 goes here:

# Convert words to numbers
x_pred  = [ [ word_to_id.get(w, word_to_id["<UNK>"]) for w in test_review.split()] for test_review in test_reviews ] 

# Pad and truncate sentences to be 100 tokens exactly
x_pred_padded = sequence.pad_sequences(x_pred, maxlen=maxlen)

# Print each sentence and its predicted output label
for sentiment, review in zip(model.predict_on_batch(x_pred_padded), test_reviews):
    print("Predicted degree of positiveness", float(sentiment), "for review:", review)
    
# The predictiction works best for reviews commenting in one sentiment only. For example for the first or the last
# sentence. When we 'mix feelings' in the review such as in the 9th review, predictions aren't as precise. When
# a review is longer than just a three words, it also helps (compare 1st and 7th review). Predicting sentiment
# also becomes harder if we don't use typically 'bad' or 'good' words as in the 6th review.

Predicted degree of positiveness 0.958768367767334 for review: <START> this was an awesome movie with all of my favorite actors
Predicted degree of positiveness 0.08071503043174744 for review: <START> i fell asleep during the first minute of this film
Predicted degree of positiveness 0.5852519869804382 for review: <START> i was not convinced by this movie but i still liked parts of it
Predicted degree of positiveness 0.6277394890785217 for review: <START> the story was a bit too sentimental for my taste
Predicted degree of positiveness 0.873368501663208 for review: <START> aki karismäki is phenomenal
Predicted degree of positiveness 0.7664064168930054 for review: <START> i was surprised how weird it was
Predicted degree of positiveness 0.7850537300109863 for review: <START> best movie ever
Predicted degree of positiveness 0.09434488415718079 for review: <START> such a good movie but i was so sad about the ending
Predicted degree of positiveness 0.5303440093994141 for review: <START> th

When you are done, download this file as a jupyter notebook (File -> Download as -> Notebook (.ipynb)). Submit the notebook file on Moodle.

PS: It is also a good idea to **take backups of your notebook** by downloading it regularly onto your own computer. The environment will expire after 10 hours and then you will lose your code. In case of connection problems, disconnection might happen earlier. So remember taking backups!