# Recurrent Neural Networks
## for sentiment analysis

Source:
- Keras.io documentation
- "Sentiment Analysis using Keras & LSTM" @ Medium

Title: Bidirectional LSTM on IMDB

Description: Train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset.

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

In [None]:
from keras import Input
from keras.datasets import imdb
from keras.layers import BatchNormalization, Dense, Dropout, Embedding, Flatten, LSTM
from keras.models import Sequential
from keras.optimizers import Adam, SGD, RMSprop
from keras.regularizers import l1, l2
from keras.utils import plot_model, to_categorical, pad_sequences

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Setup

In [None]:
top_words = 5000  # Only consider the top 5k words
input_length = 200  # Only consider the first 200 words of each movie review

In [None]:
def plot_history(train_hist):
    pd.DataFrame(train_hist).plot(figsize=(8, 5))
    plt.grid(True)
    plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
    plt.show()

### Loading dataset

In [None]:
## Load the IMDB movie review sentiment data
# Load the data, already split between train and test sets
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=top_words)

In [None]:
print(f"Train size: {x_train.shape[0]}")
print(f"Test size:  {x_test.shape[0]}")

Labels:
- 0 (negative review) or
- 1 (positive review).

In [None]:
np.unique(y_train)

In [None]:
REVIEW_INDEX = 2

print(x_train[REVIEW_INDEX])

These are vector representations of word indexes. We need to pad the sequence to a maximum of 500 words. For that Keras provides us with the pad_sequences method:



In [None]:
# Use pad_sequence to standardize sequence length:
# this will truncate sequences longer than 200 words and zero-pad sequences shorter than 200 words.
x_train = pad_sequences(x_train, maxlen=input_length)
x_test = pad_sequences(x_test, maxlen=input_length)

In [None]:
print(x_train[REVIEW_INDEX])

Now we need to build a word_to_id dictionary so that these indexes can be transformed into words for further analysis. In the dictionary, we will provide PAD token to index 0, START token to index 1, and UNK token to index 2. So we have to shift the default indexes by 3 to adjust these tokens.

In [None]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

After building word_to_it we need to id_to_word:

In [None]:
id_to_word = {idx:word for word, idx in word_to_id.items()}

**Convertion examples**:

In [None]:
id_to_word[20]

In [None]:
decoded_text = [id_to_word[id] for id in x_train[REVIEW_INDEX]]

print(" ".join(decoded_text))

### Build the model

In [None]:
embedding_vector_length = 32 

In [None]:
### Create a LSTM model for sentiment (binary) classification
### YOUR CODE HERE:

model = Sequential([
    Input((...,)),
    Embedding(top_words, embedding_vector_length),
    ...
    ...
    ...
]) 

In [None]:
model.summary()

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 

In [None]:
epochs = 6
batch_size = 64

trained = model.fit(x_train, y_train, 
                    validation_data=(x_test, y_test),
                    epochs=epochs, batch_size=batch_size)

### Evaluate the model

In [None]:
plot_history(trained.history)

### Predict sentiment

In [None]:
REVIEW_INDEX = 1

In [None]:
decoded_text = [id_to_word[id] for id in x_train[REVIEW_INDEX]]

print(" ".join(decoded_text))

In [None]:
my_review = np.array([x_train[REVIEW_INDEX]])

my_review

In [None]:
### Classify this review (`my_review`) in terms of sentiment
### YOUR CODE HERE:

...

In [None]:
### Given now the following review text, transform this input and predict sentiment

my_review_text = "One of the finest films made in recent years."

my_review_vec = []
for word in my_review_text.split(" "):
    if word[-1] == ".":
        word = word[:-1]
    my_review_vec.append(word_to_id[str.lower(word)])

my_review_vec = pad_sequences([my_review_vec], input_length)


### YOUR CODE HERE:

...