## RNN models for text data

We analyse here data from the Internet Movie Database (IMDB: https://www.imdb.com/).

We use RNN to build a classifier for movie reviews: given the text of a review, the model will predict whether it is a positive or negative review.

#### Steps

1. Load the dataset (50K IMDB Movie Review)
2. Clean the dataset
3. Encode the data
4. Split into training and testing sets
5. Tokenize and pad/truncate reviews
6. Build the RNN model
7. Train the model
8. Test the model
9. Applications


In [None]:
## import relevant libraries

import re
import nltk
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import itertools
import matplotlib.pyplot as plt

from scipy import stats
from keras.datasets import imdb

from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model

nltk.download('stopwords')

#### Reading the data

We raw an extract from IMDB hosted on a Github page:

In [None]:
DATAURL = 'https://raw.githubusercontent.com/hansmichaels/sentiment-analysis-IMDB-Review-using-LSTM/master/IMDB%20Dataset.csv'

In [None]:

data = pd.read_csv(DATAURL)
print(data)

In [None]:
## alternative way of getting the data, already preprocessed
# (X_train,Y_train),(X_test,Y_test) = imdb.load_data(path="imdb.npz",num_words=None,skip_top=0,maxlen=None,start_char=1,seed=13,oov_char=2,index_from=3)

#### Preprocessing

The original reviews are "dirty", they contain html tags, punctuation, uppercase, stop words etc. which are not good for model training. 
Therefore, we now need to clean the dataset.

**Stop words** are commonly used words in a sentence, usually to be ignored in the analysis (i.e. "the", "a", "an", "of", etc.)

In [None]:
english_stops = set(stopwords.words('english'))

In [None]:
[x[1] for x in enumerate(itertools.islice(english_stops, 10))]

In [None]:
def prep_dataset():
    x_data = data['review']       # Reviews/Input
    y_data = data['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

x_data, y_data = prep_dataset()

print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

#### Split dataset

`train_test_split()` function to partition the data in 80% training and 20% test sets

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

#### A little bit of EDA

In [None]:
print("x train shape: ",x_train.shape)
print("y train shape: ",y_train.shape)

In [None]:
print("x test shape: ",x_test.shape)
print("y test shape: ",y_test.shape)

Distribution of classes in the training set

In [None]:
plt.figure();
sns.countplot(y_train);
plt.xlabel("Classes");
plt.ylabel("Frequency");
plt.title("Y Train");

In [None]:
review_len_train = []
review_len_test = []
for i,j in zip(x_train,x_test):
    review_len_train.append(len(i))
    review_len_test.append(len(j))

In [None]:
print("min train: ", min(review_len_train), "max train: ", max(review_len_train))
print("min test: ", min(review_len_test), "max test: ", max(review_len_test))

#### Tokenize and pad/truncate

RNN models only accept numeric data, so we need to encode the reviews. `tensorflow.keras.preprocessing.text.Tokenizer` is used to encode the reviews into integers, where each unique word is automatically indexed (using `fit_on_texts`) based on the training data

x_train and x_test are converted to integers using `texts_to_sequences`

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length): `tensorflow.keras.preprocessing.sequence.pad_sequences`



In [None]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

## size of vocabulary
total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

In [None]:
x_train[0,0]

#### Build model

**Embedding Layer**: it creates word vectors of each word in the vocabulary, and group words that are related or have similar meaning by analyzing other words around them

**LSTM Layer**: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.

- *Forget Gate*, decides information is to be kept or thrown away
- *Input Gate*, updates cell state by passing previous output and current input into sigmoid activation function
- *Cell State*, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.
- *Ouput Gate*, decides the next hidden state and used for predictions

**Dense Layer**: compute the input from the LSTM layer and uses the sigmoid activation function because the output is only 0 or 1



In [None]:
# ARCHITECTURE

model = Sequential()
model.add(Embedding(total_words, 32, input_length = max_length))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

#### Training the model

For training we fit the x_train (input) and y_train (output/label) data to the RNN model. 
We use a mini-batch learning method with a batch_size of 128 and 5 epochs


In [None]:
num_epochs = 5
batch_size = 128

checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

history = model.fit(x_train, y_train, batch_size = batch_size, epochs = num_epochs, callbacks=[checkpoint])

In [None]:
plt.figure()
plt.plot(history.history["accuracy"],label="Train");
plt.title("Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epochs")
plt.legend()
plt.show();

#### Testing

In [None]:
from sklearn.metrics import confusion_matrix

predictions = model.predict(x_test)
predicted_labels = np.where(predictions > 0.5, "good review", "bad review")

target_labels = y_test
target_labels = np.where(target_labels > 0.5, "good review", "bad review")

con_mat_df = confusion_matrix(target_labels, predicted_labels, labels=["bad review","good review"])
print(con_mat_df)

In [None]:
y_pred = np.where(predictions > 0.5, 1, 0)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

### A little application

Now we feed a new review to the trained RNN model, to see whether it will be classified positive or negative.

We go through the same preprocessing (cleaning, tokenizing, encoding), and then move directly to the predcition step (the RNN model has already been trained, and it has high accuracy from cross-validation).

In [None]:
loaded_model = load_model('models/LSTM.h5')

In [None]:
review = 'Movie Review: Nothing was typical about this. Everything was beautifully done in this movie, the story, the flow, the scenario, everything. I highly recommend it for mystery lovers, for anyone who wants to watch a good movie!'

In [None]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

In [None]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

In [None]:
result = loaded_model.predict(tokenize_words)
print(result)

In [None]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

## Exercise

Try to write your own movie review, and then have the deep learning model classify it.

0. write your review
1. clean the text data
2. tokenize it
3. predict and evaluate