# COMM7370 AI Theories and Applications
# Tutorial: Recurrent Neural Network by Keras
## The Problem: Movie review sentiment analysis
In this tutorial, we will perform sentiment analysis on a corpus of movie reviews from Rotten Tomatoes. 
<img src="tomato.png" alt="drawing" width="400"/>

The dataset is from [Kaggle](https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data). In this tutorial, we use 39015 phrases as training data and 7803 phrases as testing data. Each phrase is labeled on a scale of zero to four. The sentiment corresponding to each of the labels are:
- 0: negative
- 1: somewhat negative
- 2: neutral
- 3: somewhat positive
- 4: positive
## 1. Setup

In [None]:
# install used packages in the current Jupyter kernel
import sys
!{sys.executable} -m pip install keras
!{sys.executable} -m pip install tensorflow
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install os

In [None]:
import numpy as np
import pandas as pd
import os
from matplotlib import pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM

# If you are using MacOS, please un-comment the following line
# allow to duplicate dll
os.environ['KMP_DUPLICATE_LIB_OK']='True'

## 2. Preparing the Data

In [None]:
# load data
df_train = pd.read_csv('train_sentiment.csv')
df_test = pd.read_csv('test_sentiment.csv')

y_train = df_train['Sentiment']
y_test = df_test['Sentiment']

df_train.head(10)

As you can see some phrases are incomplete and some repeat.
### Clean text
ASCII characters are ultimately interpreted by the computer as hexadecimal. In consequence, to a computer, ‘A’ is not the same as ‘a’. Therefore, we’ll want to change all characters to lowercase. Since we’re going to be splitting the sentences up into individual words based on white spaces, a word with a period right after it is not equivalent to one without a period following it (*happy. != happy*). In addition, contractions are going to be interpreted differently than the original which will have repercussions for the model (*I’m != I am*). Thus, we replace all occurrences using the proceeding function.

In [None]:
replace_list = {r"i'm": 'i am',
                r"'re": ' are',
                r"let’s": 'let us',
                r"'s":  ' is',
                r"'ve": ' have',
                r"can't": 'can not',
                r"cannot": 'can not',
                r"shan’t": 'shall not',
                r"n't": ' not',
                r"'d": ' would',
                r"'ll": ' will',
                r"'scuse": 'excuse',
                ',': ' ,',
                '.': ' .',
                '!': ' !',
                '?': ' ?',
                '\s+': ' '}

def clean_text(text):
    text = text.lower()
    for s in replace_list:
        text = text.replace(s, replace_list[s])
    text = ' '.join(text.split())
    return text

We can use `apply` method to apply the function to every row in the series.

In [None]:
X_train = df_train['Phrase'].apply(lambda p: clean_text(p))
X_test = df_test['Phrase'].apply(lambda p: clean_text(p))

`lamda`: A lambda function is a small anonymous function.  
`Dataframe.apply()` calls the passed lambda function for each row and passes each row contents as series to this lambda function. Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering original dataframe.

Let’s look at the individual length of each phrase in the corpus.

In [None]:
phrase_len = X_train.apply(lambda p: len(p.split(' ')))
max_phrase_len = phrase_len.max()
print('max phrase len: {0}'.format(max_phrase_len))

plt.figure(figsize = (10, 8))
plt.hist(phrase_len, alpha = 0.2, density = True)
plt.xlabel('phrase len')
plt.ylabel('probability')
plt.grid(alpha = 0.25)

In [None]:
phrase_len = X_test.apply(lambda p: len(p.split(' ')))
if max_phrase_len < phrase_len.max():
    max_phrase_len = phrase_len.max()
print('max phrase len: {0}'.format(max_phrase_len))
plt.figure(figsize = (10, 8))
plt.hist(phrase_len, alpha = 0.2, density = True)
plt.xlabel('phrase len')
plt.ylabel('probability')
plt.grid(alpha = 0.25)

All the inputs to a neural network must be the same length. Therefore, we store the longest length as a variable which we’ll use later to define the input to our model.

### Word embedding
Computers don’t understand words, let alone sentences, therefore, we use the tokenizer to parse the phrases. In specifying `num_words`, only the most common words will be kept.   
The tokens are then vectorized. By vectorized we mean that they are mapped to integers.

In [None]:
max_words = 8192
tokenizer = Tokenizer(
    num_words = max_words,
    filters = '"#$%&()*+-/:;<=>@[\]^_`{|}~'
)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
tokenizer.fit_on_texts(X_test)
X_test = tokenizer.texts_to_sequences(X_test)

`Tokenizer`: [Tokenizer](https://keras.io/preprocessing/text/) is the text tokenization utility class. This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector

`fit_on_texts`: Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. word_index["the"] = 1; word_index["cat"] = 2 it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word 

`texts_to_sequences`: Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

### Pad sequences
In order to feed this data into our RNN, all input documents must have the same length. We will limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0). We can accomplish this using the `pad_sequences()` function in Keras.

In [None]:
X_train = pad_sequences(X_train, maxlen = max_phrase_len)
X_test = pad_sequences(X_test, maxlen = max_phrase_len)

In [None]:
X_train.shape

## 3. Building the Model
Our model is a simple RNN model with 1 embedding, 1 LSTM and 1 dense layers.
<img src="network.png" alt="drawing" width="200"/>

In [None]:
model = Sequential()
model.add(Embedding(input_dim = max_words, output_dim = 100, input_length = max_phrase_len))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(5, activation = 'softmax'))

model.summary()

- `Embedding`: [Embedding](https://keras.io/layers/embeddings/) layer turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]. This layer can only be used as the first layer in a model. 
    - input_dim: int > 0. Size of the vocabulary
    - output_dim: int >= 0. Dimension of the dense embedding.
    - input_length: Length of input sequences, when it is constant. 
- [`LSTM`](https://keras.io/layers/recurrent/): Long Short-Term Memory layer
- `Dropout` consists in randomly setting a fraction rate of input units, indicates the fraction of the input units to drop at each update during training time.

## 4. Compiling the Model
Before we can begin training, we need to configure the training process. We decide 3 key factors during the compilation step:
- The **optimizer**. We’ll stick with a pretty good default: the Adam gradient-based optimizer (Adam - A Method for Stochastic Optimization). Keras has many [other optimizers](https://keras.io/optimizers/) you can look into as well.
- The **loss function**. Since we’re using a Softmax output layer, we’ll use the Cross-Entropy loss. Keras distinguishes between binary_crossentropy (2 classes) and categorical_crossentropy (>2 classes), so we’ll use the latter. [See all Keras losses](https://keras.io/losses/).
- A list of **metrics**. Since this is a classification problem, we’ll just have Keras report on the accuracy metric.

Here’s what that compilation looks like:

In [None]:
model.compile(
    loss='categorical_crossentropy',
    optimizer='Adam',
    metrics=['accuracy']
)

## 5. Training the Model

In [None]:
batch_size = 512
epochs = 5

history = model.fit(
    X_train,
    to_categorical(y_train),
    epochs = epochs,
    batch_size = batch_size,
    validation_data=(X_test, to_categorical(y_test))
)

In [None]:
# plotting the metrics
fig = plt.figure()
plt.subplot(2,1,1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='lower right')

plt.subplot(2,1,2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper right')

plt.tight_layout()

## 6. Using the Model
Now that we have a working, trained model, let’s put it to use. The first thing we’ll do is save it to disk so we can load it back up anytime:

In [None]:
model.save_weights('rnn.h5')

We can now reload the trained model whenever we want by rebuilding it and loading in the saved weights:

In [None]:
model = Sequential()
model.add(Embedding(input_dim = max_words, output_dim = 100, input_length = max_phrase_len))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(5, activation = 'softmax'))

# Load the model from disk later using:
model.load_weights('rnn.h5')

### Prediction result on test data
Using the trained model to make predictions is easy: we pass an array of inputs to `predict()` and it returns an array of outputs. Keep in mind that the output of our network are probabilities (because of softmax), so we’ll use `np.argmax()` to turn those into actual classes.

In [None]:
# Predict on the first 5 test sequences.
predictions = model.predict(X_test[:5])

# Print our model's predictions.
print(np.argmax(predictions, axis=1)) 

# Check our predictions against the ground truths.
print(y_test[:5])

## The Full Code

In [None]:
import numpy as np
import pandas as pd
import os
from matplotlib import pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM

# If you are using MacOS, please un-comment the following line
# allow to duplicate dll
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# load data
df_train = pd.read_csv('train_sentiment.csv')
df_test = pd.read_csv('test_sentiment.csv')

y_train = df_train['Sentiment']
y_test = df_test['Sentiment']

# clean data
replace_list = {r"i'm": 'i am',
                r"'re": ' are',
                r"let’s": 'let us',
                r"'s":  ' is',
                r"'ve": ' have',
                r"can't": 'can not',
                r"cannot": 'can not',
                r"shan’t": 'shall not',
                r"n't": ' not',
                r"'d": ' would',
                r"'ll": ' will',
                r"'scuse": 'excuse',
                ',': ' ,',
                '.': ' .',
                '!': ' !',
                '?': ' ?',
                '\s+': ' '}

def clean_text(text):
    text = text.lower()
    for s in replace_list:
        text = text.replace(s, replace_list[s])
    text = ' '.join(text.split())
    return text

X_train = df_train['Phrase'].apply(lambda p: clean_text(p))
X_test = df_test['Phrase'].apply(lambda p: clean_text(p))

phrase_len = X_train.apply(lambda p: len(p.split(' ')))
max_phrase_len = phrase_len.max()
phrase_len = X_test.apply(lambda p: len(p.split(' ')))
if max_phrase_len < phrase_len.max():
    max_phrase_len = phrase_len.max()

# word embedding
max_words = 8192
tokenizer = Tokenizer(
    num_words = max_words,
    filters = '"#$%&()*+-/:;<=>@[\]^_`{|}~'
)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
tokenizer.fit_on_texts(X_test)
X_test = tokenizer.texts_to_sequences(X_test)

# pad sequence
X_train = pad_sequences(X_train, maxlen = max_phrase_len)
X_test = pad_sequences(X_test, maxlen = max_phrase_len)

# build model
model = Sequential()
model.add(Embedding(input_dim = max_words, output_dim = 100, input_length = max_phrase_len))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(5, activation = 'softmax'))

# compile model
model.compile(
    loss='categorical_crossentropy',
    optimizer='Adam',
    metrics=['accuracy']
)

# train the model
batch_size = 512
epochs = 5

history = model.fit(
    X_train,
    to_categorical(y_train),
    epochs = epochs,
    batch_size = batch_size,
    validation_data=(X_test, to_categorical(y_test))
)

model.save_weights('rnn.h5')
# Load the model from disk later using:
#model.load_weights('rnn.h5')

# Predict on the first 5 test sequences.
predictions = model.predict(X_test[:5])

# Print our model's predictions.
print(np.argmax(predictions, axis=1)) 

# Check our predictions against the ground truths.
print(y_test[:5])

- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0.