# Recurrent Neural Networks with ``BigDL Keras``


With BigDL, now we can train the recurrent neural networks (RNNs) more neatly, such as the long short-term memory (LSTM) and the gated recurrent unit (GRU). To demonstrate the end-to-end RNN training and prediction pipeline, we take a classic problem in language modeling as a case study. Specifically, we will show how to predict the distribution of the next word given a sequence of previous words.

## Import packages

To begin with, we need to make the following necessary imports.

In [1]:
from __future__ import print_function
import math
import os
import time
import numpy as np

## Define classes for indexing words of the input document

In a language modeling problem, we define the following classes to facilitate the routine procedures for loading document data. In the following, the ``Dictionary`` class is for word indexing: words in the documents can be converted from the string format to the integer format. 

In this example, we use consecutive integers to index words of the input document.

In [2]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

The ``Dictionary`` class is used by the ``Corpus`` class to index the words of the input document.

In [3]:
class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(path + 'train.txt')
        self.valid = self.tokenize(path + 'valid.txt')
        self.test = self.tokenize(path + 'test.txt')

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r') as f:
            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r') as f:
            ids = np.zeros((tokens,), dtype='int32')
            token = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids

## Load data as batches

We load the document data by leveraging the aforementioned ``Corpus`` class. 

To speed up the subsequent data flow in the RNN model, we pre-process the loaded data as batches. This procedure is defined in the following ``batchify`` function.

Note: The dataset used below can be downloaded from this link http://goo.gl/vT4cEw. You should also extract the zip file under the "/tmp" directory to let the code run successfully.

In [4]:
data_path = "/tmp/nlp/ptb."

corpus = Corpus(data_path)

def batchify(data, batch_size):
    """Reshape data into (num_example, batch_size)"""
    nbatch = data.shape[0] // batch_size
    data = data[:nbatch * batch_size]
    data = data.reshape((batch_size, 1, nbatch)).T
    return data

batch_size = 32
train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, batch_size)
test_data = batchify(corpus.test, batch_size)

## Provide an exposition of different RNN models
We can make different RNN models available with the following single ``RNNModel`` class.

Users can select their preferred RNN model or compare different RNN models by configuring the argument of the constructor of ``RNNModel``. We will show an example following the definition of the ``RNNModel`` class.

In [5]:
from bigdl.nn.keras.layer import *
from bigdl.nn.keras.topology import Sequential

class RNNModel():
    """A model with an encoder, recurrent layer, and a decoder."""   
    #batch_size was defined 32
    def __init__(self, mode, vocab_size, num_hidden, arg_input_shape = (1, batch_size), dropout=0.5):
            self.model = Sequential()
            if mode == 'rnn_relu':
                self.model.add(SimpleRNN(num_hidden, activation = "relu", input_shape = arg_input_shape))
            elif mode == 'rnn_tanh':
                self.model.add(SimpleRNN(num_hidden, input_shape = arg_input_shape))
            elif mode == 'lstm':
                self.model.add(LSTM(num_hidden, input_shape = arg_input_shape))
            elif mode == 'gru':
                self.model.add(GRU(num_hidden, input_shape = arg_input_shape))
            else:
                raise ValueError("Invalid mode %s. Options are rnn_relu, "
                                 "rnn_tanh, lstm, and gru"%mode)
            
            self.decoder = Dense(vocab_size, activation = "tanh")
            self.model.add(self.decoder)
            self.num_hidden = num_hidden

## Build the model

We go on to build the model, initialize model parameters, and configure the optimization algorithms for training the RNN model.

For demonstration, LSTM is the chosen RNN model type. For other RNN options, one can replace the 'lstm' string to 'rnn_relu', 'rnn_tanh', or 'gru'.

In [6]:
ntokens = len(corpus.dictionary)
model_type = 'lstm'
num_hid = 100
LSTM = RNNModel(model_type, ntokens, num_hid)

creating: createKerasSequential
creating: createKerasLSTM
creating: createKerasDense


In [7]:
print(LSTM.model.get_input_shape())
print(LSTM.model.get_output_shape())

(None, 1, 32)
(None, 10000)


## Train the model and evaluate on validation and testing data sets

Now we can define functions for training and evaluating the model. The following are two helper functions that will be used during model training and evaluation.

In [8]:
from bigdl.nn.criterion import *

LSTM.model.compile(optimizer='sgd', loss=CrossEntropyCriterion(), metrics=['accuracy'])

creating: createCrossEntropyCriterion
creating: createDefault
creating: createSGD
creating: createTop1Accuracy


## Execute Training

*Note: See the loss and accuracy in the terminal. We will provide performance visualization in later topics.*

In [10]:
LSTM.model.fit(train_data, corpus.train[:len(train_data)], batch_size=8, nb_epoch=10,
validation_data=(val_data[:500], corpus.valid[:500]))

Recall that the RNN model training is based on maximization likelihood of observations. For evaluation purposes, we have used the following two measures:

* Loss: the loss function is defined as the average negative log likelihood of the target words (ground truth) under prediction: $$\text{loss} = -\frac{1}{N} \sum_{i = 1}^N \text{log} \  p_{\text{target}_i}, $$ where $N$ is the number of predictions and $p_{\text{target}_i}$ the predicted likelihood of the $i$-th target word.

* Perplexity: the average per-word perplexity is $\text{exp}(\text{loss})$.

To orient the reader using concrete examples, let us illustrate the idea of the perplexity measure as follows.

* Consider the perfect scenario where the model always predicts the likelihood of the target word as 1. In this case, for every $i$ we have $p_{\text{target}_i} = 1$. As a result, the perplexity of the perfect model is 1. 

* Consider a baseline scenario where the model always predicts the likelihood of the target word randomly at uniform among the given word set $W$. In this case, for every $i$ we have $p_{\text{target}_i} = 1 / |W|$. As a result, the perplexity of a uniformly random prediction model is always $|W|$. 

* Consider the worst-case scenario where the model always predicts the likelihood of the target word as 0. In this case, for every $i$ we have $p_{\text{target}_i} = 0$. As a result, the perplexity of the worst model is positive infinity. 


Therefore, a model with a lower perplexity that is closer to 1 is generally more effective. Any effective model has to achieve a perplexity lower than the cardinality of the target set.
 