# RNNs,  LSTMs and Language Models

In [8]:
from mxnet import gluon, nd
import re

RNNs allow you to learn from sequences.

## RNN

![](support/rnn-unrolled.png)

$$h_t=𝐖_{hx}X_t+𝐖_{hh}h_{t−1}$$
$$o_t=𝐖_{𝑜ℎ}h_t$$

Create an rnn in gluon

In [2]:
from mxnet.gluon import rnn

num_hiddens = 256
rnn_layer = rnn.RNN(num_hiddens)
rnn_layer.initialize()

Initialize the hidden state

In [6]:
batch_size = 2
state = rnn_layer.begin_state(batch_size=batch_size)
state[0].shape

(1, 2, 256)

the input shape of rnn_layer is given by (time step, batch size, number of inputs)

In [11]:
num_steps = 35
X = nd.random.uniform(shape=(num_steps, batch_size, 3))
Y, state_new = rnn_layer(X, state)
print(X.shape)
Y.shape, len(state_new), state_new[0].shape

(35, 2, 3)


((35, 2, 256), 1, (1, 2, 256))

The hidden state returned by the rnn.RNN instance in the forward computation is the state of the hidden layer available at the last time step.

## BPTT
![](support/rnn-bptt.svg)

Computational dependencies for a recurrent neural network model with three time steps

Gradients of the loss at at time t depend on results of hidden layers at previous time steps, recursively until the first time step.


$$\nabla_{𝐖_{ℎℎ}}𝐡_𝑡 = \sum_{𝑗=1}^𝑡(𝐖^⊤_{ℎℎ})^{𝑡−𝑗}𝐡_𝑗$$
$$\nabla_{𝐖_{ℎ𝑥}}𝐡_𝑡=\sum_{𝑗=1}^𝑡(𝐖^⊤_{ℎℎ})^{𝑡−𝑗}𝐱_𝑗$$
 


### BackPropagation through time.

Like regular backprop but with time.

* store intermediate results, i.e. powers of  $𝐖_{ℎℎ}$ and $h_j$ as we do the forward pass to compute the loss
* truncate sum to avoid numerical issues i.e gradient detachment


Matrix power can become arbitarily large. This is numerically unstable because eigenvalues smaller than 1 vanish for large powers and eigenvalues larger than 1 explode.  One way to address this is to truncate the sum at a computationally convenient size. 

## LSTM
![](support/lstm-chain.png)

### Long Short Term Memory 

* LSTMs model long term dependencies better than RNNs
* three types of gates that control the flow of information: input, forget and output gates .
* The hidden layer output of LSTM includes hidden states and memory cells. 
* output is computed using only hidden state. Memory cells are entirely internal.

Create an lstm in gluon

In [12]:
lstm_layer = rnn.LSTM(num_hiddens)
lstm_layer.initialize()

Initialize the hidden state

In [16]:
state = lstm_layer.begin_state(batch_size=batch_size)
print(len(state))

2


## Language Model with LSTM

In [34]:
import warnings
warnings.filterwarnings('ignore')

import glob
import time
import math

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download

import gluonnlp as nlp

num_gpus = 1
context = mx.gpu(0)
log_interval = 200

### Dataset

In [19]:
dataset_name = 'wikitext-2'
train_dataset, val_dataset, test_dataset = (nlp.data.WikiText2(segment=segment,
                                                               bos=None, 
                                                               eos='<eos>', 
                                                               skip_empty=False)
                                            for segment in ['train', 'val', 'test'])


### DataLoader

In [35]:
batch_size = 20
bptt = 35

vocab = nlp.Vocab(nlp.data.Counter(train_dataset), padding_token=None, bos_token=None)
print(vocab)

bptt_batchify = nlp.data.batchify.CorpusBPTTBatchify(vocab, bptt, batch_size, last_batch='discard')
train_data, val_data, test_data = (bptt_batchify(x) for x in [train_dataset, val_dataset, test_dataset])

Vocab(size=33278, unk="<unk>", reserved="['<eos>']")


### Model

In [36]:
model_name = 'standard_lstm_lm_200'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None)
print(model)

model.initialize(mx.init.Xavier(), ctx=context)

lr = 20
trainer = gluon.Trainer(model.collect_params(), 'sgd', {
    'learning_rate': lr,
    'momentum': 0,
    'wd': 0
})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 33278, linear)
  )
)


In [24]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [detach(i) for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

def evaluate(model, data_source, batch_size, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(
        batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        hidden = detach(hidden)
        L = loss(output.reshape(-3, -1), target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

grad_clip = 0.25
epochs = 3

### Training

In [39]:
def train(model, train_data, val_data, test_data, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()
    for epoch in range(epochs):
        total_L = 0.0
        start_epoch_time = time.time()
        start_log_interval_time = time.time()
        hidden = model.begin_state(batch_size, func=mx.nd.zeros, ctx=context)
        for i, (data, target) in enumerate(train_data):
            data = data.as_in_context(context)
            target = target.as_in_context(context)
            hidden = detach(hidden)
            L = 0
            Ls = []
            with autograd.record():
                output, hidden = model(data, hidden)
                batch_L = loss(output.reshape(-3, -1), target.reshape(-1,))
                L = L + batch_L.as_in_context(context) / (data.size)
                Ls.append(batch_L / (data.size))
            L.backward()
            grads = [p.grad(data.context) for p in parameters]
            gluon.utils.clip_global_norm(grads, grad_clip)

            trainer.step(1)

            total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])

            if i % log_interval == 0 and i > 0:
                cur_L = total_L / log_interval
                print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
                      'throughput %.2f samples/s'%(
                    epoch, i, len(train_data), cur_L, math.exp(cur_L),
                    batch_size * log_interval / (time.time() - start_log_interval_time)))
                total_L = 0.0
                start_log_interval_time = time.time()

        mx.nd.waitall()

        print('[Epoch %d] throughput %.2f samples/s'%(
                    epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
        val_L = evaluate(model, val_data, batch_size, context)
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))

        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, batch_size, context)
            model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            print('Learning rate now %f'%(lr))
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) /
                            (time.time() - start_train_time)))
    
train(model, train_data, val_data, test_data, epochs, lr)

[Epoch 0 Batch 200/2983] loss 5.46, ppl 235.91, throughput 1608.40 samples/s
[Epoch 0 Batch 400/2983] loss 5.45, ppl 233.75, throughput 1627.40 samples/s
[Epoch 0 Batch 600/2983] loss 5.28, ppl 197.20, throughput 1634.64 samples/s
[Epoch 0 Batch 800/2983] loss 5.30, ppl 199.86, throughput 1633.01 samples/s
[Epoch 0 Batch 1000/2983] loss 5.26, ppl 193.16, throughput 1632.73 samples/s
[Epoch 0 Batch 1200/2983] loss 5.26, ppl 191.98, throughput 1631.85 samples/s
[Epoch 0 Batch 1400/2983] loss 5.25, ppl 191.44, throughput 1631.19 samples/s
[Epoch 0 Batch 1600/2983] loss 5.32, ppl 203.73, throughput 1631.51 samples/s
[Epoch 0 Batch 1800/2983] loss 5.19, ppl 178.75, throughput 1634.95 samples/s
[Epoch 0 Batch 2000/2983] loss 5.21, ppl 182.49, throughput 1636.95 samples/s
[Epoch 0 Batch 2200/2983] loss 5.11, ppl 165.54, throughput 1633.29 samples/s
[Epoch 0 Batch 2400/2983] loss 5.15, ppl 171.94, throughput 1627.28 samples/s
[Epoch 0 Batch 2600/2983] loss 5.16, ppl 174.19, throughput 1630.15 