This tutorial is a modified version of the [Gluon-NLP Language Model Tutorial](http://gluonnlp.mxnet.io/examples/language_model/language_model.html)

In [None]:
%%bash

pip install spacy -U --quiet 
pip install nltk==3.2.5 -U --quiet
python -m spacy download en

In [2]:
import nltk
nltk.download('perluniprops')
nltk.download('nonbreaking_prefixes')

[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


True

In [3]:
import warnings
warnings.filterwarnings('ignore')

import glob
import time
import math
import zipfile
import os

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon.utils import download

import gluonnlp as nlp
import nltk
import spacy

In [4]:
num_gpus = 1
context = [mx.gpu(0)]
log_interval = 200

In [5]:
batch_size = 20 * len(context)
lr = 20
epochs=3
seq_len = 35
grad_clip = 0.25

## Dataset: Shakespeare Works
We will use all the works of shakespeare that is concatenated by Andrej Karpathy, [here](http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt). We took 50% of the dataset and split it into train, val and test data 

In [6]:
data_path="./data"
data_url = 'https://s3.amazonaws.com/odsc-conf/shakespeare.zip'

In [7]:
data_zip = download(data_url, path=data_path)
with zipfile.ZipFile(data_zip, 'r') as zipped_data:
    zipped_data.extractall(os.path.expanduser(data_path))

Downloading ./data/shakespeare.zip from https://s3.amazonaws.com/odsc-conf/shakespeare.zip...


In [8]:
%ls "./data"

[0m[01;34mimdb[0m/                 shakespeare_train.txt  [01;34msherlock[0m/
[01;34m__MACOSX[0m/             shakespeare_val.txt    [01;31msherlock.zip[0m
shakespeare_test.txt  [01;31mshakespeare.zip[0m        timemachine.txt


In [11]:
shakespeare_data = [data_path + "/shakespeare_train.txt", 
                 data_path + "/shakespeare_val.txt",
                data_path + "/shakespeare_test.txt"]

In [14]:
with open(shakespeare_data[0]) as f:
    text = f.read()
print(text[0:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


We need a text splitter that will break our corpus of text into sequences or samples, we will use sentence splitter provided by the [nltk package](https://www.nltk.org/)
and  

We will use the default tokenizer provided by Gluon-NLP to  split our sequences into words

The input to our model are words of sequences.

In [15]:
tokenizer = nlp.data.NLTKMosesTokenizer()

splitter=nltk.tokenize.sent_tokenize

Gluon-NLP provides a **[CorpusDataset](https://gluon-nlp.mxnet.io/api/data.html#gluonnlp.data.CorpusDataset)** API that takes a corpus of text, the splitter and tokenizer functions and creates a dataset object for you, the dataset object can be fed the dataloader APIs to get batches of data(more on it below..).  

We will create datasets for all **train, validation and test data** we have.

In [16]:
train_ds, val_ds, test_ds  = [nlp.data.CorpusDataset
    (ds, sample_splitter=splitter,flatten=True,eos='<eos>') 
    for ds in shakespeare_data]

Now we need to create a Vocabulary from our training dataset. To create a Vocabulary all we need to do is create Counter from our training dataset, which creates a map of **word  : count**

In [17]:
counter = nlp.data.Counter(train_ds)

In [18]:
print("5 most common tokens: %s\n" % counter.most_common(5))
print("unique tokens: %s" % len(counter))

5 most common tokens: [('<eos>', 20955), ('the', 8600), ('I', 7648), ('to', 6151), ('and', 6044)]

unique tokens: 34962


In [19]:
vocab = nlp.Vocab(counter, padding_token=None, bos_token=None, eos_token='<eos>')

In [20]:
print(vocab)

Vocab(size=34963, unk="<unk>", reserved="['<eos>']")


## Batchify

Now we will need to create mini-batches from the sequences of data we have. 
Gluon-NLP provides batchify function for a given sequence_length and batch_size.   

The batchify function creates batches so that the states of the previous batch connects to the hidden state of the current batch.

we will use the batchify function in the data loader that feeds the model Training process. 

In [21]:
batchify = nlp.data.batchify.CorpusBPTTBatchify(vocab, 
                       seq_len, batch_size, last_batch='discard')

In [22]:
train_dl, val_dl, test_dl = [batchify(ds) for ds in [train_ds, val_ds, test_ds]]

In [23]:
model_name = 'standard_lstm_lm_200'
dataset_name='wikitext-2'
model, vocab = nlp.model.get_model(model_name, vocab=vocab, dataset_name=None, ctx=context[0])
print(model)
print(vocab)

StandardRNN(
  (embedding): HybridSequential(
    (0): Embedding(34963 -> 200, float32)
    (1): Dropout(p = 0.2, axes=())
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2, dropout=0.2)
  (decoder): HybridSequential(
    (0): Dense(200 -> 34963, linear)
  )
)
Vocab(size=34963, unk="<unk>", reserved="['<eos>']")


In [25]:
model.initialize(mx.init.Xavier(), ctx=context)

trainer = gluon.Trainer(model.collect_params(), 'sgd', {
    'learning_rate': lr,
    'momentum': 0,
    'wd': 0
})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

We will use a slightly modified Backprogation through time algorithm called **Truncated Backpropagtion through time(TBPTT)**, here we will truncate the BPTT algorithm after **k** steps and update the weights, since a long sequence is expensive to compute complete BPTT and also potentially result in Vanishing gradients. 

We truncate by detaching the hidden state after **k** steps. Let's write a method for detaching the hidden state.

Reference:  
1. [Understanding BPTT & TBPTT conceptutally](https://machinelearningmastery.com/gentle-introduction-backpropagation-time/)  

2. [BPTT & TBPTT in detail](https://d2l.ai/chapter_recurrent-neural-networks/bptt.html?highlight=detach)


In [26]:
def detach(hidden):
    if isinstance(hidden, (tuple, list)):
        hidden = [detach(i) for i in hidden]
    else:
        hidden = hidden.detach()
    return hidden

Lets create a evaluate method that will use the model on a dataset and measure the loss

In [27]:
def evaluate(model, data_source, batch_size, ctx):
    total_L = 0.0
    ntotal = 0
    hidden = model.begin_state(batch_size=batch_size, func=mx.nd.zeros, ctx=ctx)
    for i, (data, target) in enumerate(data_source):
        data = data.as_in_context(ctx)
        target = target.as_in_context(ctx)
        output, hidden = model(data, hidden)
        hidden = detach(hidden)
        
        L = loss(output.reshape(-3, -1), target.reshape(-1))
        total_L += mx.nd.sum(L).asscalar()
        ntotal += L.size
    return total_L / ntotal

Now the training loop

In [28]:
def train(model, train_data, val_data, test_data, epochs, lr):
    best_val = float("Inf")
    start_train_time = time.time()
    parameters = model.collect_params().values()
    
    for epoch in range(epochs):
        total_L = 0.0
        start_epoch_time = time.time()
        start_log_interval_time = time.time()
        
        hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
                   for ctx in context]
        
        for i, (data, target) in enumerate(train_data):
            data_list = gluon.utils.split_and_load(data, context, 
                                                   batch_axis=1, even_split=True)
            target_list = gluon.utils.split_and_load(target, context, 
                                                     batch_axis=1, even_split=True)
            hiddens = detach(hiddens)
            
            L = 0
            Ls = []
            with autograd.record():
                for j, (X, y, h) in enumerate(zip(data_list, target_list, hiddens)):
                    output, h = model(X, h)
                    batch_L = loss(output.reshape(-3, -1), y.reshape(-1,))
                    L = L + batch_L.as_in_context(context[0]) / (len(context) * X.size)
                    Ls.append(batch_L / (len(context) * X.size))
                    hiddens[j] = h
            L.backward()
            
            grads = [p.grad(x.context) for p in parameters for x in data_list]
            gluon.utils.clip_global_norm(grads, grad_clip)

            trainer.step(1)

            total_L += sum([mx.nd.sum(l).asscalar() for l in Ls])

            if i % log_interval == 0 and i > 0:
                cur_L = total_L / log_interval
                print('[Epoch %d Batch %d/%d] loss %.2f, ppl %.2f, '
                      'throughput %.2f samples/s'%(
                    epoch, i, len(train_data), cur_L, math.exp(cur_L),
                    batch_size * log_interval / (time.time() - start_log_interval_time)))
                total_L = 0.0
                start_log_interval_time = time.time()

        mx.nd.waitall()

        print('[Epoch %d] throughput %.2f samples/s'%(
                    epoch, len(train_data)*batch_size / (time.time() - start_epoch_time)))
        val_L = evaluate(model, val_data, batch_size, context[0])
        print('[Epoch %d] time cost %.2fs, valid loss %.2f, valid ppl %.2f'%(
            epoch, time.time()-start_epoch_time, val_L, math.exp(val_L)))

        if val_L < best_val:
            best_val = val_L
            test_L = evaluate(model, test_data, batch_size, context[0])
            model.save_parameters('{}_{}-{}.params'.format(model_name, dataset_name, epoch))
            print('test loss %.2f, test ppl %.2f'%(test_L, math.exp(test_L)))
        else:
            lr = lr*0.25
            print('Learning rate now %f'%(lr))
            trainer.set_learning_rate(lr)

    print('Total training throughput %.2f samples/s'%(
                            (batch_size * len(train_data) * epochs) /
                            (time.time() - start_train_time)))

Lets check the performance ofthe pre-trained model we fetched from Gluon-NLP without training using the shakespeare  data

In [30]:
shakespeare_L = evaluate(model, test_dl, batch_size, context[0])
print('Best validation loss %.2f, test ppl %.2f' %
      (shakespeare_L, math.exp(shakespeare_L)))

Best validation loss 10.46, test ppl 34962.08


In [32]:
train( model,train_dl,val_dl,test_dl,epochs=10, lr=0.1)

[Epoch 0 Batch 200/499] loss 6.99, ppl 1090.20, throughput 480.35 samples/s
[Epoch 0 Batch 400/499] loss 6.68, ppl 799.54, throughput 478.89 samples/s
[Epoch 0] throughput 476.66 samples/s
[Epoch 0] time cost 26.42s, valid loss 6.84, valid ppl 934.93
test loss 6.84, test ppl 934.93
[Epoch 1 Batch 200/499] loss 6.39, ppl 595.78, throughput 476.75 samples/s
[Epoch 1 Batch 400/499] loss 6.27, ppl 526.18, throughput 476.46 samples/s
[Epoch 1] throughput 479.42 samples/s
[Epoch 1] time cost 26.30s, valid loss 6.68, valid ppl 797.85
test loss 6.68, test ppl 797.85
[Epoch 2 Batch 200/499] loss 6.10, ppl 444.39, throughput 483.13 samples/s
[Epoch 2 Batch 400/499] loss 6.02, ppl 410.57, throughput 474.56 samples/s
[Epoch 2] throughput 481.56 samples/s
[Epoch 2] time cost 26.20s, valid loss 6.66, valid ppl 778.30
test loss 6.66, test ppl 778.30
[Epoch 3 Batch 200/499] loss 5.90, ppl 364.99, throughput 482.56 samples/s
[Epoch 3 Batch 400/499] loss 5.84, ppl 343.06, throughput 464.88 samples/s
[Ep

## Generating Text
Lets see if our model can now generate text like shaekspeare :)

In [33]:
sentence = ["boston"]
input = mx.nd.array([vocab[sentence[0]]], ctx=context[0])

In [34]:
hidden = model.begin_state(batch_size=1, ctx=context[0])

In [35]:
text_len=200

In [36]:
for i in range(1,text_len):
    input = mx.nd.expand_dims(input, axis=1) # (batch_size * seq)
    output, hidden = model(input, hidden)
    output = mx.nd.argmax(output[0], axis=1)
    input = output
    sentence.append(vocab.idx_to_token[output[0].astype("int").asscalar()])

In [37]:
sentences = []

In [38]:
line = []
for each in sentence:
    if each != '<eos>':
        line.append(each)
    else:
        sentences.append(line)
        line=[]

In [39]:
for each in sentences:
    print(' '.join(each))

boston The king is in the world of my life, And I will tell thee what I do not see The king of my poor heart to the king.
CAPULET: I pray you, sir, I will not be a king, And I will tell you what I do not see The king of my good heart to the king.
KING RICHARD III: Well, I will not be a man to you.
QUEEN MARGARET: I will not be a king, I will not be a king.
QUEEN MARGARET: I will not be a king, I will not be a king.
QUEEN MARGARET: I will not be a gentleman to hear the king.
QUEEN MARGARET: I will not be a gentleman to hear the king to the king.
KING RICHARD III: Well, I will not be a man to hear the king.
QUEEN MARGARET: I will not be a gentleman to be a man.
ROMEO: I pray you, sir, I am a gentleman of my heart.
