# Languge Modeling using LSTM on Penn Treebank

Language Modeling is the development of models to predict the next word of the sequence given the words that precede it. In this notebook we will demonstrate how to predict next word of a sequence using an LSTM. We will be using Penn Treebank dataset which contains 888K words for training, 70K for validation, and 79K for testing, with a vocabulary size of 10K.

In [1]:
import tempfile

import fastestimator as fe
import numpy as np
import tensorflow as tf
from fastestimator.op.numpyop import NumpyOp
from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp
from fastestimator.trace import Trace
from fastestimator.trace.adapt import EarlyStopping, LRScheduler
from fastestimator.trace.io import BestModelSaver

In [2]:
# Parameters
epochs=30
batch_size=128
seq_length=20
vocab_size=10000
data_dir=None
max_train_steps_per_epoch=None
save_dir=tempfile.mkdtemp()

## Building Components

### Downloading the data

First, we will download the Penn Treebank dataset via our dataset API.

In [3]:
from fastestimator.dataset.data.penn_treebank import load_data
train_data, eval_data, _, vocab = load_data(root_dir=data_dir, seq_length=seq_length + 1)

### Step 1: Create `Pipeline`

We will create a custom NumpyOp to generate input and target sequences.

In [4]:
class CreateInputAndTarget(NumpyOp):
    def forward(self, data, state):
        return data[:-1], data[1:]

In [5]:
pipeline = fe.Pipeline(train_data=train_data,
                       eval_data=eval_data,
                       batch_size=batch_size,
                       ops=CreateInputAndTarget(inputs="x", outputs=("x", "y")),
                       drop_last=True)

### Step 2: Create `Network`

The architecture of our model is a LSTM.

In [6]:
def build_model(vocab_size, embedding_dim, rnn_units, seq_length):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[None, seq_length]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [7]:
model = fe.build(model_fn=lambda: build_model(vocab_size, embedding_dim=300, rnn_units=600, seq_length=seq_length),
                     optimizer_fn=lambda: tf.optimizers.SGD(1.0, momentum=0.9))

We now define the `Network` object:

In [8]:
network = fe.Network(ops=[
    ModelOp(model=model, inputs="x", outputs="y_pred"),
    CrossEntropy(
        inputs=("y_pred", "y"), outputs="ce", form="sparse", from_logits=True),
    UpdateOp(model=model, loss_name="ce")
])

In this example we will also use the following traces:

1. A custom trace to calculate Perplexity.
2. LRScheduler to apply custom learning rate schedule.
3. BestModelSaver for saving the best model. For illustration purpose, we will save these models in a temporary directory.
4. EarlyStopping Trace for stopping early.

In [9]:
def lr_schedule(step, init_lr):
    if step <= 1725:
        lr = init_lr + init_lr * (step - 1) / 1725
    else:
        lr = max(2 * init_lr * ((6900 - step + 1725) / 6900), 1.0)
    return lr


class Perplexity(Trace):
    def on_epoch_end(self, data):
        ce = data["ce"]
        data.write_with_log(self.outputs[0], np.exp(ce))


traces = [
    Perplexity(inputs="ce", outputs="perplexity", mode="eval"),
    LRScheduler(model=model, lr_fn=lambda step: lr_schedule(step, init_lr=1.0)),
    BestModelSaver(model=model, save_dir=save_dir, metric='perplexity', save_best_mode='min', load_best_final=True),
    EarlyStopping(monitor="perplexity", patience=5)
]

### Step 3: Create `Estimator`

In [10]:
estimator = fe.Estimator(pipeline=pipeline,
                         network=network,
                         epochs=epochs,
                         traces=traces,
                         max_train_steps_per_epoch=max_train_steps_per_epoch, 
                         log_steps=300)

## Training and Testing

In [11]:
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Start: step: 1; num_device: 1; logging_interval: 300; 
FastEstimator-Train: step: 1; ce: 9.210202; model_lr: 1.0; 
FastEstimator-Train: step: 300; ce: 6.110634; steps/sec: 8.33; model_lr: 1.1733333; 
FastEstimator-Train: step: 345; epoch: 1; epoch_time: 43.55 sec; 
FastEstimator-BestModelSaver: Saved model to /tmp/tmptl5d8hgb/model_best_perplexity.h5
FastEstimator-Eval: step: 345; epoch: 1; ce: 5.8996396; perplexity: 364.90594; since_best_perplexity: 0; min_perplexity: 364.90594; 
FastEstimator-Train: step: 600; ce: 5.7039967; steps/sec: 8.2

## Inferencing

Once the training is finished, we will use the model to generate some sequences of text.

In [12]:
def get_next_word(data, vocab):
    output = network.transform(data, mode="infer") 
    index = output["y_pred"].numpy().squeeze()[-1].argmax()
    if index == 44:    # Removing unkwown predicition
        index = output["y_pred"].numpy().squeeze()[-1].argsort()[-2]
    return index

def generate_sequence(inp_seq, vocab, min_paragraph_len=50):
    data = pipeline.transform({"x": inp_seq}, mode="infer")
    generated_seq = data["x"]
    counter=0
    next_entry=0
    # Stopping at <eos> tag or after min_paragraph_len+30 words
    while (counter<min_paragraph_len or next_entry != 43) and counter<min_paragraph_len+30:  
        next_entry = get_next_word(data, vocab)
        generated_seq = np.concatenate([generated_seq.squeeze(), [next_entry]])
        data = {"x": generated_seq[-20:].reshape((1, 20))}
        counter+=1

    return " ".join([vocab[i] for i in generated_seq])

We will provide a text sequence from the validation dataset to the model and generate a paragraph with the input text sequence. 

In [13]:
for _ in range(2):
    idx = np.random.choice(len(eval_data))
    inp_seq = eval_data["x"][idx]
    print("Input Sequence:", " ".join([vocab[i] for i in inp_seq[:20]]))
    gen_seq = generate_sequence(inp_seq, vocab, 50)
    print("\nGenerated Sequence:", gen_seq)
    print("\n")

Input Sequence: the pictures <eos> the state <unk> noted that <unk> banking practices are grounds for removing an officer or director and

Generated Sequence: the pictures <eos> the state <unk> noted that <unk> banking practices are grounds for removing an officer or director and chief executive officer <eos> mr. guber and mr. peters have been working on the board <eos> the company said the company will be able to pay for the $ N million of the company 's common shares outstanding <eos> the company said the company 's net income rose N N to $ N million from $ N million <eos>


Input Sequence: the russians in iran the russians seem to have lost interest in the whole subject <eos> meanwhile congress is cutting

Generated Sequence: the russians in iran the russians seem to have lost interest in the whole subject <eos> meanwhile congress is cutting the capital-gains tax cut to the u.s. and the u.s. trade deficit <eos> the u.s. trade deficit has been the highest since august N <eos> the dol

As you can see, the network is able to generate meaningful sentences.