In [1]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline


## (Tutorial stolen and adapted to LKA_LSTM)

# Language Modeling with ``LKA_LSTM`` and torchtext

This is a tutorial on training a model to predict the next word in a sequence using the
[nn.Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)_ module.

The PyTorch 1.2 release includes a standard transformer module based on the
paper [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf)_.
Compared to Recurrent Neural Networks (RNNs), the transformer model has proven
to be superior in quality for many sequence-to-sequence tasks while being more
parallelizable. The ``nn.Transformer`` module relies entirely on an attention
mechanism (implemented as
[nn.MultiheadAttention](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html)_)
to draw global dependencies between input and output. The ``nn.Transformer``
module is highly modularized such that a single component (e.g.,
[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)_)
can be easily adapted/composed.

<img src="file://../_static/img/transformer_architecture.jpg">


## Define the model




In this tutorial, we train a ``nn.TransformerEncoder`` model on a
causal language modeling task. Please note that this tutorial does not cover
the training of [nn.TransformerDecoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html#torch.nn.TransformerDecoder)_, as depicted in
the right half of the diagram above. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
[nn.TransformerEncoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html)_.
Along with the input sequence, a square attention mask is required because the
self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
the earlier positions in the sequence. For the language modeling task, any
tokens on the future positions should be masked.  This masking, combined with fact that
the output embeddings are offset with later positions ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
To produce a probability  distribution over output words, the output of the ``nn.TransformerEncoder``
model is passed through a linear layer to output unnormalized logits.
The log-softmax function isn't applied here due to the later use of
[CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)_,
which requires the inputs to be unnormalized logits.




In [2]:
import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

from lka_lstm import LKA_LSTM_LM

## Load and batch data




This tutorial uses ``torchtext`` to generate Wikitext-2 dataset.
To access torchtext datasets, please install torchdata following instructions at https://github.com/pytorch/data.
%%

The vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Wikitext-2 represents rare tokens as `<unk>`.

Given a 1-D vector of sequential data, ``batchify()`` arranges the data
into ``batch_size`` columns. If the data does not divide evenly into
``batch_size`` columns, then the data is trimmed to fit. For instance, with
the alphabet as the data (total length of 26) and ``batch_size=4``, we would
divide the alphabet into sequences of length 6, resulting in 4 of such sequences.

\begin{align}\begin{bmatrix}
  \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y} & \text{Z}
  \end{bmatrix}
  \Rightarrow
  \begin{bmatrix}
  \begin{bmatrix}\text{A} \\ \text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} &
  \begin{bmatrix}\text{G} \\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} &
  \begin{bmatrix}\text{M} \\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} &
  \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
  \end{bmatrix}\end{align}

Batching enables more parallelizable processing. However, batching means that
the model treats each column independently; for example, the dependence of
``G`` and ``F`` can not be learned in the example above.




In [3]:
from wikidat import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from tokenizers import Tokenizer

#train_iter = WikiText2(split='train')
#tokenizer = get_tokenizer('basic_english')
#vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
#vocab.set_default_index(vocab['<unk>'])
tokenizer = Tokenizer.from_file("en_wpiece.json")

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(tokenizer.encode(item).ids, dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into ``bsz`` separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Arguments:
        data: Tensor, shape ``[N]``
        bsz: int, batch size

    Returns:
        Tensor of shape ``[N // bsz, bsz]``
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len) #.t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

### Functions to generate input and target sequence




``get_batch()`` generates a pair of input-target sequences for
the transformer model. It subdivides the source data into chunks of
length ``bptt``. For the language modeling task, the model needs the
following words as ``Target``. For example, with a ``bptt`` value of 2,
we’d get the following two Variables for ``i`` = 0:

<img src="file://../_static/img/transformer_input_target.png">

It should be noted that the chunks are along dimension 0, consistent
with the ``S`` dimension in the Transformer model. The batch dimension
``N`` is along dimension 1.




In [4]:
bptt = 512
def get_batch(source: Tensor, i: int) -> Tuple[Tensor, Tensor]:
    """
    Args:
        source: Tensor, shape ``[full_seq_len, batch_size]``
        i: int

    Returns:
        tuple (data, target), where data has shape ``[seq_len, batch_size]`` and
        target has shape ``[seq_len * batch_size]``
    """
    #print("src shape:", source.shape )
    #seq_len = min(bptt, len(source) - 1 - i)
    seq_len = min(bptt, source.shape[1] - 1 - i )
    #print("seq len: ", seq_len )
    data = source[:,i:i+seq_len]
    target = source[:,i+1:i+1+seq_len].reshape(-1)
    return data, target

## Initiate an instance




The model hyperparameters are defined below. The ``vocab`` size is
equal to the length of the vocab object.




In [5]:
ntokens = tokenizer.get_vocab_size() #len(vocab)  # size of vocabulary
print("vocab sz: ", ntokens )
emsize = 64 #200  # embedding dimension
d_hid =  8 #200  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 8 # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
model = LKA_LSTM_LM(ntokens, emsize, d_hid, nhead ).to(device) #TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

vocab sz:  2048


## Run the model




We use [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)_
with the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html)_
(stochastic gradient descent) optimizer. The learning rate is initially set to
5.0 and follows a [StepLR](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html)_
schedule. During training, we use [nn.utils.clip_grad_norm\_](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html)_
to prevent gradients from exploding.




In [6]:
import time

criterion = nn.CrossEntropyLoss()
lr = 0.01  # learning rate
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

def train(model: nn.Module) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 50 #200
    start_time = time.time()

    num_batches = train_data.shape[1] // bptt
    #print("n batches:", num_batches )
    #print("train_shape: ", train_data.shape )
    for batch, i in enumerate(range(0, train_data.size(1) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        output = model(data)
        output_flat = output.view(-1, ntokens)
        loss = criterion(output_flat, targets)

        optimizer.zero_grad()
        loss.backward()
        #torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            #lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(min(cur_loss, 100 ))
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            #f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
            total_loss = 0
            start_time = time.time()

def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    with torch.no_grad():
        for i in range(0, eval_data.size(1) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            seq_len = data.size(1)
            output = model(data)
            output_flat = output.view(-1, ntokens)
            total_loss += seq_len * criterion(output_flat, targets).item()
            if i==0:
                out_ids = output.argmax(dim=-1)[0].cpu()
                print( tokenizer.decode(out_ids.numpy()).replace(' ##', '') )
    return total_loss / (eval_data.size(1) - 1)

Loop over epochs. Save the model if the validation loss is the best
we've seen so far. Adjust the learning rate after each epoch.



In [7]:
best_val_loss = float('inf')
epochs = 200

with TemporaryDirectory() as tempdir:
    best_model_params_path = os.path.join(tempdir, "best_model_params.pt")

    for epoch in range(1, epochs + 1):
        epoch_start_time = time.time()
        train(model)
        val_loss = evaluate(model, val_data)
        val_ppl = math.exp(min(val_loss, 100 ))
        elapsed = time.time() - epoch_start_time
        print('-' * 89)
        print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
            f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
        print('-' * 89)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), best_model_params_path)

        #scheduler.step()
    model.load_state_dict(torch.load(best_model_params_path)) # load best model states

| epoch   1 |    50/  332 batches | loss  6.14 | ppl   463.62
| epoch   1 |   100/  332 batches | loss  5.08 | ppl   160.03
| epoch   1 |   150/  332 batches | loss  4.77 | ppl   117.96
| epoch   1 |   200/  332 batches | loss  4.66 | ppl   106.07
| epoch   1 |   250/  332 batches | loss  4.58 | ppl    97.17
| epoch   1 |   300/  332 batches | loss  4.50 | ppl    90.24
=ar and ,iansy . , =ar and ,iansy . , and as a <an <rite , < starrite , and a < of the unk > ,rite , the < <anticul 'n < anditectdctt , <r < of , the < <r the is aoser <ative the the < <rite , andoll the < , the is beth the <ntth of the th 0amby < 0 0 the , the <ive the thickitiesam , , < 9 0lack , the then of <ervolaflyira theosen , the the of the <rite , <uel and aing the .rite ,uc . theache the theern thered , the <m of ander theitheraa , was <ol to the <ale , the to the < , theirion of the the unk > ,ryana thearm ,iansy . , a <ly <im ,por toia , and the aider <ambt , therite ,et , and of < the < < a of the = =ription

KeyboardInterrupt: 

## Evaluate the best model on the test dataset




In [None]:
test_loss = evaluate(model, test_data)
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)