<a href="https://colab.research.google.com/github/liadmagen/NLP-Course/blob/master/exercises_notebooks/09_LM_LSTM_Language_Model_%26_Word_parts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN / BiLSTM Language Model using Word Vectors
In this notebook, we will train a language model using **LSTM**.

We'll load the frankenstein book, and convert it into semantic representation through a word vectors library called BytePair Embedding (BPEmb).

After you run this notebook, please try changing the data-source from "frankenstein.txt" to "dracula.txt", and observe the result. 

### How are we going to do it?

We will define our data, as such, that for every word we use as an input for the model $(X=W_n)$, the next word would be the output $(Y = W_{n+1})$

The words in the output, Y, will be represented as a one-hot-vector. 

**Q: What is the size of this Vector?**


In [4]:
%%capture
!pip install bpemb

In [5]:
import tensorflow as tf

import time
import math
import unicodedata
import string

import torch
import torch.nn.functional as F
from torch import nn, tensor

from torchtext.data import get_tokenizer

from bpemb import BPEmb

In [39]:
device = torch.device("cuda")

## Word Vectors

Let's convert the text into vectors.

We will use a package called [BPEmb](https://nlp.h-its.org/bpemb/) which encodes words to vectors by dividing these words to **sub-words**, pieces of words, made of characters which frequently appear together.

Q: Remember what is the name of the Linguistic level that deals with letter-level? 

In [6]:
bpemb_en = BPEmb(lang="en")

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 722911.23B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz


100%|██████████| 3784656/3784656 [00:01<00:00, 3369678.16B/s]


In [None]:
bpemb_en.vectors.shape

(10000, 100)

We will use this helper function to load the corpus data (the books):

In [7]:
def get_file(filename = "frankenstein.txt"):
  path = tf.keras.utils.get_file(
      filename, origin=f"https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/{filename}"
  )
  with open(path, encoding="utf-8") as f:
      text = f.read() 
  text = text.replace("\n", " ")        # Remove line-breaks & newlines
  print("Corpus length:", len(text))
  return text

## RNN Model
And this is the model itself. This is a very raw structure of it. 

Note: In 'real-ilfe' we're using helping frameworks such as [ignite](https://pytorch.org/ignite/) or [lightning](https://www.pytorchlightning.ai/). 

We bring it in this version here, for learning purposes only.

In [55]:
class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, ninp, noutp, nhid, nlayers, dropout=0.5, tie_weights=False):
        """
        Parameters:
          ninp =  LSTM input size 
          noutp = size of the output (number of classes)
          nhid = number of neurons in the hidden layer
          nlayers = number of hidden layer
          dropout = dropout rate
          tie_weights = whether to use tie_weights (see note)
        """
        super(RNNModel, self).__init__()
        self.noutp = noutp
        self.drop = nn.Dropout(dropout)

        self.encoder = nn.Embedding.from_pretrained(tensor(bpemb_en.vectors))
        
        # self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity='relu', dropout=dropout)

        # LSTM (Long Short-Term Memory) is a version of RNN that 'remembers' 
        # the previous steps and therefore converges better on longer sequences. 
        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)

        self.decoder = nn.Linear(nhid, noutp)

        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to ninp (embedding size)')
            self.decoder.weight = self.encoder.weight

        self.init_weights()

        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.encoder.weight, -initrange, initrange)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -initrange, initrange)

    def forward(self, input, hidden):
        emb = self.drop(self.encoder(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.noutp)
        return F.log_softmax(decoded, dim=1), hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, batch_size, self.nhid),
                weight.new_zeros(self.nlayers, batch_size, self.nhid))


A helper class to convert the tokens into batches:

In [37]:
def batchify(data, batch_size):
    # Work out how cleanly we can divide the dataset into batch_size parts.
    nbatch = data.size(0) // batch_size

    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * batch_size)
    
    # Evenly divide the data across the batch_size batches.
    data = data.view(batch_size, -1).t().contiguous()
    
    return data.to(device)

Let's load the data:

In [8]:
train_corpus = get_file('dracula.txt')
val_corpus = get_file('frankenstein.txt')

print(train_corpus[:300])
print(val_corpus[:300])

Downloading data from https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/dracula.txt
Corpus length: 842159
Downloading data from https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/frankenstein.txt
Corpus length: 420726
Dracula, by Bram Stoker  CHAPTER I  JONATHAN HARKER'S JOURNAL  (_Kept in shorthand._)   _3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse whic
Frankenstein, or, the Modern Prometheus by Mary Wollstonecraft (Godwin) Shelley  Letter 1  _To Mrs. Saville, England._   St. Petersburgh, Dec. 11th, 17—.   You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings. 


## Semantic representation + word-parts

And convert it into vectors:

In [9]:
train_encoded_text = bpemb_en.encode(train_corpus)
train_encoded_ids = bpemb_en.encode_ids(train_corpus)

val_encoded_text = bpemb_en.encode(val_corpus)
val_encoded_ids = bpemb_en.encode_ids(val_corpus)

Let's check the result of encoded_text (we'll get to encoded_ids in a moment).

Notice that every word is now broken to pieces. 

A **'_'** mark in the beginning of a token, represents a beginning of a new word.

In [None]:
train_encoded_text[:50]

['▁dra',
 'c',
 'ula',
 ',',
 '▁by',
 '▁br',
 'am',
 '▁st',
 'oker',
 '▁chapter',
 '▁i',
 '▁jonathan',
 '▁har',
 'ker',
 "'",
 's',
 '▁journal',
 '▁(',
 '_',
 'ke',
 'pt',
 '▁in',
 '▁sh',
 'or',
 'th',
 'and',
 '.',
 '_',
 ')',
 '▁',
 '_',
 '0',
 '▁may',
 '.',
 '▁b',
 'ist',
 'rit',
 'z',
 '.',
 '_',
 '-',
 '-',
 'left',
 '▁mun',
 'ich',
 '▁at',
 '▁0:00',
 '▁p',
 '.',
 '▁m']

This method is called **word-parts**. 

Instead of converting whole words (word2vec, gloVe), or characters (FastText), this method converts slices of text, frequent combinations of characters, which often appear together in the text.

It does so by finding the most common and frequent combinations of characters in a very big corpus, counting how many each of those appear and selecting the top K combinations. Pay attention that those combinations may sometimes be just letters: `st`, `z`, or words `journal`, `chapter`. 

The result is having a vocabulary which is WAY smaller than all-the-words-in-a-language (how big would that be for English, for example? How about for your native language?) but bigger than a vocabulary that includes all the characters of a language (or multiple languages):

**character-based << word-piece based << word-based**

## Model Parameters

In [9]:
batch_size = 32
eval_batch_size = 32

vocab_size = bpemb_en.vocab_size
embsize = bpemb_en.vectors.shape[1]
nhidden = 256
nlayers = 2

In [None]:
model = RNNModel(embsize, vocab_size, nhidden, nlayers).to(device)

In [None]:
criterion = nn.NLLLoss()

## Division to train/validation

In [40]:
train_enc_ids = torch.tensor(train_encoded_ids).type(torch.int64)
train_data = batchify(train_enc_ids, batch_size)

val_enc_ids = torch.tensor(val_encoded_ids).type(torch.int64)
val_data = batchify(val_enc_ids, batch_size)

After batchify, our text is divided into fixed-size batches of word indices.

In [44]:
print("training data size: ", train_data.shape)
train_data

training data size:  torch.Size([6972, 32])


tensor([[1187,    7, 4003,  ...,  107,  204, 9948],
        [9924, 3027, 9935,  ...,   42, 9937, 9940],
        [2206, 1274,    7,  ...,  619, 9920, 9940],
        ...,
        [1842, 7579,   27,  ...,    7, 1597,   72],
        [6732,   71, 4280,  ...,   91,  107,  335],
        [ 544,    7,  154,  ...,  363,   73,   10]], device='cuda:0')

In [31]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


We return the current batch (current words) and the next one as the target (the n+1 words).

In [81]:
def get_batch(source, i):
    seq_len = min(batch_size, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

In [105]:
curr_batch, next_batch = get_batch(train_data, 0)
print("Current sentence IDs: ", curr_batch[0])
print("Target token id:", next_batch[0].item())

print("Current sentence: ", bpemb_en.decode_ids(curr_batch[0]))
print("Target token: ", bpemb_en.decode_ids([next_batch[0].item()]))

Current sentence IDs:  tensor([1187,    7, 4003, 4062, 9951,   34,  142,  120, 9976, 1022, 6948,  437,
         280,    7,   25,  394, 3238,  352, 7607, 7081, 7253,   58, 9934, 1233,
        9934,  822,  335, 1375, 3677,  107,  204, 9948], device='cuda:0')
Target token id: 9924
Current sentence:  dra the table things; andies that_ life?" may can theit off professor mesing everythingumed th,let, go her inf quickly itire:
Target token:  c


'dra the table things; andies that_ life?" may can theit off professor mesing everythingumed th,let, go her inf quickly itire:'

## Training function

In [None]:
def train(train_data, log_interval = 100):
    # Turn on training mode - which enables dropout.
    model.train()

    total_loss = 0.

    start_time = time.time()
    ntokens = len(train_data)
    hidden = model.init_hidden(batch_size)

    for batch, i in enumerate(range(0, train_data.size(0) - 1, batch_size)):
        data, targets = get_batch(train_data, i)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        hidden = repackage_hidden(hidden)
        output, hidden = model(data, hidden)
        loss = criterion(output, targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)
        for p in model.parameters():
          if p.grad is not None:
            p.data.add_(p.grad, alpha=-lr)

        total_loss += loss.item()

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                    'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // batch_size, lr,
                elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

In [None]:
def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(data_source)

    hidden = model.init_hidden(eval_batch_size)

    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, batch_size):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)
            total_loss += len(data) * criterion(output, targets).item()
    return total_loss / (len(data_source) - 1)

## Training loop:

In [None]:
# Loop over epochs.
lr = 20
best_val_loss = None
epochs = 40

for epoch in range(1, epochs+1):
    epoch_start_time = time.time()
    train(train_data)
    val_loss = evaluate(val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
            'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                        val_loss, math.exp(val_loss)))
    print('-' * 89)

    if not best_val_loss or val_loss < best_val_loss:
        best_val_loss = val_loss
    else:
        # Anneal the learning rate if no improvement has been seen in the validation dataset.
        lr /= 2.0

| epoch   1 |   100/  217 batches | lr 20.00 | ms/batch 31.28 | loss  7.23 | ppl  1386.43
| epoch   1 |   200/  217 batches | lr 20.00 | ms/batch 28.21 | loss  6.63 | ppl   754.38
-----------------------------------------------------------------------------------------
| end of epoch   1 | time:  7.27s | valid loss  7.04 | valid ppl  1140.79
-----------------------------------------------------------------------------------------
| epoch   2 |   100/  217 batches | lr 20.00 | ms/batch 28.46 | loss  6.58 | ppl   722.67
| epoch   2 |   200/  217 batches | lr 20.00 | ms/batch 28.19 | loss  6.46 | ppl   636.08
-----------------------------------------------------------------------------------------
| end of epoch   2 | time:  6.99s | valid loss  6.94 | valid ppl  1032.34
-----------------------------------------------------------------------------------------
| epoch   3 |   100/  217 batches | lr 20.00 | ms/batch 28.50 | loss  6.46 | ppl   641.77
| epoch   3 |   200/  217 batches | lr 20.

# Text Generation example

Language Models are the basis of many tasks. They can be used for classification tasks as pretrained models that are transfered into learning new tasks by fine-tuning them on those down-stream tasks.
And they can also be used for tasks that requires generating text. For example:
* Text Summarization
* Getting a response from a personalized Chat bots
* Annotating an image

Can you think of additional examples where text generation is needed?

Let's use the language model we've just trained to generate text.

You can play with the parameters, such as `words_to_generate`, `temprature` to see how it affects the output.

In [None]:
model.eval()

log_interval = 100
words_to_generate = 50
temperature = 1. # higher temperature will increase diversity

# generate random start
input = torch.randint(10000, (1, 1), dtype=torch.long).to(device)

hidden = model.init_hidden(1)

generated_word_ids = []

with torch.no_grad():  # no tracking history
 for i in range(words_to_generate):
    output, hidden = model(input, hidden)
    word_weights = output.squeeze().div(temperature).exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    input.fill_(word_idx)

    generated_word_ids.append(word_idx.tolist())
    # word = bpemb_en.decode_ids([word_idx.tolist()])
    # print(word + ('\n' if i % 20 == 19 else ' '))

    # if i % log_interval == 0:
    #     print('| Generated {}/{} words'.format(i, words_to_generate))

bpemb_en.decode_ids(generated_word_ids)

'. sudden is, as a child-inateringched, and then, went into the warm g agreedply falling, which sm f was and theity rightust. the first ised into f nightly, monthly-out at'

As discussed in class, the RNN/LSTM can be used in various tasks:

it can be used for sequence2sequence, where the sequence size is either the same or different. Some examples include: 
* Translation
* Tagging words as POS / SLR / NER
* Encoding a document as a vector for classification

etc.

# Train like a pro
## DataLoader & pyTorch wrappers

In real-world projects, we don't use 'batchify', but instead use the premade tools from Pytorch, such as the [DataLoader](https://pytorch.org/docs/stable/data.html).

[Pytorch-ignite](https://pytorch.org/ignite/index.html) and [pytorch lightning](https://www.pytorchlightning.ai/) are two common libraries that are used to speed up development with Python.

pyTorch Lightning organizes the code by wrapping the model into python classes, and separates the model from the data (and the data loading). It also has various of pre-defined and pre-trained models to quickly experiment and research.

pyTorch Ignite offers a set of callbacks to be used during training.

Both libraries have helper tools for validation metrics (RUC, accuracy, confusion matrix, etc.) as well as learning rate finder tools.

## Your turn:
Rewrite the code above to be using pyTorch lightning.

* This guide will help you converting the model into a Pytorch Lightning one:
https://pytorch-lightning.readthedocs.io/en/stable/starter/converting.html

* Use CrossEntropyLoss instad of NLL with Softmax (it's a combination of the two) https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
* Use the library learning-rate finder to decide the best learning rate for the training.

In [1]:
%%capture
! pip install pytorch-lightning

In [2]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, TensorDataset

import pytorch_lightning as pl

In [32]:
class PLRNNModel(pl.LightningModule):
  def __init__(self, 
               ninp=bpemb_en.vectors.shape[1], 
               noutp=bpemb_en.vocab_size, 
               nhid=256, 
               nlayers=2, 
               dropout=0.5, 
               tie_weights=False):
    super().__init__()
    # --------------------------
    # Implement the RNN model here.
    # Hint: This function should define the model layers
    
    self.criterion = nn.CrossEntropyLoss()
    
  
  def forward(self, x, hidden):
    # implement the forward pass here
    pass

  def training_step(self, batch, batch_idx):
    # --------------------------
    # Implement the training step here.
    # Hint: This function should return the loss
    pass

  def validation_step(self, batch, batch_idx):
    # --------------------------
    # Implement the validation step here.
    pass

  def configure_optimizers(self):
    # this time we will be using ADAM optimizer
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
    return optimizer

  def repackage_hidden(self, h):
    """Wraps hidden states in new Tensors, to detach them from their history."""

    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(self.repackage_hidden(v).to(device) for v in h)


Instead of the `batchify` function we created before, this time we will use pytorch DataLoader - which have already implemented all the functionality for us. We will combine it with `TensorDataset`, creating pairs of our text and the target.

In [42]:
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self,x, y):
        self.x = torch.tensor(x, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.long)
        self.len = self.x.shape[0]

    def __getitem__(self,idx):
        return self.x[idx],self.y[idx]
  
    def __len__(self):
        return self.len


In [43]:

train_ds = TextDataset(train_encoded_ids, train_encoded_ids[1:])
train_dl = DataLoader(train_ds, batch_size=batch_size)

val_ds = train_ds # Implement the correct validation dataset
val_dl = train_dl # Implement the correct validation data loader

In [45]:
# Validate that the data looks as you expect:
next(iter(train_dl))

[tensor([1187, 9924, 2206, 9934,  101,  473,   56,   66, 7468, 5468,  386, 8281,
          809, 3226, 9937, 9920, 2481,   64, 9976,  339,  233,   26,  176,   22,
          106,  102, 9935, 9976, 9941, 9912, 9976, 9925]),
 tensor([9924, 2206, 9934,  101,  473,   56,   66, 7468, 5468,  386, 8281,  809,
         3226, 9937, 9920, 2481,   64, 9976,  339,  233,   26,  176,   22,  106,
          102, 9935, 9976, 9941, 9912, 9976, 9925,  437])]

In [None]:
# init model
rnn_model = PLRNNModel()

# Initialize a trainer
trainer = pl.Trainer(gpus=0, max_epochs=3)

# Train the model ⚡
trainer.fit(rnn_model, train_dl, val_dl)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name      | Type             | Params
-----------------------------------------------
0 | drop      | Dropout          | 0     
1 | encoder   | Embedding        | 1.0 M 
2 | rnn       | LSTM             | 892 K 
3 | decoder   | Linear           | 2.6 M 
4 | criterion | CrossEntropyLoss | 0     
-----------------------------------------------
3.5 M     Trainable params
1.0 M     Non-trainable params
4.5 M     Total params
17.852    Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

In [None]:
# Start tensorboard.
%load_ext tensorboard
%tensorboard --logdir lightning_logs/