In [1]:
import random
random.seed(42)
%load_ext autoreload
%autoreload 2

import torch
import torch.nn as nn
import numpy as np
import pytorch_lightning as pl
from torch.nn import functional as F
from torch.utils.data import DataLoader
import torch.optim as optim
from torch.autograd import Variable as V
import torchtext
from torchtext import data
from pytorch_lightning_lm.data_module import QuotesDataModule
from pytorch_lightning_lm.model import RNNModel
from pytorch_lightning_lm.metrics import Perplexity
from pytorch_lightning.loggers import WandbLogger
from torchtext.datasets import LanguageModelingDataset
from argparse import ArgumentParser

import spacy
import nltk
from tqdm.autonotebook import tqdm
en = spacy.load("en")

def spacy_tokenizer(x):
    return [tok.text for tok in en.tokenizer(x)]



## Neural Language Models(NLMs)

Neural Language models are continuous probability models which approaches the same problem that statistical models tackled from a different perpective. And this approach has consistently beaten the benchmarks set by statistical models, both in the LM itself, as well as the downstream tasks. 

>A fundamental problem that makes statistical language modelling difficult is the curse of dimensionality. If one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary of V of size 100000, there are potentially $100000^{10} - 1 = 10^{50}-1$ free parameters.

>When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties.  - [Bengio et.al. 2003](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

All the Smoothing and Backoff techniques we have seen earlier is to overcome this shortcoming with a few crutches. Bengio et.al. continues to propose two key points of improvements over the Statistical LMs:
- Trigram or Fourgram models were the SOTA at that point. But even there, we resort to all kinds of techniques like Kneser-Ney or Witten-Bell Smoothing to leverage the information from preceeding words from the context.
- Statistical LMs also do not consider semantic similarity between two words. _"A cat is walking in a bedroom"_ is different from _"A dog is walking in a room"_ because the LM doesn't know that cat and dog are similar, or room and bedroom are similar.


The neural network approach to language modeling can be described using the three following model properties:

>- Associate each word in the vocabulary with a distributed word feature vector.
>- Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
>- Learn simultaneously the word feature vector and the parameters of the probability function. 

> [Bengio et.al. 2003](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

Let's pick the same dataset we were working with and train some Neural models. I have preprocessed the QuotesDB, split into Train, Val and Test, and saved into txt file. And learning from the previous models, have slightly refined the cleaning process where we now replace contractions with full versions and insert spaces between punctuations and words. _(Accompanying code can be found in Appendix-quotes_to_txt.ipynb)_

We will be sticking with PyTorch and the transformers library from huggingface for thr remainder of exercise. And for making things easier, we are going to be using PyTorch Lightning, which is an abstraction over pure PyTorch, hiding away the boilerplate code we have to write for training a model. And it also takes up the tedious job of making the training work on GPUs, multi-GPUs etc. seamlessly. the only ting it asks in return is that you structure your code in a specific way. And that specific way is pretty much the way we usually write code.

## Pytorch Lightning

PyTorch Lightning is an attempt at standardizing PyTorch code, abstract away boilerplate code other slightly technical training aspects like distributed training, mixed precision training, multi-GPU training, etc so that researchers can focus on what they do best and accelerate the research cycle. It also acts as a standard for production systems which makes the code less prone to errors and structured.

Below is a diagram from a [medium post](https://towardsdatascience.com/supercharge-your-ai-research-with-pytorch-lightning-337948a99eec) by the author of the library. It shows what parts of the whole cycle have been automated by Pytorch Lightning.

![](images/lightning.jpeg)

The boxes in Blue are the boxes we need to fill in with our code, and the rest of them are taken care by the framework. If you have written PyTorch code before, porting to PyTorch Lightning is really easy. I strongly advise you to check out [this link](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09) to get an overview of what it can do.

There is one mandatory requirement and an optional and recommended requirement from Pytorch Lightning:
1. (Mandatory) We should be defining our model inheriting LightningModule and not nn.Module. And in this class, we add the training, validation and test steps overriding predefined methods.
2. (Optional) We should organize all our data processing and loading to a class inheriting DataModule. This has everything related to data(downloading, loading, splitting, tokenization, batching, etc.).

So, let's define our DataModule first.

### DataModule

> A datamodule encapsulates the five steps involved in data processing
> in PyTorch:
> 
> Download / tokenize / process.
> 
> Clean and (maybe) save to disk.
> 
> Load inside Dataset.
> 
> Apply transforms (rotate, tokenize, etc…).
> 
> Wrap inside a DataLoader. 

> To define a DataModule define 5 methods:
> 
> -   prepare_data (how to download(), tokenize, etc…)
>     
> -   setup (how to split, etc…)
>     
> -   train_dataloader
>     
> -   val_dataloader(s)
>     
> -   test_dataloader(s)

_- PyTorch Lightning Docs_

_**N.B.**_ - To use DataModules, you should _**pip install pytorch-lightning==0.9.0rc2**_

A simplified DataModule is below, but we are going to use the fullblown one after importing. Some of it like the BPTTIterator, you might not get it now. But I'll explain it when we talk about LSTMs. 

```python
class QuotesDataModule(pl.LightningDataModule):

    def __init__(
        self,
        train_file: str,
        valid_file: str = None,
        test_file: str = None,
        tokenizer=None,
        pretrained_vectors=None,
        batch_size=32,
        bptt=6,
    ):

        super().__init__()
        self.train_file = train_file
        self.valid_file = valid_file
        self.test_file = valid_file
        self.tokenizer = spacy_tokenizer if tokenizer is None else tokenizer
        self.TEXT = data.Field(lower=True, tokenize=self.tokenizer)
        self.pretrained_vectors = pretrained_vectors
        self.batch_size = batch_size
        self.bptt = bptt
        self._load_data()
        self._build_vocab()
        self.vocab = self.TEXT.vocab

    def _load_data(self):
        # Read file and tokenize
        self.train_data = LanguageModelingDataset(self.train_file, self.TEXT)
        if self.valid_file:
            self.valid_data = LanguageModelingDataset(self.valid_file, self.TEXT)
        else:
            self.valid_data = None
        if self.test_file:
            self.test_data = LanguageModelingDataset(self.test_file, self.TEXT)
        else:
            self.test_data = None

    def _build_vocab(self):
        self.TEXT.build_vocab(self.train_data, vectors=self.pretrained_vectors)

    @classmethod
    def _make_iter(cls, dataset, batch_size, bptt_len):
        if dataset:
            _iter = data.BPTTIterator(
                dataset,
                batch_size=batch_size,
                bptt_len=bptt_len,  # this is where we specify the sequence length
                repeat=False,
                shuffle=True,
            )
        else:
            _iter = []
        return _iter

    def prepare_data(self):
        # No action here
        pass

    def setup(self, stage=None):
        # No action here
        pass

    def train_dataloader(self):
        return self._make_iter(self.train_data, self.batch_size, self.bptt)

    def val_dataloader(self):
        return self._make_iter(self.valid_data, self.batch_size, self.bptt)

    def test_dataloader(self):
        return self._make_iter(self.test_data, self.batch_size, self.bptt)
```

## LSTM

### Introduction

Although the Neural Language Models did not start with LSTMs, we will just skip ahead to the good part and look at an LSTM. Before that a quick refresher on LSTMs(rather RNNs). To understand the exact difference between vanilla RNNs and LSTMs, you can refer to this [amazing blog by Christopher Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

RNNs are Neural Network architectures specifically suited to sequences. They were designed to overcome the lack of memory in traditional Neural Networks. What you see below is a typical RNN , where you have a sequence, $x_1, x_2, ..., x_t$, where t is the number of timesteps. The core idea in an RNN is the use of a hidden state as kind of a memory to remember the elements of the sequence it has already seen.

So what an RNN will do is take in the inputs one by one, and spit out two outputs, a prediction and a hidden state. This hidden state and the next element in the timestep is passed to the network again, and it spits out another prediction and a new hidden state. This continues till you reach the end of the sequence.

![lstm](images/lstm.png)

Now let's see how this translates into the PyTorch implementation. The RNN implementation in PyTorch(`batch_first=False`) takes in a matrix of shape `(# of timesteps, batch size)` and returns an output of the same shape - `(# of timesteps, batch size)`. Intuitively, we may think that it need not return the intermediate outputs, but just the final output because that's the output after considering all the sequence, right? But there are many usecases, including LMs, where it will be beneficial to have access to intermediate outputs as well. The first approach might be fine for something like a Sentiment analysis, where you accumulate the information over an entire sentence and then make a classification, but if you think about use cases like NER tagging, Text Generation, etc. we would use all or multiple outputs.

![lstm](images/lstm_pytorch.png)

Now let's think about how we formulate the data for training. Even though there are many ways of training a Language Model, let's choose the simplest one - Next Word Prediction. Just like we did with the Statistical Language Models, we give the model a sequence of words and ask it to predict the next word. Only here, we do it in mini-batches. And how do we do that? We can always write our own iterator which offsets the sequence by one and pose it as the target, but why reinvent the wheel. `torchtext` has a BPTTIterator which does just this job. You provide a sequence of words to the iterator and mention the batch size and the mazimum length of context(bptt), it does the dirty job of batching the sequence into text-target pairs.

If the sequence of words are, _the quick brown fox jumped over the lazy dog_, we can use the BPTTIterator to make following text - target pairs(bptt=3), and then train our network.

![bptt](images/bptt_training.png)

![architecture](images/rnn_architecture.png)

### Pytorch Lightning Model

In PyTorch Lighning, you have to sub-class LightningModule instead of nn.Module. But LightningModule is just like nn.Module, but has more functionalities baked into it.

A LightningModule organizes your PyTorch code into 5 sections

- Computations (init).

- Train loop (training_step)

- Validation loop (validation_step)

- Test loop (test_step)

- Optimizers (configure_optimizers)

A few awesome things about Lightning is that:
1. It does not abstract your PyTorch code, but rather organizes it.
2. The awesome Trainer object takes care of the heavy lifting and boiler plate training code. i.e. no more optimizer.zero_grad() or aggregation of loss per epoch.
3. No more .cuda() or .to() calls. Phew!
4. You want to do distributed training, multi-GPU training, TPU training? No worries, Lightning has got your back.

The most basic Pytorch Lighning Model has a structure something like below
``` python
class LitModel(pl.LightningModule):
    
    def __init__(self, hparams):
        super().__init__()
        #Set all the parameters of your model
        #Create the layers of the model
        #Just like regular PyTorch
    
    def forward(self, x):
        #Forward Pass
        #Just like regular Pytorch
    
    def training_step(self, batch):
        #call the forward pass and calculate the loss
        # PyTorch training code, which was floating around in your training loop
       
    def configure_optimizers(self):
        # return the optimizer
```

That's it. You've got your Pytorch Lightning module and ow you can do the magical trainer.fit() to run your model. Take a look at the beautiful documentation for PyTorch Lightning [here](https://pytorch-lightning.readthedocs.io/en/latest/).

Now let's define our model. The model wa heavily inspired by the LM model in the [official PyTorch github](https://github.com/pytorch/examples/blob/master/word_language_model/main.py)

Below is a simple barebones(might not even run), although the model we are going to use has more bells and whistles and will be imported from another file.

``` python
class RNNModel(pl.LightningModule):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(
        self,
        rnn_type,
        ntoken,
        ninp,
        nhid,
        nlayers,
        batch_size,
        lr=1e-3,
        dropout=0.5,
        criterion=nn.CrossEntropyLoss(),
        pretrained_vectors=None,
    ):
        super(RNNModel, self).__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if pretrained_vectors is not None:
            assert pretrained_vectors.shape == torch.Size(
                [ntoken, ninp]
            ), "When using pretrained embeddings, the embedding vector should have the dimensions (ntoken, ninp)"
            self.encoder.weight.data.copy_(pretrained_vectors)
        if rnn_type in ["LSTM", "GRU"]:
            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
        else:
            raise ValueError(
                """An invalid option for `--model` was supplied,
                                options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']"""
            )

        self.decoder = nn.Linear(nhid, ntoken)
        self.init_weights()
        self.hidden = self.init_hidden(self.batch_size)

    def init_weights(self):
        gain = nn.init.calculate_gain("relu")
        nn.init.xavier_uniform_(self.encoder.weight, gain)
        nn.init.xavier_uniform_(self.decoder.weight, gain)


    def forward(self, input):
        # Applying the Embedding Layer to the input
        emb = self.drop(self.encoder(input))
        #Passing the embedded matric to RNN
        output, self.hidden = self.rnn(emb, self.hidden)
        output = self.drop(output)
        #Passing it through a decoder, in this case a FF Network
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.ntoken)
        return decoded

    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        if self.rnn_type == "LSTM":
            return (
                weight.new_zeros(self.nlayers, batch_size, self.nhid),
                weight.new_zeros(self.nlayers, batch_size, self.nhid),
            )
        else:
            return weight.new_zeros(self.nlayers, batch_size, self.nhid)
    
    # Need to reset hidden state every training step to make sure gradients 
    # are not propagated to all the previous histories as well.
    def reset_hidden(self, hidden):
        if isinstance(hidden, torch.Tensor):
            hidden = hidden.detach().to(self.device_type)
        else:
            hidden = tuple(self.reset_hidden(v) for v in hidden)
        return hidden

    def training_step(self, batch, batch_nb):
        text, targets = batch.text, batch.target
        self.hidden = self.reset_hidden(self.hidden)
        output = self(text)
        loss = self.criterion(output.view(-1, self.ntoken), targets.view(-1))
        result = pl.TrainResult(minimize=loss)
        return result

    def validation_step(self, batch, batch_nb):
        text, targets = batch.text, batch.target
        self.hidden = self.reset_hidden(self.hidden)
        output = self(text)
        val_loss = self.criterion(output.view(-1, self.ntoken), targets.view(-1))
        result = pl.EvalResult(early_stop_on=val_loss)
        return result

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.lr)
```

### Model Training

In [17]:
parser = ArgumentParser()
parser.add_argument("-f", "--fff", help="a dummy argument to fool ipython", default="1")

# add PROGRAM level args
parser.add_argument('--project-name', type=str, default='neural_lms')
parser.add_argument('--experiment-tag', type=str, default='RNN_LM')
parser.add_argument('--use-cuda', type=bool, default=True)
parser.add_argument('--use-wandb', type=bool, default=True)
parser.add_argument('--log-gradients', type=bool, default=False)
parser.add_argument('--unk-cutoff', type=int, default=2)

# add model specific args
# parser = LitModel.add_model_specific_args(parser)
parser.add_argument('--batch_size', type=int, default=128)
parser.add_argument('--bptt', type=int, default=16)
parser.add_argument('--rnn-type', type=str, default="LSTM")
parser.add_argument('--nhid', type=int, default=64)
parser.add_argument('--nlayers', type=int, default=2)
parser.add_argument('--pretrained-vector', type=str, default="fasttext.simple.300d")

# add all the available trainer options to argparse
parser.add_argument('--max_epochs', type=int, default=25)
parser.add_argument('--fast_dev_run', type=bool, default=False)
# parser = Trainer.add_argparse_args(parser)

args = parser.parse_args()

device = torch.device('cuda') if (torch.cuda.is_available()&args.use_cuda) else torch.device('cpu')

In [18]:
dm = QuotesDataModule(
    train_file="data/quotesdb/funny_quotes.train.txt",
    valid_file="data/quotesdb/funny_quotes.val.txt",
    test_file="data/quotesdb/funny_quotes.test.txt",
    tokenizer=None,
    batch_size=args.batch_size,
    bptt=args.bptt,
    unk_limit = args.unk_cutoff,
    pretrained_vectors=args.pretrained_vector,
)

In [19]:
vocab = dm.vocab
weight_matrix = vocab.vectors
ntoken, ninp = weight_matrix.shape

pad_idx = vocab.stoi["<pad>"]

ppl = Perplexity(pad_idx)
model = RNNModel(
    rnn_type=args.rnn_type, ntoken=ntoken, ninp=ninp, nhid=args.nhid, nlayers=args.nlayers, batch_size=args.batch_size, device_type= device.type, pretrained_vectors=weight_matrix, metric=ppl
)


trainer = pl.Trainer(gpus=1 if device.type =='cuda' else 0, max_epochs=args.max_epochs, auto_lr_find=True)

trainer.fit(model, datamodule=dm)
trainer.save_checkpoint(f"models/LSTM_LM_unk_2.ckpt")
torch.save(dm.vocab, "models/LSTM_LM_vocab_unk_2.sav")

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | metric    | Perplexity       | 0     
2 | drop      | Dropout          | 0     
3 | encoder   | Embedding        | 6 M   
4 | rnn       | LSTM             | 126 K 
5 | decoder   | Linear           | 1 M   


HBox(children=(FloatProgress(value=0.0, description='Finding best initial lr', style=ProgressStyle(description…

Learning rate set to 0.008317637711026709






  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | metric    | Perplexity       | 0     
2 | drop      | Dropout          | 0     
3 | encoder   | Embedding        | 6 M   
4 | rnn       | LSTM             | 126 K 
5 | decoder   | Linear           | 1 M   


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

In [20]:
# # Start tensorboard.
# %load_ext tensorboard
# %tensorboard --logdir lightning_logs/

### Load Trained Model

In [2]:
from IPython.display import display, Markdown, Latex

seeds = [
    "when life hands you a lemon,",
    "life is a",
    "i'd rather be pissed",
    "all women may not be beautiful but",
    "i really need a day between",
    "it's never too late to"
]

def generate_sentences(model, vocab, tokenizer, sampler_func, seeds, sampler_kwargs={}, num_words = 20, device='cpu'):
    for seed in seeds:
        if isinstance(sampler_func, BeamSearch):
            gen_text = " ".join(sampler_func.generate(text_seed=seed, num_words=20))
        else:
            gen_text = generate_sentence(model, vocab, tokenizer=tokenizer, seed=seed, sampler=sampler_func,sampler_kwargs=sampler_kwargs, num_words=num_words, device=device)
        gen_text = gen_text.replace("<unk>","UNK")
        display (Markdown(f"**{seed}** {gen_text}"))

In [8]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# device = torch.device('cpu')
trainer = pl.Trainer(gpus=1 if device.type =='cuda' else 0, auto_lr_find=False)
model = RNNModel.load_from_checkpoint("models/LSTM_LM.ckpt")
model.eval()
model = model.to(device)
model.hidden = model.init_hidden(1)
vocab = torch.load("models/LSTM_LM_vocab.sav")

weight_matrix = vocab.vectors
ntoken, ninp = weight_matrix.shape
assert(model.encoder.weight.data.shape == torch.Size([ntoken,ninp]))
bptt = 16

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


In [9]:
TEXT = data.Field(lower=True, tokenize=spacy_tokenizer)
test_data = LanguageModelingDataset("data/quotesdb/funny_quotes.test.txt", TEXT)

tokens = test_data.examples[0].text

### Perplexity

Eventhough, we calculated Perplexity while training the model, to compare to our previous models we need to calculate in the same way. So let's do that now.

```python

def logscore(model, vocab, word, context, device):
    inp = torch.LongTensor([vocab.stoi[x] for x in context]).unsqueeze(1).to(device)
    word_idx = vocab.stoi[word]
    out = F.log_softmax(model(inp), dim=1)[-1,:]
    return out[word_idx].item()

def perplexity(model, vocab, ngrams, device):
    log_score_sum = 0
    log_score_count=0
    for ngram in tqdm(ngrams):
        log_score_sum+=logscore(model, vocab, ngram[-1], ngram[:-1], device) 
        log_score_count+=1
    entropy = -1* (log_score_sum/log_score_count)
    return pow(2.0, entropy)
```

In [None]:
from pytorch_lightning_lm.utils import perplexity

from pytorch_lightning_lm.samplers import BeamSearch, DiverseNbestBeamSearch, DiverseBeamSearch, greedy_decoding, weighted_random_choice, topk, nucleus

from pytorch_lightning_lm.utils import generate_sentence

In [8]:
%%time
ngrams = nltk.ngrams(tokens,n=bptt+1)
test_ppl = perplexity(model, vocab, ngrams, device)
print(f"Test PPT is {test_ppl}")

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Test PPT is 50.50857680006482
Wall time: 14.6 s


### Text Generation

#### Greedy Decoding

In [8]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = greedy_decoding, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** i 'm not a good thing . <eos>

**life is a** good thing . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** i 'm not a good thing . <eos>

**i really need a day between** the world . <eos>

**it's never too late to** be a same . <eos>

#### Beam Search

In [10]:
beam_search=BeamSearch(model=model, vocab=vocab, tokenizer=spacy_tokenizer, beam_width=30,verbose=False, debug_level=0, device = device)
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = beam_search, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** " he said . <eos>

**life is a** man . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** i do n't know . <eos>

**i really need a day between** you . <eos>

**it's never too late to** you . <eos>

#### Diverse N-Best Beam Search

In [11]:
div_nbest_bs=DiverseNbestBeamSearch(model=model, vocab=vocab, tokenizer=spacy_tokenizer, beam_width=30,verbose=False, debug_level=0, device = device, diversity_factor=1)
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = div_nbest_bs, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** in the world . <eos>

**life is a** man . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** in the world . <eos>

**i really need a day between** you . <eos>

**it's never too late to** you . <eos>

#### Diverse Beam Search

In [13]:
dbs=DiverseBeamSearch(model=model, vocab=vocab, tokenizer=spacy_tokenizer, beam_width=30, num_groups=15, verbose=False, debug_level=0, device = device, diversity_strength=0)
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = dbs, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** no . <eos>

**life is a** man . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** no . <eos>

**i really need a day between** you . <eos>

**it's never too late to** you . <eos>

#### Weighted Random Sampling

In [15]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=weighted_random_choice, sampler_kwargs={"temperature":1}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** the maestro is as you do n't wish to sporting out . i was stolen , but i 'm protecting

**life is a** room in yet ... i want for our pressure . <eos>

**i'd rather be pissed** , i would make me its business - claim . <eos>

**all women may not be beautiful but** i ’m comming , you can mean in . the ' answer is small 12 . it was lying to

**i really need a day between** the shore of children , he was starving before ' human exploited and she 'd think i guess , sir

**it's never too late to** get doing my heart <eos>

In [13]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=weighted_random_choice, sampler_kwargs={"temperature":0.9}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** where when you are just up in his mind , ” he says , “ is what it is hard

**life is a** sex?why is no business . i do n't seem an mind more dumb i am going to rescue a combination

**i'd rather be pissed** on me at the bed that much events , what is much people one of the grail the things are

**all women may not be beautiful but** it was n't being ? <eos>

**i really need a day between** the picture of her ... <eos>

**it's never too late to** him . <eos>

In [14]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=weighted_random_choice, sampler_kwargs={"temperature":0.5}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** the world and the i would not have already been . <eos>

**life is a** great . <eos>

**i'd rather be pissed** to think that i hoped you 're not a good thing . <eos>

**all women may not be beautiful but** i think to be a few - the opinion of the beginning of the person , he could actually have

**i really need a day between** the one of the time . <eos>

**it's never too late to** protect you . <eos>

#### Top-k Sampling

In [16]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=topk, sampler_kwargs={"temperature":1, "k":3}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** i was not going to be a same thing to be . <eos>

**life is a** lot of the same . <eos>

**i'd rather be pissed** to be . <eos>

**all women may not be beautiful but** you 're not a good thing to be a way , and the man , and you are a good

**i really need a day between** a time . <eos>

**it's never too late to** be the way of the world . <eos>

In [11]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=topk, sampler_kwargs={"temperature":0.5, "k":50}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** the world , and that 's not a only man , but the lot of the great - love ,

**life is a** last time . <eos>

**i'd rather be pissed** to be a same - <eos>

**all women may not be beautiful but** i 'm not a time to be a thing about them . <eos>

**i really need a day between** the word , that was the lot of the same , and i have to have to hold his face

**it's never too late to** be a very same of a same person . <eos>

#### Top-p or Nucleus Sampling

In [18]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=nucleus, sampler_kwargs={"temperature":0.9, "p":0.9}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** " in your cat with no past would forget his hand . <eos>

**life is a** pizza who really watched , ” the large hour was the proximity , no , good white and the single

**i'd rather be pissed** . what was going to one of his emotions . she said for what i can speak and can go

**all women may not be beautiful but** the light and is it to try like never understand the world 's mind ’s style . <eos>

**i really need a day between** with the house . <eos>

**it's never too late to** discover you can leave the fifteen in view . the street ? <eos>

In [20]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=nucleus, sampler_kwargs={"temperature":1, "p":0.5}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** it ’s not a place . <eos>

**life is a** lot of this ? <eos>

**i'd rather be pissed** in the moon . <eos>

**all women may not be beautiful but** i was not a man . <eos>

**i really need a day between** your own partner . i am , and you have the little real , and do n't know what i

**it's never too late to** me . <eos>

In [18]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=nucleus, sampler_kwargs={"temperature":0.9, "p":0.5}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** and and . <eos>

**life is a** family . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** they have n't been on the old . <eos>

**i really need a day between** the best and the person . <eos>

**it's never too late to** go on . <eos>

## Attention

### Introduction

The next major innovation in the field of Neural Models was the introduction of Attention.

### Pytorch Lightning Model

### Model Training

In [5]:
from pytorch_lightning_lm.model import RNNAttentionModel

In [24]:
parser = ArgumentParser()
parser.add_argument("-f", "--fff", help="a dummy argument to fool ipython", default="1")

# add PROGRAM level args
parser.add_argument('--project-name', type=str, default='neural_lms')
parser.add_argument('--experiment-tag', type=str, default='RNN_LM_w_Att')
parser.add_argument('--use-cuda', type=bool, default=True)
parser.add_argument('--use-wandb', type=bool, default=False)
parser.add_argument('--log-gradients', type=bool, default=False)
parser.add_argument('--unk-cutoff', type=int, default=1)

# add model specific args
# parser = LitModel.add_model_specific_args(parser)
parser.add_argument('--batch_size', type=int, default=128)
parser.add_argument('--bptt', type=int, default=32)
parser.add_argument('--rnn-type', type=str, default="LSTM")
parser.add_argument('--nhid', type=int, default=128)
parser.add_argument('--nlayers', type=int, default=2)
parser.add_argument('--att-width', type=int, default=16)
parser.add_argument('--pretrained-vector', type=str, default="fasttext.simple.300d")

# add all the available trainer options to argparse
parser.add_argument('--max_epochs', type=int, default=25)
parser.add_argument('--fast_dev_run', type=bool, default=False)
# ie: now --gpus --num_nodes ... --fast_dev_run all work in the cli
# parser = Trainer.add_argparse_args(parser)
args = parser.parse_args()
device = torch.device('cuda') if (torch.cuda.is_available()&args.use_cuda) else torch.device('cpu')


In [25]:
dm = QuotesDataModule(
    train_file="data/quotesdb/funny_quotes.train.txt",
    valid_file="data/quotesdb/funny_quotes.val.txt",
    test_file="data/quotesdb/funny_quotes.test.txt",
    tokenizer=None,
    batch_size=args.batch_size,
    bptt=args.bptt,
    unk_limit = args.unk_cutoff,
    pretrained_vectors=args.pretrained_vector,
)



In [18]:
vocab = dm.vocab
weight_matrix = vocab.vectors
ntoken, ninp = weight_matrix.shape

pad_idx = vocab.stoi["<pad>"]

ppl = Perplexity(pad_idx)
model = RNNAttentionModel(
    rnn_type=args.rnn_type, 
    ntoken=ntoken, 
    ninp=ninp, 
    nhid=args.nhid, 
    attention_width=args.att_width,
    nlayers=args.nlayers, 
    batch_size=args.batch_size, 
    device_type= device.type, 
    pretrained_vectors=weight_matrix, metric=ppl
)

early_stop_callback = pl.callbacks.EarlyStopping(
   min_delta=0.01,
   patience=5,
   verbose=False,
   mode='min'
)

trainer = pl.Trainer(gpus=1 if device.type =='cuda' else 0, 
                     max_epochs=args.max_epochs, 
                    auto_lr_find=False if args.fast_dev_run else True,
                    fast_dev_run=args.fast_dev_run,
                    early_stop_callback=early_stop_callback)

trainer.fit(model, datamodule=dm)
trainer.auto_lr_find = False
trainer.save_checkpoint(f"models/LSTM_w_Att_LM.ckpt")
torch.save(dm.vocab, "models/LSTM_w_Att_LM_vocab.sav")

### Load Trained Model

In [19]:
from IPython.display import display, Markdown, Latex

seeds = [
    "when life hands you a lemon,",
    "life is a",
    "i'd rather be pissed",
    "all women may not be beautiful but",
    "i really need a day between",
    "it's never too late to"
]

def generate_sentences(model, vocab, tokenizer, sampler_func, seeds, sampler_kwargs={}, num_words = 20, device='cpu'):
    for seed in seeds:
        if isinstance(sampler_func, BeamSearch):
            gen_text = " ".join(sampler_func.generate(text_seed=seed, num_words=20))
        else:
            gen_text = generate_sentence(model, vocab, tokenizer=tokenizer, seed=seed, sampler=sampler_func,sampler_kwargs=sampler_kwargs, num_words=num_words, device=device)
        gen_text = gen_text.replace("<unk>","UNK")
        display (Markdown(f"**{seed}** {gen_text}"))

In [21]:
# device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device = torch.device('cpu')
trainer = pl.Trainer(gpus=1 if device.type =='cuda' else 0, auto_lr_find=False)
model = RNNAttentionModel.load_from_checkpoint("models/RNN_LM_w_Att_LSTM_128_32_128_2_16.ckpt")
model.eval()
model = model.to(device)
model.hidden = model.init_hidden(1)
vocab = torch.load("models/RNN_LM_w_Att_LSTM_128_32_128_2_16_vocab.sav")

weight_matrix = vocab.vectors
ntoken, ninp = weight_matrix.shape
assert(model.encoder.weight.data.shape == torch.Size([ntoken,ninp]))
bptt = 32

GPU available: True, used: False
TPU available: False, using: 0 TPU cores


AssertionError: 

In [22]:
model.encoder.weight.data.shape

torch.Size([45947, 300])

In [23]:
ntoken

22656

In [9]:
TEXT = data.Field(lower=True, tokenize=spacy_tokenizer)
test_data = LanguageModelingDataset("data/quotesdb/funny_quotes.test.txt", TEXT)

tokens = test_data.examples[0].text

Eventhough, we calculated Perplexity while training the model, to compare to our previous models we need to calculate in the same way. So let's do that now.

```python

def logscore(model, vocab, word, context, device):
    inp = torch.LongTensor([vocab.stoi[x] for x in context]).unsqueeze(1).to(device)
    word_idx = vocab.stoi[word]
    out = F.log_softmax(model(inp), dim=1)[-1,:]
    return out[word_idx].item()

def perplexity(model, vocab, ngrams, device):
    log_score_sum = 0
    log_score_count=0
    for ngram in tqdm(ngrams):
        log_score_sum+=logscore(model, vocab, ngram[-1], ngram[:-1], device) 
        log_score_count+=1
    entropy = -1* (log_score_sum/log_score_count)
    return pow(2.0, entropy)
```

In [None]:
from pytorch_lightning_lm.utils import perplexity

from pytorch_lightning_lm.samplers import BeamSearch, DiverseNbestBeamSearch, DiverseBeamSearch, greedy_decoding, weighted_random_choice, topk, nucleus

from pytorch_lightning_lm.utils import generate_sentence

In [8]:
%%time
ngrams = nltk.ngrams(tokens,n=bptt+1)
test_ppl = perplexity(model, vocab, ngrams, device)
print(f"Test PPT is {test_ppl}")

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Test PPT is 50.50857680006482
Wall time: 14.6 s


#### Greedy Decoding

In [8]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = greedy_decoding, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** i 'm not a good thing . <eos>

**life is a** good thing . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** i 'm not a good thing . <eos>

**i really need a day between** the world . <eos>

**it's never too late to** be a same . <eos>

#### Beam Search

In [10]:
beam_search=BeamSearch(model=model, vocab=vocab, tokenizer=spacy_tokenizer, beam_width=30,verbose=False, debug_level=0, device = device)
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = beam_search, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** " he said . <eos>

**life is a** man . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** i do n't know . <eos>

**i really need a day between** you . <eos>

**it's never too late to** you . <eos>

#### Diverse N-Best Beam Search

In [11]:
div_nbest_bs=DiverseNbestBeamSearch(model=model, vocab=vocab, tokenizer=spacy_tokenizer, beam_width=30,verbose=False, debug_level=0, device = device, diversity_factor=1)
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = div_nbest_bs, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** in the world . <eos>

**life is a** man . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** in the world . <eos>

**i really need a day between** you . <eos>

**it's never too late to** you . <eos>

#### Diverse Beam Search

In [13]:
dbs=DiverseBeamSearch(model=model, vocab=vocab, tokenizer=spacy_tokenizer, beam_width=30, num_groups=15, verbose=False, debug_level=0, device = device, diversity_strength=0)
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func = dbs, seeds=seeds, sampler_kwargs={}, num_words = 20, device=device)

**when life hands you a lemon,** no . <eos>

**life is a** man . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** no . <eos>

**i really need a day between** you . <eos>

**it's never too late to** you . <eos>

#### Weighted Random Sampling

In [15]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=weighted_random_choice, sampler_kwargs={"temperature":1}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** the maestro is as you do n't wish to sporting out . i was stolen , but i 'm protecting

**life is a** room in yet ... i want for our pressure . <eos>

**i'd rather be pissed** , i would make me its business - claim . <eos>

**all women may not be beautiful but** i ’m comming , you can mean in . the ' answer is small 12 . it was lying to

**i really need a day between** the shore of children , he was starving before ' human exploited and she 'd think i guess , sir

**it's never too late to** get doing my heart <eos>

In [13]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=weighted_random_choice, sampler_kwargs={"temperature":0.9}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** where when you are just up in his mind , ” he says , “ is what it is hard

**life is a** sex?why is no business . i do n't seem an mind more dumb i am going to rescue a combination

**i'd rather be pissed** on me at the bed that much events , what is much people one of the grail the things are

**all women may not be beautiful but** it was n't being ? <eos>

**i really need a day between** the picture of her ... <eos>

**it's never too late to** him . <eos>

In [14]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=weighted_random_choice, sampler_kwargs={"temperature":0.5}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** the world and the i would not have already been . <eos>

**life is a** great . <eos>

**i'd rather be pissed** to think that i hoped you 're not a good thing . <eos>

**all women may not be beautiful but** i think to be a few - the opinion of the beginning of the person , he could actually have

**i really need a day between** the one of the time . <eos>

**it's never too late to** protect you . <eos>

#### Top-k Sampling

In [16]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=topk, sampler_kwargs={"temperature":1, "k":3}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** i was not going to be a same thing to be . <eos>

**life is a** lot of the same . <eos>

**i'd rather be pissed** to be . <eos>

**all women may not be beautiful but** you 're not a good thing to be a way , and the man , and you are a good

**i really need a day between** a time . <eos>

**it's never too late to** be the way of the world . <eos>

In [11]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=topk, sampler_kwargs={"temperature":0.5, "k":50}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** the world , and that 's not a only man , but the lot of the great - love ,

**life is a** last time . <eos>

**i'd rather be pissed** to be a same - <eos>

**all women may not be beautiful but** i 'm not a time to be a thing about them . <eos>

**i really need a day between** the word , that was the lot of the same , and i have to have to hold his face

**it's never too late to** be a very same of a same person . <eos>

#### Top-p or Nucleus Sampling

In [18]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=nucleus, sampler_kwargs={"temperature":0.9, "p":0.9}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** " in your cat with no past would forget his hand . <eos>

**life is a** pizza who really watched , ” the large hour was the proximity , no , good white and the single

**i'd rather be pissed** . what was going to one of his emotions . she said for what i can speak and can go

**all women may not be beautiful but** the light and is it to try like never understand the world 's mind ’s style . <eos>

**i really need a day between** with the house . <eos>

**it's never too late to** discover you can leave the fifteen in view . the street ? <eos>

In [20]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=nucleus, sampler_kwargs={"temperature":1, "p":0.5}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** it ’s not a place . <eos>

**life is a** lot of this ? <eos>

**i'd rather be pissed** in the moon . <eos>

**all women may not be beautiful but** i was not a man . <eos>

**i really need a day between** your own partner . i am , and you have the little real , and do n't know what i

**it's never too late to** me . <eos>

In [18]:
generate_sentences(model, vocab, tokenizer=spacy_tokenizer, sampler_func=nucleus, sampler_kwargs={"temperature":0.9, "p":0.5}, seeds=seeds, num_words = 20, device=device)

**when life hands you a lemon,** and and . <eos>

**life is a** family . <eos>

**i'd rather be pissed** . <eos>

**all women may not be beautiful but** they have n't been on the old . <eos>

**i really need a day between** the best and the person . <eos>

**it's never too late to** go on . <eos>