In [8]:
import random
random.seed(42)
%load_ext autoreload
%autoreload 2

import torch
import torch.nn as nn
import numpy as np
import pytorch_lightning as pl
from torch.nn import functional as F
from torch.utils.data import DataLoader
import torch.optim as optim
from torch.autograd import Variable as V
import torchtext
from torchtext import data
import spacy
en = spacy.load('en')
from torchtext.datasets import LanguageModelingDataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


__Intro about neural LMs__ 

Let's train the same dataset on some Neural models.

starting with simple LSTMs.

We will be sticking with PyTorch and the transformers library from huggingface for thr remainder of the workshop. And for making things easier, we are going to be using PyTorch Lightning, which is an abstraction over pure PyTorch, hiding away the boilerplate code we have to write for training a model. And it also takes up the tedious job of making the training work on GPUs, multi-GPUs etc. seamlessly. the only ting it asks in return is that you structure your code in a specific way. And that specific way is pretty much the way we usually write code.

## Pytorch Lightning
PyTorch Lightning is an attempt at standardizing PyTorch code, abstract away boilerplate code other slightly technical training aspects like distributed training, mixed precision training, multi-GPU training, etc so that researchers can focus on what they do best and accelerate the research cycle. It also acts as a standard for production systems which makes the code less prone to errors and structured.

Below is a diagram from a [medium post](https://towardsdatascience.com/supercharge-your-ai-research-with-pytorch-lightning-337948a99eec) by the author of the library. It shows what parts of the whole cycle have been automated by Pytorch Lightning.

![](images/lightning.jpeg)

The boxes in Blue are the boxes we need to fill in with our code, and the rest of them are taken care by the framework. If you have written PyTorch code before, porting to PyTorch Lightning is really easy. I stringly advise you to check out [this link](https://pytorch-lightning.readthedocs.io/en/latest/new-project.html) to get an overview of what it can do.

The primary requirement of the framework is that we define a LightningModule(which is synonymous to nn.Module) which is the model. And in the same module, we define the training steps in specific methods.
The data can be packaged into a DataLoader or can have a DataModule wrapping everything that is related to data(downloading, loading, splitting, tokenization, batching, etc.). This is the recommended way as well.

So, let's define our DataModule

### DataModule

_N.B._ - I have preprocessed the QuotesDB, split into Train, Val and Test, and saved into txt file. And learning from the previous models, have slightly refined the cleaning process where we now replace contractions with full versions and insert spaces between punctuations and words.

> A datamodule encapsulates the five steps involved in data processing
> in PyTorch:
> 
> Download / tokenize / process.
> 
> Clean and (maybe) save to disk.
> 
> Load inside Dataset.
> 
> Apply transforms (rotate, tokenize, etc…).
> 
> Wrap inside a DataLoader. 

_- PyTorch Lightning Docs_

> To define a DataModule define 5 methods:
> 
> -   prepare_data (how to download(), tokenize, etc…)
>     
> -   setup (how to split, etc…)
>     
> -   train_dataloader
>     
> -   val_dataloader(s)
>     
> -   test_dataloader(s)

In [2]:
from pytorch_lightning_lm.data_module import QuotesDataModule
from pytorch_lightning_lm.model import RNNModel
from pytorch_lightning.loggers import WandbLogger

In [3]:
project = 'neural_lms'

In [4]:
batch_size = 32
bptt = 6
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
rnn_type = 'LSTM'
# ninp = 200
nhid=32
nlayers=2

In [5]:
dm = QuotesDataModule(
    train_file="data/quotesdb/funny_quotes.train.txt",
    valid_file="data/quotesdb/funny_quotes.val.txt",
    test_file="data/quotesdb/funny_quotes.test.txt",
    tokenizer=None,
    batch_size=batch_size,
    bptt=bptt,
    pretrained_vectors="fasttext.simple.300d",
)



In [9]:
vocab = dm.vocab
weight_matrix = vocab.vectors
ntoken, ninp = weight_matrix.shape

In [10]:
model = RNNModel(
    rnn_type=rnn_type, ntoken=ntoken, ninp=ninp, nhid=nhid, nlayers=nlayers, batch_size=batch_size, device_type= device.type, pretrained_vectors=weight_matrix
)

In [11]:
# wandb_logger = WandbLogger(name='trial_1',project=project)
trainer = pl.Trainer(gpus=1 if device.type =='cuda' else 0, max_epochs=5, fast_dev_run=True)#, logger= wandb_logger) #fast_dev_run=True,

Running in fast_dev_run mode: will run a full train, val and test loop using a single batch
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


In [12]:
trainer.fit(model, datamodule=dm)


  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | drop      | Dropout          | 0     
2 | encoder   | Embedding        | 13 M  
3 | rnn       | LSTM             | 51 K  
4 | decoder   | Linear           | 1 M   


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…




1

In [13]:
seq = torch.ones(bptt, dtype=torch.long)
toks = dm.TEXT.preprocess("When life hands you lemons")
x = dm.TEXT.numericalize([toks]).to('cpu').squeeze(1)
length = min(len(x), bptt)
seq[-length:] = x[-length:]
seq = seq.unsqueeze(1)

In [14]:
model.hidden = model.init_hidden(1)
model = model.to('cpu')
model.eval()
out = model(seq)
out

tensor([[-10.7090, -10.6165, -10.7652,  ..., -10.8997, -10.8639, -10.6925],
        [-10.7068, -10.6144, -10.7447,  ..., -10.9029, -10.8717, -10.6778],
        [-10.7071, -10.6122, -10.7299,  ..., -10.9087, -10.8767, -10.6698],
        [-10.7085, -10.6129, -10.7242,  ..., -10.9129, -10.8805, -10.6679],
        [-10.7108, -10.6154, -10.7195,  ..., -10.9155, -10.8800, -10.6670],
        [-10.7117, -10.6141, -10.7179,  ..., -10.9189, -10.8809, -10.6644]],
       grad_fn=<LogSoftmaxBackward>)

In [15]:
dm.TEXT.vocab.itos[torch.argmax(out[-1,:])]

'my'

In [None]:
def generate_sequence(self, src):
    #src = [sent_len]
    src = src.unsqueeze(1)
    #src = [sent_len, 1]
    generate_step = 0
    while generate_step < 20:
      out = self.forward(src)
      #out = [sent_len + 1, 1, vocab_size]
      out = torch.argmax(out[-1, :], dim=1) # [1]
      out = out.unsqueeze(0) #[1,1]
      src = torch.cat((src, out), dim=0)
      generate_step += 1
    src = src.squeeze(1)
    return src

In [None]:
def word_ids_to_sentence(id_tensor, vocab, join=None):
    """Converts a sequence of word ids to a sentence"""
    if isinstance(id_tensor, torch.LongTensor):
        ids = id_tensor.transpose(0, 1).contiguous().view(-1)
    elif isinstance(id_tensor, np.ndarray):
        ids = id_tensor.transpose().reshape(-1)
    batch = [vocab.itos[ind] for ind in ids] # denumericalize
    if join is None:
        return batch
    else:
        return join.join(batch)

In [None]:
arrs = model(b.text).cpu().data.numpy()
word_ids_to_sentence(np.argmax(arrs, axis=2), TEXT.vocab, join=' ')