## 4-Training Using Pytorch

We assume you run through the first 3 steps described in `full_process_keras`, which are independent of whether one uses `keras` or `pytorch`.

If you want to run the code in this notebook from your terminal: 

    python train_pytorch.py models/model_pytorch data/sklearn_clean/ data/scalaz_clean

with all options one can pass (e.g. `--bidirectional`, `--epochs`, etc...)

Remember that we will have opened a few `txt` files (will pass 1024 sequences from the files, so at least 1024 files will be opened). Set `ulimit -n` high, for example, on terminal:

    ulimit -n 4096

Now let's talk a little about some differences between using `keras` and `pytorch`. One notable difference is that while `tensorflow` (`keras`'s default backend) uses static graphs (define-and-run), `pytorch` is dynamic (define-by-run). This changes the way you normally code your models and, to my opinion, adds flexibility to both the building of the model and the training methods.

For example, when using `pytorch` you normally code your models as a python class where you define the forward pass (although `pytorch` also has a sequential API). In addition, you also need to define your training and validation phases, as we will see below. This implies having to compute the loss, compute and clear the gradients, update parameters, etc. This implies more coding, but gives you additional flexibility. In this "simple" excercise, this might not be perceived as an advantage, but when it comes to more complex models or forms of training, it is. 

Without further due, let's code the stateful, multi-layer RNN in pytorch with a final `Timedistributed` linear layer. Remember that `stateful` simply means that the hidden state of the previous batch of sequences will be use as starting state for the current batch. 

Code below is overcommented for clarity. With the comments looks like a lot of coding, but if you remove them looks better :)

In [1]:
import os
import sys
import numpy as np

from tqdm import tqdm,trange
from random import choice
from joblib import dump
from text_utils import char2vec, n_chars
from glob import glob

import torch
import torch.nn  as nn
import torch.optim as optim
import torch.nn.functional as F

from train_pytorch import generate_batches, n_chars
from torch.autograd import Variable


class TimeDistributed(nn.Module):
    def __init__(self, module):
        """
        No credit for me for this, I took it from here a while ago and been using it since then: 
        https://github.com/SeanNaren/deepspeech.pytorch/blob/master/model.py
        
        Collapses input of dim T*N*H to (T*N)*H, and applies to a module.
        :param module: Module to apply input to.
        """
        super(TimeDistributed, self).__init__()
        self.module = module

    def forward(self, x):
        t, n = x.size(0), x.size(1)

        # let's collapse the dimensions
        # contiguous simply returns a contiguous tensor containing the same data as self 
        x = x.contiguous().view(t * n, -1)

        # apply the module we pass when we initialize the class (in our case will be a Linear layer)
        x = self.module(x)

        # and expand dimensions
        x = x.contiguous().view(t, n, -1)
        return x

    def __repr__(self):
        # let's make it unambiguous and consistent with other pytorch layers
        tmpstr = self.__class__.__name__ + ' (\n'
        tmpstr += self.module.__repr__()
        tmpstr += ')'
        return tmpstr 
    
    
class RNNCharTagger(nn.Module):
    """
    Here is where the fun begins!

    Parameters are self explanatory
    """
    def __init__(self, lstm_layers, input_dim, out_dim, batch_size, dropout, batch_first=True):
        super(RNNCharTagger, self).__init__()
        
        
        self.lstm_layers = lstm_layers
        self.input_dim = input_dim
        self.out_dim = out_dim
        self.dropout = dropout
        self.batch_first = batch_first
        self.batch_size = batch_size


        # LSTM layers: because we have to made every layer stateful, we need to define them one by one. 
        # Note that the dropout option in pytorch adds dropout after all but last recurrent layer. 
        # This means that if you define a one layer LSTM and add dropout, will do nothing. 
        # The solution is easy, let's manually add it after every recurrent layer
        self.lstm1 =  nn.LSTM(self.input_dim, self.out_dim, batch_first=self.batch_first)
        self.drop1 = nn.Dropout(self.dropout)

        # for every layer after the 1st one, we define the RNN and add dropout for all but last one
        for i in range(1,self.lstm_layers):
            if (i+1) < self.lstm_layers:
                setattr(self, 'lstm'+str(i+1), nn.LSTM(self.out_dim, self.out_dim, batch_first=self.batch_first))
                setattr(self, 'drop'+str(i+1), nn.Dropout(self.dropout))
            else:
                setattr(self, 'lstm'+str(i+1), nn.LSTM(self.out_dim, self.out_dim, batch_first=self.batch_first))

        # The Timedistributed layer after the RNN layers
        self.linear = TimeDistributed(nn.Linear(self.out_dim, 1))

        # Initialize cell states
        for i in range(self.lstm_layers):

            # one could also initialize as zeros.
            # setattr(self, 'h'+str(i+1), nn.Parameter(torch.zeros(1, self.batch_size, self.out_dim)))
            # setattr(self, 'c'+str(i+1), nn.Parameter(torch.zeros(1, self.batch_size, self.out_dim)))
            setattr(self, 'h'+str(i+1), nn.Parameter(nn.init.normal_(torch.Tensor(1, self.batch_size, self.out_dim))))
            setattr(self, 'c'+str(i+1), nn.Parameter(nn.init.normal_(torch.Tensor(1, self.batch_size, self.out_dim))))

    def forward(self, X):

        # in the first forward pass we will use the initialized cell state and store 
        # the output state per LSTM layer. If we used just oned layer, this would be
        # easier:  

        # output, self.hidden = self.lstm(X, (self.hidden))

        # Because we use multiple layers and we want the code to be readable, we store
        # the cell states in a list and update at the end
        output, (h1, c1) = self.lstm1(X, (self.h1, self.c1))
        output = self.drop1(output)
        hidden_states = [(h1,c1)]
        for i in range(1,self.lstm_layers):
            h,c = getattr(self, 'h'+str(i+1)), getattr(self, 'c'+str(i+1))
            output, (nh,nc) = getattr(self, 'lstm'+str(i+1))(output, (h,c))
            if (i+1) < self.lstm_layers:
                output = getattr(self, 'drop'+str(i+1))(output)
            hidden_states.append((nh,nc))

        for i in range(self.lstm_layers):
            setattr(self, 'h'+str(i+1), nn.Parameter(hidden_states[i][0].data))
            setattr(self, 'c'+str(i+1), nn.Parameter(hidden_states[i][1].data))

        # Finally, sigmoid on the output to classify as python or scala 
        output = F.sigmoid(self.linear(output))

        return output   

After the code above, coding 3 layers of `Bidirectional` LSTMs looks pretty straightforward. Moreover, given the fact that, as mentioned in `full_process_keras`, we do not use `stateful` here.

In [2]:
class BiRNNCharTagger(nn.Module):
    def __init__(self, lstm_layers, input_dim, out_dim, batch_size, dropout, batch_first=True):
        super(BiRNNCharTagger, self).__init__()

        self.lstm_layers = lstm_layers
        self.input_dim = input_dim
        self.out_dim = out_dim
        self.dropout = dropout
        self.batch_first = batch_first
        self.batch_size = batch_size

        self.lstm =  nn.LSTM(
            self.input_dim,
            self.out_dim,
            batch_first=self.batch_first,
            dropout=self.dropout,
            num_layers = self.lstm_layers,
            bidirectional=True)
        self.linear = TimeDistributed(nn.Linear(2*self.out_dim, 1))

    def forward(self, X):
        lstm_output, hidden = self.lstm(X)
        output = F.sigmoid(self.linear(lstm_output))
        return output

As simple as that! So, let's training it...right? 

Not so fast, another difference between `pytorch` and `keras` is that when using `pytorch`, you normally need to define the training and validation phases, setting the model accordingly.

These can be simple functions as the ones in the cell below (again, overcommented)

In [3]:
def train(train_gen, model, criterion, optimizer, epoch, steps_per_epoch):
    """
    Params:
    -----------
    train_gen: train generator
    model    : pytorch model
    criterion: loss function
    optimizer: your favourite optimizer
    epoch    : integer indicating the current epoch
    steps_per_epoch: how many steps will define an epoch
    """
    
    # switch to train mode
    model.train()

    # we will use tqdm for pretty progressbars
    with trange(steps_per_epoch) as t:
        for i in t:
            t.set_description('epoch %i' % epoch)

            # 1. Generate X and y, turn them into Variables to be passed to the model, 
            # and to cuda mode if cuda is available
            X,y = train_gen.__next__()
            X_var = Variable(torch.from_numpy(X).float())
            y_var = Variable(torch.from_numpy(y).float())
            if use_cuda:
                X_var, y_var = X_var.cuda(), y_var.cuda()

            # 2. Pytorch accumulates gradients. We need to clear them after each step
            optimizer.zero_grad()

            # 3. Run the forward pass
            y_pred = model(X_var)
            
            # 4. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()            
            loss = criterion(y_pred, y_var)
            # if using previous torch versions
            # t.set_postfix(loss=loss.data[0])            
            t.set_postfix(loss=loss.item())
            loss.backward()
            optimizer.step()

Before defining the validation function, which is going to be nearly identical to `train`, let me define a couple of helpers so our validation metrics will be directly comparable to those at the `full_process_keras` notebook. 

In pytorch there is not such a thing as an *"accuracy metric"*, or not that I have found, but is very easy to code one. In addition, after the validation steps, we would like to see the mean of the validation metrics after each step. 

With that in mind, let's define two functions: 

In [4]:
class AverageMeter(object):
    """Computes and stores the average and current value
    from here: https://github.com/SeanNaren/deepspeech.pytorch/blob/master/model.py
    """
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


class Accuracy(nn.Module):
    """
    wrapper to compute accuracy
    """
    def __init__(self):
        super(Accuracy, self).__init__()

    def forward(self,y_pred,y):
        y_pred = (y_pred.view(-1, 1) > 0.5).data.float()
        y = y.view(-1, 1).data.float()
        # if using previous torch versions
        # acc = (y_pred == y).sum()/y.size(0)
        acc = (y_pred == y).sum().item()/y.size(0)
        return Variable(torch.FloatTensor([acc]))

    def __repr__(self):
        return self.__class__.__name__ + '(\n)'

And now, the validation stage, which is identical to `train`, with the exception we do not clean gradients or update parameters, we simple *"predict"* and compute validation metrics:

In [5]:
def validate(val_gen, model, metrics, validation_steps):

    # switch to evaluate mode
    model.eval()

    losses = []
    for i in range(len(metrics)):
        losses.append(AverageMeter())

    with trange(validation_steps) as t:
        for i in t:
            t.set_description('validating')
            X,y = val_gen.__next__()
            X_var = Variable(torch.from_numpy(X).float())
            y_var = Variable(torch.from_numpy(y).float())
            if use_cuda:
                X_var, y_var = X_var.cuda(), y_var.cuda()
            y_pred = model(X_var)
            for i in range(len(metrics)):
                # if using previous torch versions
                # losses[i].update(metrics[i](y_pred, y_var).data[0])
                losses[i].update(metrics[i](y_pred, y_var).item())

        for metric,loss in zip(metrics, losses):
            print("val_{}: {}".format(metric.__repr__().split("(")[0], loss.val))

Now we can train. Let's 1st just define the model and check all makes sense

In [7]:
lstm_layers = 3
input_dim = n_chars
out_dim = 128 #(rnn_size in the keras notebook)
batch_size = 1024
dropout = 0.2

model = RNNCharTagger(lstm_layers, input_dim, out_dim, batch_size, dropout)
use_cuda = torch.cuda.is_available()
if use_cuda:
    model = model.cuda()
print(model)

RNNCharTagger(
  (lstm1): LSTM(96, 128, batch_first=True)
  (drop1): Dropout(p=0.2)
  (lstm2): LSTM(128, 128, batch_first=True)
  (drop2): Dropout(p=0.2)
  (lstm3): LSTM(128, 128, batch_first=True)
  (linear): TimeDistributed (
  Linear(in_features=128, out_features=1, bias=True))
)


In [8]:
dir_a = "data/sklearn_clean/"
dir_b = "data/scalaz_clean/"

# training and validation files
train_a = glob(os.path.join(dir_a, "train/*"))
train_b = glob(os.path.join(dir_b, "train/*"))
val_a = glob(os.path.join(dir_a, "test/*"))
val_b = glob(os.path.join(dir_b, "test/*"))

# sequences of less than 200 and more than 20 will be spliced together
min_jump_size_a = 20
min_jump_size_b = 20
max_jump_size_a = 200
max_jump_size_b = 200
juma = [min_jump_size_a, max_jump_size_a]
jumb = [min_jump_size_b, max_jump_size_b]

# length of the resulting sequence that will be passed to the RNN model
seq_len = 100

# start the generators
train_gen = generate_batches(train_a, juma, train_b, jumb, batch_size, seq_len)
val_gen = generate_batches(val_a, juma, val_b, jumb, batch_size, seq_len)

# set training and validation parameters
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
metrics = [nn.MSELoss(), nn.BCELoss(), Accuracy()]
epochs = 5
steps_per_epoch = 100
validation_steps = 50    
for epoch in range(1,epochs+1):
    train(train_gen, model, criterion, optimizer, epoch, steps_per_epoch)
    validate(val_gen, model, metrics, validation_steps)

epoch 1: 100%|██████████| 100/100 [00:34<00:00,  2.88it/s, loss=0.124]
validating: 100%|██████████| 50/50 [00:17<00:00,  2.78it/s]
epoch 2:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.13149894773960114
val_BCELoss: 0.41690024733543396
val_Accuracy: 0.8169531226158142


epoch 2: 100%|██████████| 100/100 [00:37<00:00,  2.68it/s, loss=0.0731]
validating: 100%|██████████| 50/50 [00:18<00:00,  2.70it/s]
epoch 3:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.08572780340909958
val_BCELoss: 0.29218724370002747
val_Accuracy: 0.8883105516433716


epoch 3: 100%|██████████| 100/100 [00:37<00:00,  2.68it/s, loss=0.0555]
validating: 100%|██████████| 50/50 [00:18<00:00,  2.74it/s]
epoch 4:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.06608203053474426
val_BCELoss: 0.23601852357387543
val_Accuracy: 0.9172949194908142


epoch 4: 100%|██████████| 100/100 [00:37<00:00,  2.67it/s, loss=0.0449]
validating: 100%|██████████| 50/50 [00:18<00:00,  2.69it/s]
epoch 5:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.05284154787659645
val_BCELoss: 0.19539694488048553
val_Accuracy: 0.93408203125


epoch 5: 100%|██████████| 100/100 [00:36<00:00,  2.70it/s, loss=0.0431]
validating: 100%|██████████| 50/50 [00:18<00:00,  2.74it/s]

val_MSELoss: 0.0439709909260273
val_BCELoss: 0.16351854801177979
val_Accuracy: 0.945263683795929





Just note that with three layers of stateful LSTMs, with 100 steps and 5 epochs we achive a better accuracy that using `keras` (it was 0.8906). But not only that, remember that when using keras every epoch took around 101 seconds, while here takes 36s aproximately. FAST!

Let's try with Bidirectional LSTMs. 

In [7]:
# del(model)
model = BiRNNCharTagger(lstm_layers,n_chars,out_dim,batch_size,dropout)
use_cuda = torch.cuda.is_available()
if use_cuda:
    model = model.cuda()
print(model)

BiRNNCharTagger(
  (lstm): LSTM(96, 128, num_layers=3, batch_first=True, dropout=0.2, bidirectional=True)
  (linear): TimeDistributed (
  Linear(in_features=256, out_features=1, bias=True))
)


In [8]:
for epoch in range(1,epochs+1):
    train(train_gen, model, criterion, optimizer, epoch, steps_per_epoch)
    validate(val_gen, model, metrics, validation_steps)

epoch 1: 100%|██████████| 100/100 [00:56<00:00,  1.78it/s, loss=0.0843]
validating: 100%|██████████| 50/50 [00:18<00:00,  2.66it/s]
epoch 2:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.08892463892698288
val_BCELoss: 0.2923818826675415
val_Accuracy: 0.8800097703933716


epoch 2: 100%|██████████| 100/100 [00:58<00:00,  1.72it/s, loss=0.0459]
validating: 100%|██████████| 50/50 [00:19<00:00,  2.62it/s]
epoch 3:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.05173882469534874
val_BCELoss: 0.17716321349143982
val_Accuracy: 0.9314258098602295


epoch 3: 100%|██████████| 100/100 [00:59<00:00,  1.69it/s, loss=0.0314]
validating: 100%|██████████| 50/50 [00:19<00:00,  2.62it/s]
epoch 4:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.040631264448165894
val_BCELoss: 0.13843119144439697
val_Accuracy: 0.9457812309265137


epoch 4: 100%|██████████| 100/100 [00:59<00:00,  1.69it/s, loss=0.0234]
validating: 100%|██████████| 50/50 [00:19<00:00,  2.59it/s]
epoch 5:   0%|          | 0/100 [00:00<?, ?it/s]

val_MSELoss: 0.0357527956366539
val_BCELoss: 0.1246110275387764
val_Accuracy: 0.9513378739356995


epoch 5: 100%|██████████| 100/100 [00:58<00:00,  1.70it/s, loss=0.0191]
validating: 100%|██████████| 50/50 [00:19<00:00,  2.63it/s]

val_MSELoss: 0.02755051665008068
val_BCELoss: 0.09572076797485352
val_Accuracy: 0.9632617235183716





Again, if we compare with the results obtained using `keras` for the same number of epocs and steps per epoch we find that the accuracy was 0.931 and training every epoch takes around 200 seconds. 

Let's save the model and plot the results

In [10]:
model_path = "models/model_pytorch"
MODEL_DIR = model_path.split("/")[0]
if not os.path.exists(MODEL_DIR):
    os.makedirs(MODEL_DIR)
torch.save(model.state_dict(), model_path)

From here, tagging the characters and plotting is nearly identical (if not a bit easier) to the process described in `full_process_keras`

In [8]:
model = BiRNNCharTagger(lstm_layers,n_chars,out_dim,batch_size,dropout)
model.load_state_dict(torch.load(model_path))
model = model.cuda()
model.eval()

gen = generate_batches(val_a, juma, val_b, jumb, batch_size, seq_len, return_text=True)
steps = 50

# 1. Store the predictions, labels and corresponding text
predictions, labels, texts = [],[],[]
with trange(steps) as t:
    for i in t:
        X,y,text = gen.__next__()
        X_var = Variable(torch.from_numpy(X).float())
        y_var = Variable(torch.from_numpy(y).float())
        if use_cuda:
            X_var, y_var = X_var.cuda(), y_var.cuda()
        pr = model(X_var)
        predictions.append(pr.data)
        labels.append(y_var.data)
        texts.append(text)

preds = torch.cat(predictions,dim=1).reshape(batch_size,steps*seq_len)
preds = preds.cpu().numpy()
labs = torch.cat(labels,dim=1).reshape(batch_size,steps*seq_len)
labs = labs.cpu().numpy()
txts = []
for j in range(batch_size):
    txts.append("".join([texts[i][j] for i in range(steps)]))


output_dir = "output/sklearn_or_scala_preds_pytorch"    
try:
    os.makedirs(output_dir)
except os.error:
    pass
for i in range(batch_size):
    path = os.path.join(output_dir, 'part_' + str(i).zfill(5) + ".joblib")
    dump((txts[i], preds[i], labs[i]), path)

100%|██████████| 50/50 [00:14<00:00,  3.39it/s]


And finally plotting

In [10]:
import matplotlib
import matplotlib.pyplot as plt

from joblib import load
%matplotlib inline

from plot_predictions import prediction_to_html

predictions_dir = "output/sklearn_or_scala_preds_pytorch" 
output_dir = "output/sklearn_or_scala_preds_pytorch_html"

try:
    os.makedirs(output_dir)
except os.error:
    pass
files = glob(os.path.join(predictions_dir, "*"))
for i, f in enumerate(files[100:110]):
    text, prediction, labels = load(f)
    html = prediction_to_html(text, prediction, labels, cmap="Reds")
    out_path = os.path.join(output_dir, 'part-' + str(i).zfill(5) + ".html")
    with open(out_path, "w") as out:
        out.write(html)

In [13]:
from IPython.display import display, HTML
display(HTML(filename="output/sklearn_or_scala_preds_pytorch_html/part-00004.html"))