###NLP Transfer Learning using ULMFit

Universal Language Model Fine-Tuning(ULMFit) is a transfer learning technique for NLP tasks. It is a state-of-the-art NLP technique along with alternatives such as BERT and XLNet (for text classification)

We use transfer learning to do text classification of IMDB movie reviews, using a language model pre-trained on Wikitext.

![alt text](https://nlp.fast.ai/images/ulmfit_approach.png)

The code here is based on the following four notebooks in Fastai's Course V3 Part 2 - [text pre-process](https://github.com/fastai/course-v3/blob/master/nbs/dl2/12_text.ipynb), [awd-lstm](https://github.com/fastai/course-v3/blob/master/nbs/dl2/12a_awd_lstm.ipynb), [LM pre-train](https://github.com/fastai/course-v3/blob/master/nbs/dl2/12b_lm_pretrain.ipynb) and [ULMFit](https://github.com/fastai/course-v3/blob/master/nbs/dl2/12c_ulmfit.ipynb). The logic for text pre-processing has been implemented in the data_lib and imported into this notebook.

NB: This is intended to be a Clean Updated version of the notebook KD AWD LSTM which had become very messy and intertwined.

**ULMFit**

Note that our transfer learning is transforming in two different dimensions:
1. Dataset from wikitext corpus to IMDB movie reviews
2. Problem from language model (ie. next word prediction) to text classification

Hence, the overall process we follow is as shown below:

![alt text](https://miro.medium.com/max/1413/1*stYzRq07Blajrg2l6gw9Aw.png)

1. Load text data for IMDB and Wikitext and pre-process it for Language Modeling
2. Build our AWD-LSTM module which will serve as the core module for both our Language Model and Text Classification architecture.  
3. Build our Language Model architecture using AWD-LSTM and a Linear Decoder
4. Train our Language Model using Wikitext data
5. Adapt the data of the pre-trained Wikitext Language Model to IMDB - transform the vocab and embeddings
6. Adapt the weights of the pre-trained Wikitext Language Model to IMDB - retrain on IMDB
7. Load text data for IMDB and pre-process it for Classification
8. Build our Classification architecture using AWD-LSTM and a Pooling Classifier
9. Train our Classification using IMDB data

**Todos**
*   Add DataBundle for data instead of Fastai data
*   Track results in DTR
*   DTR Hyperparameters tracking
*   Trigger the shuffle of the LMDataset at the beginning of each epoch. Should be done inside AWD-LSTM Callback
*   Add comments as required
*   Cleanup cells with Fastai code
*   Run for epochs and check results. Compare data/results with Fastai
*   Make sure App and Arch are clean and as per standard template
*   Display batch and results
*   Train the model with Wikitext data and save the data and the model
*   Check all the parameters at each step with what is used in the Fastai lesson eg. lr, bs, bptt, emb_sz, nh, nl, dps, scheduling rates, gradient clipping, optimiser hyper params, AWD-LSTM CB params, num_epochs

### Import Fastai Lesson code - this is needed only to load the data (from IMDB)

In [0]:
#----------------------------------------------------
# Import Fastai Lesson code
#----------------------------------------------------
!git clone https://github.com/fastai/course-v3.git
!mv course-v3/nbs/dl2/exp .
import IPython.core.debugger as db

! git clone https://github.com/NVIDIA/apex
! pip install -v --no-cache-dir apex/

#export
from exp.nb_12 import *

In [0]:
#----------------------------------------------------
# Create the Data Bunch using Fastai Lesson code
# After running this cell once in a session, set 'loaded' to True to simply load the pickle file instead of running all the processing again
#----------------------------------------------------
torch.manual_seed(0)
path = datasets.untar_data(datasets.URLs.IMDB)
loaded=True
if (not loaded):
  il = TextList.from_files(path, include=['train', 'test', 'unsup'])
  sd = SplitData.split_by_func(il, partial(random_splitter, p_valid=0.1))
  proc_tok,proc_num = TokenizeProcessor(max_workers=8),NumericalizeProcessor()
  ll = label_by_func(sd, lambda x: 0, proc_x = [proc_tok,proc_num])
  pickle.dump(ll, open(path/'ll_lm.pkl', 'wb'))
  pickle.dump(proc_num.vocab, open(path/'vocab_lm.pkl', 'wb'))
ll = pickle.load(open(path/'ll_lm.pkl', 'rb'))
vocab = pickle.load(open(path/'vocab_lm.pkl', 'rb'))
bs,bptt = 64,70
data = lm_databunchify(ll, bs, bptt)

#----------------------------------------------------
# The Data Bunch loaded above takes 12-30 minutes to train 1 epoch.
# So we make a subset of that data to allow faster training while debugging
#----------------------------------------------------

# Get subset of data
num_t = 2500
num_v = num_t // 5

new_t, new_v = ll.train, ll.valid
tx, ty=TextList(new_t.x[:num_t]), TextList(new_t.y[:num_t])
vx, vy=TextList(new_v.x[:num_v]), TextList(new_v.y[:num_v])
tll = LabeledData(tx, ty)
vll = LabeledData(vx, vy)
new_sd = SplitData(tll, vll)
subset_data = lm_databunchify(new_sd, bs, bptt)

# The below is just to get a single batch running some tests below
x,y = next(iter(subset_data.train_dl))
print('Data ', x.float().mean(), y.float().mean())

Downloading https://s3.amazonaws.com/fast-ai-nlp/imdb


Data  tensor(1626.1665) tensor(1617.8551)


### Import KD Libraries

In [0]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [0]:
import IPython.core.debugger as db
from functools import partial
import warnings
import torch
import torch.nn.functional as F
from torch import tensor, nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')
gd_path = 'gdrive/My Drive/Colab Data/fastai-v3'  #change dir to your project folder
gn_path = 'gdrive/My Drive/Colab Notebooks'  #change dir to your project folder

import sys
sys.path.insert(1, gn_path + '/exp')

In [0]:
from nb_data import *
from nb_training import *
from nb_optimiser import *

### Define the Language Model Application class

In [0]:
#----------------------------------------------------
# Language Model Application
#----------------------------------------------------
class AppLanguageModel():

  def __init__(self):
    self._arch = None
    self.lmdb = None
    self.vocab = None
    pass

  def load_data(self, data_path):
    ds_params = {'target_ds': FastaiLMDataset, 'bs': 64, 'bptt': 70}
    self.lmdb = LanguageModelFolderDataBundle(data_path, ds_params)
    self.lmdb.do()
    self.vocab = self.lmdb.convert_state['vocab_i2w']

  def create_arch(self, *dps):
    tok_pad = self.vocab.index(PAD)
    emb_sz, n_h, n_layers, pad_idx = 300, 300, 2, tok_pad
    self._arch = ArchLanguageModel(self.vocab, emb_sz, n_h, n_layers, pad_idx, *dps)
    return self._arch

  def run_train(self, split_lr=[5e-3], split=False, one_cycle=False, freeze="UNFREEZE_LSTM", num_epochs=1):
    train_dl = self.lmdb.train_dl
    valid_dl = self.lmdb.valid_dl
    # NB: Use cross_entropy_flat for Language Model
    loss_func = cross_entropy_flat

    # split_lr is a list:
    #   1. a single-element list [0.01] - same LR for all groups. 
    #        If 'Split' is False, there is only one group. 
    #        If 'Split' is True, there are multiple groups.
    #   2. a multi-element list [0.01, 0.03, 0.05] - discriminative LR for different groups. 
    #        'Split' cannot be False
    # eg. split_lr = [lr/2., lr/2., lr] for one cycle
    assert(isinstance(split_lr, list))
    assert(len(split_lr) > 0)
    assert(not ((split == False) and (len(split_lr) > 1)))

    # NB: Use AwdLstmCB for Language Model and accuracy_flat
    callbs=[CudaCB(device = torch.device('cuda',0)), AwdLstmCB(alpha=2., beta=1.), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy_flat})]
    if (one_cycle):
      one_cycle_callbs = create_OneCycleCB(split_lr, phases=[0.5, 0.5], mom_start=0.8, mom_mid=0.7, mom_end=0.8)
      callbs = callbs + one_cycle_callbs

    model = self._arch.model
    self._arch.freeze(freeze)

    print ('BEFORE Hyper parameters')
    # eg. opt_func = optim.SGD, sgd_opt_func, adam_opt_func
    opt_func=adam_opt_func
    lr = split_lr[0]
    if (split and (len(split_lr) == 1)):
      hypers_group = [{}] * self._arch.n_splits
      opt_groups=(self._arch.splitter, hypers_group, {'lr': lr})
    elif (split and (len(split_lr) > 1)):
      hypers_group = [{'lr': lr_g} for lr_g in split_lr]
      opt_groups=(self._arch.splitter, hypers_group, {})
    else:
      opt_groups = None
    opt = get_optimiser(model, lr, opt_func, opt_groups)

    dtr = DebugTracker(disp=(True, False))
    debug_cbs = [dtr, DebugYhatLossCB()]
    callbs = callbs + debug_cbs

    loop = Trainer(train_dl, valid_dl, model, opt, loss_func, callbs, dtr=dtr)
    loop.fit(num_epochs=num_epochs)

    # TODO !!!!!!! Make _print_opt() a public function, maybe using repr
    print ('AFTER Hyper parameters')
    loop.opt._print_opt()
    return loop

### Step 1 - Load Text Data (IMDB and Wikitext) and pre-process 

In [0]:
#----------------------------------------------------
# Import IMDb dataset that contains 50,000 label reviews and 50,000 unlabeled reviews. 
#----------------------------------------------------

lm_app = AppLanguageModel()

path_imdb = datasets.untar_data(datasets.URLs.IMDB)
lm_app.load_data(path_imdb)

# !!!!!!! TEMP
lm_imdb_vocab = lm_app.vocab

--------- IMDB Language Model DataBundle init /root/.fastai/data/imdb {'target_ds': <class 'nb_data.FastaiLMDataset'>, 'bs': 64, 'bptt': 70}
FolderItemContainer loaded 100001 items of type TextFileItemList
Split using split_random into 3000, 2000 and 0 items of type TextFileItemList
Extracted 3000 items of type SentenceItemList using extract_doc
Extracted 3000 items of type DummyItemList using extract_dummy
Converted 3000 items to type SentenceWordItemList using SentenceToWord
Converted 3000 items to type SentenceWordIdItemList using WordToWordId
Converted 894342 items to type StreamWordIdItemList using WordIdToStream
Extracted 2000 items of type SentenceItemList using extract_doc
Extracted 2000 items of type DummyItemList using extract_dummy
Converted 2000 items to type SentenceWordItemList using SentenceToWord
Converted 2000 items to type SentenceWordIdItemList using WordToWordId
Converted 595678 items to type StreamWordIdItemList using WordIdToStream
Final StreamWordIdItemList (8943

### Step 2 - Build AWD-LSTM Module (Dropout Layers)

In [0]:
#----------------------------------------------------
# AWD-LSTM implements dropout at several places in the LSTM model
#
# Essentially dropout consists of replacing some coefficients by 0 with probability p. To ensure that 
# the average of the weights remains constant, we apply a correction to the weights that aren't nullified 
# of a factor 1/(1-p)
#----------------------------------------------------

#----------------------------------------------------
# Create a mask that tells us which elements to nullify or not
#----------------------------------------------------
def dropout_mask(x, sz, p):
  return x.new(*sz).bernoulli_(1-p).div_(1-p)

#----------------------------------------------------
# Dropout for an RNN has to work differently than dropout for a fully-connected network. Since a RNN
# has timesteps, we want dropout to replace a value by 0 for all timesteps.
#
# Inside a RNN, all tensors have shape (samples, timesteps, features). We want to consistently apply 
# the dropout mask across the timesteps dimension, so that the dropout value for all timesteps should 
# be the same. Therefore, we create a dropout mask for the samples and features dimension and broadcast
# it to the timesteps dimension.
# 
# Once we have a mask, applying the dropout to a tensor 'x' is simply done by 'x = x * mask'
#----------------------------------------------------
class RNNDropout(nn.Module):
  def __init__(self, p=0.5):
    super().__init__()
    self.p=p

  def forward(self, x):
    if not self.training or self.p == 0.: return x
    # The mask 'm' has shape (samples, 1, features). The timesteps dimension
    # which has size 1 gets replicated since it gets broadcast when the mask 
    # is multiplied with 'x'
    m = dropout_mask(x.data, (x.size(0), 1, x.size(2)), self.p)
    return x * m

#----------------------------------------------------
# Weight Dropout dropout is applied to hidden-to-hidden matrix inside the LSTM by zeroing out 
# hidden units randomly. This is a little hacky if we want to preserve the CuDNN speed and not 
# reimplement the cell from scratch. This needs to be done in a way that ensure the gradients 
# are still computed and the initial weights still updated. 
#
# We wrap the LSTM with the WeightDropout module. We add a parameter that will contain the raw weights. 
# Weight masks are randomly refreshed at each batch's forward pass.
#----------------------------------------------------
WEIGHT_HH = 'weight_hh_l0'
class WeightDropout(nn.Module):
  def __init__(self, module, weight_p=[0.], param_names=[WEIGHT_HH]):
    super().__init__()
    self.module,self.weight_p,self.param_names = module,weight_p,param_names
    
    # Keep a list of the LSTM's weight param names which we want to dropout
    for param in self.param_names:
      # Get the LSTM weights and copy it into ourself.
      w = getattr(self.module, param)
      self.register_parameter(f'{param}_raw', nn.Parameter(w.data))

      # !!!!! DON'T UNDERSTAND THIS LOGIC EXACTLY
      self.module._parameters[param] = F.dropout(w, p=self.weight_p, training=False)

  # ----------------------------
  # Update the LSTM weights at the beginning of each forward pass
  # ----------------------------
  def _setweights(self):
    for param in self.param_names:
      # !!!!! DON'T UNDERSTAND THIS LOGIC EXACTLY
      # Get the weights which we had saved away and apply the dropout on it
      raw_w = getattr(self, f'{param}_raw')
      self.module._parameters[param] = F.dropout(raw_w, p=self.weight_p, training=self.training)

  # ----------------------------
  # During our forward, we update the LSTM weights and then call the LSTM forward. Since we
  # are wrapping the LSTM, the LSTM forward is not called directly.
  # ----------------------------
  def forward(self, *args):
    self._setweights()
    with warnings.catch_warnings():
      #To avoid the warning that comes because the weights aren't flattened.
      warnings.simplefilter("ignore")
      return self.module.forward(*args)

#----------------------------------------------------
# Embedding Dropout is applied when we lookup the ids of our tokens inside the embedding matrix
# As the dropout occurs on the embedding matrix that is used for a full forward and backward 
# pass, this means that all occurrences of a specific word will disappear within that pass
#----------------------------------------------------
class EmbeddingDropout(nn.Module):
  "Applies dropout in the embedding layer by zeroing out some elements of the embedding vector."
  def __init__(self, emb, embed_p):
    super().__init__()
    self.emb,self.embed_p = emb,embed_p
    self.pad_idx = self.emb.padding_idx
    if self.pad_idx is None: self.pad_idx = -1

  def forward(self, words, scale=None):
    if self.training and self.embed_p != 0:
      # Since we zero out the entire embedding vector, our mask only looks at the
      # first size dimension ie. the number of embedding words
      size = (self.emb.weight.size(0),1)
      mask = dropout_mask(self.emb.weight.data, size, self.embed_p)

      # Apply the dropout mask on the embedding
      masked_embed = self.emb.weight * mask
    else:
      # No dropout, use the original embedding
      masked_embed = self.emb.weight

    if scale: 
      masked_embed.mul_(scale)

    return F.embedding(words, masked_embed, self.pad_idx, self.emb.max_norm,
                           self.emb.norm_type, self.emb.scale_grad_by_freq, self.emb.sparse)

test_x = torch.randn(4, 5)
test_mask = dropout_mask(test_x, (4, 5), 0.5); print('Test Mask is', test_mask)

test_dp = RNNDropout(0.3)
test_x = torch.randn(2, 3, 4)
print ('Original input\n', test_x, '\nAfter dropout\n', test_dp(test_x))

test_module = nn.LSTM(5, 2)
test_dp_module = WeightDropout(test_module, 0.4)
print('Initial weights are', getattr(test_dp_module.module, WEIGHT_HH))

test_input = torch.randn(4,8,5)
test_h = (torch.zeros(1,8,2), torch.zeros(1,8,2))
test_x, test_h = test_dp_module(test_input, test_h)
print('After forward weights are', getattr(test_dp_module.module, WEIGHT_HH))

test_enc = nn.Embedding(100, 7, padding_idx=1)
test_enc_dp = EmbeddingDropout(test_enc, 0.5)
test_input = torch.randint(0,100,(8,))
test_enc_dp(test_input)

Test Mask is tensor([[2., 2., 2., 0., 2.],
        [2., 0., 0., 2., 2.],
        [0., 2., 0., 0., 2.],
        [0., 2., 2., 0., 2.]])
Original input
 tensor([[[-2.0707,  0.6421, -0.9576,  1.9489],
         [ 0.2422,  1.5373,  1.3674,  0.9027],
         [-0.0936, -2.5835,  0.8783, -1.1348]],

        [[-1.2275,  2.2400,  0.7559,  1.0947],
         [-0.4146, -1.2804,  1.0682,  1.3891],
         [ 0.2492,  0.1286, -1.6805,  1.0375]]]) 
After dropout
 tensor([[[-2.9582,  0.9173, -0.0000,  0.0000],
         [ 0.3460,  2.1961,  0.0000,  0.0000],
         [-0.1337, -3.6908,  0.0000, -0.0000]],

        [[-1.7536,  3.2001,  0.0000,  1.5639],
         [-0.5923, -1.8291,  0.0000,  1.9845],
         [ 0.3559,  0.1837, -0.0000,  1.4822]]])
Initial weights are Parameter containing:
tensor([[-1.9371e-02, -7.6613e-02],
        [ 2.7376e-04,  3.3425e-01],
        [-3.6033e-01, -4.4733e-01],
        [ 3.2776e-01, -2.0330e-01],
        [-4.5878e-01, -3.2258e-01],
        [-6.1957e-01,  5.1847e-01],
    

tensor([[-1.0230, -0.2401,  1.1616, -2.6485,  2.1642,  3.8254,  4.4194],
        [ 1.1437,  0.6681,  1.0725, -1.4870, -0.7679, -1.1570, -0.9646],
        [-0.0000, -0.0000,  0.0000,  0.0000, -0.0000,  0.0000,  0.0000],
        [ 0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000, -0.0000],
        [ 0.0000,  0.0000,  0.0000,  0.0000, -0.0000, -0.0000,  0.0000],
        [ 1.7627, -2.3011,  2.8185, -3.2187,  1.5134,  0.2587, -0.8498],
        [ 0.0805,  2.8928,  0.9860, -0.2880,  0.4170,  0.0177,  1.6220],
        [ 1.5210, -1.3632, -1.6870,  0.4517,  0.4301, -1.5746, -1.2623]],
       grad_fn=<EmbeddingBackward>)

### Step 2 - Build AWD-LSTM Module (Core) and Callbacks

**Core AWD-LSTM Module**

In [0]:
#----------------------------------------------------
# AWD-LSTM architecture. It is a multi-layer LSTM network with 'n_layers' layers.
# The input data goes through an embedding layer (with embedding dropout). The resulting output of
# the embedding layer goes through the 'input dropout'. This is then fed to the first LSTM
# layer. Each LSTM layer has a 'weight dropout'. The output of the first LSTM layer goes through
# a 'hidden dropout' before being fed to the second LSTM layer. And so on for each layer.
#
# The last LSTM layer doesn't have a 'hidden dropout' but an 'output dropout' instead.
# That then goes through a Linear Decoder module which produces the final output.
#----------------------------------------------------

class AWD_LSTM(nn.Module):
  emb_init=0.1

  # ----------------------------
  # Build the architecture
  # ----------------------------
  def __init__(self, vocab_sz, emb_sz, n_h, n_layers, pad_idx, emb_p, inp_p, weight_p, hidden_p, apply_packing=False):
    super().__init__()
    self.bs = 1
    self.vocab_sz, self.emb_sz, self.n_h, self.n_layers = vocab_sz, emb_sz, n_h, n_layers
    self.pad_idx, self.apply_packing = pad_idx, apply_packing

    # Probabilities for various dropoouts
    self.emp_p, self.inp_p, self.weight_p, self.hidden_p = emb_p, inp_p, weight_p, hidden_p

    # Embedding layer with Dropout
    # Embedding matrix has shape [vocab size, embedding vector size]
    self.emb = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_idx)
    self.emb.weight.data.uniform_(-self.emb_init, self.emb_init)
    self.emb_dp = EmbeddingDropout(self.emb, emb_p)

    # Input dropout (post Embedding)
    self.inp_dp = RNNDropout(inp_p)

    # Create the multi-layer deep RNN network. We create a list of LSTM layers and another list
    # of Hidden Dropouts, one for each LSTM layer.
    self.lstm_layers = nn.ModuleList([self._create_lstm(*self._layer_sz(i)) for i in range(n_layers)])
    self.hidden_dps = nn.ModuleList([self._create_hidden_dp(i == n_layers-1) for i in range(n_layers)])

    # Re-initialise the hidden state for all LSTM layers
    self.reset_state()

  # ----------------------------
  # Process the forward pass
  # ----------------------------
  def forward(self, input):
    # If the batch size changes, then re-initialise the hidden state (whose shape depends on batch size)
    bs, ts = input.size()
    if bs!=self.bs:
      self.bs=bs
      self.reset_state()

    # When this module is used for Classification, the input data is padded as the sequences 
    # could be of different lengths. To make the LSTM computations efficient we pack this
    # data more compactly as an optimisation. Padding (and therefore packing) is not needed when 
    # the module is used for a Language Model as the input data is a continuous stream of data 
    # rather than individual sentences which doesn't require padding.
    if (self.apply_packing):
      input, pad_mask, seq_lengths = self._pad_mask(input, ts)

    # Input goes through Embedding layer with dropout
    # Input has shape [samples, timesteps]
    # Emb_val has shape [samples, timesteps, embedding size]
    emb_val = self.emb_dp(input)

    # Now apply Input dropout to the result of the embedding
    # This will then be fed to the LSTM layers
    # Inp_val has shape [samples, timesteps, embedding size]
    inp_val = self.inp_dp(emb_val)

    #print ('KD input and inp val', input.float().mean(), inp_val.mean())

    # Keep a list of hidden state, raw output values and output values post dropout for each LSTM layer
    new_states, out_vals, out_dp_vals = [], [], []

    # Go through each LSTM layer and its corresponding Hidden Dropout and Hidden State
    for lstm_dp, hidden_dp, state in zip(self.lstm_layers, self.hidden_dps, self.state):
      
      if (self.apply_packing):
        # Pack the data, apply the LSTM and then unpack the data
        inp_val = pack_padded_sequence(inp_val, seq_lengths, batch_first=True)
        out_val, new_state = lstm_dp(inp_val, state)
        out_val = pad_packed_sequence(out_val, batch_first=True)[0]
      else:
        # Apply the LSTM directly
        out_val, new_state = lstm_dp(inp_val, state)

      # Apply the Hidden Dropout to the LSTM output
      out_dp_val = hidden_dp(out_val)

      #print ('KD layer', out_val.mean(), state[0].mean(), state[1].mean(), new_state[0].mean(), new_state[1].mean())
      #print ('KD hidden', out_dp_val.mean())

      # Add the state, raw output value and output value post dropout for this layer, to the lists
      # [hidden state layer 1, hidden state layer 2, ....]
      # [raw output layer 1, raw output layer 2, ...]
      # [(post dropout) output layer 1, output layer 2, ...]
      new_states.append(new_state)
      out_vals.append(out_val)
      out_dp_vals.append(out_dp_val)

      # The post-dropout output will become the input to the next layer
      inp_val = out_dp_val

    # Save the new hidden states
    self.state = self._to_detach(new_states)

    # Return ([list of raw outputs for each layer], [list of outputs for each layer])
    if (self.apply_packing):
      return (out_vals, out_dp_vals, pad_mask)
    else:
      return (out_vals, out_dp_vals)

  # ----------------------------
  # Initialise the hidden state for all LSTM layers
  # ----------------------------
  def reset_state(self):
    # [hidden state layer 1, hidden state layer 2, ...]
    self.state = [self._zero_state(*self._layer_sz(i)) for i in range(self.n_layers)]

  # ----------------------------
  # Detach 'h' from its history
  # ----------------------------
  def _to_detach(self, h):
    return h.detach() if type(h) == torch.Tensor else tuple(self._to_detach(v) for v in h)

  # ----------------------------
  # Compute the dimensions of the i'th LSTM layer
  #   n_x = number of input features of the LSTM
  #   n_h = number of hidden features of the LSTM
  #
  # NB:     
  #     the input of a LSTM layer has shape (samples, timesteps, input features)
  #    the output of a LSTM layer has shape (samples, timesteps, hidden features)
  #    Since the first two dimensions are always the same, we mention only the third
  #    dimension in the comments below.
  # ----------------------------
  def _layer_sz(self, layer_i):
    is_last = False

    if (layer_i == 0):
      # First layer has (n_x, n_h) = (embedding size, hidden size)
      # ie. input size matches the embedding, and output size is hidden
      n_x = self.emb_sz
      n_h = self.n_h

    elif (layer_i == self.n_layers-1):
      # Last layer has (n_x, n_h) = (hidden size, embedding size)
      # ie. input size matches the output size of previous layer = hidden
      # and output size is same as embedding
      n_x = self.n_h
      n_h = self.emb_sz
      is_last = True

    else:
      # All middle layers have (n_x, n_h) = (hidden size, hidden size)
      # ie. input size matches the output size of previous layer = hidden
      # and output size is also hidden
      n_x = self.n_h
      n_h = self.n_h

    return (n_x, n_h, is_last)

  # ----------------------------
  # Initialise the hidden state for one LSTM layer
  # ----------------------------
  def _zero_state(self, n_x, n_h, is_last):
    # This is done only to get the data type of the existing hidden state. When the
    # new hidden state is cloned from the parameters below, it preserves
    # the type
    some_param = next(self.parameters()).data

    # Create new hidden state and set all values to 0.
    # Hidden and Cell state has shape (1, samples, hidden features)
    state_h = some_param.new(1, self.bs, n_h).zero_()
    state_c = some_param.new(1, self.bs, n_h).zero_()
    return ((state_h, state_c))

  # ----------------------------
  # Create one LSTM layer with its Weight Dropout
  # ----------------------------
  def _create_lstm(self, n_x, n_h, is_last):
    # 'batch_first' is true because the data is shaped (samples, timesteps, hidden size)
    # instead of (timesteps, samples, hidden size)
    lstm = nn.LSTM(n_x, n_h, 1, batch_first=True)
    lstm_dp = WeightDropout(lstm, self.weight_p)
    return (lstm_dp)

  # ----------------------------
  # Create one Hidden Dropout layer
  # Hidden Dropout is not applied to the last layer so we use an Identity layer
  # which does a No-op
  # ----------------------------
  def _create_hidden_dp(self, is_last):
    if (is_last):
      hidden_dp = nn.Identity()
    else:
      hidden_dp = RNNDropout(self.hidden_p)
    return (hidden_dp)

  # ----------------------------
  # Compute the padding mask
  # ----------------------------
  def _pad_mask(self, input, ts):
    # Mask contains True for all elements which are padding
    # Shape is same as input ie. [samples, timesteps]
    pad_mask = (input == self.pad_idx)
    # Count number of padding values in each row by adding up all True mask values per row
    # Subtract that from the total width (ie. timesteps) to get the length of the data sequence
    # Shape is [samples]
    seq_lengths = ts - pad_mask.long().sum(1)

    # Strip out rows with no sequence data
    n_empty = (seq_lengths == 0).sum()
    if n_empty > 0:
      input = input[:-n_empty]
      seq_lengths = seq_lengths[:-n_empty]
      self.state = [(h[0][:,:input.size(0)], h[1][:,:input.size(0)]) for h in self.state]

    return (input, pad_mask, seq_lengths)

#----------------------------------------------------
# A sequential module that passes the reset call to its children
# KD - Not sure when this reset functionally is needed or how it is used
#----------------------------------------------------
class SequentialRNN(nn.Sequential):
  def reset(self):
    for c in self.children():
      if hasattr(c, 'reset_state'): c.reset_state()

**AWD-LSTM Callbacks**

In [0]:
#----------------------------------------------------
# Implement AWD LSTM logic for:
#     1. Modify the architecture's output which returns a tuple (decoded, raw output, output) to
#        keep only the the decoded tensor as the 'yhat' value (for the loss function). The 
#        raw output and output are stored separately
#     2. Apply Activation Regularization (AR): we add to the loss an L2 penalty on the last 
#        activations of the AWD LSTM (with dropout applied)
#     3. Apply Temporal Activation Regularization (TAR): we add to the loss an L2 penalty on 
#        the difference between two consecutive (in terms of timesteps) raw outputs
#     4. TODO - Trigger the shuffle of the LMDataset at the beginning of each epoch
#----------------------------------------------------
class AwdLstmCB(Callback):
  def __init__(self, alpha, beta):
    self.alpha, self.beta = alpha, beta

  def after_tr_pred(self, ctx):
    self._extract_yhat(ctx)

  def after_val_pred(self, ctx):
    self._extract_yhat(ctx)

  def _extract_yhat(self, ctx):
    ctx.out_vals, ctx.out_dp_vals = ctx.yhat[1], ctx.yhat[2]
    ctx.yhat = ctx.yhat[0]

  def _regularise_ar_tar(self, ctx):
    if self.alpha != 0.:
      ctx.loss += self.alpha * ctx.out_dp_vals[-1].float().pow(2).mean()

    if self.beta != 0.:
      h = ctx.out_vals[-1]
      if h.size(1)>1: 
        ctx.loss += self.beta * (h[:,1:] - h[:,:-1]).float().pow(2).mean()

### Step 3 - Build Language Model Architecture

**Linear Decoder Module**

In [0]:
#----------------------------------------------------
# Linear Decoder module to process the output of the LSTM layers
# It applies the Output Dropout on the last LSTM layer and then uses a simple
# Linear layer. It optionally uses the same weights as the weights of the embedding
# layer
#----------------------------------------------------
class LinearDecoder(nn.Module):
  def __init__(self, n_inp, n_out, out_p, tie_encoder):
    super().__init__()

    # Output Dropout
    self.out_dp = RNNDropout(out_p)

    # Decoder is a simple linear layer
    self.decoder = nn.Linear(n_inp, n_out)
    self.decoder.bias.data.zero_()

    # If we are using weight-tying, then copy the weights from the embedding
    # layer. If not, use Kaiming Init to initialise them
    if (tie_encoder):
      self.decoder.weight = tie_encoder.weight
    else:
      init.kaiming_uniform_(self.decoder.weight)

  # ----------------------------
  # Process the forward pass
  # ----------------------------
  def forward(self, input):
    # Get the list of raw outputs and post-dropout outputs from the LSTM layers
    out_vals, out_dp_vals = input

    # Apply Output Dropout. Make it contiguous so the subsequent view() function can work
    out_dp = self.out_dp(out_dp_vals[-1]).contiguous()

    # Flatten the data for the linear layer from 3D to 2D as (samples * timesteps, features)
    decoded = self.decoder(out_dp.view(out_dp.size(0)*out_dp.size(1), out_dp.size(2)))
    
    #print ('KD Decoder output & decoded', out_dp.mean(), decoded.mean())

    # Return Output - ([final decoded probabilities for each word], [list of raw outputs for each layer], [list of outputs for each layer])
    return (decoded, out_vals, out_dp_vals)


**Language Model Architecture**

In [0]:
#----------------------------------------------------
# Create the end-to-end language model architecture with the AWD-LSTM followed by the Decoder
#----------------------------------------------------
class ArchLanguageModel():

  def __init__(self, vocab, emb_sz, n_h, n_layers, pad_idx, out_p=0.4, hidden_p=0.2, inp_p=0.6, 
                       emb_p=0.1, weight_p=0.5, tie_weights=True, bias=True):
    self.vocab = vocab
    vocab_sz = len(vocab)

    self.awd_lstm_enc = AWD_LSTM(vocab_sz, emb_sz, n_h, n_layers, pad_idx, emb_p, inp_p, weight_p, hidden_p)

    # Get the embedding layer from the AWD-LSTM to enable weight-tying with the
    # Decoder
    enc = self.awd_lstm_enc.emb if tie_weights else None

    # The input of the Decoder has embedding size to match the output of the 
    # last AWD-LSTM layer. The output of the Decoder has vocab size to produce
    # a probability for each word in the vocab
    self.decoder = LinearDecoder(emb_sz, vocab_sz, out_p, tie_encoder=enc)

    self.model = SequentialRNN(self.awd_lstm_enc, self.decoder)

  # ----------------------------
  # ----------------------------
  def load_weights(self, weights):
    self.model.load_state_dict(weights)

  # ----------------------------
  # ----------------------------
  def save_weights(self, awd_lstm_enc_path, vocab_path, weights_path):

    # We only need to save the encoder (first part of the model)...
    torch.save(self.awd_lstm_enc.state_dict(), awd_lstm_enc_path)

    # ...as well as the vocabulary used.
    # We will use both for the classification task
    pickle.dump(self.vocab, open(vocab_path, 'wb'))

    # Save the full model
    torch.save(self.model.state_dict(), weights_path)

  # ----------------------------
  # ----------------------------
  def freeze(self, type):
    if (type == "FREEZE_LSTM"):
      for rnn in self.awd_lstm_enc.lstm_layers:
        for p in rnn.parameters(): p.requires_grad_(False)
    elif (type == "UNFREEZE_LSTM"):
      for rnn in self.awd_lstm_enc.lstm_layers:
        for p in rnn.parameters(): p.requires_grad_(True)

  # ----------------------------
  # Split into three groups - two for each rnn/corresponding dropout, then one last 
  # group that contains the embeddings/decoder.
  # ----------------------------
  n_splits = 3
  def splitter(self, type):
    awd_lstm_enc = self.awd_lstm_enc
    
    groups = []
    # One group for each LSTM/Hidden Dropout pair
    for i in range(len(awd_lstm_enc.lstm_layers)): 
      groups.append(nn.Sequential(awd_lstm_enc.lstm_layers[i], awd_lstm_enc.hidden_dps[i]))
  
    # Embedding and Input Dropouts and Decoder
    groups += [nn.Sequential(awd_lstm_enc.emb, awd_lstm_enc.emb_dp, awd_lstm_enc.inp_dp, self.decoder)]

    # List of list of parameters by group
    return [list(o.parameters()) for o in groups]

**Callbacks for Language Model**

In [0]:
#----------------------------------------------------
# Clip the gradients to allow us to use a higher learning rate by putting a 
# maximum value on the norm of the gradients
#----------------------------------------------------
class GradientClipping(Callback):
  def __init__(self, clip=None): 
    self.clip = clip

  def after_tr_backward(self, ctx):
    if self.clip:  
      nn.utils.clip_grad_norm_(ctx.model.parameters(), self.clip)

**Flattened versions of the cross entropy loss and the accuracy metric**

In [0]:
def cross_entropy_flat(input, target):
  bs,sl = target.size()
  return F.cross_entropy(input.view(bs * sl, -1), target.view(bs * sl))

def accuracy_flat(input, target):
  bs,sl = target.size()
  return accuracy(input.view(bs * sl, -1), target.view(bs * sl))

### Step 3 - Test Run language model end-to-end 

In [0]:
torch.manual_seed(0)
lm_app.create_arch()
loop = lm_app.run_train(split_lr=[5e-3], split=False, one_cycle=False, freeze="UNFREEZE_LSTM", num_epochs=1)

### Set up Tensorboard

**To view Tensorboard output locally, use ngrok to tunnel traffic to localhost. First, download and unzip ngrok on the Colab server**

In [0]:
! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
! unzip ngrok-stable-linux-amd64.zip

--2020-03-24 14:30:27--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.206.78.89, 34.192.123.246, 54.174.156.76, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.206.78.89|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2020-03-24 14:30:28 (19.0 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   


**Get TensorBoard running in the background**

In [0]:
# Set the LOGDIR correctly to use Tensorboard
LOG_DIR = 'tbtry'

In [0]:
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)

**Launch ngrok background process**

In [0]:
get_ipython().system_raw('./ngrok http 6006 &')

**We get the public URL where we can access the colab TensorBoard web page. This will output a URL you can click on**

In [0]:
! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

http://46803a22.ngrok.io


In [0]:
!rm -R tbtry
!ls -lR tbtry

ls: cannot access 'tbtry': No such file or directory


### Debug Experimentation

In [0]:
dcdrrun_df, dcdrbatch_df, dcdrstep_df, dcdrdf = loop.dtr.pd_results()

### Step 4 - Pre-Train Language Model on Wikitext data

TO DO - run code from this [notebook](https://github.com/fastai/course-v3/blob/master/nbs/dl2/12b_lm_pretrain.ipynb)

This notebook literally combines the previous two notebooks. High level there is two steps:
import and preprocess data
create the model we made the previous notebook


An important thing to notice is that in IMDb dataset we used different vocabulary than in wikitext 103. To solve this we just combine the two vocabularies by overwriting others and using embeddings from wikitext 103.

### Step 5 - Transform the pre-trained Wikitext Language Model vocab to the IMDB vocab

**Get the IMDB corpus vocab**

In [0]:
# The code in the Fastai lesson gets the vocab like this, but there is no need to
# do that. We can use the vocab which we had loaded earlier since it is identical.
tvocab = ll.train.proc_x[1].vocab

# The tvocab and the vocab that we had loaded earlier are actually identical.
vocab[50:60], tvocab[50:60]

(['!', 'from', 'so', 'like', 'there', 'or', 'just', 'her', 'do', 'about'],
 ['!', 'from', 'so', 'like', 'there', 'or', 'just', 'her', 'do', 'about'],
 tensor([0.0500, 0.0750, 0.1250, 0.0100, 0.1000]))

**Fetch the Language Model pretrained on Wikitext**

We can use the Language Model from Step 4, or download a smaller version that was pretrained by the Fastai team.


In [0]:
# Download a pretrained small model which was trained with wikitext 103.
# In the final version we will train our own full model, but during debugging we can use this smaller one
path_wiki_model = Path.cwd()/'wikimdl'

! wget http://files.fast.ai/models/wt103_tiny.tgz -P {path_wiki_model}
! tar xf {path_wiki_model}/wt103_tiny.tgz -C {path_wiki_model}

--2020-03-24 09:00:03--  http://files.fast.ai/models/wt103_tiny.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75482451 (72M) [application/x-gtar-compressed]
Saving to: ‘/content/wikimdl/wt103_tiny.tgz’


2020-03-24 09:00:09 (12.8 MB/s) - ‘/content/wikimdl/wt103_tiny.tgz’ saved [75482451/75482451]



In [0]:
# See the directory structure
pt = path_wiki_model/'pretrained'
path_wiki_model.ls(), pt.ls()

([PosixPath('/content/wikimdl/pretrained'),
  PosixPath('/content/wikimdl/wt103_tiny.tgz')],
 [PosixPath('/content/wikimdl/pretrained/pretrained.pth'),
  PosixPath('/content/wikimdl/pretrained/vocab.pkl')])

**Load the weights of the pretrained model**

In [0]:
# Get the model weights of the pretrained model
old_wgts  = torch.load(path_wiki_model/'pretrained'/'pretrained.pth')
print(old_wgts.keys())

odict_keys(['0.emb.weight', '0.emb_dp.emb.weight', '0.rnns.0.weight_hh_l0_raw', '0.rnns.0.module.weight_ih_l0', '0.rnns.0.module.weight_hh_l0', '0.rnns.0.module.bias_ih_l0', '0.rnns.0.module.bias_hh_l0', '0.rnns.1.weight_hh_l0_raw', '0.rnns.1.module.weight_ih_l0', '0.rnns.1.module.weight_hh_l0', '0.rnns.1.module.bias_ih_l0', '0.rnns.1.module.bias_hh_l0', '1.decoder.weight', '1.decoder.bias'])


In [0]:
# See the different weights in the pretrained model. Below we see that the emb weights, emb dp weights and decoder weights are the same, as expected.
old_wgts['0.emb.weight'][500:520, 6], old_wgts['0.emb_dp.emb.weight'][500:520, 6], old_wgts['1.decoder.weight'][500:520, 6]

(tensor([ 0.5687,  0.0495,  0.2678, -0.2443, -0.0888,  0.0075, -0.6737,  0.2850,
         -0.6399,  0.3175, -0.9170,  0.4346,  0.4071, -0.1929,  0.5366, -0.7570,
          0.5967,  0.0317, -0.3167, -0.3438], device='cuda:0'),
 tensor([ 0.5687,  0.0495,  0.2678, -0.2443, -0.0888,  0.0075, -0.6737,  0.2850,
         -0.6399,  0.3175, -0.9170,  0.4346,  0.4071, -0.1929,  0.5366, -0.7570,
          0.5967,  0.0317, -0.3167, -0.3438], device='cuda:0'),
 tensor([ 0.5687,  0.0495,  0.2678, -0.2443, -0.0888,  0.0075, -0.6737,  0.2850,
         -0.6399,  0.3175, -0.9170,  0.4346,  0.4071, -0.1929,  0.5366, -0.7570,
          0.5967,  0.0317, -0.3167, -0.3438], device='cuda:0'))

**Update the Embeddings of the pretrained model so that they conform to the new IMDB vocab rather than the old Wikitest voca**b

In [0]:
# Get the vocab of the pretrained model

# In our current vocabulary, it is very unlikely that the ids correspond to what is in the 
# vocabulary used to train the pretrain model. The tokens are sorted by frequency (apart 
# from the special tokens that are all first) so that order is specific to the corpus used. 
# For instance, the word 'house' has different ids in the our current vocab and the pretrained one.

old_vocab = pickle.load(open(path_wiki_model/'pretrained'/'vocab.pkl', 'rb'))
print(len(old_vocab))

idx_house_new, idx_house_old = lm_imdb_vocab.index('house'),old_vocab.index('house')
idx_house_new, idx_house_old

60002


(343, 230)

In [0]:
# We somehow need to match our pretrained weights to the new vocabulary. This is done 
# on the embeddings and the decoder (since the weights between embeddings and decoders 
# are tied) by putting the rows of the embedding matrix (or decoder bias) in the right order.
# It may also happen that we have words that aren't in the pretrained vocab, in this 
# case, we put the mean of the pretrained embedding weights/decoder bias.

def match_embeds(old_wgts, old_vocab, new_vocab):
  old_emb_wgts = old_wgts['0.emb.weight']
  old_decoder_bias = old_wgts['1.decoder.bias']
  old_emb_wgts_mean, old_decoder_bias_mean = old_emb_wgts.mean(dim=0), old_decoder_bias.mean()

  new_emb_wgts = old_emb_wgts.new_zeros(len(new_vocab), old_emb_wgts.size(1))
  new_decoder_bias = old_decoder_bias.new(len(new_vocab))
  w2i = {word:i for i, word in enumerate(old_vocab)}

  for new_i, word in enumerate(new_vocab):
    if word in w2i:
      old_i = w2i[word]
      new_emb_wgts[new_i] = old_emb_wgts[old_i]
      new_decoder_bias[new_i] = old_decoder_bias[old_i]
    else:
      new_emb_wgts[new_i] = old_emb_wgts_mean
      new_decoder_bias[new_i] = old_decoder_bias_mean

  old_wgts['0.emb.weight'] = new_emb_wgts
  old_wgts['0.emb_dp.emb.weight'] = new_emb_wgts
  old_wgts['1.decoder.weight'] = new_emb_wgts
  old_wgts['1.decoder.bias'] = new_decoder_bias

  return (old_wgts)

# check that the word "house" was properly converted

house_wgt  = old_wgts['0.emb.weight'][idx_house_old]
house_bias = old_wgts['1.decoder.bias'][idx_house_old]

matched_wgts = match_embeds(old_wgts, old_vocab, lm_imdb_vocab)

test_near(matched_wgts['0.emb.weight'][idx_house_new],house_wgt)
test_near(matched_wgts['1.decoder.bias'][idx_house_new],house_bias)

### Step 6 - Load our model with the pre-trained weights, in preparation for retraining on IMDB

In [0]:
# These are the dropout probabilities 
dps = tensor([0.1, 0.15, 0.25, 0.02, 0.2]) * 0.5
_lm_arch = lm_app.create_arch(*dps)
# !!!!!!! Think over whether this is the right way to do it
_lm_model = _lm_arch.model
_lm_model, _lm_model.state_dict().keys(), matched_wgts.keys()

(SequentialRNN(
   (0): AWD_LSTM(
     (emb): Embedding(17965, 300, padding_idx=1)
     (emb_dp): EmbeddingDropout(
       (emb): Embedding(17965, 300, padding_idx=1)
     )
     (inp_dp): RNNDropout()
     (lstm_layers): ModuleList(
       (0): WeightDropout(
         (module): LSTM(300, 300, batch_first=True)
       )
       (1): WeightDropout(
         (module): LSTM(300, 300, batch_first=True)
       )
     )
     (hidden_dps): ModuleList(
       (0): RNNDropout()
       (1): Identity()
     )
   )
   (1): LinearDecoder(
     (out_dp): RNNDropout()
     (decoder): Linear(in_features=300, out_features=17965, bias=True)
   )
 ),
 odict_keys(['0.emb.weight', '0.emb_dp.emb.weight', '0.lstm_layers.0.weight_hh_l0_raw', '0.lstm_layers.0.module.weight_ih_l0', '0.lstm_layers.0.module.weight_hh_l0', '0.lstm_layers.0.module.bias_ih_l0', '0.lstm_layers.0.module.bias_hh_l0', '0.lstm_layers.1.weight_hh_l0_raw', '0.lstm_layers.1.module.weight_ih_l0', '0.lstm_layers.1.module.weight_hh_l0', '0.lstm

In [0]:
#----------------------------------------------------
# So we rename the keys in the weights to correspond to the KD model module names
#----------------------------------------------------

def rename_wgt_keys(model, wgts):
  renamed_wgts=OrderedDict()
  for old_key, new_key in zip(wgts.keys(), model.state_dict().keys()):
    renamed_wgts[new_key] = wgts[old_key]
    print (old_key, new_key, torch.all(torch.eq(renamed_wgts[new_key], wgts[old_key])))
  return (renamed_wgts)

new_wgts = rename_wgt_keys(_lm_model, matched_wgts)

0.emb.weight 0.emb.weight tensor(True, device='cuda:0')
0.emb_dp.emb.weight 0.emb_dp.emb.weight tensor(True, device='cuda:0')
0.rnns.0.weight_hh_l0_raw 0.lstm_layers.0.weight_hh_l0_raw tensor(True, device='cuda:0')
0.rnns.0.module.weight_ih_l0 0.lstm_layers.0.module.weight_ih_l0 tensor(True, device='cuda:0')
0.rnns.0.module.weight_hh_l0 0.lstm_layers.0.module.weight_hh_l0 tensor(True, device='cuda:0')
0.rnns.0.module.bias_ih_l0 0.lstm_layers.0.module.bias_ih_l0 tensor(True, device='cuda:0')
0.rnns.0.module.bias_hh_l0 0.lstm_layers.0.module.bias_hh_l0 tensor(True, device='cuda:0')
0.rnns.1.weight_hh_l0_raw 0.lstm_layers.1.weight_hh_l0_raw tensor(True, device='cuda:0')
0.rnns.1.module.weight_ih_l0 0.lstm_layers.1.module.weight_ih_l0 tensor(True, device='cuda:0')
0.rnns.1.module.weight_hh_l0 0.lstm_layers.1.module.weight_hh_l0 tensor(True, device='cuda:0')
0.rnns.1.module.bias_ih_l0 0.lstm_layers.1.module.bias_ih_l0 tensor(True, device='cuda:0')
0.rnns.1.module.bias_hh_l0 0.lstm_layers.1.

In [0]:
#----------------------------------------------------
# Load the pre-trained weights into our LM architecture using the renamed keys
#----------------------------------------------------

_lm_arch.load_weights(new_wgts)

### Step 6 - Tune the weights of the pre-trained Wikitext Language Model by retraining on IMDB, using discriminative Learning Rates

**Step A - Train with LSTM Layers frozen and split param groups (but same hyperparameters over time and across param groups)**

In [0]:
lm_app.run_train(split_lr=[5e-3], split=True, one_cycle=False, freeze="FREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,4.697655,0.231395,4.287965,0.256731,00:11


AFTER Hyper parameters
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


**Step B - Train with LSTM Layers frozen and split param groups (hyperparameters vary over time due to 1 cycle but are the same across param groups)**

In [0]:
lm_app.run_train(split_lr=[2e-2], split=True, one_cycle=True, freeze="FREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.02}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.02}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.02}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,4.42474,0.244092,4.216343,0.261604,00:11


AFTER Hyper parameters
4 {'momentum': 0.7999745709300017, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.285763141529842e-06}
4 {'momentum': 0.7999745709300017, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.285763141529842e-06}
2 {'momentum': 0.7999745709300017, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.285763141529842e-06}


**Step C - Train with all Layers unfrozen and split param groups (hyperparameters vary over time due to 1 cycle and also vary across param groups)**

In [0]:
lr_tmp = 2e-3
split_lr = [lr_tmp/2., lr_tmp/2., lr_tmp]
lm_app.run_train(split_lr=split_lr, split=True, one_cycle=True, freeze="UNFREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.001}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.001}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.002}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,4.224579,0.256747,4.177207,0.266004,00:12


AFTER Hyper parameters
4 {'momentum': 0.7999745709300017, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 2.6428815707657885e-07}
4 {'momentum': 0.7999745709300017, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 2.6428815707657885e-07}
2 {'momentum': 0.7999745709300017, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.285763141531577e-07}


**Save our model**

In [0]:
! mkdir -p imdbtuned
path_imdb_tuned = Path.cwd()/'imdbtuned'

_lm_arch.save_weights(path_imdb_tuned/'finetuned_enc.pth', path_imdb_tuned/'vocab_lm.pkl', path_imdb_tuned/'finetuned.pth')

### Step 7 - Load IMDB Data and pre-process for Classification

In [0]:
loaded=True
if (not loaded):
  vocab = pickle.load(open(path/'vocab_lm.pkl', 'rb'))
  proc_tok,proc_num,proc_cat = TokenizeProcessor(),NumericalizeProcessor(vocab=vocab),CategoryProcessor()
  il = TextList.from_files(path, include=['train', 'test'])
  sd = SplitData.split_by_func(il, partial(grandparent_splitter, valid_name='test'))
  ll = label_by_func(sd, parent_labeler, proc_x = [proc_tok, proc_num], proc_y=proc_cat)
  pickle.dump(ll, open(path/'ll_clas.pkl', 'wb'))
ll = pickle.load(open(path/'ll_clas.pkl', 'rb'))
vocab = pickle.load(open(path/'vocab_lm.pkl', 'rb'))
bs,bptt = 64,70
data = clas_databunchify(ll, bs)

### Define the Text Classifier Application class

In [0]:
#----------------------------------------------------
# Text Classifier Application
#----------------------------------------------------
class AppTextClassifier():

  def __init__(self):
    self._arch = None
    self.tcdb = None
    self.vocab = None
    pass

  def _load_vocab(self, vocab_path):
    vocab = pickle.load(open(vocab_path, 'rb'))
    return vocab

  def load_data(self, data_path, vocab_path):
    self.vocab = self._load_vocab(vocab_path)
    self.tcdb = TextClassificationFolderDataBundle(data_path, bs=64, vocab_i2w=self.vocab)
    self.tcdb.do()

  def create_arch(self, *dps):
    tok_pad = self.vocab.index(PAD)
    vocab_sz = len(self.vocab)
    emb_sz, n_h, n_layers, pad_idx, n_out, bptt = 300, 300, 2, tok_pad, 2, 70
    self._arch = ArchTextClassifier(vocab_sz, emb_sz, n_h, n_layers, pad_idx, n_out, bptt, *dps)
    return self._arch

  # ----------------------------
  # ----------------------------
  def run_train(self, split_lr, split=False, one_cycle=False, freeze="UNFREEZE_LSTM", num_epochs=1):
    train_dl = self.tcdb.train_dl
    valid_dl = self.tcdb.valid_dl
    # NB: We don't use cross_entropy_flat for Classification
    loss_func = F.cross_entropy

    # split_lr is a list:
    #   1. a single-element list [0.01] - same LR for all groups. 
    #        If 'Split' is False, there is only one group. 
    #        If 'Split' is True, there are multiple groups.
    #   2. a multi-element list [0.01, 0.03, 0.05] - discriminative LR for different groups. 
    #        'Split' cannot be False
    assert(isinstance(split_lr, list))
    assert(len(split_lr) > 0)
    assert(not ((split == False) and (len(split_lr) > 1)))

    # NB: We don't use the AwdLstmCB while doing Classification and we use accuracy not accuracy_flat
    callbs=[CudaCB(device = torch.device('cuda',0)), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy})]
    if (one_cycle):
      one_cycle_callbs = create_OneCycleCB(split_lr, phases=[0.5, 0.5], mom_start=0.8, mom_mid=0.7, mom_end=0.8)
      callbs = callbs + one_cycle_callbs

    model = self._arch.model
    self._arch.freeze(freeze)

    print ('BEFORE Hyper parameters')
    opt_func=adam_opt_func
    lr = split_lr[0]
    if (split and (len(split_lr) == 1)):
      hypers_group = [{}] * self._arch.n_splits
      opt_groups=(self._arch.splitter, hypers_group, {'lr': lr})
    elif (split and (len(split_lr) > 1)):
      hypers_group = [{'lr': lr_g} for lr_g in split_lr]
      opt_groups=(self._arch.splitter, hypers_group, {})
    else:
      opt_groups = None
    opt = get_optimiser(model, lr, opt_func, opt_groups)

    loop = Trainer(train_dl, valid_dl, model, opt, loss_func, callbs)
    loop.fit(num_epochs=num_epochs)

    # TODO !!!!!!! Make _print_opt() a public function, maybe using repr
    print ('AFTER Hyper parameters')
    loop.opt._print_opt()

  # ----------------------------
  # ----------------------------
  def run_predict(self):
    x,y = next(iter(self.tcdb.valid_dl))
    pred_batch = self._arch.model.eval()(x.cuda())
    return (pred_batch)

    # Predicting on the padded batch or on the individual unpadded samples give the same results.
    pred_ind = []
    for inp in x:
      length = x.size(1) - (inp == self.awd_lstm_enc.pad_idx).long().sum()
      inp = inp[:length]
      pred_ind.append(self.arch.eval()(inp[None].cuda()))
    assert near(pred_batch, torch.cat(pred_ind))

In [0]:
tc_app = AppTextClassifier()
tc_app.load_data(path_imdb, path_imdb_tuned/'vocab_lm.pkl')

--------- IMDB Classification DataBundle init /root/.fastai/data/imdb
FolderItemContainer loaded 50001 items of type TextFileItemList
Split using split_random into 1500, 1000 and 0 items of type TextFileItemList
Extracted 1500 items of type SentenceItemList using extract_doc
Extracted 1500 items of type ClassNameItemList using extract_custom
Converted 1500 items to type SentenceWordItemList using SentenceToWord
Converted 1500 items to type SentenceWordIdItemList using WordToWordId
Converted 1500 items to type ClassIdItemList using NameToId
Extracted 1000 items of type SentenceItemList using extract_doc
Extracted 1000 items of type ClassNameItemList using extract_custom
Converted 1000 items to type SentenceWordItemList using SentenceToWord
Converted 1000 items to type SentenceWordIdItemList using WordToWordId
Converted 1000 items to type ClassIdItemList using NameToId
Final SentenceWordIdItemList (1500 items)
[[2, 7, 8, 198, 81, 1805, 120, 19, 11, 2705, 844, 12, 11, 485, 717, 8, 1529, 1

### Step 8 - Build Classification architecture

**Pooling Classifier Module**

In [0]:
#----------------------------------------------------
# Create a Linear Classifier with Pooling, which sits after the AWD-LSTM as a classification head of the model
#
# Instead of using only the last hidden value from the (last LSTM layer of the) AWD-LSTM for classification, Concat Pooling uses three 
# things (from the last LSTM layer) viz. the last hidden value, the average of all hidden values and the maximum of all the hidden values 
#----------------------------------------------------
class PoolingLinearClassifier(nn.Module):

  # ----------------------------
  # The architecture consists of one or more sequential 'blocks' of layers
  # Each 'block' consists of a BatchNorm, Dropout, Linear and Relu layers
  # except the last 'block' which has no Relu
  #
  # 'layers_sz' is a list of input sizes of each block (and the output size of the last block). 
  #     The output size of a block is the same as the input size of the next block.
  #     eg. [900, 50, 2] means two layer blocks ie. layer_1 is (900, 50) and layer_2 is (50, 2)
  #
  # 'drops_p' is a list of the dropout percentages of each block eg. [0.1, 0.1]
  # ----------------------------
  def __init__(self, layers_sz, drops_p):
    super().__init__()

    # Number of blocks is one less than the length of the layer_sz list
    n_layer_blocks = len(layers_sz) - 1
    assert(n_layer_blocks > 0)
    assert(n_layer_blocks == len(drops_p))
    
    # Create the layers of each block
    layers = []
    for i in range(n_layer_blocks):
      n_in, n_out = layers_sz[i], layers_sz[i + 1]
      is_last = (i == (n_layer_blocks - 1))
      layers += self._create_block(n_in, n_out, drops_p[i], is_last)

    # Build a single Sequential with all layers
    self.layers = nn.Sequential(*layers)

  # ----------------------------
  # Apply Concat Pooling on the output value of the last LSTM layer. Then pass that to the linear
  # classifier layers
  # ----------------------------
  def forward(self, input):
    x = self._concat_pool(input)

    # Now pass this through the classification layers
    x = self.layers(x)
    return x

  # ----------------------------
  # From the output value of the last LSTM layer, find the last hidden value, average 
  # hidden value and max hidden value and concatenate them.
  # ----------------------------
  def _concat_pool(self, input):
    out_vals, out_dp_vals, pad_mask = input

    # Get the output value from the last LSTM layer, with shape [samples, timesteps, embedding size]
    lastlstm_out_dp = out_dp_vals[-1]
    # Calculate the actual sequence lengths without padding
    # The mask has shape [samples, timesteps] and 
    # the sequence lengths has shape [samples]
    seq_lengths = lastlstm_out_dp.size(1) - pad_mask.long().sum(dim=1)

    # Make sure that we ignore the padding in the last state/average/maximum.
    # Expand the mask to a third embedding dimension so it has shape [samples, timesteps, embedding size]
    fill_mask_3d = pad_mask[:, :, None]

    # Use the fill mask to fill the pad values with 0, so they get ignored in the average calculation
    # Then sum all the values in each row, and divide by the number of values in each row (ie. the sequence length)
    # to get the average value for each row (ie. sample)
    # The average has shape [samples, embedding size]
    zero_fill_out_dp = lastlstm_out_dp.masked_fill(fill_mask_3d, 0)
    sum_out_dp = zero_fill_out_dp.sum(dim=1)
    numval_out_dp = seq_lengths.type(sum_out_dp.dtype)
    numval_out_dp = numval_out_dp[:,None]
    avg_out_dp = sum_out_dp.div(numval_out_dp)

    # Use the fill mask to fill the pad values with neg-infinity, so they get ignored in the max calculation
    # Then get the max values in each row
    # The max has shape [samples, embedding size]
    neginf_fill_out_dp = lastlstm_out_dp.masked_fill(fill_mask_3d, -float('inf'))
    max_out_dp, _ = neginf_fill_out_dp.max(dim=1)

    # Get the hidden value from the column for the last elements of each sequence
    # The last hidden value has shape [samples, embedding size]
    n_samples = lastlstm_out_dp.size(0)
    all_rowidxs = torch.arange(0, n_samples)
    lastval_colidxs = seq_lengths - 1
    lastval_out_dp = lastlstm_out_dp[all_rowidxs, lastval_colidxs]

    # Concatenate the last value, average value and max value along axis 1
    # The concat pool has shape [samples, embedding size * 3]
    c_pool = torch.cat([lastval_out_dp, max_out_dp, avg_out_dp], 1)

    return (c_pool)

  # ----------------------------
  # Create the layers of a block
  # ----------------------------
  def _create_block (self, n_in, n_out, drop_p, is_last):
    layers = []
    layers.append(nn.BatchNorm1d(n_in))
    layers.append(nn.Dropout(drop_p))
    layers.append(nn.Linear(n_in, n_out))
    if (not is_last):
      layers.append(nn.ReLU(inplace=True))

    return layers

**Sentence Encoder Module**

In [0]:
#----------------------------------------------------
# Create a module which feeds our text sequences to the AWD-LSTM in chunks of bptt length
# so that we don't run out of memory by feeding all the data at once.
#----------------------------------------------------
class SentenceEncoder(nn.Module):
  def __init__(self, lstm_module, bptt, pad_idx=1):
    super().__init__()
    self.lstm_module = lstm_module
    self.bptt, self.pad_idx = bptt, pad_idx
    
  # ----------------------------
  # Wrapper over the AWD-LSTM to break the input sequences into chunks of bptt timesteps each
  # and then merge all the result values from each chunk, before returning them
  # ----------------------------
  def forward(self, input):
    bs, full_seq_len = input.size()
    self.lstm_module.bs = bs
    self.lstm_module.reset_state()

    out_vals_lol, out_dp_vals_lol, pad_masks = [],[],[]
    for i in range(0, full_seq_len, self.bptt):
      # Get a chunk at a time, with all samples and bptt timesteps columns
      chunk = input[:, i: min(i+self.bptt, full_seq_len)]

      # Feed a chunk to the LSTM, and append the outputs to result lists
      # The outputs ovs and odps are a list of values from each LSTM layer
      # So out_vals_lol and out_dp_vals_lol are list-of-lists (lol)
      # The LSTM may have truncated empty rows, so extend the outputs so they are all
      # of the required size
      ovs, odps, pm = self.lstm_module(chunk)
      pad_masks.append(self._pad_extend_rows(pm, bs, 1))
      out_vals_lol.append([self._pad_extend_rows(ov, bs, 0) for ov in ovs])
      out_dp_vals_lol.append([self._pad_extend_rows(odp, bs, 0) for odp in odps])

    # Now convert the return value to the same format as returned by the AWD-LSTM ie.
    # outputs are a flat list of LSTM layer values, and a single pad mask
    return (self._val_flatten(out_vals_lol, bs),
           self._val_flatten(out_dp_vals_lol, bs),
           torch.cat(pad_masks, dim=1))

  # ----------------------------
  # Flatten the list-of-lists (lol)
  #
  # The outer list is the chunks eg three chunks 'a', 'b', 'c'. The inner list is the LSTM layers eg. two layers '1', '2'
  #
  # We will flatten the list by concatenating all values for each LSTM layer from 
  #    [[l1_a, l2_a], [l1_b, l2_b], [l1_c, l2_c]] to
  #    [l1_a + l1_b + l1_c, l2_a + l2_b + l2_c]                         
  # ----------------------------
  def _val_flatten(self, val_lol, bs):
    # Get the number of LSTM layers from the inner list
    n_inner = len(val_lol[0])
    # For each layer index, pick out the corresponding value for that layer from each inner list and concatenate them
    val_flat = [torch.cat([inner[i_inner] for inner in val_lol], dim = 1) for i_inner in range(n_inner)]

    return (val_flat)

  # ----------------------------
  # If an output tensor has fewer rows than the sample batch size (because the LSTM truncated 
  # empty rows), add additional rows filled with 'val'
  # ----------------------------
  def _pad_extend_rows(self, ot, bs, val):
    if (ot.shape[0] < bs):
      # Create additional rows with the right shape filled with 'val'
      add_rows = ot.new_zeros(bs - ot.shape[0], *ot.shape[1:]) + val

      # Concat the additional rows to the output tensor
      return torch.cat([ot, add_rows])
    else:
      return (ot)


**Classifier Architecture**

In [0]:

#----------------------------------------------------
# Create the end-to-end classification architecture with the AWD-LSTM (and Sentence 
# Encoder) followed by the Pooling Classifier
#----------------------------------------------------
class ArchTextClassifier():

  def __init__(self, vocab_sz, emb_sz, n_h, n_layers, pad_idx, n_out, bptt, out_p=0.4, hidden_p=0.2, 
                        inp_p=0.6, emb_p=0.1, weight_p=0.5, layers_sz=None, drops_p=None):
    # Create the AWD-LSTM and Sentence Encoder
    self.awd_lstm_enc = AWD_LSTM(vocab_sz, emb_sz, n_h, n_layers, pad_idx, 
                       emb_p, inp_p, weight_p, hidden_p, apply_packing=True)
    self.enc = SentenceEncoder(self.awd_lstm_enc, bptt)

    # Initialise the layers_sz and drops_p for the Pooling Classifier
    if layers_sz is None: layers_sz = [50]
    if drops_p is None:  drops_p = [0.1] * len(layers_sz)
    n_in = 3 * emb_sz
    layers_sz = [n_in] + layers_sz + [n_out] 
    drops_p = [out_p] + drops_p

    self.plc = PoolingLinearClassifier(layers_sz, drops_p)
    self.model = SequentialRNN(self.enc, self.plc)

  # ----------------------------
  # ----------------------------
  def load_weights(self, weights_path):
    self.awd_lstm_enc.load_state_dict(torch.load(weights_path))

  # ----------------------------
  # ----------------------------
  def freeze(self, type):
    if (type == "FREEZE_LSTM"):
      for rnn in self.awd_lstm_enc.lstm_layers:
        for p in rnn.parameters(): p.requires_grad_(False)
    elif (type == "UNFREEZE_LSTM"):
      for rnn in self.awd_lstm_enc.lstm_layers:
        for p in rnn.parameters(): p.requires_grad_(True)
    elif (type == "FREEZE_ENCODER"):
      for p in self.enc.parameters(): p.requires_grad_(False)
    elif (type == "UNFREEZE_ENCODER"):
      for p in self.enc.parameters(): p.requires_grad_(True)
    elif (type == "UNFREEZE_LAST_LSTM"):
      for p in self.awd_lstm_enc.lstm_layers[-1].parameters(): p.requires_grad_(True)

  # ----------------------------
  # ----------------------------
  n_splits = 4
  def splitter(self, type):
    awd_lstm_enc = self.awd_lstm_enc

    # Embedding and Input Dropouts
    groups = [nn.Sequential(awd_lstm_enc.emb, awd_lstm_enc.emb_dp, awd_lstm_enc.inp_dp)]

    # One group for each LSTM/Hidden Dropout pair
    for i in range(len(awd_lstm_enc.lstm_layers)): 
      groups.append(nn.Sequential(awd_lstm_enc.lstm_layers[i], awd_lstm_enc.hidden_dps[i]))

    # One group for the Pooling Classifier
    groups.append(self.plc)

    # List of list of parameters by group
    return [list(o.parameters()) for o in groups]

### Step 9 - Train the Classifier with IMDB data

In [0]:
torch.manual_seed(0)

# These are the dropout probabilities
dps = tensor([0.4, 0.3, 0.4, 0.05, 0.5]) * 0.25
_tc_arch = tc_app.create_arch(*dps)

# Load pre-trained AWD-LSTM weights
_tc_arch.load_weights(path_imdb_tuned/'finetuned_enc.pth')

tc_app.run_train(split_lr=[1e-2], split=True, one_cycle=True, freeze="FREEZE_ENCODER", num_epochs=1)

lr_tmp = 5e-3
split_lr=[lr_tmp/2., lr_tmp/2., lr_tmp/2., lr_tmp]
tc_app.run_train(split_lr=split_lr, split=True, one_cycle=True, freeze="UNFREEZE_LAST_LSTM", num_epochs=1)

lr_tmp = 1e-3
split_lr=[lr_tmp/8., lr_tmp/4., lr_tmp/2., lr_tmp]
tc_app.run_train(split_lr=split_lr, split=True, one_cycle=True, freeze="UNFREEZE_ENCODER", num_epochs=2)

BEFORE Hyper parameters
1 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}
8 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,0.59224,0.682571,0.634573,0.668359,00:01


AFTER Hyper parameters
1 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
5 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
5 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
8 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
BEFORE Hyper parameters
1 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0025}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0025}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0025}
8 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,0.47138,0.778646,0.532593,0.727148,00:02


AFTER Hyper parameters
1 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.261729121149329e-05}
4 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.261729121149329e-05}
4 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.261729121149329e-05}
8 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 8.523458242298658e-05}
BEFORE Hyper parameters
1 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.000125}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00025}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0005}
8 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.001}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,0.385024,0.833984,0.47216,0.760547,00:03
1,0.353268,0.844215,0.452019,0.783398,00:03


AFTER Hyper parameters
1 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.359408171751905e-07}
4 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 1.071881634350381e-06}
4 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 2.143763268700762e-06}
8 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.287526537401524e-06}


### Predictions

In [0]:
tc_app.run_predict()

In [0]:
_tc_arch.model

SequentialRNN(
  (0): SentenceEncoder(
    (lstm_module): AWD_LSTM(
      (emb): Embedding(17965, 300, padding_idx=1)
      (emb_dp): EmbeddingDropout(
        (emb): Embedding(17965, 300, padding_idx=1)
      )
      (inp_dp): RNNDropout()
      (lstm_layers): ModuleList(
        (0): WeightDropout(
          (module): LSTM(300, 300, batch_first=True)
        )
        (1): WeightDropout(
          (module): LSTM(300, 300, batch_first=True)
        )
      )
      (hidden_dps): ModuleList(
        (0): RNNDropout()
        (1): Identity()
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(900, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.10000000149011612, inplace=False)
      (2): Linear(in_features=900, out_features=50, bias=True)
      (3): ReLU(inplace=True)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1, inplace=False)
    

## Obsolete

In [0]:
path_imdb = datasets.untar_data(datasets.URLs.IMDB)
lmds_params = {'target_ds': FastaiLMDataset, 'bs': 64, 'bptt': 70}
lmdb = LanguageModelFolderDataBundle(path_imdb, lmds_params)
lmdb.do()
lm_imdb_vocab = lmdb.convert_state['vocab_i2w']

In [0]:
print (len(lmdb.convert_state['vocab_i2w']), len(lmdb.convert_state['vocab_w2i'].keys()))
lmdb.convert_state['vocab_i2w'][500], lmdb.convert_state['vocab_w2i']['history']

18189 41168


('son', 426)

In [0]:
#----------------------------------------------------
#----------------------------------------------------
class ArchLanguageModel():

  # ----------------------------
  # ----------------------------
  def test_train(self, test_data, opt_func, split_lr, split, one_cycle, freeze, num_epochs=1):
    train_dl = test_data.train_dl
    valid_dl = test_data.valid_dl
    # NB: Use cross_entropy_flat for Language Model
    loss_func = cross_entropy_flat

    # split_lr is a list:
    #   1. a single-element list [0.01] - same LR for all groups. 
    #        If 'Split' is False, there is only one group. 
    #        If 'Split' is True, there are multiple groups.
    #   2. a multi-element list [0.01, 0.03, 0.05] - discriminative LR for different groups. 
    #        'Split' cannot be False
    assert(isinstance(split_lr, list))
    assert(len(split_lr) > 0)
    assert(not ((split == False) and (len(split_lr) > 1)))

    # NB: Use AwdLstmCB for Language Model and accuracy_flat
    callbs=[CudaCB(device = torch.device('cuda',0)), AwdLstmCB(alpha=2., beta=1.), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy_flat})]
    if (one_cycle):
      one_cycle_callbs = create_OneCycleCB(split_lr, phases=[0.5, 0.5], mom_start=0.8, mom_mid=0.7, mom_end=0.8)
      callbs = callbs + one_cycle_callbs

    self.freeze(freeze)

    print ('BEFORE Hyper parameters')
    lr = split_lr[0]
    if (split and (len(split_lr) == 1)):
      hypers_group = [{}] * self.n_splits
      opt_groups=(self.splitter, hypers_group, {'lr': lr})
    elif (split and (len(split_lr) > 1)):
      hypers_group = [{'lr': lr_g} for lr_g in split_lr]
      opt_groups=(self.splitter, hypers_group, {})
    else:
      opt_groups = None
    opt = get_optimiser(self.arch, lr, opt_func, opt_groups)

    loop = Trainer(train_dl, valid_dl, self.arch, opt, loss_func, callbs)
    #db.set_trace()
    loop.fit(num_epochs=num_epochs)

    # TODO !!!!!!! Make _print_opt() a public function, maybe using repr
    print ('AFTER Hyper parameters')
    loop.opt._print_opt()

In [0]:
test_vocab = lm_imdb_vocab
test_data = lmdb
#test_data = subset_data
tok_pad = test_vocab.index(PAD)

torch.manual_seed(0)
test_language_model = ArchLanguageModel(test_vocab, 300, 300, 2, tok_pad)

test_language_model.test_train(test_data, opt_func=adam_opt_func, split_lr=[5e-3], split=False, one_cycle=False, freeze="UNFREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
12 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,6.506213,0.079433,6.039642,0.085496,00:18


AFTER Hyper parameters
12 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


In [0]:
#----------------------------------------------------
# The pre-trained weights are named after the module names in Fastai's model, which are slightly
# different than the module names in my KD model
#----------------------------------------------------
# These are the dropout probabilities 
dps = tensor([0.1, 0.15, 0.25, 0.02, 0.2]) * 0.5

arch_language_model = ArchLanguageModel(lm_imdb_vocab, 300, 300, 2, tok_pad, *dps)
arch_language_model.arch, arch_language_model.arch.state_dict().keys(), matched_wgts.keys()

(SequentialRNN(
   (0): AWD_LSTM(
     (emb): Embedding(18189, 300, padding_idx=1)
     (emb_dp): EmbeddingDropout(
       (emb): Embedding(18189, 300, padding_idx=1)
     )
     (inp_dp): RNNDropout()
     (lstm_layers): ModuleList(
       (0): WeightDropout(
         (module): LSTM(300, 300, batch_first=True)
       )
       (1): WeightDropout(
         (module): LSTM(300, 300, batch_first=True)
       )
     )
     (hidden_dps): ModuleList(
       (0): RNNDropout()
       (1): Identity()
     )
   )
   (1): LinearDecoder(
     (out_dp): RNNDropout()
     (decoder): Linear(in_features=300, out_features=18189, bias=True)
   )
 ),
 odict_keys(['0.emb.weight', '0.emb_dp.emb.weight', '0.lstm_layers.0.weight_hh_l0_raw', '0.lstm_layers.0.module.weight_ih_l0', '0.lstm_layers.0.module.weight_hh_l0', '0.lstm_layers.0.module.bias_ih_l0', '0.lstm_layers.0.module.bias_hh_l0', '0.lstm_layers.1.weight_hh_l0_raw', '0.lstm_layers.1.module.weight_ih_l0', '0.lstm_layers.1.module.weight_hh_l0', '0.lstm

In [0]:
arch_language_model.test_train(test_data, opt_func=adam_opt_func, split_lr=[5e-3], split=True, one_cycle=False, freeze="FREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,4.703729,0.231917,4.291876,0.256696,00:17


AFTER Hyper parameters
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


In [0]:
arch_language_model.test_train(test_data, opt_func=adam_opt_func, split_lr=[2e-2], split=True, one_cycle=True, freeze="FREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.02}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.02}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.02}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,4.430527,0.245284,4.219647,0.261421,00:17


AFTER Hyper parameters
4 {'momentum': 0.7999748271191593, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.234525822385427e-06}
4 {'momentum': 0.7999748271191593, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.234525822385427e-06}
2 {'momentum': 0.7999748271191593, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.234525822385427e-06}


In [0]:
lr_tmp = 2e-3
split_lr = [lr_tmp/2., lr_tmp/2., lr_tmp]
arch_language_model.test_train(test_data, opt_func=adam_opt_func, split_lr=split_lr, split=True, one_cycle=True, freeze="UNFREEZE_LSTM", num_epochs=1)

BEFORE Hyper parameters
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.001}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.001}
2 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.002}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,4.229662,0.257942,4.181029,0.265248,00:19


AFTER Hyper parameters
4 {'momentum': 0.7999748271191593, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 2.617262911193581e-07}
4 {'momentum': 0.7999748271191593, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 2.617262911193581e-07}
2 {'momentum': 0.7999748271191593, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.234525822387162e-07}


In [0]:
lm_imdb_vocab = pickle.load(open(path_imdb_tuned/'vocab_lm.pkl', 'rb'))
tlmdb = TextClassificationFolderDataBundle(path_imdb, bs=64, vocab_i2w=lm_imdb_vocab)
tlmdb.do()

--------- IMDB Classification DataBundle init /root/.fastai/data/imdb
FolderItemContainer loaded 50001 items of type TextFileItemList
Split using split_random into 1500, 1000 and 0 items of type TextFileItemList
Extracted 1500 items of type SentenceItemList using extract_doc
Extracted 1500 items of type ClassNameItemList using extract_custom
Converted 1500 items to type SentenceWordItemList using SentenceToWord
Converted 1500 items to type SentenceWordIdItemList using WordToWordId
Converted 1500 items to type ClassIdItemList using NameToId
Extracted 1000 items of type SentenceItemList using extract_doc
Extracted 1000 items of type ClassNameItemList using extract_custom
Converted 1000 items to type SentenceWordItemList using SentenceToWord
Converted 1000 items to type SentenceWordIdItemList using WordToWordId
Converted 1000 items to type ClassIdItemList using NameToId
Final SentenceWordIdItemList (1500 items)
[[2, 7, 8, 381, 13, 8, 7, 48, 22, 6149, 15236, 10, 7, 5185, 7, 1843, 10, 24, 6

In [0]:
#----------------------------------------------------
class ArchTextClassifier():

  # ----------------------------
  # ----------------------------
  def OLD_test_predict(self, test_data):
    x,y = next(iter(test_data.valid_dl))
    pred_batch = self.arch.eval()(x.cuda())
    return (pred_batch)

    # Predicting on the padded batch or on the individual unpadded samples give the same results.
    pred_ind = []
    for inp in x:
      length = x.size(1) - (inp == self.awd_lstm_enc.pad_idx).long().sum()
      inp = inp[:length]
      pred_ind.append(self.arch.eval()(inp[None].cuda()))
    assert near(pred_batch, torch.cat(pred_ind))

  # ----------------------------
  # ----------------------------
  def OLD_test_train(self, test_data, opt_func, split_lr, split, one_cycle, freeze, num_epochs=1):
    train_dl = test_data.train_dl
    valid_dl = test_data.valid_dl
    # NB: We don't use cross_entropy_flat for Classification
    loss_func = F.cross_entropy

    # split_lr is a list:
    #   1. a single-element list [0.01] - same LR for all groups. 
    #        If 'Split' is False, there is only one group. 
    #        If 'Split' is True, there are multiple groups.
    #   2. a multi-element list [0.01, 0.03, 0.05] - discriminative LR for different groups. 
    #        'Split' cannot be False
    assert(isinstance(split_lr, list))
    assert(len(split_lr) > 0)
    assert(not ((split == False) and (len(split_lr) > 1)))

    # NB: We don't use the AwdLstmCB while doing Classification and we use accuracy not accuracy_flat
    callbs=[CudaCB(device = torch.device('cuda',0)), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy})]
    if (one_cycle):
      one_cycle_callbs = create_OneCycleCB(split_lr, phases=[0.5, 0.5], mom_start=0.8, mom_mid=0.7, mom_end=0.8)
      callbs = callbs + one_cycle_callbs

    self.freeze(freeze)

    print ('BEFORE Hyper parameters')
    lr = split_lr[0]
    if (split and (len(split_lr) == 1)):
      hypers_group = [{}] * self.n_splits
      opt_groups=(self.splitter, hypers_group, {'lr': lr})
    elif (split and (len(split_lr) > 1)):
      hypers_group = [{'lr': lr_g} for lr_g in split_lr]
      opt_groups=(self.splitter, hypers_group, {})
    else:
      opt_groups = None
    opt = get_optimiser(self.arch, lr, opt_func, opt_groups)

    loop = Trainer(train_dl, valid_dl, self.arch, opt, loss_func, callbs)
    #db.set_trace()
    loop.fit(num_epochs=num_epochs)

    # TODO !!!!!!! Make _print_opt() a public function, maybe using repr
    print ('AFTER Hyper parameters')
    loop.opt._print_opt()

# eg. opt_func = optim.SGD, sgd_opt_func, adam_opt_func
# eg. split_lr = [lr/2., lr/2., lr] for one cycle

In [0]:
test_vocab = lm_imdb_vocab
test_data = tlmdb
#test_data = subset_data
tok_pad = test_vocab.index(PAD)
bptt = 70

torch.manual_seed(0)
dps = tensor([0.4, 0.3, 0.4, 0.05, 0.5]) * 0.25
arch_classifier = ArchTextClassifier(len(test_vocab), 300, 300, 2, tok_pad, 2, bptt, *dps)

# Load pre-trained AWD-LSTM weights
arch_classifier.load_weights(path_imdb_tuned/'finetuned_enc.pth')

arch_classifier.test_train(test_data, opt_func=adam_opt_func, split_lr=[1e-2], split=True, one_cycle=True, freeze="FREEZE_ENCODER", num_epochs=1)

lr_tmp = 5e-3
split_lr=[lr_tmp/2., lr_tmp/2., lr_tmp/2., lr_tmp]
arch_classifier.test_train(test_data, opt_func=adam_opt_func, split_lr=split_lr, split=True, one_cycle=True, freeze="UNFREEZE_LAST_LSTM", num_epochs=1)

lr_tmp = 1e-3
split_lr=[lr_tmp/8., lr_tmp/4., lr_tmp/2., lr_tmp]
arch_classifier.test_train(test_data, opt_func=adam_opt_func, split_lr=split_lr, split=True, one_cycle=True, freeze="UNFREEZE_ENCODER", num_epochs=2)

BEFORE Hyper parameters
1 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}
5 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}
8 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.01}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,0.631232,0.658575,0.632412,0.63457,00:02


AFTER Hyper parameters
1 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
5 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
5 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
8 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00017046916484597316}
BEFORE Hyper parameters
1 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0025}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0025}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0025}
8 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,0.494811,0.756417,0.543253,0.723437,00:02


AFTER Hyper parameters
1 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.261729121149329e-05}
4 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.261729121149329e-05}
4 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.261729121149329e-05}
8 {'momentum': 0.7982962913144535, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 8.523458242298658e-05}
BEFORE Hyper parameters
1 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.000125}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.00025}
4 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.0005}
8 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.001}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,0.412415,0.807199,0.482806,0.760938,00:03
1,0.369527,0.838914,0.449211,0.794141,00:03


AFTER Hyper parameters
1 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 5.359408171751905e-07}
4 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 1.071881634350381e-06}
4 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 2.143763268700762e-06}
8 {'momentum': 0.7995722430686906, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 4.287526537401524e-06}


In [0]:
arch_classifier.test_predict(test_data)

In [0]:
#----------------------------------------------------
# End-to-end run - Train using KD Trainer (as opposed to Fastai Learner)
# Change the 'model' variable to switch between KD Model and Fastai Model
# Change the 'test_data' variable to switch between full 'data' and 'subset_data'
# Change the 'opt' variable to switch between the Pytorch SGD optimiser, KD's SGD optimiser and KD's Adam optimiser
#----------------------------------------------------

#----------------------------------------------------
# Note that I run various combinations of training to compare results. Those results are in an email on fal_thu
# This tested KD model and Fastai model, with KD Trainer and Fastai Learner, full data and subset data and different optimisers
# The results came out more or less consistent in all these scenarios showing that the KD Trainer, KD model, KD callbacks and KD optimisers
# were equivalent to the ones from the Fastai lesson.
#----------------------------------------------------


#----------------------------------------------------
# Create the end-to-end architecture with the AWD-LSTM followed by the Decoder
#----------------------------------------------------
def get_language_model(vocab_sz, emb_sz, n_h, n_layers, pad_idx, out_p=0.4, hidden_p=0.2, inp_p=0.6, 
                       emb_p=0.1, weight_p=0.5, tie_weights=True, bias=True):
  awd_lstm_enc = AWD_LSTM(vocab_sz, emb_sz, n_h, n_layers, pad_idx, emb_p, inp_p, weight_p, hidden_p)

  # Get the embedding layer from the AWD-LSTM to enable weight-tying with the
  # Decoder
  enc = awd_lstm_enc.emb if tie_weights else None

  # The input of the Decoder has embedding size to match the output of the 
  # last AWD-LSTM layer. The output of the Decoder has vocab size to produce
  # a probability for each word in the vocab
  awd_lstm_dec = LinearDecoder(emb_sz, vocab_sz, out_p, tie_encoder=enc)

  return SequentialRNN(awd_lstm_enc, awd_lstm_dec)

#----------------------------------------------------
# Freeze the LSTM layers
#----------------------------------------------------
def freeze_lstm(model, freeze):
  do_grad = not freeze
  for rnn in model[0].lstm_layers:
    for p in rnn.parameters(): p.requires_grad_(do_grad)

#----------------------------------------------------
# Now run the model end to end using same hyperparameters for the 3 parameter groups
# !!!! Using my code only (ie. KD Trainer, KD Optimiser and KD Model) and no Fastai
#----------------------------------------------------

# !!!!! The first thing was to implement the param group splitter and make sure it works with the KD Adam optimiser. That is done.

def test_awd_lstm(model, test_data, lr, opt_type, one_cycle, freeze):
  train_dl = test_data.train_dl
  valid_dl = test_data.valid_dl
  loss_func = cross_entropy_flat

  callbs=[CudaCB(device = torch.device('cuda',0)), AwdLstmCB(alpha=2., beta=1.), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy_flat})]
  if (one_cycle):
    # TODO !!!!!!! Using freeze as the if condition to differentiate between Step 2 and Step 3 below is a hack
    if (freeze):
      one_cycle_callbs = create_OneCycleCB([lr], phases=[0.5, 0.5], mom_start=0.8, mom_mid=0.7, mom_end=0.8)
    else:
      one_cycle_callbs = create_OneCycleCB([lr/2., lr/2., lr], phases=[0.5, 0.5], mom_start=0.8, mom_mid=0.7, mom_end=0.8)
    callbs = callbs + one_cycle_callbs

  freeze_lstm(model, freeze)

  print ('BEFORE Hyper parameters')
  if (opt_type == "py_sgd"):
    opt = optim.SGD(model.parameters(), lr=lr)
  elif (opt_type == "lib_sgd"):
    opt = sgd_opt_func(model.parameters(), lr=lr)
  elif (opt_type == "lib_adam"):
    opt = adam_opt_func(model.parameters(), lr=lr)
  elif (opt_type == "lib_adam_split"):
    opt_groups=(lm_splitter, [{}, {}, {}], {'lr': lr})
    opt = get_optimiser(model, lr, adam_opt_func, opt_groups)
  else:
    opt = None

  loop = Trainer(train_dl, valid_dl, model, opt, loss_func, callbs)
  loop.fit(num_epochs=1)

  # TODO !!!!!!! Make _print_opt() a public function, maybe using repr
  print ('AFTER Hyper parameters')
  loop.opt._print_opt()

In [0]:
#----------------------------------------------------
# Create the model based on KD code
# Do this after importing Lib ie. the first 4 cells of this current notebook
#----------------------------------------------------
torch.manual_seed(0)
tok_pad = vocab.index(PAD)
kd_model_cpu = get_language_model(len(vocab), 300, 300, 2, tok_pad)
kd_model = kd_model_cpu.cuda()

In [0]:
test_awd_lstm(kd_model, subset_data, lr=5e-3, opt_type="lib_adam", one_cycle=False, freeze=False)

BEFORE Hyper parameters
12 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


epoch,train_loss,train_acc,valid_loss,valid_acc,time
0,6.837101,0.081921,6.57095,0.087704,00:43


AFTER Hyper parameters
12 {'momentum': 0.9, 'sqr_momentum': 0.99, 'eps': 1e-05, 'weight_decay': 0.0, 'lr': 0.005}


**Split the model into param groups**

In [0]:
#----------------------------------------------------
# Split into three groups - two for each rnn/corresponding dropout, then one last 
# group that contains the embeddings/decoder. This is the one that needs to be
# trained the most as we may have new embeddings vectors.
#----------------------------------------------------

def lm_splitter(m):
    groups = []
    for i in range(len(m[0].lstm_layers)): 
      groups.append(nn.Sequential(m[0].lstm_layers[i], m[0].hidden_dps[i]))
    groups += [nn.Sequential(m[0].emb, m[0].emb_dp, m[0].inp_dp, m[1])]
    return [list(o.parameters()) for o in groups]

In [0]:
test_awd_lstm(kd_model, subset_data, lr=5e-3, opt_type="lib_adam_split", one_cycle=False, freeze=True)

In [0]:
test_awd_lstm(kd_model, subset_data, lr=2e-2, opt_type="lib_adam_split", one_cycle=True, freeze=True)

In [0]:
test_awd_lstm(kd_model, subset_data, lr=2e-3, opt_type="lib_adam_split", one_cycle=True, freeze=False)

In [0]:
  def temp_forward(self, input):
    # If the batch size changes, then re-initialise the hidden state (whose shape depends on batch size)
    bs, _ = input.size()
    if bs!=self.bs:
      self.bs=bs
      self.reset_state()

    # Input goes through Embedding layer with dropout
    # Input has shape [samples, timesteps]
    # Emb_val has shape [samples, timesteps, embedding size]
    emb_val = self.emb_dp(input)

    # Now apply Input dropout to the result of the embedding
    # This will then be fed to the LSTM layers
    # Inp_val has shape [samples, timesteps, embedding size]
    inp_val = self.inp_dp(emb_val)

    #print ('KD input and inp val', input.float().mean(), inp_val.mean())

    # Keep a list of hidden state, raw output values and output values post dropout for each LSTM layer
    new_states, out_vals, out_dp_vals = [], [], []

    # Go through each LSTM layer and its corresponding Hidden Dropout and Hidden State
    for lstm_dp, hidden_dp, state in zip(self.lstm_layers, self.hidden_dps, self.state):
      # Apply the LSTM
      out_val, new_state = lstm_dp(inp_val, state)
      # Apply the Hidden Dropout to the LSTM output
      out_dp_val = hidden_dp(out_val)

      #print ('KD layer', out_val.mean(), state[0].mean(), state[1].mean(), new_state[0].mean(), new_state[1].mean())
      #print ('KD hidden', out_dp_val.mean())

      # Add the state, raw output value and output value post dropout for this layer, to the lists
      # [hidden state layer 1, hidden state layer 2, ....]
      # [raw output layer 1, raw output layer 2, ...]
      # [(post dropout) output layer 1, output layer 2, ...]
      new_states.append(new_state)
      out_vals.append(out_val)
      out_dp_vals.append(out_dp_val)

      # The post-dropout output will become the input to the next layer
      inp_val = out_dp_val

    # Save the new hidden states
    self.state = self._to_detach(new_states)

    # Return ([list of raw outputs for each layer], [list of outputs for each layer])
    return (out_vals, out_dp_vals)


In [0]:


model = kd_model
test_data = subset_data
#test_data = data
py_sgd = optim.SGD(model.parameters(), lr=5e-3)
lib_sgd = sgd_opt_func(model.parameters(), lr=5e-3)
lib_adam = adam_opt_func(model.parameters(), lr=5e-3)

train_dl = test_data.train_dl
valid_dl = test_data.valid_dl
loss_func = cross_entropy_flat
opt = lib_adam

callbs=[CudaCB(device = torch.device('cuda',0)), AwdLstmCB(alpha=2., beta=1.), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy_flat})]

loop = Trainer(train_dl, valid_dl, model, opt, loss_func, callbs)
loop.fit(num_epochs=1)


In [0]:


model = kd_model
test_data = subset_data
#test_data = data
py_sgd = optim.SGD(model.parameters(), lr=5e-3)
lib_sgd = sgd_opt_func(model.parameters(), lr=5e-3)
lib_adam = adam_opt_func(model.parameters(), lr=5e-3)

opt_groups=(lm_splitter, [{}, {}, {}], {'lr': 1e-2})
lib_adam_split = get_optimiser(model, 5e-3, adam_opt_func, opt_groups)

train_dl = test_data.train_dl
valid_dl = test_data.valid_dl
loss_func = cross_entropy_flat
opt = lib_adam_split

callbs=[CudaCB(device = torch.device('cuda',0)), AwdLstmCB(alpha=2., beta=1.), GradientClipping(clip=0.1), ProgressCallback(), MetricsCB({"acc": accuracy_flat})]

loop = Trainer(train_dl, valid_dl, model, opt, loss_func, callbs)
loop.fit(num_epochs=1)