# Lesson 7

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.fastai.learner import *
from fastai.fastai.column_data import *
from fastai.fastai.io import *
from fastai.fastai.lm_rnn import *

## 00:00:00 - Part 1 recap

* Part 1 theme = classification and regression with DL.
  * Identify and learning best practises.
* First 4 lessons: image classification, structured data and NLP in practise.
* Last 3 lessons: understanding more detail about what is going on under the hood.

## 00:01:01 - Part 2 preview

* Move from classification focus to generative models:
  * Chat responses.
  * Images.
  * Text.
* Move from best practises to speculative stuff:
  * Recent papers that haven't been fully tested.
* Learn how to read papers.

## 00:02:51 - RNNs (recap of Lesson 6)

* RNNs are just standard fully-connected networks.
* Recap of lesson from last week (see Lesson 6 notebook).

### 00:06:20 - Multi-output model

* Split into non-overlapping pieces, the use a piece to predict the next chars offset by 1.
* Problem with RNN model created earlier: each time we start a new sequence, we have to learn the hidden state from scratch:
  ```
  def forward(self, *cs):
      bs = cs[0].size(0)
      
      # Problem: we are created a brand new hidden state each forward prop
      h = V(torch.zeros(1, bs, n_hidden)
      
      inp = self.e(torch.stack(cs))
      outp, h = self.rnn(inp, h)
  ```
  * Can improve on that by saving the state of `self.h` in the constructor:

In [11]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        
        super().__init__()
        
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size[0]
        
        # This handles the last batch, if we don't have enough
        # text for a batch size.
        if self.h.size(1) != bs:
            self.init_hidden(bs)
            
        outp, h = self.rnn(self.e(cs), self.h)
        
        # Store results of hidden layer and throw away history of operations.
        # Called: backprop through time.
        self.h = repackage_var(h)
        
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(1, bs, n_hidden)))

### 00:10:50 - Backprop through time

* In multi-output model, unrolled RNN is going to be the size of the corpus. Eg if it's a million words, `self.h` would have a million layers, which would be expensive to run backprop etc.
  * Want to remember state but not history: `Variable(h.data) if type(h) == Variable else tuple(repackage_var(v) for v in h)`
    * By passing `h.data` into a new `Variable` class, you lose the history.
* Process of running back prop through hidden state history is backprop through time.
* Usually set a cap on how many layers to run bptt on.
  * In original RNN lesson, we had a var called `bptt = 70`, which sets how many layers to run backprop through.
* Longer values may let you capture more state about the problem, but may also results in exploding / vanishing gradients.

### 00:16:00 - Minibatching in RNNs

* How do you get the data into the model, given we want to do a mini-batch of sections?
* Reminder about how Torch text breaks up documents for batching:
  * Split the corpus into 64 equal sized chunks.
  * Stack the "chunks" - each mini batch splits the chunks of size `bptt`.

### 00:20:31 - Augmentation for NLP (audience question)

* No good way known: Jeremy planning to study it.
* Recent Kaggle winner won it by randomly inserting parts of rows.

### 00:21:38 - Choosing a `bptt` size (audience question)

* Question: how do you choose a `bptt` size?
* Answers:
  * Memory: Matrix size for a minibatch has a bptt x bs size, so pick one that your GPU can handle.
  * Stability: If training is unstable: loss shotting off to NaN can try reducing (less layers to gradient explode through).
  * Performance: Consider reducing if each batch is training too slowly.

## 00:23:24 - Training model with Torchtext tooling

* Back to Torchtext for preparing dataset.
* Have a choice when preparing dataset:
  * Write your own dataset subclass
  * Change your data to fit the dataset classes you have.

In [3]:
from torchtext import vocab, data

from fastai.fastai.nlp import *
from fastai.fastai.lm_rnn import *

PATH = 'data/nietzche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'

TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

### Preparing dataset (not covered in lecture)

* Prepared data by copying nietzsche.txt data to `trn` and `val` folder, then deleting last 20% of rows in `trn` and deleted first 80% of rows in `val`.
* Useful to not have a random shuffled validation set: val set is actual ordered text.

In [13]:
!mkdir {TRN}

In [14]:
!mkdir {VAL}

In [5]:
get_data('https://s3.amazonaws.com/text-datasets/nietzsche.txt', f'{PATH}nietzsche.txt')

nietzsche.txt: 606kB [00:05, 113kB/s]                             


In [9]:
!wc -l {PATH}nietzsche.txt

    9934 data/nietzche/nietzsche.txt


In [32]:
# Use ceil over round because that line ends a sentence.
trn_rows = math.ceil(9934 * 0.8)

In [30]:
!head -n {trn_rows} {PATH}nietzsche.txt > {TRN}/train.txt

In [35]:
!tail -n 5 {TRN}/train.txt

that the easy life which the Jesuit manuals advocate is for the benefit,
not of the Jesuits but the laity. Indeed, it may be questioned whether
we enlightened ones would become equally competent workers as the result
of similar tactics and organization, and equally worthy of admiration as
the result of self mastery, indefatigable industry and devotion.


In [33]:
!tail -n {9934 - trn_rows} {PATH}nietzsche.txt > {VAL}/val.txt

In [37]:
!head -n 5 {VAL}/val.txt


56

=Victory of Knowledge over Radical Evil.=--It proves a material gain to
him who would attain knowledge to have had during a considerable period


* In Torchtext, you create a `Field`.
  * Description for how to process text.
  * `tokenize` - since you want a character model, can just use the list function in Python to tokenise:

In [41]:
list('yo wassup')

['y', 'o', ' ', 'w', 'a', 's', 's', 'u', 'p']

In [38]:
TEXT = data.Field(lower=True, tokenize=list)

* Settings:
  * `bs` - batch size (same setting as last notebook).
  * `bptt` - size of backprop through time.
  * `n_fac` - size of embedding.
  * `n_hidden` - size of hidden state.

In [42]:
bs=64; bptt=8; n_fac=42; n_hidden=256

* Create Fast.ai dataset:

In [39]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

* Length of data loader is how many batches to go through (should be equal to token num / batch size / bptt):

In [46]:
len(md.trn_ds[0].text) / bs / bptt

943.3046875

  * Not exactly that in Torchtext: bptt randomised a little bit:

In [47]:
len(md.trn_dl)

942

Num unique tokens:

In [49]:
md.nt

55

In [50]:
len(md.trn_ds), len(md.trn_ds[0].text)

(1, 482972)

* `TEXT` also contains an extra attribute called `vocab` containing:
  * list of all unique items in vocab (`TEXT.vocab.itos`).
  * reverse mapping from each item to number (`TEXT.vocab.stoi`).

In [53]:
TEXT.vocab.itos

['<unk>',
 '<pad>',
 ' ',
 'e',
 't',
 'i',
 'a',
 'o',
 'n',
 's',
 'r',
 'h',
 'l',
 'd',
 'c',
 'u',
 'f',
 'm',
 'p',
 'g',
 ',',
 'y',
 'w',
 'b',
 'v',
 '-',
 '.',
 '"',
 'k',
 'x',
 ';',
 ':',
 'q',
 'j',
 '!',
 '?',
 '(',
 ')',
 "'",
 'z',
 '1',
 '2',
 '=',
 '_',
 '3',
 '[',
 ']',
 '4',
 '5',
 '6',
 '8',
 '7',
 '9',
 '0',
 'ä']

In [56]:
TEXT.vocab.stoi

defaultdict(<function torchtext.vocab._default_unk_index()>,
            {'<unk>': 0,
             '<pad>': 1,
             ' ': 2,
             'e': 3,
             't': 4,
             'i': 5,
             'a': 6,
             'o': 7,
             'n': 8,
             's': 9,
             'r': 10,
             'h': 11,
             'l': 12,
             'd': 13,
             'c': 14,
             'u': 15,
             'f': 16,
             'm': 17,
             'p': 18,
             'g': 19,
             ',': 20,
             'y': 21,
             'w': 22,
             'b': 23,
             'v': 24,
             '-': 25,
             '.': 26,
             '"': 27,
             'k': 28,
             'x': 29,
             ';': 30,
             ':': 31,
             'q': 32,
             'j': 33,
             '!': 34,
             '?': 35,
             '(': 36,
             ')': 37,
             "'": 38,
             'z': 39,
             '1': 40,
             '2': 41,
             '=':

In [68]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        
        # Check for final batch (might not equal batch size)
        if self.h.size(1) != bs:
            self.init_hidden(bs)

        outp, h = self.rnn(self.e(cs), self.h)
        
        # Get rid of history
        self.h = repackage_var(h)
        
        # Have to use `.view` to make output rank 2 tensor which is what's required by loss func.
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = V(torch.zeros(1, bs, n_hidden))

In [69]:
m = CharSeqStatefulRnn(md.nt, n_fac, 512)
opt = optim.Adam(m.parameters(), 1e-3)

* Output uses `.view` because loss function can't handle rank 3 tensors: has to be rank 2 (or rank 4).
* `dim` arg is required in `log_softmax` for Pytorch >= 0.3.

In [70]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.880812   1.850538  
    1      1.702586   1.708077                              
    2      1.628063   1.638349                              
    3      1.570418   1.602545                              



[1.6025455]

In [71]:
set_lrs(opt, 1e-4)

In [72]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.493271   1.558022  
    1      1.494684   1.553495                              
    2      1.493436   1.55244                               
    3      1.48739    1.543987                              



[1.5439872]

### 00:42:49 - RNN unpacked

In [78]:
# Roughly equivalent to what PyTorch is doing

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

In [81]:
class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs:
            self.init_hidden(bs)
            
        o = self.h
        
        outp = []
        for c in cs:
            o = self.rnn(self.e(c), o)
            outp.append(o)
        outp = self.l_out(torch.stack(outp))
        
        self.h = repackage_var(o)
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = V(torch.zeros(1, bs, n_hidden))

In [80]:
m = CharSeqStatefulRnn2(md.nt, n_fac, 512)
opt = optim.Adam(m.parameters(), 1e-3)

### 00:44:05 - Why use Tanh (audience question)

* Shape is similar to sigmoid but goes between -1 and 1.
* Force it into a range to help avoid a gradient explosion, which may happen with a `relu`.


## 00:46:43 - GRUCell

* `RNNCell` is not commonly used in practise - even with `tanh` activation function, it tends to have trouble with gradient explosion.
* Instead: use `GRUCell`:

<img src="http://www.wildml.com/wp-content/uploads/2015/10/Screen-Shot-2015-10-23-at-10.36.51-AM.png" width=400px>

GRU Gating. Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling.” (2014)

* Contains a neural net in a neural net that learns how much to remember of hidden state:
  * If you see a full stop, throw away hidden state (for example).
* `r` (reset gate):
  * input normally gets multiplied by weight matrix to create new activations.
  * With reset gate, input doesn't get added to previous activation directly.
  * Previous activations are first multiplied by the reset gate (between 0 and 1) before being added to input + hidden concat.

  * Equation for reset: $r_t = \sigma(W_r\dot[h_t-1,x_t])$
  
  * Equal to matrix product of a weight matrix and concat of hidden state and new input.
  * Basically a one layer neural net / logistic regression model: can be thought of a neural network within a neural network.
  
* `z` (update gate)
  * decides to what degree do you use your new hidden state vs leaving it how it was.
  * In other words, how important is it to remember the current input?
  * Equation for update: $z_t = \sigma(W_z\dot[h_t-1,x_t])$
  
* Final update equation is a linear interpolation using $z_t$:

  $$
  \hat{h}_t = \text{tanh}(W\cdot[r_t * h_t - 1, x_t]) \\
  h_t = (1 - z_t) * h_t - 1 + z_t * \hat{h}_t
  $$
  
* Definition from PyTorch source code:

In [82]:
# Roughly equivalent to what PyTorch is doing

def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih)
    gh = F.linear(hidden, w_hh, b_hh)
    i_r, i_i, i_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)
    
    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n)
    return newgate + inputgate * (hidden - newgate)

### 00:52:36 - Replace RNNCell with GRUCell in char model