# Notebook 2: Training

## Setup

First I import all the libraries I'll need:

In [1]:
import torch
import torch.nn as nn
from torch.functional import F
import re
from collections import Counter
import random
from IPython.display import clear_output
import math

Next I set random seeds for all sources of randomness in this notebook, so that the results will be the same each time it is run:

In [2]:
random.seed(1)
torch.manual_seed(1)

<torch._C.Generator at 0x7f48e80b2dd0>

## Get Data

Since I downloaded the book I'll be using as a data source in the notebook `1-Download`, I can simply load the text file into this notebook from the current directory:

In [3]:
with open('reason.txt') as f:
    fulltext = f.read()

I don't want the text before or after the book's actual contents, so I find the start and end points and trim the document string:

In [4]:
start = fulltext.index('PREFACE TO THE FIRST EDITION 1781')
start

7902

In [5]:
end = fulltext.index('End of the Project Gutenberg EBook')
end

1270832

In [6]:
fulltext = fulltext[start:end]

Finally, I make the whole text lowercase so that capitalization doesn't play a role in tokenization:

In [7]:
fulltext = fulltext.lower()

## Tokenization

I manually tokenize the text with a few simple rules:
- Remove all underscores, since they don't contribute any semantic meaning
- Remove all numbers, since I only want words for my text generator
- Only include full words (possibly with an apostrophe) and certain common punctuation that is useful for sentence structure

In [8]:
def tokenize(text):
    text = re.sub('_', '', text) # remove all underscores
    text = re.sub('\d+', '', text) # remove all numbers
    fulltoks = re.findall(r"[\w']+|[,.?;:]", text) # only include full words and wanted punctuation
    return fulltoks

In [9]:
fulltoks = tokenize(fulltext)

I create a vocabulary using the Counter class from Python's standard library. It lets me rank the tokens in the order of the most commonly occurring tokens:

In [10]:
vocab = [tok for tok,count in Counter(fulltoks).most_common()]

I'd like to only use the top 3000 most common tokens:

In [11]:
vocab = vocab[:3000]

## Transforms

These classes are defined in `helpers.py` and are loosely based on the concept of transforms from the [fastai library](https://github.com/fastai/fastai). I'm using them to store the token vocabulary and for numericalization (converting lists of tokens to tensors of numbers, and vice versa):

In [12]:
from helpers import BaseTransform, TokTransform

In [13]:
tok_tfm = TokTransform(vocab)
data = tok_tfm.encode(fulltoks) # numericalize the full dataset

Since I'm training a language model, I want the model to predict the following token of each input token. This means I want the input data to include everything but the last token (since the last token has no next token) and the output data to include everything but the first token (since the first token does not follow anything):

In [14]:
x = data[:-1] # the inputs include everything but the last token
y = data[1:] # the outputs include everything but the first token
assert len(x) == len(y)
assert len(data) == len(x)+1

## Chunking

I want to break the full dataset into chunks of 50 tokens, instead of one long tensor. The I will have a long list of tensors that are 50 tokens long. These will be the "items" that I use as inputs and group together into batches:

In [15]:
chunk_sz = 50

In [16]:
n_chunks = math.floor(len(x) / chunk_sz)
n_chunks

4694

In [17]:
x = x[:n_chunks*chunk_sz]; x = x.chunk(n_chunks)
y = y[:n_chunks*chunk_sz]; y = y.chunk(n_chunks)

In [18]:
assert(len(x) == len(y))

## Training / Validation Split

I want to shuffle all the chunks of text before splitting them into training and validation sets:

In [19]:
def shuffle_same(x_set, y_set):
        "Shuffle both x_set and y_set, but keep them lined up with each other"
        zipped = list(zip(x_set, y_set))
        random.shuffle(zipped)
        return list(zip(*zipped))

In [20]:
x,y = shuffle_same(x,y)

Then I'll simply take the first 80% of the pairs for the training set, and the last 20% for the validation set:

In [21]:
len(x)

4694

In [22]:
cut = int(len(x) * .8)
cut

3755

In [23]:
range(cut), range(cut,len(x))

(range(0, 3755), range(3755, 4694))

In [24]:
x_train, x_val = x[:cut], x[cut:]
y_train, y_val = y[:cut], y[cut:]

In [25]:
assert len(x_train) == len(y_train)
assert len(x_val) == len(y_val)

## Dataloading

These classes are again based on the concepts from `fastai`. They hold the datasets plus a batch generator:

In [26]:
from helpers import DataLoader, DataLoaders

In [27]:
bs = 16
dl_train = DataLoader(x_train, y_train, bs)
dl_val = DataLoader(x_val, y_val, bs)
dls = DataLoaders(dl_train, dl_val)

## Model

My custom model is created in `model.py`:

In [28]:
from model import LangModel

In [29]:
voc_sz = tok_tfm.count
emb_sz = 200
hid_sz = 300

In [30]:
model = LangModel(voc_sz, emb_sz, hid_sz)

## Loss

My loss function is simple cross-entropy, but first I need to flatten the predictions and the targets so that the PyTorch cross-entropy function will accept them. I also don't want to include unknown ('xxunk') tokens in the loss calculation:

In [31]:
def LM_loss(preds, targ):
    preds = preds.view(-1, tok_tfm.count)
    targ = targ.view(-1)
    # don't include xxunk indices
    preds = preds[targ!=0]
    targ = targ[targ!=0]
    return F.cross_entropy(preds, targ)

## Optimizer

I'm simply using the Adam optimizer from PyTorch, with a learning rate that seemed to work well:

In [32]:
opt = torch.optim.Adam(model.parameters(), lr=.001)

## Metrics (accuracy)

I'm using simple accuracy for my metric, but again not including unknown tokens since I only care about how many in-vocabulary tokens it predicts correctly:

In [33]:
def accuracy(preds, targ):
    # don't include xxunk indices
    preds = preds[targ!=0]
    targ = targ[targ!=0]
    nums = preds.argmax(dim=-1)
    return (nums == targ).float().mean()

## Learner

The learner class is again based on the concepts from `fastai` but much simpler. It just groups all the objects together that I need for the training process, along with training methods:

In [34]:
from helpers import Learner

In [35]:
learn = Learner(dls, model, LM_loss, opt, accuracy)

After some experimentation, I realized my accuracy usually goes down after 3 epochs, so that's how long I'll train the model:

In [36]:
learn.fit(3)

Batch 235/235


I use my logging method to see the losses and metric after each epoch:

In [37]:
learn.print_logs()

   Train loss  Val loss    Metric
0    5.291291  4.747541  0.218831
1    4.480110  4.473159  0.236761
2    4.182501  4.345974  0.244180


## Saving

Finally, I save the trained model and the `TokTransform` object (which I will need for text generation) so that I can use them in the next notebook, `3-Generation`:

In [38]:
!mkdir saves

mkdir: cannot create directory ‘saves’: File exists


In [39]:
torch.save(learn.model, 'saves/model')

In [40]:
import pickle

In [41]:
pickle.dump(tok_tfm, open('saves/tok_tfm.p', 'wb'))