# Image Captioning

## Part 2: Train a CNN-RNN Model

---

In this notebook, we will train our CNN-RNN model.  

- [Step 1](#step1): Training Setup
  - [1a](#1a): CNN-RNN architecture
  - [1b](#1b): Hyperparameters and other variables
  - [1c](#1c): Image transform
  - [1d](#1d): Data loader
  - [1e](#1e):Loss function, learnable parameters and optimizer


- [Step 2](#step2): Train and Validate the Model
  - [2a](#2a): Train for the first time
  - [2b](#2b): Resume training
  - [2c](#2c): Notes regarding model validation

<a id='step1'></a>
## Step 1: Training Setup

We will describe the model architecture and specify hyperparameters and set other options that are important to the training procedure. We will refer to [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance.

<a id='1a'></a>
### CNN-RNN architecture

For the complete CNN-RNN model, see **model.py**. For the encoder model, we use a pre-trained ResNet which has been known to achieve great success in image classification. The decoder is an RNN which has an Embedding layer, a LSTM layer and a fully-connected layer. LSTM has been shown to be successful in sequence generation.

<a id='1b'></a>
### Hyperparameters and other variables

In the next code cell, we will set the values for:

- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. We will set it to `32`.
- `vocab_threshold` - the minimum word count threshold.  A larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary. We will set it to `5` just like [this paper](https://arxiv.org/pdf/1411.4555.pdf)
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. This will be changed to `True` once we are done setting `vocab_threshold` and generating a `vocab.pkl` file.
- `embed_size` - the dimensionality of the image and word embeddings. We have tried `512` as done in [this paper](https://arxiv.org/pdf/1411.4555.pdf) but it took a long time to train, so I will set it to `256`.
- `hidden_size` - the number of features in the hidden state of the RNN decoder. We will use `512` based on [this paper](https://arxiv.org/pdf/1411.4555.pdf). The larger the number, the better the RNN model can memorize sequences. However, larger numbers can significantly slow down the training process.
- `num_epochs` - the number of epochs to train the model.  We are dealing with a huge amount of data so it will take a long time to complete even 1 epoch. Therefore, we will set `num_epochs` to `1`. We will save the model AND the optimizer every 100 training steps, and to resume training from the last step.

In [1]:
# Watch for any changes in vocabulary.py, data_loader.py, utils.py or model.py, and re-load it automatically.
%load_ext autoreload
%autoreload 2

In [2]:
import torch
import torch.nn as nn
from torch.autograd import Variable
from torchvision import transforms
import sys
from pycocotools.coco import COCO
import math
import torch.utils.data as data
import numpy as np
import os
import requests
import time

from utils import train, validate
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN

# Set values for the training variables
batch_size = 32         # batch size
vocab_threshold = 5     # minimum word count threshold
vocab_from_file = True  # if True, load existing vocab file
embed_size = 256        # dimensionality of image and word embeddings
hidden_size = 512       # number of features in hidden state of the RNN decoder
num_epochs = 1          # number of training epochs

<a id='1c'></a>
### Image transform

When setting this transform, we keep two things in mind:
- the images in the dataset have varying heights and widths, and 
- since we are using a pre-trained model, we must perform the corresponding appropriate normalization.

**Training set**: As seen in the following code cell, we will set the transform for training set as follows:

```python
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])
```

According to [this page](https://pytorch.org/docs/master/torchvision/models.html), like other pre-trained models, ResNet expects input images normalized as follows: 
- The images are expected to have width and height of at least 224. The first and second transformations resize and crop the images to 224 x 224:
```python
transforms.Resize(256),                          # smaller edge of image resized to 256
transforms.RandomCrop(224),                      # get 224x224 crop from random location
```
- The images have to be converted from numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]:
```python
transforms.ToTensor(),                           # convert the PIL Image to a tensor
```
- Then they are normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. This is achieved using the last transformation step:
```python
transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))
```

The data augmentation step `transforms.RandomHorizontalFlip()` improves the accuracy of the image classification task as mentioned in [this paper](http://cs231n.stanford.edu/reports/2017/pdfs/300.pdf).

**Validation set**: We won't use the image augmentation step, i.e. RandomHorizontalFlip(), and will use CenterCrop() instead of RandomCrop().

In [3]:
# Define a transform to pre-process the training images
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Define a transform to pre-process the validation images
transform_val = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.CenterCrop(224),                      # get 224x224 crop from the center
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

<a id='1d'></a>
### Data loader
We will build data loaders for training and validation sets, applying the above image transforms. We will then get the size of the vocabulary from the `train_loader`, and use it to initialize our `encoder` and `decoder`.

In [4]:
# Build data loader, applying the transforms
train_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)
val_loader = get_loader(transform=transform_val,
                         mode='val',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)


# The size of the vocabulary
vocab_size = len(train_loader.dataset.vocab)

# Initialize the encoder and decoder
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available
if torch.cuda.is_available():
    encoder.cuda()
    decoder.cuda()

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...


  0%|          | 0/414113 [00:00<?, ?it/s]

Done (t=0.61s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [00:51<00:00, 8017.16it/s]


Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...


  0%|          | 0/202654 [00:00<?, ?it/s]

Done (t=0.29s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 202654/202654 [00:24<00:00, 8158.09it/s]


<a id='1e'></a>
### Loss function, learnable parameters and optimizer

**Loss function**: We will use `CrossEntropyLoss()`.

**Learnable parameters**: According to [this paper](https://arxiv.org/pdf/1411.4555.pdf), the "loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and word embeddings." We will follow this strategy and choose the parameters accordingly. This makes sense for two reasons:
- the EncoderCNN in this project uses ResNet which has been pre-trained on an image classification task. So we don't have to optimize the parameters of the entire network again for a similar image classification task. We only need to optimize the top layer whose outputs are fed into the DecoderRNN.
- the DecoderRNN is not a pre-trained network, so we have to optimize all its parameters.

**Optimizer**: According to [this paper](https://arxiv.org/pdf/1502.03044.pdf), Adam optimizer works best on the MS COCO Dataset. Therefore, we will use it.

In [5]:
# Define the loss function
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 

# Define the optimizer
optimizer = torch.optim.Adam(params=params, lr=0.001)

<a id='step2'></a>
## Step 2: Train and Validate the Model

At the beginning of this notebook, we have imported the `train` fuction and the `validate` function from `utils.py`. To figure out how well our model is doing, we will print out the training loss and perplexity during training. We will try to minimize overfitting by assessing the model's performance, i.e. the Bleu-4 score, on the validation dataset. 

It will take a long time to train and validate the model. Therefore we will split the training procedure into two parts: first, we will train the model for the first time and save the it every 100 steps; then we will resume, as many times as we would like or until the early stopping criterion is satisfied. We will save the model and optimizer weights in the `models` subdirectory. We will do the same for the validation procedure.

First, let's calculate the total number of training and validation steps per epoch.

In [6]:
# Set the total number of training and validation steps per epoch
total_train_step = math.ceil(len(train_loader.dataset.caption_lengths) / train_loader.batch_sampler.batch_size)
total_val_step = math.ceil(len(val_loader.dataset.caption_lengths) / val_loader.batch_sampler.batch_size)
print ("Number of training steps:", total_train_step)
print ("Number of validation steps:", total_val_step)

<a id='2a'></a>
### Train for the first time

Run the below cell if training for the first time or training continously without break. To resume training, skip this cell and run the one below it.

In [12]:
# Keep track of train and validation losses and validation Bleu-4 scores by epoch
train_losses = []
val_losses = []
val_bleus = []
# Keep track of the current best validation Bleu score
best_val_bleu = float("-INF")

start_time = time.time()
for epoch in range(1, num_epochs + 1):
    train_loss = train(train_loader, encoder, decoder, criterion, optimizer, 
                       vocab_size, epoch, total_train_step)
    train_losses.append(train_loss)
    val_loss, val_bleu = validate(val_loader, encoder, decoder, criterion,
                                  train_loader.dataset.vocab, epoch, total_val_step)
    val_losses.append(val_loss)
    val_bleus.append(val_bleu)
    if val_bleu > best_val_bleu:
        print ("Validation Bleu-4 improved from {:0.4f} to {:0.4f}, saving model to best-model.pkl".
               format(best_val_bleu, val_bleu))
        best_val_bleu = val_bleu
        filename = os.path.join("./models", "best-model.pkl")
        save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
                   val_bleu, val_bleus, epoch)
    else:
        print ("Validation Bleu-4 did not improve, saving model to model-{}.pkl".format(epoch))
    # Save the entire model anyway, regardless of being the best model so far or not
    filename = os.path.join("./models", "model-{}.pkl".format(epoch))
    save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
               val_bleu, val_bleus, epoch)
    print ("Epoch [%d/%d] took %ds" % (epoch, num_epochs, time.time() - start_time))
    if epoch > 5:
        # Stop if the validation Bleu doesn't improve for 3 epochs
        if early_stopping(val_bleus, 3):
            break
    start_time = time.time()

Epoch [1/1], Step [100/12942], Train loss: 2.3676, Train perplexity: 10.6715
Epoch [1/1], Step [10/6333], Val loss: 1.9856, Val perplexity: 7.2833, Val Bleu-4: 0.1056Validation Bleu-4 improved from -inf to 0.0001666988058694744, saving models to                 best-encoder.pkl and best-decoder.pkl


<a id='2b'></a>
### Resume training

Resume training if having trained and saved the model. There are two types of data loading for training depending on where we are in the process: 
1. We will load a model from the latest training step if we are in the middle of the process and have previously saved a model, e.g. train-model-14000.pkl which means model was saved for epoch 1 at training step 4000.
2. We will load a model saved by the below validation process after completing validating one epoch. This is when we start to train the next epoch. Therefore, we need to reset `start_loss` and `start_step` to 0.0 and 1 respectively.

We will modify the code cell below depending on where we are in the training process.

In [18]:
# Load the last checkpoints
checkpoint = torch.load(os.path.join('./models', 'train-model-76500.pkl'))

# Load the pre-trained weights
encoder.load_state_dict(checkpoint['encoder'])
decoder.load_state_dict(checkpoint['decoder'])
optimizer.load_state_dict(checkpoint['optimizer'])

# Load start_loss from checkpoint if in the middle of training process; otherwise, comment it out
start_loss = checkpoint['total_loss']
# Reset start_loss to 0.0 if starting a new epoch; otherwise comment it out
#start_loss = 0.0

# Load epoch. Add 1 if we start a new epoch
epoch = checkpoint['epoch']
# Load start_step from checkpoint if in the middle of training process; otherwise, comment it out
start_step = checkpoint['train_step'] + 1
# Reset start_step to 1 if starting a new epoch; otherwise comment it out
#start_step = 1

# Train 1 epoch at a time due to very long training time
train_loss = train(train_loader, encoder, decoder, criterion, optimizer, 
                   vocab_size, epoch, total_train_step, start_step, start_loss)

Epoch [1/1], Step [100/12942], Loss: 4.0215, Perplexity: 55.7821


Now that we have completed training an entire epoch, we will save the necessary information. We will load pre-trained weights from the last train step `train-model-{epoch}12900.pkl`, `best_val_bleu` from `best-model.pkl` and the rest from `model-{epoch}.pkl`). We will append `train_loss` to the list `train_losses`. Then we will save the information needed for the epoch.

In [None]:
# Load checkpoints
train_checkpoint = torch.load(os.path.join('./models', 'train-model-712900.pkl'))
epoch_checkpoint = torch.load(os.path.join('./models', 'model-6.pkl'))
best_checkpoint = torch.load(os.path.join('./models', 'best-model.pkl'))

# Load the pre-trained weights and epoch from the last train step
encoder.load_state_dict(train_checkpoint['encoder'])
decoder.load_state_dict(train_checkpoint['decoder'])
optimizer.load_state_dict(train_checkpoint['optimizer'])
epoch = train_checkpoint['epoch']

# Load from the previous epoch
train_losses = epoch_checkpoint['train_losses']
val_losses = epoch_checkpoint['val_losses']
val_bleus = epoch_checkpoint['val_bleus']

# Load from the best model
best_val_bleu = best_checkpoint['val_bleu']

train_losses.append(train_loss)
print (train_losses, val_losses, val_bleus, best_val_bleu)
print ("Training completed for epoch {}, saving model to train-model-{}.pkl".format(epoch, epoch))
filename = os.path.join("./models", "train-model-{}.pkl".format(epoch))
save_epoch(filename, encoder, decoder, optimizer, train_losses, val_losses, 
           best_val_bleu, val_bleus, epoch)

<a id='2c'></a>
### Notes regarding model validation

- Another way to validate a model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing the model's predicted captions for the validation images. Then, write up a script or use one [available online](https://github.com/tylin/coco-caption) to calculate the BLEU score of the model. 
- Other evaluation metrics (such as TEOR and Cider) are mentioned in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf). 


# Next steps

A few things that we may try in the future to improve model performance:

- Adjust learning rate: make it decay over time, as in [this example](https://github.com/pytorch/examples/blob/master/imagenet/main.py).
- Perform batch normalization.
- Run the code on a GPU to so that we can train the model more. Currently we train for only up to 1800 steps of the first epoch.
- Update the way we save checkpoints: save and load the correct epoch #, save the best validation Bleu-4 and val_bleus.
- Update **Resume training** to start from the correct epoch and/or training step. We also need to ensure we load the latest best validation Bleu-4. 