# Training Network

In this notebook, we will train the CNN-RNN model.  

We can try out many different architectures and hyperparameters when searching for a good model.

Outline of this notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Training the Model
- [Step 3](#step3): Validating the Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, we will customize the training of the CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure. The values we set now will be used when training we model in **Step 2** below.

### Parameters

We begin by setting the following variables:
- `batch_size` - the batch size of each training batch. It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold. Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary. 
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.
- `hidden_size` - the number of features in the hidden state of the RNN decoder.
- `num_epochs` - the number of epochs to train the model. We set `num_epochs=3`, but feel free to increase or decrease this number. [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but we'll soon see that we can get reasonable results in a matter of a few hours! (_But of course, if you we want to compete with current research, we will have to train for much longer._)
- `save_every` - determines how often to save the model weights. We set `save_every=1`, to save the model weights after each epoch. This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training. Note that we probably **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected! We keep this at its default value of `100` to avoid clogging the notebook.
- `log_file` - the name of the text file containing, for every step, how the loss and perplexity evolved during training.


### Image Transformations

When modifying this transform, keep in mind that:
- the images in the dataset have varying heights and widths, and 
- When we using a pre-trained model, we must perform the corresponding appropriate normalization.


We use the `transform_train` as described in the previous notebook. In the original[ResNet](https://arxiv.org/pdf/1512.03385.pdf) paper, which is the ResNet architecture that our CNN encoder uses, it scales shorter edge of images to 256, randomly crops it at 224, randomly samples and horizontally flips the images, and performs batch normalization. Thus, to keep the best performance of the original ResNet model, it makes the most sense to keep the image preprocessing and transforms the same as the original model. Thus, we use the default `transform_train` as follows:

```
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])
```

### Architecture Details

The architecture consists of a CNN encoder and RNN decoder. The CNN encoder is a pre-trained ResNet on ImageNet, which is a VGG convolutional neural network with skip connections. It has been proven to work really well on tasks like image recognition, because the residual connections help model the residual differences before and after the convolution with the help of the identity block. A good pre-trained network on ImageNet is already good at extracting both useful low-level and high-level features for image tasks, so it naturally serves as a feature encoder for the image we want to caption. Since we are not doing the traditional image classification task, we drop the last fully connected layer and replace it without a new trainable fully connected layer to help transform the final feature map to an encoding that are more useful for the RNN decoder.

RNNs have long been shown useful in language tasks due to its ability to model data with sequential nature, such as language. Specifically, LSTMs even incoporate both long term and short term information as memories in the network. Thus, we pick a RNN decoder for the captioning task. Specifically, following the spirit of sequence to sequence (seq2seq) models used in translation, we leveraged the architecture choices in [this paper](https://arxiv.org/pdf/1411.4555.pdf) to use a LSTM to generate captions based on the encoded information from the CNN encoder. Specifically, **we first use the CNN encoder output concatenated with the "START" token as the initial input for the RNN decoder.** We apply a fully connected layer on the hidden states at that timestamp to output a softmax probability over the words in our entire vocabulary, where we choose the word with the highest probability as the word generated at that timestamp. Then, we feed this predicted word back again as the input for the next step. We continue so until we generated a caption of max length, or the network generated the "STOP" token, which indicates the end of the sentence.

### Hyperparameters

Please peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance to set some of the values above. We consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best. Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**). If we are unhappy with the performance, we can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train the model.


To choose the hyperparameters,we initially consulted [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf). We used a minimum word count threshold of **5**, an embedding size of **512** and a hidden size of **512** as well. We trained the network for 3 epochs. When initially inspecting the loss decrease, it is decreasing well as expected, but after training for 20 hours, when I did the inference on test images, the network appears to have overfitted on the training data, because generated captions are not related with the test images at all. We repeated the inference with the model trained after every epoch, and it still performs unsatisfactorily. Thus, we decreased the embedding size to **256** and trained again, this times for only 1 epoch. The network performs great this time!


### Trainable Parameters

Next, we will specify a Python list containing the learnable parameters of the model. For instance, if we decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then we should set `params` to something like:

```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

We decided to freeze all but the last layer of ResNet, because it's already pre-trained on ResNet and performs well. We can still fine tune the entire ResNet for better performance, but since ResNet is kind of big and deep architecture with a lot of parameters, freezing them makes the training faster, as the RNN decoder is already slow to train. Empirical results suggest that the pre-trained ResNet indeed does a good job. Since last layer of CNN encoder is used to transform the CNN feature map to something that RNN needs, it makes sense to train the last new fully connected layer from scratch. 

The RNN decoder is completely new, not a part of the pre-trained ResNet, so we also train all the parameters inside the RNN decoder.

### Optimizer

Finally, we will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

We chose the Adam optimizer because it is one of the most popular and effective optimizers. It combines the benefits of weight decay, momentum, and many other optimization tricks altogether.

In [1]:
# Watch for any changes in model.py, and re-load it automatically.
import math
from model import EncoderCNN, DecoderRNN
from data_loader import get_loader
from pycocotools.coco import COCO
from torchvision import transforms
import torch.nn as nn
import torch

%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package punkt to /home/masoud/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
# Select appropriate values for the Python variables below.
batch_size = 128  # batch size
vocab_threshold = 5  # minimum word count threshold
vocab_from_file = True  # if True, load existing vocab file
embed_size = 256  # dimensionality of image and word embeddings
hidden_size = 512  # number of features in hidden state of the RNN decoder
num_epochs = 1  # number of training epochs
save_every = 1  # determines frequency of saving model weights
print_every = 1  # determines window for printing average loss
# name of file with saved training loss and perplexity
log_file = "training_log.txt"
# Path to cocoapi dir
cocoapi_dir = r"/media/masoud/F60C689F0C685C9D/immediate D/Course_Assignments/FINISHED/VISION/Udacity - Computer Vision Nanodegree/PROJECTS/2 - IMAGE CAPTIONING/MY/"


# Amend the image transform below.
transform_train = transforms.Compose(
    [
        # smaller edge of image resized to 256
        transforms.Resize(256),
        # get 224x224 crop from random location
        transforms.RandomCrop(224),
        # horizontally flip image with probability=0.5
        transforms.RandomHorizontalFlip(),
        # convert the PIL Image to a tensor
        transforms.ToTensor(),
        transforms.Normalize(
            (0.485, 0.456, 0.406),  # normalize image for pre-trained model
            (0.229, 0.224, 0.225),
        ),
    ]
)

In [18]:
# Build data loader.
data_loader = get_loader(
    transform=transform_train,
    mode="train",
    batch_size=batch_size,
    vocab_threshold=vocab_threshold,
    vocab_from_file=vocab_from_file,
    cocoapi_loc=cocoapi_dir,
)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=1.33s)
creating index...
index created!
Obtaining caption lengths...


100%|█████████████████████████████████████████████████████████████████████████| 414113/414113 [00:44<00:00, 9207.58it/s]


In [19]:
# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder.
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function.
criterion = (
    nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()
)

# Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(
    len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size
)

In [20]:
print(total_step)

3236


<a id='step2'></a>
## Step 2: Training the Model

It is useful to load saved weights to resume training. In that case, note the names of the files containing the encoder and decoder weights that we'd like to load (`encoder_file` and `decoder_file`).  Then we can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

It is a good practice to make sure to take extensive notes and record the settings that we used in various training runs while we trying out parameters.

### A Note on Tuning Hyperparameters

To figure out how well the model is doing, we can look at how the training loss and [perplexity](http://www.sefidian.com/2022/05/11/understanding-perplexity-for-language-models/) evolve during training. However, this will not tell us if our model is overfitting to the training data, and, unfortunately, **overfitting is a problem that is commonly encountered when training image captioning models**.  

In this project we do not have strict requirements regarding the performance of the model. We want to demonstrate that the model has learned **_something_** when we generate captions on the test data. For now, we train the model for 3 epochs without worrying about performance. Then, we will go to the next notebook in the sequence (**3_Inference.ipynb**) to see how the model performs on the test data. We can come back to this notebook and amend hyperparameters (if necessary), and re-train the model.

You can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636). In the next step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [21]:
import torch.utils.data as data
import sys
import numpy as np
import os
import requests
import time
from workspace_utils import active_session

In [22]:
# Open the training log file.
f = open(log_file, "w")

old_time = time.time()


for epoch in range(1, num_epochs + 1):
    for i_step in range(1, total_step + 1):

        print(i_step)
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler

        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)

        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()

        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)

        # uncomment to debug
        # print(outputs.shape, captions.shape)
        # torch.Size([128, 12, 9955]) torch.Size([128, 12])

        # print(outputs.view(-1, vocab_size).shape, captions.view(-1).shape)
        # torch.Size([1536, 9955]) torch.Size([1536])

        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))

        # Backward pass.
        loss.backward()

        # Update the parameters in the optimizer.
        optimizer.step()

        # Get training statistics.
        stats = f"Epoch [{epoch}/{num_epochs}], Step [{i_step}/{total_step}], Loss: {loss.item():.4f}, Perplexity: {np.exp(loss.item()):.4f}"

        # Print training statistics (on same line).
        print("\r" + stats, end="")
        sys.stdout.flush()

        # Print training statistics to file.
        f.write(stats + "\n")
        f.flush()

        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print("\r" + stats)

    # Save the weights.
    if epoch % save_every == 0:
        torch.save(
            decoder.state_dict(), os.path.join("./models", "decoder-%d.pkl" % epoch)
        )
        torch.save(
            encoder.state_dict(), os.path.join("./models", "encoder-%d.pkl" % epoch)
        )

# Close the training log file.
f.close()

1
Epoch [1/1], Step [1/3236], Loss: 9.2042, Perplexity: 9938.3369
2
Epoch [1/1], Step [2/3236], Loss: 9.0498, Perplexity: 8516.4601
3
Epoch [1/1], Step [3/3236], Loss: 8.8611, Perplexity: 7052.4929
4
Epoch [1/1], Step [4/3236], Loss: 8.6000, Perplexity: 5431.8378
5
Epoch [1/1], Step [5/3236], Loss: 8.2145, Perplexity: 3694.3029
6
Epoch [1/1], Step [6/3236], Loss: 8.0998, Perplexity: 3293.9635
7
Epoch [1/1], Step [7/3236], Loss: 6.8265, Perplexity: 921.9415
8
Epoch [1/1], Step [8/3236], Loss: 6.1396, Perplexity: 463.8692
9
Epoch [1/1], Step [9/3236], Loss: 5.4941, Perplexity: 243.2473
10
Epoch [1/1], Step [10/3236], Loss: 5.2898, Perplexity: 198.3087
11
Epoch [1/1], Step [11/3236], Loss: 5.0797, Perplexity: 160.7320
12
Epoch [1/1], Step [12/3236], Loss: 4.9498, Perplexity: 141.1531
13
Epoch [1/1], Step [13/3236], Loss: 4.8679, Perplexity: 130.0506
14
Epoch [1/1], Step [14/3236], Loss: 5.0019, Perplexity: 148.6954
15
Epoch [1/1], Step [15/3236], Loss: 4.7131, Perplexity: 111.3971
16
Epoc

KeyboardInterrupt: 

In [None]:
# Uncomment to save the models

In [16]:
# torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-final.pkl'))
# torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-final.pkl'))

<a id='step3'></a>
## Step 3: Validate the Model

To assess potential overfitting, one approach is to assess performance on a validation set. To do this task, we need to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses the RNN decoder to generate captions. That code will prove incredibly useful here. 

To validate our model, we create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  We can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating the model involves creating a .json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing the model's predicted captions for the validation images. Then, we can write our own script or use one that we can [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of our model. Read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf). For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [16]:
# Validate the model.
from data_loader_val import get_loader
from collections import defaultdict
import json
import os


def clean_sentence(output):
    sentence = ""
    for i in output:
        word = data_loader.dataset.vocab.idx2word[i]
        if i == 0:
            continue
        if i == 1:
            break
        if i == 18:
            sentence = sentence + word
        else:
            sentence = sentence + " " + word
    return sentence


transform_test = transforms.Compose(
    [
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(
            (0.485, 0.456, 0.406),  # normalize image for pre-trained model
            (0.229, 0.224, 0.225),
        ),
    ]
)


# Create the data loader.
data_loader = get_loader(
    transform=transform_test, mode="valid", cocoapi_loc=cocoapi_dir
)

vocab_size = len(data_loader.dataset.vocab)
embed_size = 256  # dimensionality of image and word embeddings
hidden_size = 512  # number of features in hidden state of the RNN decoder
encoder_file = "encoder-1.pkl"
decoder_file = "decoder-1.pkl"

# Initialize the encoder and decoder.
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Load the trained weights.
encoder.load_state_dict(torch.load(os.path.join("./models", encoder_file)))
decoder.load_state_dict(torch.load(os.path.join("./models", decoder_file)))

encoder.eval()
decoder.eval()


pred_result = defaultdict(list)
for img_id, img in data_loader:
    img = img.to(device)
    with torch.no_grad():
        features = encoder(img).unsqueeze(1)
        output = decoder.sample(features)
    sentence = clean_sentence(output)
    pred_result[img_id.item()].append(sentence)

with open("cocoapi/annotations/captions_val2014.json", "r") as f:
    caption = json.load(f)
    valid_annot = caption["annotations"]
valid_result = defaultdict(list)
for i in valid_annot:
    valid_result[i["image_id"]].append(i["caption"])

Vocabulary successfully loaded from vocab.pkl file!


RuntimeError: Error(s) in loading state_dict for DecoderRNN:
	Missing key(s) in state_dict: "embed.weight", "linear.weight", "linear.bias". 
	Unexpected key(s) in state_dict: "word_embeddings.weight", "hidden2out.weight", "hidden2out.bias". 

In [None]:
from bleu import Bleu

bleu_scorer = Bleu()

score, scores = bleu_scorer.compute_score(valid_result, pred_result)