# Computer Vision

## Project: Image Captioning

---

In this notebook, we will train our CNN-RNN model.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train the Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, we will customize the training of our CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values we set now will be used when training our model in **Step 2** below.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.
- `save_every` - determines how often to save the model weights.  We will set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

## Explanations to how do we construct our model?
### 1 CNN-RNN architecture

**explanations:** 
Regarding the encoder CNN network - it's a pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images.  The output is then flattened to a vector, before being passed through a `Linear` layer to transform the feature vector to have the same size as the word embedding.

The decoder RNN network starts with a word embedding layer which takes the embedded image feature vector from encoder CNN network, followed by an LSTM layer, then Linear layer which generates the predicted score for the output caption word. The value for batch_size is 128 which is power of 2, and is one of the popular batch_size (32, 64, 128). The vocabulary threshold is 4, which we've tried in the previous notebook and seem to be a reasonable number to remove infrequent words. The embedded dimension is 256, which is between 100-300 as described in these 2 papers https://nlp.stanford.edu/pubs/glove.pdf and https://arxiv.org/pdf/1310.4546.pdf. The hidden layer size is 512, which is one of the suggested hidden layer size in many papers.

### 2 Transform in transform_train

**explanations:**
I resize the image to 256 since some image might be smaller than 224. RandomCrop with 224 is necessary to ensure the size of images to be the same size and is the same size used in pre-trained CNN. RandomHorizontalFlip with 0.5 probability auguments training image data so that the model can be trained on more dataset. Normalize the images with the same average and standard deviation value as the pre-trained model's distribution.

### Task #3

Next, we will specify a Python list containing the learnable parameters of the model.  For instance, if we decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then we should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### 3 Selection of trainable parameters

**explanations:** The trainable parameters I select are: params = list(decoder.parameters()) + list(encoder.embed.parameters()) 

The reason why I select only encoder.embed.parameters is because we use pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images.  The output is then flattened to a vector, before being passed through a `Linear` layer to transform the feature vector to have the same size as the word embedding. So only last layer (encoder.embed) from the encoder network has trainable parameters. 

For decoder network, since all layers has trainable parameters, so I select all of those.

### Task #4

Finally, we will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### 4 Selection of the optimizer

**explanations:** I use Adam as the optimizer. The reason why I chose Adam (adaptive moment estimation, which uses past gradients to calculate current gradients) are: it's pretty widespread, and is practically accepted for use in training neural nets. On average, it performs better than other optimizer on CNN/RNN Neural network.

In [1]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## Select appropriate values for the Python variables below.
batch_size = 128           # batch size
vocab_threshold = 4        # minimum word count threshold
vocab_from_file = False    # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# Define the optimizer.
optimizer = torch.optim.Adam(params, lr = 0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

loading annotations into memory...
Done (t=0.90s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=0.93s)
creating index...


  0%|          | 1211/414113 [00:00<01:07, 6073.89it/s]

index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:09<00:00, 5921.98it/s]


<a id='step2'></a>
## Step 2: Train Our Model

It will be useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that we will like to load are (`encoder_file` and `decoder_file`).  Then we can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

In [2]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()
response = requests.request("GET", 
                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
                            headers={"Metadata-Flavor":"Google"})

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        if time.time() - old_time > 60:
            old_time = time.time()
            requests.request("POST", 
                             "https://nebula.udacity.com/api/v1/remote/keep-alive", 
                             headers={'Authorization': "STAR " + response.text})
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/3236], Loss: 3.8191, Perplexity: 45.5628
Epoch [1/3], Step [200/3236], Loss: 3.5648, Perplexity: 35.3333
Epoch [1/3], Step [300/3236], Loss: 3.3685, Perplexity: 29.0356
Epoch [1/3], Step [400/3236], Loss: 3.2900, Perplexity: 26.8423
Epoch [1/3], Step [500/3236], Loss: 3.4279, Perplexity: 30.8123
Epoch [1/3], Step [600/3236], Loss: 3.0085, Perplexity: 20.2577
Epoch [1/3], Step [700/3236], Loss: 2.8884, Perplexity: 17.9644
Epoch [1/3], Step [800/3236], Loss: 2.8039, Perplexity: 16.5089
Epoch [1/3], Step [900/3236], Loss: 2.7008, Perplexity: 14.8920
Epoch [1/3], Step [1000/3236], Loss: 4.0320, Perplexity: 56.3708
Epoch [1/3], Step [1100/3236], Loss: 2.6640, Perplexity: 14.3538
Epoch [1/3], Step [1200/3236], Loss: 2.6839, Perplexity: 14.6420
Epoch [1/3], Step [1300/3236], Loss: 2.4073, Perplexity: 11.1040
Epoch [1/3], Step [1400/3236], Loss: 2.3821, Perplexity: 10.8276
Epoch [1/3], Step [1500/3236], Loss: 2.3732, Perplexity: 10.7322
Epoch [1/3], Step [1600/3236], Los