# Computer Vision Nanodegree

## Project: Image Captioning

---

Train your CNN-RNN model.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup
Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.
- `save_every` - determines how often to save the model weights.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

### Question 1

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

**Answer:** The architecture used consists of two parts; CNN encoder and RNN decoder. I used the pre-trained ResNet50 for CNN part according to the Udacity notebook provided. Since we are doing only feature extraction, the fully connected layer is not added so that the final feature map is transformed into information that will be useful in the RNN decoder section.

For RNN part, I followed through the architecture from https://arxiv.org/pdf/1411.4555v2.pdf. With the reference from the paper, I set vocab_threshold = 5, embed_size = 256 (to prevent from overfitting) , and hidden_size = 512 for my final parameters. The number of epochs was set to 3 according to the recommendation from the instruction. Both training loss and perplexity decreases exponentially and then reaches plateau. 


### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**Answer:** I used the recommended parameters in the 'transform_train' provided:

transform_train = transforms.Compose([

    transforms.Resize(256)
    transforms.RandomCrop(224), 
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406),
                         (0.229, 0.224, 0.225))])    
### Task #3

Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**Answer:** I trained all the parameters inside my RNN decoder section with params = list(decoder.parameters()) + list(encoder.embed.parameters()) as given originally.

I set batch_size = 128 as it is so that the the model trains faster. Originally the embed_size and hidden_size were set to 512, but later I changed embed_size to 256, to make my feature vector smaller. Since longer embedding vectors do not add enough information and smaller ones do not represent the semantics well enough. 

### Task #4

Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### Question 4

**Question:** How did you select the optimizer used to train your model?

**Answer:** I chose Adam optimser because it is faster and more reliably reaching a global minimum, especially with LSTM. It is also the most popular optimiser when training a network since it combines the Root Mean Square Propagation and Stochastric Gradient together with momentum.

In [2]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## Select appropriate values for the Python variables below.
batch_size = 128          # batch size
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = True    # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...



  0%|          | 0/414113 [00:00<?, ?it/s][A
  0%|          | 397/414113 [00:00<01:44, 3963.45it/s][A
  0%|          | 782/414113 [00:00<01:46, 3892.37it/s][A
  0%|          | 1220/414113 [00:00<01:42, 4026.63it/s][A
  0%|          | 1662/414113 [00:00<01:39, 4135.88it/s][A
  1%|          | 2094/414113 [00:00<01:38, 4189.01it/s][A
  1%|          | 2540/414113 [00:00<01:36, 4265.36it/s][A
  1%|          | 2967/414113 [00:00<01:36, 4264.76it/s][A
  1%|          | 3413/414113 [00:00<01:35, 4320.75it/s][A
  1%|          | 3869/414113 [00:00<01:33, 4388.01it/s][A
  1%|          | 4317/414113 [00:01<01:32, 4414.77it/s][A
  1%|          | 4757/414113 [00:01<01:32, 4408.80it/s][A
  1%|▏         | 5197/414113 [00:01<01:32, 4403.98it/s][A
  1%|▏         | 5633/414113 [00:01<01:33, 4365.29it/s][A
  1%|▏         | 6087/414113 [00:01<01:32, 4414.79it/s][A
  2%|▏         | 6527/414113 [00:01<01:32, 4399.69it/s][A
  2%|▏         | 6974/414113 [00:01<01:32, 4418.93it/s][A
  2%|▏     

 29%|██▉       | 121954/414113 [00:27<01:03, 4569.37it/s][A
 30%|██▉       | 122412/414113 [00:27<01:03, 4561.85it/s][A
 30%|██▉       | 122869/414113 [00:27<01:04, 4529.63it/s][A
 30%|██▉       | 123326/414113 [00:27<01:04, 4540.98it/s][A
 30%|██▉       | 123781/414113 [00:28<01:06, 4386.19it/s][A
 30%|██▉       | 124221/414113 [00:28<01:07, 4307.27it/s][A
 30%|███       | 124653/414113 [00:28<01:07, 4285.49it/s][A
 30%|███       | 125121/414113 [00:28<01:05, 4395.08it/s][A
 30%|███       | 125562/414113 [00:28<01:05, 4396.25it/s][A
 30%|███       | 126006/414113 [00:28<01:05, 4409.06it/s][A
 31%|███       | 126448/414113 [00:28<01:05, 4401.53it/s][A
 31%|███       | 126902/414113 [00:28<01:04, 4440.49it/s][A
 31%|███       | 127347/414113 [00:28<01:04, 4439.93it/s][A
 31%|███       | 127804/414113 [00:29<01:03, 4475.30it/s][A
 31%|███       | 128262/414113 [00:29<01:03, 4503.79it/s][A
 31%|███       | 128713/414113 [00:29<01:04, 4423.70it/s][A
 31%|███       | 129159/

 59%|█████▊    | 242918/414113 [00:55<00:38, 4408.80it/s][A
 59%|█████▉    | 243380/414113 [00:55<00:38, 4469.28it/s][A
 59%|█████▉    | 243830/414113 [00:55<00:38, 4477.74it/s][A
 59%|█████▉    | 244289/414113 [00:55<00:37, 4509.00it/s][A
 59%|█████▉    | 244741/414113 [00:55<00:37, 4493.94it/s][A
 59%|█████▉    | 245197/414113 [00:55<00:37, 4513.02it/s][A
 59%|█████▉    | 245649/414113 [00:55<00:37, 4476.31it/s][A
 59%|█████▉    | 246101/414113 [00:55<00:37, 4487.84it/s][A
 60%|█████▉    | 246550/414113 [00:55<00:37, 4410.90it/s][A
 60%|█████▉    | 246992/414113 [00:55<00:38, 4350.01it/s][A
 60%|█████▉    | 247440/414113 [00:56<00:37, 4386.47it/s][A
 60%|█████▉    | 247894/414113 [00:56<00:37, 4429.87it/s][A
 60%|█████▉    | 248341/414113 [00:56<00:37, 4439.49it/s][A
 60%|██████    | 248786/414113 [00:56<00:37, 4421.02it/s][A
 60%|██████    | 249245/414113 [00:56<00:36, 4469.76it/s][A
 60%|██████    | 249693/414113 [00:56<00:36, 4454.38it/s][A
 60%|██████    | 250150/

 88%|████████▊ | 363646/414113 [01:22<00:11, 4411.49it/s][A
 88%|████████▊ | 364088/414113 [01:22<00:11, 4408.53it/s][A
 88%|████████▊ | 364529/414113 [01:22<00:11, 4403.85it/s][A
 88%|████████▊ | 364981/414113 [01:22<00:11, 4435.05it/s][A
 88%|████████▊ | 365436/414113 [01:22<00:10, 4466.25it/s][A
 88%|████████▊ | 365893/414113 [01:22<00:10, 4494.65it/s][A
 88%|████████▊ | 366353/414113 [01:23<00:10, 4525.56it/s][A
 89%|████████▊ | 366806/414113 [01:23<00:10, 4487.21it/s][A
 89%|████████▊ | 367255/414113 [01:23<00:10, 4466.07it/s][A
 89%|████████▉ | 367708/414113 [01:23<00:10, 4484.79it/s][A
 89%|████████▉ | 368157/414113 [01:23<00:10, 4483.33it/s][A
 89%|████████▉ | 368617/414113 [01:23<00:10, 4517.53it/s][A
 89%|████████▉ | 369069/414113 [01:23<00:09, 4510.76it/s][A
 89%|████████▉ | 369521/414113 [01:23<00:09, 4497.47it/s][A
 89%|████████▉ | 369971/414113 [01:23<00:09, 4483.63it/s][A
 89%|████████▉ | 370420/414113 [01:23<00:09, 4449.55it/s][A
 90%|████████▉ | 370866/

Done (t=1.03s)
creating index...
index created!
Obtaining caption lengths...


<a id='step2'></a>
## Step 2: Train your Model

In [3]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()
response = requests.request("GET", 
                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
                            headers={"Metadata-Flavor":"Google"})

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        if time.time() - old_time > 60:
            old_time = time.time()
            requests.request("POST", 
                             "https://nebula.udacity.com/api/v1/remote/keep-alive", 
                             headers={'Authorization': "STAR " + response.text})
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/3236], Loss: 3.8684, Perplexity: 47.8655
Epoch [1/3], Step [200/3236], Loss: 3.6402, Perplexity: 38.0994
Epoch [1/3], Step [300/3236], Loss: 3.5219, Perplexity: 33.8498
Epoch [1/3], Step [400/3236], Loss: 3.3213, Perplexity: 27.6955
Epoch [1/3], Step [500/3236], Loss: 3.0596, Perplexity: 21.3188
Epoch [1/3], Step [600/3236], Loss: 3.0153, Perplexity: 20.3960
Epoch [1/3], Step [700/3236], Loss: 3.0016, Perplexity: 20.1167
Epoch [1/3], Step [800/3236], Loss: 3.2770, Perplexity: 26.49714
Epoch [1/3], Step [900/3236], Loss: 3.0740, Perplexity: 21.6293
Epoch [1/3], Step [1000/3236], Loss: 2.7862, Perplexity: 16.2194
Epoch [1/3], Step [1100/3236], Loss: 2.5998, Perplexity: 13.4608
Epoch [1/3], Step [1200/3236], Loss: 2.4921, Perplexity: 12.0861
Epoch [1/3], Step [1300/3236], Loss: 2.5904, Perplexity: 13.3352
Epoch [1/3], Step [1400/3236], Loss: 2.5594, Perplexity: 12.9278
Epoch [1/3], Step [1500/3236], Loss: 2.3900, Perplexity: 10.9133
Epoch [1/3], Step [1600/3236], Lo