## Image Captioning

---

In this notebook, you will train your CNN-RNN model.  

You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.

This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:
- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  
- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.

This notebook **will be graded**.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.


### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.


#### CNN-RNN Architecture Overview
1. The model used is a CNN-RNN architecthure
2. The CNN works as the enocder and the RNN works as the Decoder
3. The CNN-Enocder tries to encode the image into feature-encodings
4. The RNN-Decoder tries to decode the feature-encodings from CNN-Enocder into captions
5. The CNN feature-enocdings goes as the input at first time to the RNN-Decoder
6. You can see this in the below image: 
![Image Captioning CNN-RNN model](images/encoder-decoder.png)

#### CNN-Enocder ResNet50
1. The CNN-Enocder used is ResNet50 model
2. The ResNet50 used was a pre-trained model on ImageNet and the weights were loaded from this pre-trained weights
3. CNN-Enocder layers were not trainable
4. The ResNet50 model consists of 50 layers 
5. The building block of ResNet50 is residual block
6. Below is the diagram of residual block:
![Image Captioning RESIDUAL BLOCK model](images/residual-block.png)
7. Instead of hoping each stack of layers directly fits a desired underlying mapping, we explicitly let these layers fit a residual mapping. 
8. The original mapping is recast into F(x)+x. 
9. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. 
10. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
11. To know more about Residual Networks, refer the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf).



#### RNN-Decoder
1] The RNN-Decoder:  

| Layer         		|     Description	        					|
|:---------------------:|:---------------------------------------------:|
| Input         		| Captions encoded into numbers     		    | 
| Embedding Layer     	| 128 Neurons                	                |
| LSTM Layer			| 100 Neurons 									|
| Output Layer      	| 414113 Neurons	                            |  
2] EMBEDDING LAYER:
  - When you're dealing with words in text, you end up with tens of thousands of classes to predict, one for each word. 
  - Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other values set to 0.
  - The matrix multiplication going into the first hidden layer will have almost all of the resulting values be zero. This a huge waste of computation.
![Image Captioning Embedding_Explainataion_1](images/embedding_explainataion_1.png)
  - To solve this problem and greatly increase the efficiency of our networks, we use what are called embeddings.
  - Embeddings are just a fully connected layer like you've seen before.
  - We call this layer the embedding layer and the weights are embedding weights.
  - We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix.
  - We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.
  - Instead of doing the matrix multiplication, we use the weight matrix as a lookup table.
![Image Captioning Embedding_Explainataion_2](images/embedding_lookup.png)

3] LSTM Layer:
  - LSTM layer comprises of LSTM Nodes
  - Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies
  - LSTMs are explicitly designed to avoid the long-term dependency problem.
  - Below is LSTM Node Diagram and the computations that happend in LSTM node
  ![Image Captioning LSTM Node](images/lstm_node.png)



### Transformation Chosen and Why

1. `transform_train` is a part of pre-processing step
2. The original image is resized to 256
3. The Resized image gets randomly cropped to 224x224
4. The Cropped Image has probability of 50% of horizontally flip
5. The Image is Converted to Torch Tensor. This is necessary so that image (H x W x C) is converted to (C x H x W) which can be acccepted by the Model
6. Normalize the Torch Tensor image with mean and standard deviation.
7. The transform_train in code : `transform_train = transforms.Compose([transforms.Resize(256), transforms.RandomCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), (0.229, 0.224, 0.225))])`
8. I left the transform at it's provided value
9. I think it's a good transform because every epoch the model will see a different image with the same captions
10. This will help the model to generalize by looking different and more data
11. The images in the dataset is of heights and widths and the model can not accept images of varying length and hence all are resized and cropped to 224x224 hieght and width
12. If using a pre-trained model, you must perform the corresponding appropriate normalization.

### Trainable Parameters Chosen and Why

1. Hyper-parameters selected:
   - number of epochs = 3
   - batch size = 10
   - embedding nodes = 128
   - lstm nodes = 100
   - optimizer = adam
2. Trainable weights were only of RNN-Decoder and The Last layer of CNN-Encoder
3. `params = list(decoder.parameters()) + list(encoder.embed.parameters())`
4. The ResNet50 is already a pretrained model on ImageNet and hence is a good candidate for extracting rich features... By not training the whole model of ResNet50 save training time and precious GPU memory... Because training ResNet50 which is a humongous network will be really costly
5. The last layer of CNN Encoder is trainable so that the features selected are based on this specific dataset
6. By this stratergy we pick the better features and save Gpu memory and Training time
7. RNN-Decoder weights are trainable because it is necessary for them to learn the context of images and captions.


### Optimizer

- Adam optimizer is selected
- Adam is chosen because it is computational effficient
- Faster convergence
- Much better than other variants of gradient descent

In [3]:
import sys
sys.path.append('./cocoapi/PythonAPI')
from pycocotools.coco import COCO
!pip install nltk
import nltk
nltk.download('punkt')
import torch
import torch.nn as nn
from torchvision import transforms
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## TODO #1: Select appropriate values for the Python variables below.
batch_size = 3          # batch size
vocab_threshold = 5       # minimum word count threshold
vocab_from_file = True    # if True, load existing vocab file
embed_size = 128          # dimensionality of image and word embeddings
hidden_size = 100        # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) TODO #2: Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO #3: Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters())

# TODO #4: Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.0001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[nltk_data] Downloading package punkt to /home/jai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...


  1%|          | 2643/414113 [00:00<00:31, 13249.69it/s]

Done (t=0.42s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [00:29<00:00, 14013.36it/s]


<a id='step2'></a>
## Step 2: Train your Model

Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  

It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  

You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).

### A Note on Tuning Hyperparameters

To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  

However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.

That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [4]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions[:, :-1])
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/138038], Loss: 8.5686, Perplexity: 5263.7438
Epoch [1/3], Step [200/138038], Loss: 6.6477, Perplexity: 771.03467
Epoch [1/3], Step [300/138038], Loss: 5.2150, Perplexity: 184.01379
Epoch [1/3], Step [400/138038], Loss: 5.6655, Perplexity: 288.7222
Epoch [1/3], Step [500/138038], Loss: 5.0063, Perplexity: 149.3552
Epoch [1/3], Step [600/138038], Loss: 4.8283, Perplexity: 125.0007
Epoch [1/3], Step [700/138038], Loss: 5.7179, Perplexity: 304.2780
Epoch [1/3], Step [800/138038], Loss: 4.6182, Perplexity: 101.3149
Epoch [1/3], Step [900/138038], Loss: 5.3391, Perplexity: 208.3182
Epoch [1/3], Step [1000/138038], Loss: 4.6895, Perplexity: 108.7992
Epoch [1/3], Step [1100/138038], Loss: 4.2879, Perplexity: 72.81400
Epoch [1/3], Step [1200/138038], Loss: 5.0057, Perplexity: 149.2591
Epoch [1/3], Step [1300/138038], Loss: 4.7591, Perplexity: 116.6428
Epoch [1/3], Step [1400/138038], Loss: 4.3322, Perplexity: 76.11296
Epoch [1/3], Step [1500/138038], Loss: 4.2645, Perplex

Epoch [1/3], Step [12100/138038], Loss: 4.0432, Perplexity: 57.00975
Epoch [1/3], Step [12200/138038], Loss: 3.3443, Perplexity: 28.34206
Epoch [1/3], Step [12300/138038], Loss: 4.5519, Perplexity: 94.81028
Epoch [1/3], Step [12400/138038], Loss: 3.3265, Perplexity: 27.84173
Epoch [1/3], Step [12500/138038], Loss: 3.6667, Perplexity: 39.12309
Epoch [1/3], Step [12600/138038], Loss: 4.5171, Perplexity: 91.56865
Epoch [1/3], Step [12700/138038], Loss: 3.5817, Perplexity: 35.93408
Epoch [1/3], Step [12800/138038], Loss: 3.6377, Perplexity: 38.00561
Epoch [1/3], Step [12900/138038], Loss: 4.4117, Perplexity: 82.41141
Epoch [1/3], Step [13000/138038], Loss: 3.4575, Perplexity: 31.73709
Epoch [1/3], Step [13100/138038], Loss: 4.0240, Perplexity: 55.92453
Epoch [1/3], Step [13200/138038], Loss: 3.5956, Perplexity: 36.43741
Epoch [1/3], Step [13300/138038], Loss: 4.3064, Perplexity: 74.16936
Epoch [1/3], Step [13400/138038], Loss: 3.7802, Perplexity: 43.82321
Epoch [1/3], Step [13500/138038], 

Epoch [1/3], Step [23900/138038], Loss: 3.2845, Perplexity: 26.69646
Epoch [1/3], Step [24000/138038], Loss: 3.5984, Perplexity: 36.54088
Epoch [1/3], Step [24100/138038], Loss: 4.9037, Perplexity: 134.7816
Epoch [1/3], Step [24200/138038], Loss: 3.0545, Perplexity: 21.209812
Epoch [1/3], Step [24300/138038], Loss: 2.8637, Perplexity: 17.52710
Epoch [1/3], Step [24400/138038], Loss: 2.9109, Perplexity: 18.37307
Epoch [1/3], Step [24500/138038], Loss: 3.8301, Perplexity: 46.06902
Epoch [1/3], Step [24600/138038], Loss: 4.4488, Perplexity: 85.52164
Epoch [1/3], Step [24700/138038], Loss: 3.2617, Perplexity: 26.09405
Epoch [1/3], Step [24800/138038], Loss: 3.5144, Perplexity: 33.59662
Epoch [1/3], Step [24900/138038], Loss: 3.8400, Perplexity: 46.52477
Epoch [1/3], Step [25000/138038], Loss: 3.5025, Perplexity: 33.19805
Epoch [1/3], Step [25100/138038], Loss: 4.9707, Perplexity: 144.1274
Epoch [1/3], Step [25200/138038], Loss: 4.2295, Perplexity: 68.68360
Epoch [1/3], Step [25300/138038],

Epoch [1/3], Step [35700/138038], Loss: 2.8109, Perplexity: 16.62566
Epoch [1/3], Step [35800/138038], Loss: 2.8033, Perplexity: 16.49872
Epoch [1/3], Step [35900/138038], Loss: 3.7450, Perplexity: 42.30819
Epoch [1/3], Step [36000/138038], Loss: 2.8795, Perplexity: 17.80564
Epoch [1/3], Step [36100/138038], Loss: 3.2155, Perplexity: 24.91511
Epoch [1/3], Step [36200/138038], Loss: 3.3663, Perplexity: 28.96993
Epoch [1/3], Step [36300/138038], Loss: 3.8657, Perplexity: 47.7359
Epoch [1/3], Step [36400/138038], Loss: 3.0360, Perplexity: 20.82201
Epoch [1/3], Step [36500/138038], Loss: 3.1364, Perplexity: 23.02061
Epoch [1/3], Step [36600/138038], Loss: 4.2066, Perplexity: 67.12641
Epoch [1/3], Step [36700/138038], Loss: 3.6267, Perplexity: 37.58870
Epoch [1/3], Step [36800/138038], Loss: 3.7455, Perplexity: 42.33155
Epoch [1/3], Step [36900/138038], Loss: 4.1058, Perplexity: 60.69061
Epoch [1/3], Step [37000/138038], Loss: 4.7067, Perplexity: 110.6813
Epoch [1/3], Step [37100/138038], L

Epoch [1/3], Step [47500/138038], Loss: 2.8126, Perplexity: 16.65302
Epoch [1/3], Step [47600/138038], Loss: 2.7977, Perplexity: 16.40741
Epoch [1/3], Step [47700/138038], Loss: 2.4577, Perplexity: 11.67811
Epoch [1/3], Step [47800/138038], Loss: 3.1066, Perplexity: 22.3441
Epoch [1/3], Step [47900/138038], Loss: 2.8816, Perplexity: 17.84365
Epoch [1/3], Step [48000/138038], Loss: 2.7723, Perplexity: 15.99600
Epoch [1/3], Step [48100/138038], Loss: 2.6535, Perplexity: 14.20391
Epoch [1/3], Step [48200/138038], Loss: 3.4741, Perplexity: 32.26932
Epoch [1/3], Step [48300/138038], Loss: 2.4853, Perplexity: 12.00468
Epoch [1/3], Step [48400/138038], Loss: 2.1443, Perplexity: 8.536191
Epoch [1/3], Step [48500/138038], Loss: 4.8910, Perplexity: 133.0816
Epoch [1/3], Step [48600/138038], Loss: 2.9137, Perplexity: 18.42464
Epoch [1/3], Step [48700/138038], Loss: 4.4796, Perplexity: 88.20241
Epoch [1/3], Step [48800/138038], Loss: 3.1966, Perplexity: 24.44820
Epoch [1/3], Step [48900/138038], L

Epoch [1/3], Step [59400/138038], Loss: 3.2827, Perplexity: 26.64792
Epoch [1/3], Step [59500/138038], Loss: 2.3942, Perplexity: 10.95994
Epoch [1/3], Step [59600/138038], Loss: 3.5814, Perplexity: 35.92399
Epoch [1/3], Step [59700/138038], Loss: 3.4234, Perplexity: 30.67414
Epoch [1/3], Step [59800/138038], Loss: 3.2238, Perplexity: 25.12228
Epoch [1/3], Step [59900/138038], Loss: 1.9141, Perplexity: 6.78082
Epoch [1/3], Step [60000/138038], Loss: 3.2716, Perplexity: 26.35354
Epoch [1/3], Step [60100/138038], Loss: 2.8215, Perplexity: 16.80139
Epoch [1/3], Step [60200/138038], Loss: 3.4999, Perplexity: 33.11183
Epoch [1/3], Step [60300/138038], Loss: 2.9420, Perplexity: 18.95326
Epoch [1/3], Step [60400/138038], Loss: 3.4727, Perplexity: 32.22225
Epoch [1/3], Step [60500/138038], Loss: 3.3759, Perplexity: 29.2505
Epoch [1/3], Step [60600/138038], Loss: 2.7917, Perplexity: 16.3089
Epoch [1/3], Step [60700/138038], Loss: 3.0878, Perplexity: 21.9280
Epoch [1/3], Step [60800/138038], Loss

Epoch [1/3], Step [71300/138038], Loss: 2.1870, Perplexity: 8.908310
Epoch [1/3], Step [71400/138038], Loss: 2.6071, Perplexity: 13.55998
Epoch [1/3], Step [71500/138038], Loss: 3.7703, Perplexity: 43.39218
Epoch [1/3], Step [71600/138038], Loss: 3.2505, Perplexity: 25.8021
Epoch [1/3], Step [71700/138038], Loss: 3.2722, Perplexity: 26.36881
Epoch [1/3], Step [71800/138038], Loss: 2.7749, Perplexity: 16.03660
Epoch [1/3], Step [71900/138038], Loss: 3.3764, Perplexity: 29.26395
Epoch [1/3], Step [72000/138038], Loss: 3.9761, Perplexity: 53.3094
Epoch [1/3], Step [72100/138038], Loss: 2.5661, Perplexity: 13.01533
Epoch [1/3], Step [72200/138038], Loss: 2.7054, Perplexity: 14.96029
Epoch [1/3], Step [72300/138038], Loss: 3.3413, Perplexity: 28.25563
Epoch [1/3], Step [72400/138038], Loss: 3.3236, Perplexity: 27.75965
Epoch [1/3], Step [72500/138038], Loss: 3.4092, Perplexity: 30.2416
Epoch [1/3], Step [72600/138038], Loss: 3.4655, Perplexity: 31.9924
Epoch [1/3], Step [72700/138038], Loss

Epoch [1/3], Step [83200/138038], Loss: 2.5239, Perplexity: 12.4773
Epoch [1/3], Step [83300/138038], Loss: 3.4295, Perplexity: 30.86132
Epoch [1/3], Step [83400/138038], Loss: 2.7663, Perplexity: 15.89995
Epoch [1/3], Step [83500/138038], Loss: 3.9382, Perplexity: 51.32364
Epoch [1/3], Step [83600/138038], Loss: 3.5115, Perplexity: 33.4973
Epoch [1/3], Step [83700/138038], Loss: 1.9497, Perplexity: 7.026893
Epoch [1/3], Step [83800/138038], Loss: 2.4103, Perplexity: 11.13737
Epoch [1/3], Step [83900/138038], Loss: 2.3445, Perplexity: 10.42858
Epoch [1/3], Step [84000/138038], Loss: 2.2178, Perplexity: 9.187236
Epoch [1/3], Step [84100/138038], Loss: 3.3973, Perplexity: 29.8839
Epoch [1/3], Step [84200/138038], Loss: 3.8991, Perplexity: 49.3588
Epoch [1/3], Step [84300/138038], Loss: 2.8285, Perplexity: 16.9205
Epoch [1/3], Step [84400/138038], Loss: 2.3753, Perplexity: 10.75377
Epoch [1/3], Step [84500/138038], Loss: 2.2343, Perplexity: 9.34033
Epoch [1/3], Step [84600/138038], Loss: 

Epoch [1/3], Step [95100/138038], Loss: 2.7854, Perplexity: 16.2057
Epoch [1/3], Step [95200/138038], Loss: 2.1455, Perplexity: 8.54644
Epoch [1/3], Step [95300/138038], Loss: 4.0678, Perplexity: 58.4287
Epoch [1/3], Step [95400/138038], Loss: 3.1456, Perplexity: 23.23315
Epoch [1/3], Step [95500/138038], Loss: 3.0752, Perplexity: 21.6538
Epoch [1/3], Step [95600/138038], Loss: 3.0064, Perplexity: 20.2152
Epoch [1/3], Step [95700/138038], Loss: 3.0085, Perplexity: 20.2565
Epoch [1/3], Step [95800/138038], Loss: 2.5559, Perplexity: 12.88344
Epoch [1/3], Step [95900/138038], Loss: 3.3458, Perplexity: 28.3838
Epoch [1/3], Step [96000/138038], Loss: 4.2828, Perplexity: 72.44658
Epoch [1/3], Step [96100/138038], Loss: 3.3388, Perplexity: 28.18631
Epoch [1/3], Step [96200/138038], Loss: 3.3909, Perplexity: 29.69380
Epoch [1/3], Step [96300/138038], Loss: 2.6639, Perplexity: 14.35268
Epoch [1/3], Step [96400/138038], Loss: 2.8335, Perplexity: 17.00442
Epoch [1/3], Step [96500/138038], Loss: 3

Epoch [1/3], Step [106900/138038], Loss: 2.0406, Perplexity: 7.69545
Epoch [1/3], Step [107000/138038], Loss: 3.3565, Perplexity: 28.69005
Epoch [1/3], Step [107100/138038], Loss: 2.8187, Perplexity: 16.7546
Epoch [1/3], Step [107200/138038], Loss: 2.2195, Perplexity: 9.202889
Epoch [1/3], Step [107300/138038], Loss: 5.4174, Perplexity: 225.2998
Epoch [1/3], Step [107400/138038], Loss: 3.5884, Perplexity: 36.1764
Epoch [1/3], Step [107500/138038], Loss: 2.6924, Perplexity: 14.76679
Epoch [1/3], Step [107600/138038], Loss: 3.2419, Perplexity: 25.5817
Epoch [1/3], Step [107700/138038], Loss: 2.4961, Perplexity: 12.13521
Epoch [1/3], Step [107800/138038], Loss: 1.6575, Perplexity: 5.24610
Epoch [1/3], Step [107900/138038], Loss: 3.1528, Perplexity: 23.40152
Epoch [1/3], Step [108000/138038], Loss: 3.3610, Perplexity: 28.81903
Epoch [1/3], Step [108100/138038], Loss: 3.4402, Perplexity: 31.1930
Epoch [1/3], Step [108200/138038], Loss: 1.9766, Perplexity: 7.21847
Epoch [1/3], Step [108300/1

Epoch [1/3], Step [118600/138038], Loss: 2.8330, Perplexity: 16.99675
Epoch [1/3], Step [118700/138038], Loss: 3.4954, Perplexity: 32.96263
Epoch [1/3], Step [118800/138038], Loss: 3.4174, Perplexity: 30.48928
Epoch [1/3], Step [118900/138038], Loss: 2.8034, Perplexity: 16.5009
Epoch [1/3], Step [119000/138038], Loss: 2.6590, Perplexity: 14.28181
Epoch [1/3], Step [119100/138038], Loss: 2.6279, Perplexity: 13.8453
Epoch [1/3], Step [119200/138038], Loss: 3.6638, Perplexity: 39.0080
Epoch [1/3], Step [119300/138038], Loss: 2.6813, Perplexity: 14.6033
Epoch [1/3], Step [119400/138038], Loss: 2.4453, Perplexity: 11.53366
Epoch [1/3], Step [119500/138038], Loss: 1.9819, Perplexity: 7.256816
Epoch [1/3], Step [119600/138038], Loss: 3.4162, Perplexity: 30.4538
Epoch [1/3], Step [119700/138038], Loss: 2.2672, Perplexity: 9.65253
Epoch [1/3], Step [119800/138038], Loss: 2.7053, Perplexity: 14.95898
Epoch [1/3], Step [119900/138038], Loss: 3.0473, Perplexity: 21.05955
Epoch [1/3], Step [120000/

Epoch [1/3], Step [130300/138038], Loss: 2.3817, Perplexity: 10.82320
Epoch [1/3], Step [130400/138038], Loss: 3.7157, Perplexity: 41.08612
Epoch [1/3], Step [130500/138038], Loss: 3.5638, Perplexity: 35.2971
Epoch [1/3], Step [130600/138038], Loss: 3.2072, Perplexity: 24.7107
Epoch [1/3], Step [130700/138038], Loss: 3.1776, Perplexity: 23.98862
Epoch [1/3], Step [130800/138038], Loss: 3.3794, Perplexity: 29.3541
Epoch [1/3], Step [130900/138038], Loss: 2.9580, Perplexity: 19.2592
Epoch [1/3], Step [131000/138038], Loss: 2.4329, Perplexity: 11.39245
Epoch [1/3], Step [131100/138038], Loss: 1.9997, Perplexity: 7.387247
Epoch [1/3], Step [131200/138038], Loss: 2.6294, Perplexity: 13.86599
Epoch [1/3], Step [131300/138038], Loss: 2.5550, Perplexity: 12.87193
Epoch [1/3], Step [131400/138038], Loss: 2.5588, Perplexity: 12.9203
Epoch [1/3], Step [131500/138038], Loss: 2.6404, Perplexity: 14.0190
Epoch [1/3], Step [131600/138038], Loss: 2.8090, Perplexity: 16.5941
Epoch [1/3], Step [131700/1

Epoch [2/3], Step [4200/138038], Loss: 2.6225, Perplexity: 13.7699
Epoch [2/3], Step [4300/138038], Loss: 3.2807, Perplexity: 26.5951
Epoch [2/3], Step [4400/138038], Loss: 3.2105, Perplexity: 24.7903
Epoch [2/3], Step [4500/138038], Loss: 2.3529, Perplexity: 10.51564
Epoch [2/3], Step [4600/138038], Loss: 2.3536, Perplexity: 10.52293
Epoch [2/3], Step [4700/138038], Loss: 5.3407, Perplexity: 208.6662
Epoch [2/3], Step [4800/138038], Loss: 2.1951, Perplexity: 8.98100
Epoch [2/3], Step [4900/138038], Loss: 3.6114, Perplexity: 37.0164
Epoch [2/3], Step [5000/138038], Loss: 2.5498, Perplexity: 12.80463
Epoch [2/3], Step [5100/138038], Loss: 2.4323, Perplexity: 11.3853
Epoch [2/3], Step [5200/138038], Loss: 2.8539, Perplexity: 17.3555
Epoch [2/3], Step [5300/138038], Loss: 1.8082, Perplexity: 6.099680
Epoch [2/3], Step [5400/138038], Loss: 2.4579, Perplexity: 11.6798
Epoch [2/3], Step [5500/138038], Loss: 3.6737, Perplexity: 39.3991
Epoch [2/3], Step [5600/138038], Loss: 3.3349, Perplexity

Epoch [2/3], Step [16200/138038], Loss: 2.2146, Perplexity: 9.15749
Epoch [2/3], Step [16300/138038], Loss: 2.0966, Perplexity: 8.138549
Epoch [2/3], Step [16400/138038], Loss: 2.5290, Perplexity: 12.5409
Epoch [2/3], Step [16500/138038], Loss: 2.7139, Perplexity: 15.08783
Epoch [2/3], Step [16600/138038], Loss: 2.7462, Perplexity: 15.58319
Epoch [2/3], Step [16700/138038], Loss: 2.9715, Perplexity: 19.5209
Epoch [2/3], Step [16800/138038], Loss: 2.8013, Perplexity: 16.46651
Epoch [2/3], Step [16900/138038], Loss: 2.2405, Perplexity: 9.39769
Epoch [2/3], Step [17000/138038], Loss: 3.7010, Perplexity: 40.48600
Epoch [2/3], Step [17100/138038], Loss: 2.8438, Perplexity: 17.18082
Epoch [2/3], Step [17200/138038], Loss: 3.3751, Perplexity: 29.2276
Epoch [2/3], Step [17300/138038], Loss: 2.0858, Perplexity: 8.05065
Epoch [2/3], Step [17400/138038], Loss: 2.8076, Perplexity: 16.5705
Epoch [2/3], Step [17500/138038], Loss: 2.9747, Perplexity: 19.5837
Epoch [2/3], Step [17600/138038], Loss: 4.

Epoch [2/3], Step [28100/138038], Loss: 2.3354, Perplexity: 10.3331
Epoch [2/3], Step [28200/138038], Loss: 3.2090, Perplexity: 24.75499
Epoch [2/3], Step [28300/138038], Loss: 2.3272, Perplexity: 10.2494
Epoch [2/3], Step [28400/138038], Loss: 3.4425, Perplexity: 31.2637
Epoch [2/3], Step [28500/138038], Loss: 3.0801, Perplexity: 21.7603
Epoch [2/3], Step [28600/138038], Loss: 2.4903, Perplexity: 12.06515
Epoch [2/3], Step [28700/138038], Loss: 3.4151, Perplexity: 30.4202
Epoch [2/3], Step [28800/138038], Loss: 2.6644, Perplexity: 14.3590
Epoch [2/3], Step [28900/138038], Loss: 3.2219, Perplexity: 25.0761
Epoch [2/3], Step [29000/138038], Loss: 3.1582, Perplexity: 23.52854
Epoch [2/3], Step [29100/138038], Loss: 3.7089, Perplexity: 40.8072
Epoch [2/3], Step [29200/138038], Loss: 1.9516, Perplexity: 7.040042
Epoch [2/3], Step [29300/138038], Loss: 2.1647, Perplexity: 8.712474
Epoch [2/3], Step [29400/138038], Loss: 2.8234, Perplexity: 16.8344
Epoch [2/3], Step [29500/138038], Loss: 2.6

Epoch [2/3], Step [40000/138038], Loss: 3.1610, Perplexity: 23.5937
Epoch [2/3], Step [40100/138038], Loss: 3.5896, Perplexity: 36.2183
Epoch [2/3], Step [40200/138038], Loss: 3.5441, Perplexity: 34.6088
Epoch [2/3], Step [40300/138038], Loss: 2.2944, Perplexity: 9.918609
Epoch [2/3], Step [40400/138038], Loss: 2.4163, Perplexity: 11.2048
Epoch [2/3], Step [40500/138038], Loss: 2.0583, Perplexity: 7.832642
Epoch [2/3], Step [40600/138038], Loss: 3.0159, Perplexity: 20.4070
Epoch [2/3], Step [40700/138038], Loss: 3.3462, Perplexity: 28.3958
Epoch [2/3], Step [40800/138038], Loss: 4.2988, Perplexity: 73.6095
Epoch [2/3], Step [40900/138038], Loss: 2.8282, Perplexity: 16.91421
Epoch [2/3], Step [41000/138038], Loss: 2.8749, Perplexity: 17.7237
Epoch [2/3], Step [41100/138038], Loss: 3.0457, Perplexity: 21.0237
Epoch [2/3], Step [41200/138038], Loss: 2.8078, Perplexity: 16.57354
Epoch [2/3], Step [41300/138038], Loss: 2.3266, Perplexity: 10.2426
Epoch [2/3], Step [41400/138038], Loss: 3.28

Epoch [2/3], Step [51900/138038], Loss: 3.1727, Perplexity: 23.8727
Epoch [2/3], Step [52000/138038], Loss: 2.4150, Perplexity: 11.1894
Epoch [2/3], Step [52100/138038], Loss: 3.6569, Perplexity: 38.74302
Epoch [2/3], Step [52200/138038], Loss: 1.8380, Perplexity: 6.284199
Epoch [2/3], Step [52300/138038], Loss: 2.0100, Perplexity: 7.46363
Epoch [2/3], Step [52400/138038], Loss: 2.1368, Perplexity: 8.47238
Epoch [2/3], Step [52500/138038], Loss: 2.5437, Perplexity: 12.7267
Epoch [2/3], Step [52600/138038], Loss: 2.9068, Perplexity: 18.2978
Epoch [2/3], Step [52700/138038], Loss: 3.3521, Perplexity: 28.5620
Epoch [2/3], Step [52800/138038], Loss: 2.8746, Perplexity: 17.71799
Epoch [2/3], Step [52900/138038], Loss: 3.4780, Perplexity: 32.3942
Epoch [2/3], Step [53000/138038], Loss: 3.4525, Perplexity: 31.5803
Epoch [2/3], Step [53100/138038], Loss: 3.0722, Perplexity: 21.5894
Epoch [2/3], Step [53200/138038], Loss: 2.3660, Perplexity: 10.6545
Epoch [2/3], Step [53300/138038], Loss: 5.098

Epoch [2/3], Step [63800/138038], Loss: 3.3243, Perplexity: 27.7783
Epoch [2/3], Step [63900/138038], Loss: 2.9044, Perplexity: 18.2535
Epoch [2/3], Step [64000/138038], Loss: 2.7361, Perplexity: 15.4271
Epoch [2/3], Step [64100/138038], Loss: 1.8469, Perplexity: 6.34045
Epoch [2/3], Step [64200/138038], Loss: 1.9841, Perplexity: 7.27284
Epoch [2/3], Step [64300/138038], Loss: 2.4224, Perplexity: 11.2728
Epoch [2/3], Step [64400/138038], Loss: 1.6498, Perplexity: 5.20607
Epoch [2/3], Step [64500/138038], Loss: 2.6510, Perplexity: 14.16830
Epoch [2/3], Step [64600/138038], Loss: 1.3881, Perplexity: 4.007339
Epoch [2/3], Step [64700/138038], Loss: 2.7427, Perplexity: 15.5281
Epoch [2/3], Step [64800/138038], Loss: 4.4128, Perplexity: 82.50020
Epoch [2/3], Step [64900/138038], Loss: 3.3759, Perplexity: 29.2513
Epoch [2/3], Step [65000/138038], Loss: 2.4905, Perplexity: 12.06700
Epoch [2/3], Step [65100/138038], Loss: 1.9391, Perplexity: 6.95276
Epoch [2/3], Step [65200/138038], Loss: 3.16

Epoch [2/3], Step [75700/138038], Loss: 2.7706, Perplexity: 15.9680
Epoch [2/3], Step [75800/138038], Loss: 2.6734, Perplexity: 14.4897
Epoch [2/3], Step [75900/138038], Loss: 2.6888, Perplexity: 14.71384
Epoch [2/3], Step [76000/138038], Loss: 2.8849, Perplexity: 17.90179
Epoch [2/3], Step [76100/138038], Loss: 3.8090, Perplexity: 45.10560
Epoch [2/3], Step [76200/138038], Loss: 2.7534, Perplexity: 15.6964
Epoch [2/3], Step [76300/138038], Loss: 2.2824, Perplexity: 9.79978
Epoch [2/3], Step [76400/138038], Loss: 3.4624, Perplexity: 31.8929
Epoch [2/3], Step [76500/138038], Loss: 1.7757, Perplexity: 5.90457
Epoch [2/3], Step [76600/138038], Loss: 2.9664, Perplexity: 19.4226
Epoch [2/3], Step [76700/138038], Loss: 2.5998, Perplexity: 13.46102
Epoch [2/3], Step [76800/138038], Loss: 3.8585, Perplexity: 47.3940
Epoch [2/3], Step [76900/138038], Loss: 2.8424, Perplexity: 17.1568
Epoch [2/3], Step [77000/138038], Loss: 3.9868, Perplexity: 53.8802
Epoch [2/3], Step [77100/138038], Loss: 2.17

Epoch [2/3], Step [87600/138038], Loss: 2.0122, Perplexity: 7.479826
Epoch [2/3], Step [87700/138038], Loss: 2.5861, Perplexity: 13.27791
Epoch [2/3], Step [87800/138038], Loss: 2.8218, Perplexity: 16.8074
Epoch [2/3], Step [87900/138038], Loss: 1.9725, Perplexity: 7.188429
Epoch [2/3], Step [88000/138038], Loss: 2.6794, Perplexity: 14.5770
Epoch [2/3], Step [88100/138038], Loss: 2.0735, Perplexity: 7.95245
Epoch [2/3], Step [88200/138038], Loss: 2.8321, Perplexity: 16.9811
Epoch [2/3], Step [88300/138038], Loss: 1.9434, Perplexity: 6.982874
Epoch [2/3], Step [88400/138038], Loss: 2.8355, Perplexity: 17.0385
Epoch [2/3], Step [88500/138038], Loss: 1.9221, Perplexity: 6.83549
Epoch [2/3], Step [88600/138038], Loss: 1.8228, Perplexity: 6.189182
Epoch [2/3], Step [88700/138038], Loss: 3.7791, Perplexity: 43.77486
Epoch [2/3], Step [88800/138038], Loss: 3.1052, Perplexity: 22.3132
Epoch [2/3], Step [88900/138038], Loss: 2.6354, Perplexity: 13.9483
Epoch [2/3], Step [89000/138038], Loss: 1.

Epoch [2/3], Step [99500/138038], Loss: 3.1853, Perplexity: 24.1750
Epoch [2/3], Step [99600/138038], Loss: 2.1018, Perplexity: 8.181200
Epoch [2/3], Step [99700/138038], Loss: 4.3739, Perplexity: 79.3512
Epoch [2/3], Step [99800/138038], Loss: 2.1165, Perplexity: 8.30212
Epoch [2/3], Step [99900/138038], Loss: 2.6738, Perplexity: 14.4955
Epoch [2/3], Step [100000/138038], Loss: 2.0919, Perplexity: 8.0999
Epoch [2/3], Step [100100/138038], Loss: 2.5637, Perplexity: 12.9832
Epoch [2/3], Step [100200/138038], Loss: 2.6143, Perplexity: 13.6577
Epoch [2/3], Step [100300/138038], Loss: 3.3018, Perplexity: 27.1606
Epoch [2/3], Step [100400/138038], Loss: 2.5444, Perplexity: 12.7350
Epoch [2/3], Step [100500/138038], Loss: 3.3057, Perplexity: 27.2684
Epoch [2/3], Step [100600/138038], Loss: 1.9831, Perplexity: 7.265655
Epoch [2/3], Step [100700/138038], Loss: 1.9607, Perplexity: 7.10409
Epoch [2/3], Step [100800/138038], Loss: 2.4382, Perplexity: 11.45216
Epoch [2/3], Step [100900/138038], Lo

Epoch [2/3], Step [111300/138038], Loss: 3.5440, Perplexity: 34.6039
Epoch [2/3], Step [111400/138038], Loss: 3.3560, Perplexity: 28.67335
Epoch [2/3], Step [111500/138038], Loss: 2.8552, Perplexity: 17.3783
Epoch [2/3], Step [111600/138038], Loss: 3.2158, Perplexity: 24.9236
Epoch [2/3], Step [111700/138038], Loss: 2.4342, Perplexity: 11.4070
Epoch [2/3], Step [111800/138038], Loss: 2.1793, Perplexity: 8.84016
Epoch [2/3], Step [111900/138038], Loss: 2.4447, Perplexity: 11.5276
Epoch [2/3], Step [112000/138038], Loss: 2.7244, Perplexity: 15.24719
Epoch [2/3], Step [112100/138038], Loss: 2.9195, Perplexity: 18.53196
Epoch [2/3], Step [112200/138038], Loss: 1.5088, Perplexity: 4.52151
Epoch [2/3], Step [112300/138038], Loss: 2.5904, Perplexity: 13.3355
Epoch [2/3], Step [112400/138038], Loss: 2.2774, Perplexity: 9.750905
Epoch [2/3], Step [112500/138038], Loss: 2.1540, Perplexity: 8.61978
Epoch [2/3], Step [112600/138038], Loss: 2.9498, Perplexity: 19.1028
Epoch [2/3], Step [112700/1380

Epoch [2/3], Step [123100/138038], Loss: 2.3346, Perplexity: 10.32539
Epoch [2/3], Step [123200/138038], Loss: 3.4257, Perplexity: 30.74275
Epoch [2/3], Step [123300/138038], Loss: 4.0307, Perplexity: 56.2994
Epoch [2/3], Step [123400/138038], Loss: 3.1849, Perplexity: 24.16479
Epoch [2/3], Step [123500/138038], Loss: 2.9408, Perplexity: 18.9302
Epoch [2/3], Step [123600/138038], Loss: 4.1598, Perplexity: 64.06136
Epoch [2/3], Step [123700/138038], Loss: 2.3697, Perplexity: 10.69422
Epoch [2/3], Step [123800/138038], Loss: 3.7614, Perplexity: 43.00917
Epoch [2/3], Step [123900/138038], Loss: 1.7113, Perplexity: 5.53637
Epoch [2/3], Step [124000/138038], Loss: 1.4631, Perplexity: 4.31926
Epoch [2/3], Step [124100/138038], Loss: 1.9771, Perplexity: 7.221833
Epoch [2/3], Step [124200/138038], Loss: 1.7046, Perplexity: 5.49900
Epoch [2/3], Step [124300/138038], Loss: 2.2748, Perplexity: 9.72611
Epoch [2/3], Step [124400/138038], Loss: 2.9422, Perplexity: 18.9571
Epoch [2/3], Step [124500/1

Epoch [2/3], Step [134900/138038], Loss: 2.4692, Perplexity: 11.81256
Epoch [2/3], Step [135000/138038], Loss: 2.2354, Perplexity: 9.35000
Epoch [2/3], Step [135100/138038], Loss: 2.5316, Perplexity: 12.5737
Epoch [2/3], Step [135200/138038], Loss: 1.8327, Perplexity: 6.25097
Epoch [2/3], Step [135300/138038], Loss: 2.3523, Perplexity: 10.50995
Epoch [2/3], Step [135400/138038], Loss: 2.2593, Perplexity: 9.57647
Epoch [2/3], Step [135500/138038], Loss: 4.4491, Perplexity: 85.55093
Epoch [2/3], Step [135600/138038], Loss: 2.4891, Perplexity: 12.0510
Epoch [2/3], Step [135700/138038], Loss: 3.4401, Perplexity: 31.1908
Epoch [2/3], Step [135800/138038], Loss: 3.3047, Perplexity: 27.2416
Epoch [2/3], Step [135900/138038], Loss: 2.0227, Perplexity: 7.558916
Epoch [2/3], Step [136000/138038], Loss: 2.3814, Perplexity: 10.8196
Epoch [2/3], Step [136100/138038], Loss: 2.5270, Perplexity: 12.51585
Epoch [2/3], Step [136200/138038], Loss: 4.1630, Perplexity: 64.2615
Epoch [2/3], Step [136300/138

Epoch [3/3], Step [8900/138038], Loss: 1.9777, Perplexity: 7.22618
Epoch [3/3], Step [9000/138038], Loss: 2.2669, Perplexity: 9.649347
Epoch [3/3], Step [9100/138038], Loss: 2.6766, Perplexity: 14.5351
Epoch [3/3], Step [9200/138038], Loss: 3.8008, Perplexity: 44.7352
Epoch [3/3], Step [9300/138038], Loss: 2.6007, Perplexity: 13.47363
Epoch [3/3], Step [9400/138038], Loss: 3.1416, Perplexity: 23.14125
Epoch [3/3], Step [9500/138038], Loss: 3.5093, Perplexity: 33.4248
Epoch [3/3], Step [9600/138038], Loss: 3.1011, Perplexity: 22.2227
Epoch [3/3], Step [9700/138038], Loss: 2.3281, Perplexity: 10.2579
Epoch [3/3], Step [9800/138038], Loss: 2.2091, Perplexity: 9.10747
Epoch [3/3], Step [9900/138038], Loss: 1.8033, Perplexity: 6.06961
Epoch [3/3], Step [10000/138038], Loss: 2.4624, Perplexity: 11.7335
Epoch [3/3], Step [10100/138038], Loss: 2.3211, Perplexity: 10.1864
Epoch [3/3], Step [10200/138038], Loss: 3.1921, Perplexity: 24.3394
Epoch [3/3], Step [10300/138038], Loss: 2.6841, Perplexi

Epoch [3/3], Step [20900/138038], Loss: 2.8320, Perplexity: 16.9795
Epoch [3/3], Step [21000/138038], Loss: 2.1513, Perplexity: 8.59561
Epoch [3/3], Step [21100/138038], Loss: 1.8689, Perplexity: 6.480872
Epoch [3/3], Step [21200/138038], Loss: 2.2352, Perplexity: 9.34866
Epoch [3/3], Step [21300/138038], Loss: 2.0129, Perplexity: 7.485013
Epoch [3/3], Step [21400/138038], Loss: 3.3815, Perplexity: 29.4149
Epoch [3/3], Step [21500/138038], Loss: 2.4372, Perplexity: 11.4408
Epoch [3/3], Step [21600/138038], Loss: 2.2018, Perplexity: 9.04158
Epoch [3/3], Step [21700/138038], Loss: 2.7034, Perplexity: 14.9303
Epoch [3/3], Step [21800/138038], Loss: 2.5760, Perplexity: 13.1443
Epoch [3/3], Step [21900/138038], Loss: 3.4863, Perplexity: 32.6640
Epoch [3/3], Step [22000/138038], Loss: 3.4503, Perplexity: 31.51136
Epoch [3/3], Step [22100/138038], Loss: 2.3689, Perplexity: 10.6857
Epoch [3/3], Step [22200/138038], Loss: 3.1470, Perplexity: 23.2661
Epoch [3/3], Step [22300/138038], Loss: 2.273

Epoch [3/3], Step [32800/138038], Loss: 2.0560, Perplexity: 7.81441
Epoch [3/3], Step [32900/138038], Loss: 1.9933, Perplexity: 7.33961
Epoch [3/3], Step [33000/138038], Loss: 2.4178, Perplexity: 11.22099
Epoch [3/3], Step [33100/138038], Loss: 2.3285, Perplexity: 10.26269
Epoch [3/3], Step [33200/138038], Loss: 2.2336, Perplexity: 9.33324
Epoch [3/3], Step [33300/138038], Loss: 2.3821, Perplexity: 10.8277
Epoch [3/3], Step [33400/138038], Loss: 2.1233, Perplexity: 8.35886
Epoch [3/3], Step [33500/138038], Loss: 4.1000, Perplexity: 60.34331
Epoch [3/3], Step [33600/138038], Loss: 2.8444, Perplexity: 17.1914
Epoch [3/3], Step [33700/138038], Loss: 2.4116, Perplexity: 11.1513
Epoch [3/3], Step [33800/138038], Loss: 2.4320, Perplexity: 11.38171
Epoch [3/3], Step [33900/138038], Loss: 1.8222, Perplexity: 6.18527
Epoch [3/3], Step [34000/138038], Loss: 3.1101, Perplexity: 22.4232
Epoch [3/3], Step [34100/138038], Loss: 5.1113, Perplexity: 165.8897
Epoch [3/3], Step [34200/138038], Loss: 2.1

Epoch [3/3], Step [44700/138038], Loss: 2.0753, Perplexity: 7.967375
Epoch [3/3], Step [44800/138038], Loss: 2.2745, Perplexity: 9.72285
Epoch [3/3], Step [44900/138038], Loss: 1.6817, Perplexity: 5.37502
Epoch [3/3], Step [45000/138038], Loss: 2.6777, Perplexity: 14.5517
Epoch [3/3], Step [45100/138038], Loss: 1.9640, Perplexity: 7.12817
Epoch [3/3], Step [45200/138038], Loss: 1.7959, Perplexity: 6.02502
Epoch [3/3], Step [45300/138038], Loss: 2.3102, Perplexity: 10.07613
Epoch [3/3], Step [45400/138038], Loss: 2.6508, Perplexity: 14.1659
Epoch [3/3], Step [45500/138038], Loss: 2.9498, Perplexity: 19.1016
Epoch [3/3], Step [45600/138038], Loss: 2.5667, Perplexity: 13.0233
Epoch [3/3], Step [45700/138038], Loss: 1.9132, Perplexity: 6.77491
Epoch [3/3], Step [45800/138038], Loss: 1.9510, Perplexity: 7.03578
Epoch [3/3], Step [45900/138038], Loss: 2.5138, Perplexity: 12.3522
Epoch [3/3], Step [46000/138038], Loss: 1.5065, Perplexity: 4.51085
Epoch [3/3], Step [46100/138038], Loss: 3.4021

Epoch [3/3], Step [56700/138038], Loss: 2.3769, Perplexity: 10.77180
Epoch [3/3], Step [56800/138038], Loss: 2.6165, Perplexity: 13.6881
Epoch [3/3], Step [56900/138038], Loss: 3.2324, Perplexity: 25.3406
Epoch [3/3], Step [57000/138038], Loss: 1.6538, Perplexity: 5.226698
Epoch [3/3], Step [57100/138038], Loss: 2.8489, Perplexity: 17.2681
Epoch [3/3], Step [57200/138038], Loss: 2.7261, Perplexity: 15.27270
Epoch [3/3], Step [57300/138038], Loss: 2.1756, Perplexity: 8.80702
Epoch [3/3], Step [57400/138038], Loss: 2.5423, Perplexity: 12.70858
Epoch [3/3], Step [57500/138038], Loss: 2.5854, Perplexity: 13.2680
Epoch [3/3], Step [57600/138038], Loss: 3.7019, Perplexity: 40.5225
Epoch [3/3], Step [57700/138038], Loss: 3.4144, Perplexity: 30.39882
Epoch [3/3], Step [57800/138038], Loss: 2.3118, Perplexity: 10.0925
Epoch [3/3], Step [57900/138038], Loss: 2.3881, Perplexity: 10.8927
Epoch [3/3], Step [58000/138038], Loss: 1.8247, Perplexity: 6.20078
Epoch [3/3], Step [58100/138038], Loss: 2.0

Epoch [3/3], Step [68600/138038], Loss: 2.9304, Perplexity: 18.7357
Epoch [3/3], Step [68700/138038], Loss: 2.0377, Perplexity: 7.673123
Epoch [3/3], Step [68800/138038], Loss: 2.2406, Perplexity: 9.398965
Epoch [3/3], Step [68900/138038], Loss: 3.6006, Perplexity: 36.6216
Epoch [3/3], Step [69000/138038], Loss: 2.1718, Perplexity: 8.77444
Epoch [3/3], Step [69100/138038], Loss: 3.0723, Perplexity: 21.59191
Epoch [3/3], Step [69200/138038], Loss: 2.5359, Perplexity: 12.6277
Epoch [3/3], Step [69300/138038], Loss: 2.1317, Perplexity: 8.428899
Epoch [3/3], Step [69400/138038], Loss: 3.1098, Perplexity: 22.4157
Epoch [3/3], Step [69500/138038], Loss: 3.1893, Perplexity: 24.27183
Epoch [3/3], Step [69600/138038], Loss: 2.4886, Perplexity: 12.0448
Epoch [3/3], Step [69700/138038], Loss: 3.6769, Perplexity: 39.5221
Epoch [3/3], Step [69800/138038], Loss: 2.8845, Perplexity: 17.8943
Epoch [3/3], Step [69900/138038], Loss: 2.0976, Perplexity: 8.146363
Epoch [3/3], Step [70000/138038], Loss: 2.

Epoch [3/3], Step [80500/138038], Loss: 2.9602, Perplexity: 19.3013
Epoch [3/3], Step [80600/138038], Loss: 3.7757, Perplexity: 43.62936
Epoch [3/3], Step [80700/138038], Loss: 2.8486, Perplexity: 17.2634
Epoch [3/3], Step [80800/138038], Loss: 2.6020, Perplexity: 13.4913
Epoch [3/3], Step [80900/138038], Loss: 3.4396, Perplexity: 31.1739
Epoch [3/3], Step [81000/138038], Loss: 2.0341, Perplexity: 7.64552
Epoch [3/3], Step [81100/138038], Loss: 2.5351, Perplexity: 12.61790
Epoch [3/3], Step [81200/138038], Loss: 2.0818, Perplexity: 8.01933
Epoch [3/3], Step [81300/138038], Loss: 1.7353, Perplexity: 5.67094
Epoch [3/3], Step [81400/138038], Loss: 2.7408, Perplexity: 15.50016
Epoch [3/3], Step [81500/138038], Loss: 2.8762, Perplexity: 17.74742
Epoch [3/3], Step [81600/138038], Loss: 2.1991, Perplexity: 9.01652
Epoch [3/3], Step [81700/138038], Loss: 1.8617, Perplexity: 6.43495
Epoch [3/3], Step [81800/138038], Loss: 1.3929, Perplexity: 4.026698
Epoch [3/3], Step [81900/138038], Loss: 2.0

Epoch [3/3], Step [92500/138038], Loss: 3.1656, Perplexity: 23.7034
Epoch [3/3], Step [92600/138038], Loss: 1.9836, Perplexity: 7.26897
Epoch [3/3], Step [92700/138038], Loss: 2.7907, Perplexity: 16.2924
Epoch [3/3], Step [92800/138038], Loss: 3.4324, Perplexity: 30.9521
Epoch [3/3], Step [92900/138038], Loss: 2.1071, Perplexity: 8.22451
Epoch [3/3], Step [93000/138038], Loss: 2.7586, Perplexity: 15.7785
Epoch [3/3], Step [93100/138038], Loss: 2.1748, Perplexity: 8.80052
Epoch [3/3], Step [93200/138038], Loss: 2.1459, Perplexity: 8.549498
Epoch [3/3], Step [93300/138038], Loss: 2.5813, Perplexity: 13.21476
Epoch [3/3], Step [93400/138038], Loss: 3.0215, Perplexity: 20.5218
Epoch [3/3], Step [93500/138038], Loss: 1.7904, Perplexity: 5.99174
Epoch [3/3], Step [93600/138038], Loss: 2.4320, Perplexity: 11.3821
Epoch [3/3], Step [93700/138038], Loss: 2.8603, Perplexity: 17.46664
Epoch [3/3], Step [93800/138038], Loss: 3.5226, Perplexity: 33.87334
Epoch [3/3], Step [93900/138038], Loss: 1.88

Epoch [3/3], Step [104400/138038], Loss: 2.6964, Perplexity: 14.8267
Epoch [3/3], Step [104500/138038], Loss: 2.9104, Perplexity: 18.3633
Epoch [3/3], Step [104600/138038], Loss: 2.1864, Perplexity: 8.90275
Epoch [3/3], Step [104700/138038], Loss: 2.9735, Perplexity: 19.5612
Epoch [3/3], Step [104800/138038], Loss: 3.5560, Perplexity: 35.0225
Epoch [3/3], Step [104900/138038], Loss: 2.5002, Perplexity: 12.1852
Epoch [3/3], Step [105000/138038], Loss: 2.6323, Perplexity: 13.9058
Epoch [3/3], Step [105100/138038], Loss: 3.0775, Perplexity: 21.7035
Epoch [3/3], Step [105200/138038], Loss: 2.6850, Perplexity: 14.6575
Epoch [3/3], Step [105300/138038], Loss: 2.7760, Perplexity: 16.05529
Epoch [3/3], Step [105400/138038], Loss: 2.0775, Perplexity: 7.98472
Epoch [3/3], Step [105500/138038], Loss: 2.2336, Perplexity: 9.33368
Epoch [3/3], Step [105600/138038], Loss: 2.7395, Perplexity: 15.4798
Epoch [3/3], Step [105700/138038], Loss: 2.1162, Perplexity: 8.29929
Epoch [3/3], Step [105800/138038]

Epoch [3/3], Step [116200/138038], Loss: 2.3896, Perplexity: 10.9088
Epoch [3/3], Step [116300/138038], Loss: 3.0512, Perplexity: 21.1408
Epoch [3/3], Step [116400/138038], Loss: 4.3008, Perplexity: 73.7593
Epoch [3/3], Step [116500/138038], Loss: 2.6851, Perplexity: 14.6599
Epoch [3/3], Step [116600/138038], Loss: 2.8647, Perplexity: 17.5434
Epoch [3/3], Step [116700/138038], Loss: 2.0040, Perplexity: 7.41887
Epoch [3/3], Step [116800/138038], Loss: 2.7541, Perplexity: 15.70725
Epoch [3/3], Step [116900/138038], Loss: 3.0219, Perplexity: 20.5293
Epoch [3/3], Step [117000/138038], Loss: 2.9881, Perplexity: 19.84803
Epoch [3/3], Step [117100/138038], Loss: 3.0766, Perplexity: 21.6835
Epoch [3/3], Step [117200/138038], Loss: 1.5370, Perplexity: 4.650532
Epoch [3/3], Step [117300/138038], Loss: 2.6679, Perplexity: 14.4101
Epoch [3/3], Step [117400/138038], Loss: 2.1249, Perplexity: 8.372249
Epoch [3/3], Step [117500/138038], Loss: 4.0636, Perplexity: 58.1814
Epoch [3/3], Step [117600/1380

Epoch [3/3], Step [128000/138038], Loss: 2.1731, Perplexity: 8.785794
Epoch [3/3], Step [128100/138038], Loss: 3.4485, Perplexity: 31.4527
Epoch [3/3], Step [128200/138038], Loss: 2.6992, Perplexity: 14.8679
Epoch [3/3], Step [128300/138038], Loss: 3.1350, Perplexity: 22.98910
Epoch [3/3], Step [128400/138038], Loss: 3.2690, Perplexity: 26.2851
Epoch [3/3], Step [128500/138038], Loss: 2.2970, Perplexity: 9.943987
Epoch [3/3], Step [128600/138038], Loss: 2.3822, Perplexity: 10.8283
Epoch [3/3], Step [128700/138038], Loss: 2.8874, Perplexity: 17.94601
Epoch [3/3], Step [128800/138038], Loss: 1.8701, Perplexity: 6.48936
Epoch [3/3], Step [128900/138038], Loss: 3.4509, Perplexity: 31.5277
Epoch [3/3], Step [129000/138038], Loss: 3.0091, Perplexity: 20.2693
Epoch [3/3], Step [129100/138038], Loss: 3.1505, Perplexity: 23.3478
Epoch [3/3], Step [129200/138038], Loss: 2.3955, Perplexity: 10.9734
Epoch [3/3], Step [129300/138038], Loss: 2.5236, Perplexity: 12.4738
Epoch [3/3], Step [129400/1380

<a id='step3'></a>
## Step 3: (Optional) Validate your Model

To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. 

If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [None]:
# (Optional) TODO: Validate your model.