# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, you will train your CNN-RNN model.  

You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.

This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:
- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  
- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.

This notebook **will be graded**.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.

You should only amend blocks of code that are preceded by a `TODO` statement.  **Any code blocks that are not preceded by a `TODO` statement should not be modified**.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.

### Question 1

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

**Answer:** 


### (Optional) Task #2

Note that we have provided a recommended image transform `transform_train` for pre-processing the training images, but you are welcome (and encouraged!) to modify it as you wish.  When modifying this transform, keep in mind that:
- the images in the dataset have varying heights and widths, and 
- if using a pre-trained model, you must perform the corresponding appropriate normalization.

### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**Answer:** 

### Task #3

Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**Answer:** 

### Task #4

Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### Question 4

**Question:** How did you select the optimizer used to train your model?

**Answer:** 

In [1]:
import sys
sys.path.append('./cocoapi/PythonAPI')
from pycocotools.coco import COCO
!pip install nltk
import nltk
nltk.download('punkt')
import torch
import torch.nn as nn
from torchvision import transforms
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math


## TODO #1: Select appropriate values for the Python variables below.
batch_size = 3          # batch size
vocab_threshold = 5       # minimum word count threshold
vocab_from_file = True    # if True, load existing vocab file
embed_size = 128          # dimensionality of image and word embeddings
hidden_size = 100        # number of features in hidden state of the RNN decoder
num_epochs = 3             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) TODO #2: Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO #3: Specify the learnable parameters of the model.
params = decoder.parameters()

# TODO #4: Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.0001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[nltk_data] Downloading package punkt to /home/jai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...


  0%|          | 0/414113 [00:00<?, ?it/s]

Done (t=0.40s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [00:29<00:00, 14156.99it/s]


<a id='step2'></a>
## Step 2: Train your Model

Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  

It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  

You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).

### A Note on Tuning Hyperparameters

To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  

However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.

That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [3]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions[:, :-1])
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/138038], Loss: 2.1073, Perplexity: 8.22573
Epoch [1/3], Step [200/138038], Loss: 1.8138, Perplexity: 6.133653
Epoch [1/3], Step [300/138038], Loss: 3.0154, Perplexity: 20.3978
Epoch [1/3], Step [400/138038], Loss: 2.0732, Perplexity: 7.950005
Epoch [1/3], Step [500/138038], Loss: 2.2363, Perplexity: 9.358334
Epoch [1/3], Step [600/138038], Loss: 3.2319, Perplexity: 25.32776
Epoch [1/3], Step [700/138038], Loss: 2.9498, Perplexity: 19.1027
Epoch [1/3], Step [800/138038], Loss: 2.7602, Perplexity: 15.8037
Epoch [1/3], Step [900/138038], Loss: 1.7479, Perplexity: 5.74269
Epoch [1/3], Step [1000/138038], Loss: 2.0718, Perplexity: 7.93890
Epoch [1/3], Step [1100/138038], Loss: 3.0420, Perplexity: 20.9478
Epoch [1/3], Step [1200/138038], Loss: 2.5171, Perplexity: 12.3921
Epoch [1/3], Step [1300/138038], Loss: 1.9686, Perplexity: 7.16056
Epoch [1/3], Step [1400/138038], Loss: 2.5132, Perplexity: 12.3443
Epoch [1/3], Step [1500/138038], Loss: 3.2980, Perplexity: 27.0586


Epoch [1/3], Step [12200/138038], Loss: 2.0956, Perplexity: 8.13073
Epoch [1/3], Step [12300/138038], Loss: 3.6374, Perplexity: 37.9922
Epoch [1/3], Step [12400/138038], Loss: 2.1328, Perplexity: 8.438724
Epoch [1/3], Step [12500/138038], Loss: 3.4372, Perplexity: 31.1000
Epoch [1/3], Step [12600/138038], Loss: 3.2756, Perplexity: 26.4595
Epoch [1/3], Step [12700/138038], Loss: 3.0183, Perplexity: 20.4566
Epoch [1/3], Step [12800/138038], Loss: 2.2882, Perplexity: 9.857315
Epoch [1/3], Step [12900/138038], Loss: 2.5864, Perplexity: 13.28180
Epoch [1/3], Step [13000/138038], Loss: 2.2209, Perplexity: 9.21555
Epoch [1/3], Step [13100/138038], Loss: 3.4572, Perplexity: 31.72958
Epoch [1/3], Step [13200/138038], Loss: 2.8030, Perplexity: 16.49360
Epoch [1/3], Step [13300/138038], Loss: 1.9693, Perplexity: 7.16598
Epoch [1/3], Step [13400/138038], Loss: 2.5268, Perplexity: 12.51348
Epoch [1/3], Step [13500/138038], Loss: 2.9461, Perplexity: 19.0319
Epoch [1/3], Step [13600/138038], Loss: 2.

Epoch [1/3], Step [24100/138038], Loss: 2.8817, Perplexity: 17.84521
Epoch [1/3], Step [24200/138038], Loss: 3.3792, Perplexity: 29.34755
Epoch [1/3], Step [24300/138038], Loss: 2.5739, Perplexity: 13.11694
Epoch [1/3], Step [24400/138038], Loss: 2.3819, Perplexity: 10.82539
Epoch [1/3], Step [24500/138038], Loss: 3.8884, Perplexity: 48.83479
Epoch [1/3], Step [24600/138038], Loss: 2.5974, Perplexity: 13.42875
Epoch [1/3], Step [24700/138038], Loss: 3.3128, Perplexity: 27.46327
Epoch [1/3], Step [24800/138038], Loss: 2.7550, Perplexity: 15.72175
Epoch [1/3], Step [24900/138038], Loss: 3.2608, Perplexity: 26.0709
Epoch [1/3], Step [25000/138038], Loss: 2.9898, Perplexity: 19.88090
Epoch [1/3], Step [25100/138038], Loss: 2.1415, Perplexity: 8.51213
Epoch [1/3], Step [25200/138038], Loss: 3.0569, Perplexity: 21.2626
Epoch [1/3], Step [25300/138038], Loss: 3.1833, Perplexity: 24.12686
Epoch [1/3], Step [25400/138038], Loss: 2.4326, Perplexity: 11.38811
Epoch [1/3], Step [25500/138038], Los

Epoch [1/3], Step [36000/138038], Loss: 2.2816, Perplexity: 9.79264
Epoch [1/3], Step [36100/138038], Loss: 2.5006, Perplexity: 12.19005
Epoch [1/3], Step [36200/138038], Loss: 1.8668, Perplexity: 6.46731
Epoch [1/3], Step [36300/138038], Loss: 3.1962, Perplexity: 24.43902
Epoch [1/3], Step [36400/138038], Loss: 3.3298, Perplexity: 27.93414
Epoch [1/3], Step [36500/138038], Loss: 2.1622, Perplexity: 8.69063
Epoch [1/3], Step [36600/138038], Loss: 3.1644, Perplexity: 23.6756
Epoch [1/3], Step [36700/138038], Loss: 2.7430, Perplexity: 15.5332
Epoch [1/3], Step [36800/138038], Loss: 1.8140, Perplexity: 6.134733
Epoch [1/3], Step [36900/138038], Loss: 3.8543, Perplexity: 47.19407
Epoch [1/3], Step [37000/138038], Loss: 3.4390, Perplexity: 31.1562
Epoch [1/3], Step [37100/138038], Loss: 2.2522, Perplexity: 9.50888
Epoch [1/3], Step [37200/138038], Loss: 1.8642, Perplexity: 6.45114
Epoch [1/3], Step [37300/138038], Loss: 2.6673, Perplexity: 14.40172
Epoch [1/3], Step [37400/138038], Loss: 2.

Epoch [1/3], Step [47900/138038], Loss: 3.1202, Perplexity: 22.65009
Epoch [1/3], Step [48000/138038], Loss: 3.8524, Perplexity: 47.1068
Epoch [1/3], Step [48100/138038], Loss: 2.5847, Perplexity: 13.25969
Epoch [1/3], Step [48200/138038], Loss: 2.0853, Perplexity: 8.04739
Epoch [1/3], Step [48300/138038], Loss: 1.9570, Perplexity: 7.078428
Epoch [1/3], Step [48400/138038], Loss: 3.4398, Perplexity: 31.18156
Epoch [1/3], Step [48500/138038], Loss: 3.7587, Perplexity: 42.8926
Epoch [1/3], Step [48600/138038], Loss: 2.0457, Perplexity: 7.73420
Epoch [1/3], Step [48700/138038], Loss: 2.3135, Perplexity: 10.10992
Epoch [1/3], Step [48800/138038], Loss: 2.7472, Perplexity: 15.5997
Epoch [1/3], Step [48900/138038], Loss: 4.0048, Perplexity: 54.8629
Epoch [1/3], Step [49000/138038], Loss: 2.4247, Perplexity: 11.2989
Epoch [1/3], Step [49100/138038], Loss: 1.5961, Perplexity: 4.93354
Epoch [1/3], Step [49200/138038], Loss: 3.0476, Perplexity: 21.06469
Epoch [1/3], Step [49300/138038], Loss: 3.

Epoch [1/3], Step [59800/138038], Loss: 3.2039, Perplexity: 24.62826
Epoch [1/3], Step [59900/138038], Loss: 3.1835, Perplexity: 24.1308
Epoch [1/3], Step [60000/138038], Loss: 3.7621, Perplexity: 43.03709
Epoch [1/3], Step [60100/138038], Loss: 2.2384, Perplexity: 9.37847
Epoch [1/3], Step [60200/138038], Loss: 3.4748, Perplexity: 32.29114
Epoch [1/3], Step [60300/138038], Loss: 2.4585, Perplexity: 11.6874
Epoch [1/3], Step [60400/138038], Loss: 2.9516, Perplexity: 19.13622
Epoch [1/3], Step [60500/138038], Loss: 2.9734, Perplexity: 19.5580
Epoch [1/3], Step [60600/138038], Loss: 4.2401, Perplexity: 69.41329
Epoch [1/3], Step [60700/138038], Loss: 2.4597, Perplexity: 11.7018
Epoch [1/3], Step [60800/138038], Loss: 2.0369, Perplexity: 7.66692
Epoch [1/3], Step [60900/138038], Loss: 2.8112, Perplexity: 16.6297
Epoch [1/3], Step [61000/138038], Loss: 2.7987, Perplexity: 16.42354
Epoch [1/3], Step [61100/138038], Loss: 2.2821, Perplexity: 9.796909
Epoch [1/3], Step [61200/138038], Loss: 2

Epoch [1/3], Step [71700/138038], Loss: 3.0512, Perplexity: 21.14105
Epoch [1/3], Step [71800/138038], Loss: 2.5655, Perplexity: 13.0065
Epoch [1/3], Step [71900/138038], Loss: 3.0512, Perplexity: 21.14111
Epoch [1/3], Step [72000/138038], Loss: 1.8329, Perplexity: 6.25216
Epoch [1/3], Step [72100/138038], Loss: 2.8314, Perplexity: 16.9689
Epoch [1/3], Step [72200/138038], Loss: 1.7297, Perplexity: 5.63907
Epoch [1/3], Step [72300/138038], Loss: 4.9277, Perplexity: 138.0590
Epoch [1/3], Step [72400/138038], Loss: 3.3444, Perplexity: 28.34284
Epoch [1/3], Step [72500/138038], Loss: 3.4984, Perplexity: 33.0630
Epoch [1/3], Step [72600/138038], Loss: 2.6906, Perplexity: 14.74080
Epoch [1/3], Step [72700/138038], Loss: 3.2189, Perplexity: 24.9999
Epoch [1/3], Step [72800/138038], Loss: 2.7656, Perplexity: 15.88868
Epoch [1/3], Step [72900/138038], Loss: 1.8735, Perplexity: 6.51130
Epoch [1/3], Step [73000/138038], Loss: 2.6148, Perplexity: 13.6645
Epoch [1/3], Step [73100/138038], Loss: 1.

Epoch [1/3], Step [83600/138038], Loss: 2.7917, Perplexity: 16.3090
Epoch [1/3], Step [83700/138038], Loss: 2.4942, Perplexity: 12.11215
Epoch [1/3], Step [83800/138038], Loss: 3.3196, Perplexity: 27.6495
Epoch [1/3], Step [83900/138038], Loss: 2.2727, Perplexity: 9.70569
Epoch [1/3], Step [84000/138038], Loss: 2.4913, Perplexity: 12.0770
Epoch [1/3], Step [84100/138038], Loss: 2.4564, Perplexity: 11.66243
Epoch [1/3], Step [84200/138038], Loss: 3.1149, Perplexity: 22.5304
Epoch [1/3], Step [84300/138038], Loss: 2.3465, Perplexity: 10.4488
Epoch [1/3], Step [84400/138038], Loss: 2.3136, Perplexity: 10.1113
Epoch [1/3], Step [84500/138038], Loss: 2.3524, Perplexity: 10.5106
Epoch [1/3], Step [84600/138038], Loss: 2.5660, Perplexity: 13.0139
Epoch [1/3], Step [84700/138038], Loss: 3.1738, Perplexity: 23.8993
Epoch [1/3], Step [84800/138038], Loss: 2.6870, Perplexity: 14.6873
Epoch [1/3], Step [84900/138038], Loss: 2.6521, Perplexity: 14.18338
Epoch [1/3], Step [85000/138038], Loss: 2.054

Epoch [1/3], Step [95500/138038], Loss: 2.7259, Perplexity: 15.2699
Epoch [1/3], Step [95600/138038], Loss: 3.3018, Perplexity: 27.16115
Epoch [1/3], Step [95700/138038], Loss: 2.2180, Perplexity: 9.18872
Epoch [1/3], Step [95800/138038], Loss: 2.4779, Perplexity: 11.9159
Epoch [1/3], Step [95900/138038], Loss: 3.4462, Perplexity: 31.38175
Epoch [1/3], Step [96000/138038], Loss: 2.8026, Perplexity: 16.4879
Epoch [1/3], Step [96100/138038], Loss: 2.4503, Perplexity: 11.5919
Epoch [1/3], Step [96200/138038], Loss: 2.2113, Perplexity: 9.12804
Epoch [1/3], Step [96300/138038], Loss: 2.3321, Perplexity: 10.29967
Epoch [1/3], Step [96400/138038], Loss: 3.3091, Perplexity: 27.36150
Epoch [1/3], Step [96500/138038], Loss: 2.5188, Perplexity: 12.4141
Epoch [1/3], Step [96600/138038], Loss: 3.4956, Perplexity: 32.9687
Epoch [1/3], Step [96700/138038], Loss: 2.8276, Perplexity: 16.9049
Epoch [1/3], Step [96800/138038], Loss: 2.1412, Perplexity: 8.50938
Epoch [1/3], Step [96900/138038], Loss: 2.18

Epoch [1/3], Step [107300/138038], Loss: 3.0222, Perplexity: 20.5371
Epoch [1/3], Step [107400/138038], Loss: 2.3175, Perplexity: 10.1505
Epoch [1/3], Step [107500/138038], Loss: 2.8638, Perplexity: 17.52723
Epoch [1/3], Step [107600/138038], Loss: 3.5701, Perplexity: 35.5206
Epoch [1/3], Step [107700/138038], Loss: 3.3533, Perplexity: 28.59837
Epoch [1/3], Step [107800/138038], Loss: 1.9152, Perplexity: 6.788377
Epoch [1/3], Step [107900/138038], Loss: 2.7581, Perplexity: 15.7702
Epoch [1/3], Step [108000/138038], Loss: 2.7319, Perplexity: 15.3613
Epoch [1/3], Step [108100/138038], Loss: 2.2395, Perplexity: 9.38849
Epoch [1/3], Step [108200/138038], Loss: 3.5607, Perplexity: 35.1878
Epoch [1/3], Step [108300/138038], Loss: 3.0536, Perplexity: 21.1910
Epoch [1/3], Step [108400/138038], Loss: 2.0237, Perplexity: 7.566141
Epoch [1/3], Step [108500/138038], Loss: 2.5817, Perplexity: 13.2192
Epoch [1/3], Step [108600/138038], Loss: 2.9159, Perplexity: 18.4659
Epoch [1/3], Step [108700/1380

Epoch [1/3], Step [119100/138038], Loss: 1.8442, Perplexity: 6.32307
Epoch [1/3], Step [119200/138038], Loss: 4.3412, Perplexity: 76.8018
Epoch [1/3], Step [119300/138038], Loss: 2.0200, Perplexity: 7.538691
Epoch [1/3], Step [119400/138038], Loss: 2.3776, Perplexity: 10.7786
Epoch [1/3], Step [119500/138038], Loss: 3.2245, Perplexity: 25.1415
Epoch [1/3], Step [119600/138038], Loss: 2.0701, Perplexity: 7.92593
Epoch [1/3], Step [119700/138038], Loss: 1.4634, Perplexity: 4.320786
Epoch [1/3], Step [119800/138038], Loss: 4.0587, Perplexity: 57.9012
Epoch [1/3], Step [119900/138038], Loss: 2.8352, Perplexity: 17.03330
Epoch [1/3], Step [120000/138038], Loss: 3.4864, Perplexity: 32.6691
Epoch [1/3], Step [120100/138038], Loss: 2.8369, Perplexity: 17.06241
Epoch [1/3], Step [120200/138038], Loss: 2.9542, Perplexity: 19.1869
Epoch [1/3], Step [120300/138038], Loss: 1.7118, Perplexity: 5.538886
Epoch [1/3], Step [120400/138038], Loss: 3.6186, Perplexity: 37.2838
Epoch [1/3], Step [120500/138

Epoch [1/3], Step [130800/138038], Loss: 1.7676, Perplexity: 5.85650
Epoch [1/3], Step [130900/138038], Loss: 2.6828, Perplexity: 14.6261
Epoch [1/3], Step [131000/138038], Loss: 2.2434, Perplexity: 9.42574
Epoch [1/3], Step [131100/138038], Loss: 4.1742, Perplexity: 64.98581
Epoch [1/3], Step [131200/138038], Loss: 2.2762, Perplexity: 9.739354
Epoch [1/3], Step [131300/138038], Loss: 2.4422, Perplexity: 11.4985
Epoch [1/3], Step [131400/138038], Loss: 2.0511, Perplexity: 7.77626
Epoch [1/3], Step [131500/138038], Loss: 2.9723, Perplexity: 19.53670
Epoch [1/3], Step [131600/138038], Loss: 2.4325, Perplexity: 11.3877
Epoch [1/3], Step [131700/138038], Loss: 2.7145, Perplexity: 15.0978
Epoch [1/3], Step [131800/138038], Loss: 2.2172, Perplexity: 9.181576
Epoch [1/3], Step [131900/138038], Loss: 2.5380, Perplexity: 12.6540
Epoch [1/3], Step [132000/138038], Loss: 2.7307, Perplexity: 15.34435
Epoch [1/3], Step [132100/138038], Loss: 3.0874, Perplexity: 21.92082
Epoch [1/3], Step [132200/13

Epoch [2/3], Step [4700/138038], Loss: 2.5281, Perplexity: 12.52945
Epoch [2/3], Step [4800/138038], Loss: 2.2221, Perplexity: 9.226685
Epoch [2/3], Step [4900/138038], Loss: 2.4637, Perplexity: 11.7479
Epoch [2/3], Step [5000/138038], Loss: 1.7254, Perplexity: 5.61466
Epoch [2/3], Step [5100/138038], Loss: 2.6297, Perplexity: 13.8700
Epoch [2/3], Step [5200/138038], Loss: 3.3082, Perplexity: 27.33685
Epoch [2/3], Step [5300/138038], Loss: 3.0612, Perplexity: 21.35265
Epoch [2/3], Step [5400/138038], Loss: 2.7871, Perplexity: 16.23397
Epoch [2/3], Step [5500/138038], Loss: 2.0847, Perplexity: 8.04247
Epoch [2/3], Step [5600/138038], Loss: 2.8580, Perplexity: 17.42716
Epoch [2/3], Step [5700/138038], Loss: 2.6960, Perplexity: 14.82103
Epoch [2/3], Step [5800/138038], Loss: 2.5084, Perplexity: 12.28570
Epoch [2/3], Step [5900/138038], Loss: 2.5937, Perplexity: 13.3789
Epoch [2/3], Step [6000/138038], Loss: 2.1368, Perplexity: 8.47246
Epoch [2/3], Step [6100/138038], Loss: 2.4801, Perplex

Epoch [2/3], Step [16700/138038], Loss: 2.6508, Perplexity: 14.16564
Epoch [2/3], Step [16800/138038], Loss: 2.0660, Perplexity: 7.892872
Epoch [2/3], Step [16900/138038], Loss: 3.0679, Perplexity: 21.4958
Epoch [2/3], Step [17000/138038], Loss: 2.4934, Perplexity: 12.1021
Epoch [2/3], Step [17100/138038], Loss: 2.7705, Perplexity: 15.9661
Epoch [2/3], Step [17200/138038], Loss: 2.2204, Perplexity: 9.21118
Epoch [2/3], Step [17300/138038], Loss: 3.1906, Perplexity: 24.30282
Epoch [2/3], Step [17400/138038], Loss: 2.7437, Perplexity: 15.54379
Epoch [2/3], Step [17500/138038], Loss: 3.1544, Perplexity: 23.4391
Epoch [2/3], Step [17600/138038], Loss: 3.1698, Perplexity: 23.8018
Epoch [2/3], Step [17700/138038], Loss: 3.1641, Perplexity: 23.6668
Epoch [2/3], Step [17800/138038], Loss: 2.6489, Perplexity: 14.13819
Epoch [2/3], Step [17900/138038], Loss: 2.1914, Perplexity: 8.94799
Epoch [2/3], Step [18000/138038], Loss: 2.0446, Perplexity: 7.72576
Epoch [2/3], Step [18100/138038], Loss: 3.2

Epoch [2/3], Step [28600/138038], Loss: 3.3922, Perplexity: 29.73080
Epoch [2/3], Step [28700/138038], Loss: 2.0734, Perplexity: 7.951853
Epoch [2/3], Step [28800/138038], Loss: 2.5838, Perplexity: 13.2474
Epoch [2/3], Step [28900/138038], Loss: 4.1845, Perplexity: 65.66350
Epoch [2/3], Step [29000/138038], Loss: 2.6159, Perplexity: 13.67903
Epoch [2/3], Step [29100/138038], Loss: 3.5835, Perplexity: 35.9987
Epoch [2/3], Step [29200/138038], Loss: 2.4869, Perplexity: 12.0244
Epoch [2/3], Step [29300/138038], Loss: 3.0964, Perplexity: 22.11777
Epoch [2/3], Step [29400/138038], Loss: 2.5703, Perplexity: 13.07000
Epoch [2/3], Step [29500/138038], Loss: 2.7682, Perplexity: 15.9297
Epoch [2/3], Step [29600/138038], Loss: 2.9047, Perplexity: 18.25905
Epoch [2/3], Step [29700/138038], Loss: 2.6009, Perplexity: 13.4761
Epoch [2/3], Step [29800/138038], Loss: 3.6260, Perplexity: 37.56076
Epoch [2/3], Step [29900/138038], Loss: 2.1097, Perplexity: 8.245745
Epoch [2/3], Step [30000/138038], Loss:

Epoch [2/3], Step [40500/138038], Loss: 1.6614, Perplexity: 5.26691
Epoch [2/3], Step [40600/138038], Loss: 1.5782, Perplexity: 4.846389
Epoch [2/3], Step [40700/138038], Loss: 3.6318, Perplexity: 37.7806
Epoch [2/3], Step [40800/138038], Loss: 2.7026, Perplexity: 14.9183
Epoch [2/3], Step [40900/138038], Loss: 2.6445, Perplexity: 14.0759
Epoch [2/3], Step [41000/138038], Loss: 2.4649, Perplexity: 11.76184
Epoch [2/3], Step [41100/138038], Loss: 3.3579, Perplexity: 28.72827
Epoch [2/3], Step [41200/138038], Loss: 2.5000, Perplexity: 12.1830
Epoch [2/3], Step [41300/138038], Loss: 4.0382, Perplexity: 56.7265
Epoch [2/3], Step [41400/138038], Loss: 2.2634, Perplexity: 9.61544
Epoch [2/3], Step [41500/138038], Loss: 2.0985, Perplexity: 8.154092
Epoch [2/3], Step [41600/138038], Loss: 3.4430, Perplexity: 31.2804
Epoch [2/3], Step [41700/138038], Loss: 3.5671, Perplexity: 35.41322
Epoch [2/3], Step [41800/138038], Loss: 3.2958, Perplexity: 26.9990
Epoch [2/3], Step [41900/138038], Loss: 2.4

Epoch [2/3], Step [52400/138038], Loss: 3.0091, Perplexity: 20.2701
Epoch [2/3], Step [52500/138038], Loss: 3.1870, Perplexity: 24.21514
Epoch [2/3], Step [52600/138038], Loss: 2.1860, Perplexity: 8.899666
Epoch [2/3], Step [52700/138038], Loss: 3.0594, Perplexity: 21.31395
Epoch [2/3], Step [52800/138038], Loss: 2.4490, Perplexity: 11.5768
Epoch [2/3], Step [52900/138038], Loss: 3.5409, Perplexity: 34.4995
Epoch [2/3], Step [53000/138038], Loss: 3.5407, Perplexity: 34.49018
Epoch [2/3], Step [53100/138038], Loss: 3.4144, Perplexity: 30.39914
Epoch [2/3], Step [53200/138038], Loss: 4.1429, Perplexity: 62.98615
Epoch [2/3], Step [53300/138038], Loss: 2.0595, Perplexity: 7.84178
Epoch [2/3], Step [53400/138038], Loss: 3.3678, Perplexity: 29.0154
Epoch [2/3], Step [53500/138038], Loss: 3.5726, Perplexity: 35.6105
Epoch [2/3], Step [53600/138038], Loss: 2.8647, Perplexity: 17.5444
Epoch [2/3], Step [53700/138038], Loss: 1.8044, Perplexity: 6.07611
Epoch [2/3], Step [53800/138038], Loss: 3.

Epoch [2/3], Step [64300/138038], Loss: 3.1843, Perplexity: 24.1505
Epoch [2/3], Step [64400/138038], Loss: 2.5833, Perplexity: 13.2406
Epoch [2/3], Step [64500/138038], Loss: 3.7468, Perplexity: 42.3854
Epoch [2/3], Step [64600/138038], Loss: 2.2878, Perplexity: 9.85360
Epoch [2/3], Step [64700/138038], Loss: 3.2545, Perplexity: 25.9056
Epoch [2/3], Step [64800/138038], Loss: 2.7060, Perplexity: 14.9699
Epoch [2/3], Step [64900/138038], Loss: 2.5114, Perplexity: 12.3223
Epoch [2/3], Step [65000/138038], Loss: 2.7384, Perplexity: 15.4618
Epoch [2/3], Step [65100/138038], Loss: 3.3509, Perplexity: 28.52794
Epoch [2/3], Step [65200/138038], Loss: 2.5825, Perplexity: 13.23042
Epoch [2/3], Step [65300/138038], Loss: 2.6034, Perplexity: 13.5101
Epoch [2/3], Step [65400/138038], Loss: 2.3100, Perplexity: 10.0746
Epoch [2/3], Step [65500/138038], Loss: 3.1881, Perplexity: 24.24122
Epoch [2/3], Step [65600/138038], Loss: 2.1781, Perplexity: 8.829627
Epoch [2/3], Step [65700/138038], Loss: 4.16

Epoch [2/3], Step [76200/138038], Loss: 2.8745, Perplexity: 17.7160
Epoch [2/3], Step [76300/138038], Loss: 2.3160, Perplexity: 10.13504
Epoch [2/3], Step [76400/138038], Loss: 2.8712, Perplexity: 17.6583
Epoch [2/3], Step [76500/138038], Loss: 3.8064, Perplexity: 44.9872
Epoch [2/3], Step [76600/138038], Loss: 3.5171, Perplexity: 33.68682
Epoch [2/3], Step [76700/138038], Loss: 1.6824, Perplexity: 5.378634
Epoch [2/3], Step [76800/138038], Loss: 2.9470, Perplexity: 19.0480
Epoch [2/3], Step [76900/138038], Loss: 4.0745, Perplexity: 58.8229
Epoch [2/3], Step [77000/138038], Loss: 1.7872, Perplexity: 5.97258
Epoch [2/3], Step [77100/138038], Loss: 2.5804, Perplexity: 13.20319
Epoch [2/3], Step [77200/138038], Loss: 1.7502, Perplexity: 5.75552
Epoch [2/3], Step [77300/138038], Loss: 1.7115, Perplexity: 5.53737
Epoch [2/3], Step [77400/138038], Loss: 2.8209, Perplexity: 16.79154
Epoch [2/3], Step [77500/138038], Loss: 2.7382, Perplexity: 15.45872
Epoch [2/3], Step [77600/138038], Loss: 3.

Epoch [2/3], Step [88100/138038], Loss: 2.0502, Perplexity: 7.76925
Epoch [2/3], Step [88200/138038], Loss: 2.4383, Perplexity: 11.4533
Epoch [2/3], Step [88300/138038], Loss: 3.4521, Perplexity: 31.56618
Epoch [2/3], Step [88400/138038], Loss: 3.6936, Perplexity: 40.1905
Epoch [2/3], Step [88500/138038], Loss: 3.4441, Perplexity: 31.31430
Epoch [2/3], Step [88600/138038], Loss: 3.4029, Perplexity: 30.05042
Epoch [2/3], Step [88700/138038], Loss: 3.0762, Perplexity: 21.67644
Epoch [2/3], Step [88800/138038], Loss: 2.5073, Perplexity: 12.2715
Epoch [2/3], Step [88900/138038], Loss: 2.1320, Perplexity: 8.431376
Epoch [2/3], Step [89000/138038], Loss: 2.9136, Perplexity: 18.42359
Epoch [2/3], Step [89100/138038], Loss: 3.0550, Perplexity: 21.2213
Epoch [2/3], Step [89200/138038], Loss: 2.1878, Perplexity: 8.915763
Epoch [2/3], Step [89300/138038], Loss: 3.0315, Perplexity: 20.72777
Epoch [2/3], Step [89400/138038], Loss: 2.6406, Perplexity: 14.0219
Epoch [2/3], Step [89500/138038], Loss: 

Epoch [2/3], Step [100000/138038], Loss: 2.2314, Perplexity: 9.31293
Epoch [2/3], Step [100100/138038], Loss: 2.4013, Perplexity: 11.0376
Epoch [2/3], Step [100200/138038], Loss: 2.4336, Perplexity: 11.39965
Epoch [2/3], Step [100300/138038], Loss: 2.5493, Perplexity: 12.7984
Epoch [2/3], Step [100400/138038], Loss: 2.4947, Perplexity: 12.1184
Epoch [2/3], Step [100500/138038], Loss: 2.7269, Perplexity: 15.28501
Epoch [2/3], Step [100600/138038], Loss: 2.6673, Perplexity: 14.4010
Epoch [2/3], Step [100700/138038], Loss: 4.4276, Perplexity: 83.7300
Epoch [2/3], Step [100800/138038], Loss: 3.2386, Perplexity: 25.4985
Epoch [2/3], Step [100900/138038], Loss: 3.7351, Perplexity: 41.8936
Epoch [2/3], Step [101000/138038], Loss: 2.1908, Perplexity: 8.94249
Epoch [2/3], Step [101100/138038], Loss: 2.7199, Perplexity: 15.17836
Epoch [2/3], Step [101200/138038], Loss: 2.8392, Perplexity: 17.1026
Epoch [2/3], Step [101300/138038], Loss: 2.3829, Perplexity: 10.83578
Epoch [2/3], Step [101400/1380

Epoch [2/3], Step [111800/138038], Loss: 3.4439, Perplexity: 31.31021
Epoch [2/3], Step [111900/138038], Loss: 2.6066, Perplexity: 13.5534
Epoch [2/3], Step [112000/138038], Loss: 2.2757, Perplexity: 9.73460
Epoch [2/3], Step [112100/138038], Loss: 2.1149, Perplexity: 8.28874
Epoch [2/3], Step [112200/138038], Loss: 2.4243, Perplexity: 11.2946
Epoch [2/3], Step [112300/138038], Loss: 2.1509, Perplexity: 8.59299
Epoch [2/3], Step [112400/138038], Loss: 2.6479, Perplexity: 14.1238
Epoch [2/3], Step [112500/138038], Loss: 2.1693, Perplexity: 8.75210
Epoch [2/3], Step [112600/138038], Loss: 2.1320, Perplexity: 8.43181
Epoch [2/3], Step [112700/138038], Loss: 2.6653, Perplexity: 14.3727
Epoch [2/3], Step [112800/138038], Loss: 2.9157, Perplexity: 18.4626
Epoch [2/3], Step [112900/138038], Loss: 1.9173, Perplexity: 6.80272
Epoch [2/3], Step [113000/138038], Loss: 3.0276, Perplexity: 20.6473
Epoch [2/3], Step [113100/138038], Loss: 3.1480, Perplexity: 23.2887
Epoch [2/3], Step [113200/138038]

Epoch [2/3], Step [123600/138038], Loss: 3.2354, Perplexity: 25.4171
Epoch [2/3], Step [123700/138038], Loss: 2.4669, Perplexity: 11.7853
Epoch [2/3], Step [123800/138038], Loss: 3.0333, Perplexity: 20.7648
Epoch [2/3], Step [123900/138038], Loss: 3.9931, Perplexity: 54.22365
Epoch [2/3], Step [124000/138038], Loss: 2.2842, Perplexity: 9.817693
Epoch [2/3], Step [124100/138038], Loss: 3.3461, Perplexity: 28.3913
Epoch [2/3], Step [124200/138038], Loss: 2.9862, Perplexity: 19.8102
Epoch [2/3], Step [124300/138038], Loss: 2.9547, Perplexity: 19.1967
Epoch [2/3], Step [124400/138038], Loss: 3.6868, Perplexity: 39.9159
Epoch [2/3], Step [124500/138038], Loss: 2.0163, Perplexity: 7.51017
Epoch [2/3], Step [124600/138038], Loss: 3.4030, Perplexity: 30.05412
Epoch [2/3], Step [124700/138038], Loss: 2.6848, Perplexity: 14.65572
Epoch [2/3], Step [124800/138038], Loss: 3.1421, Perplexity: 23.15293
Epoch [2/3], Step [124900/138038], Loss: 3.8856, Perplexity: 48.6970
Epoch [2/3], Step [125000/138

Epoch [2/3], Step [135400/138038], Loss: 2.1582, Perplexity: 8.65547
Epoch [2/3], Step [135500/138038], Loss: 2.8764, Perplexity: 17.7501
Epoch [2/3], Step [135600/138038], Loss: 3.9737, Perplexity: 53.1825
Epoch [2/3], Step [135700/138038], Loss: 2.8165, Perplexity: 16.7190
Epoch [2/3], Step [135800/138038], Loss: 2.3304, Perplexity: 10.2819
Epoch [2/3], Step [135900/138038], Loss: 2.8802, Perplexity: 17.81726
Epoch [2/3], Step [136000/138038], Loss: 2.0911, Perplexity: 8.093529
Epoch [2/3], Step [136100/138038], Loss: 3.1803, Perplexity: 24.0541
Epoch [2/3], Step [136200/138038], Loss: 1.7726, Perplexity: 5.88600
Epoch [2/3], Step [136300/138038], Loss: 2.6118, Perplexity: 13.6241
Epoch [2/3], Step [136400/138038], Loss: 3.2434, Perplexity: 25.6208
Epoch [2/3], Step [136500/138038], Loss: 1.9202, Perplexity: 6.82261
Epoch [2/3], Step [136600/138038], Loss: 2.5304, Perplexity: 12.5583
Epoch [2/3], Step [136700/138038], Loss: 3.7364, Perplexity: 41.94651
Epoch [2/3], Step [136800/13803

Epoch [3/3], Step [9500/138038], Loss: 3.9114, Perplexity: 49.97063
Epoch [3/3], Step [9600/138038], Loss: 2.7545, Perplexity: 15.71365
Epoch [3/3], Step [9700/138038], Loss: 2.2787, Perplexity: 9.763938
Epoch [3/3], Step [9800/138038], Loss: 3.0043, Perplexity: 20.17291
Epoch [3/3], Step [9900/138038], Loss: 3.2923, Perplexity: 26.90398
Epoch [3/3], Step [10000/138038], Loss: 3.5028, Perplexity: 33.2092
Epoch [3/3], Step [10100/138038], Loss: 2.4739, Perplexity: 11.8685
Epoch [3/3], Step [10200/138038], Loss: 2.7108, Perplexity: 15.0410
Epoch [3/3], Step [10300/138038], Loss: 2.2540, Perplexity: 9.52555
Epoch [3/3], Step [10400/138038], Loss: 3.2598, Perplexity: 26.0450
Epoch [3/3], Step [10500/138038], Loss: 4.2553, Perplexity: 70.48053
Epoch [3/3], Step [10600/138038], Loss: 2.5671, Perplexity: 13.0274
Epoch [3/3], Step [10700/138038], Loss: 2.5793, Perplexity: 13.1879
Epoch [3/3], Step [10800/138038], Loss: 2.0010, Perplexity: 7.39648
Epoch [3/3], Step [10900/138038], Loss: 1.9669,

Epoch [3/3], Step [21500/138038], Loss: 2.7843, Perplexity: 16.1880
Epoch [3/3], Step [21600/138038], Loss: 1.6459, Perplexity: 5.185802
Epoch [3/3], Step [21700/138038], Loss: 1.5069, Perplexity: 4.51286
Epoch [3/3], Step [21800/138038], Loss: 4.5578, Perplexity: 95.3731
Epoch [3/3], Step [21900/138038], Loss: 3.2098, Perplexity: 24.77374
Epoch [3/3], Step [22000/138038], Loss: 3.0901, Perplexity: 21.98040
Epoch [3/3], Step [22100/138038], Loss: 2.2300, Perplexity: 9.30017
Epoch [3/3], Step [22200/138038], Loss: 2.8574, Perplexity: 17.41626
Epoch [3/3], Step [22300/138038], Loss: 3.3614, Perplexity: 28.8290
Epoch [3/3], Step [22400/138038], Loss: 2.7502, Perplexity: 15.6454
Epoch [3/3], Step [22500/138038], Loss: 3.1710, Perplexity: 23.8307
Epoch [3/3], Step [22600/138038], Loss: 2.5554, Perplexity: 12.8761
Epoch [3/3], Step [22700/138038], Loss: 3.1968, Perplexity: 24.4543
Epoch [3/3], Step [22800/138038], Loss: 1.9687, Perplexity: 7.161253
Epoch [3/3], Step [22900/138038], Loss: 2.6

Epoch [3/3], Step [33400/138038], Loss: 3.4567, Perplexity: 31.7135
Epoch [3/3], Step [33500/138038], Loss: 2.2954, Perplexity: 9.928322
Epoch [3/3], Step [33600/138038], Loss: 2.3904, Perplexity: 10.91842
Epoch [3/3], Step [33700/138038], Loss: 2.5187, Perplexity: 12.41236
Epoch [3/3], Step [33800/138038], Loss: 1.8104, Perplexity: 6.113083
Epoch [3/3], Step [33900/138038], Loss: 1.9865, Perplexity: 7.28980
Epoch [3/3], Step [34000/138038], Loss: 2.6739, Perplexity: 14.4957
Epoch [3/3], Step [34100/138038], Loss: 2.4761, Perplexity: 11.8950
Epoch [3/3], Step [34200/138038], Loss: 1.6291, Perplexity: 5.09933
Epoch [3/3], Step [34300/138038], Loss: 2.9785, Perplexity: 19.65765
Epoch [3/3], Step [34400/138038], Loss: 2.3341, Perplexity: 10.3206
Epoch [3/3], Step [34500/138038], Loss: 2.4147, Perplexity: 11.1860
Epoch [3/3], Step [34600/138038], Loss: 2.0488, Perplexity: 7.758885
Epoch [3/3], Step [34700/138038], Loss: 2.2037, Perplexity: 9.05802
Epoch [3/3], Step [34800/138038], Loss: 2.

Epoch [3/3], Step [45300/138038], Loss: 2.0043, Perplexity: 7.42116
Epoch [3/3], Step [45400/138038], Loss: 3.5921, Perplexity: 36.31177
Epoch [3/3], Step [45500/138038], Loss: 3.3396, Perplexity: 28.20767
Epoch [3/3], Step [45600/138038], Loss: 2.1929, Perplexity: 8.96082
Epoch [3/3], Step [45700/138038], Loss: 2.2574, Perplexity: 9.55803
Epoch [3/3], Step [45800/138038], Loss: 2.4090, Perplexity: 11.1228
Epoch [3/3], Step [45900/138038], Loss: 2.7093, Perplexity: 15.0194
Epoch [3/3], Step [46000/138038], Loss: 3.6146, Perplexity: 37.1370
Epoch [3/3], Step [46100/138038], Loss: 2.8565, Perplexity: 17.3997
Epoch [3/3], Step [46200/138038], Loss: 3.5038, Perplexity: 33.24042
Epoch [3/3], Step [46300/138038], Loss: 2.1837, Perplexity: 8.87885
Epoch [3/3], Step [46400/138038], Loss: 3.4790, Perplexity: 32.4281
Epoch [3/3], Step [46500/138038], Loss: 2.8349, Perplexity: 17.0294
Epoch [3/3], Step [46600/138038], Loss: 2.8984, Perplexity: 18.1456
Epoch [3/3], Step [46700/138038], Loss: 1.644

Epoch [3/3], Step [57200/138038], Loss: 2.6931, Perplexity: 14.77790
Epoch [3/3], Step [57300/138038], Loss: 2.4147, Perplexity: 11.1868
Epoch [3/3], Step [57400/138038], Loss: 3.7963, Perplexity: 44.53475
Epoch [3/3], Step [57500/138038], Loss: 1.8945, Perplexity: 6.649050
Epoch [3/3], Step [57600/138038], Loss: 1.7506, Perplexity: 5.758338
Epoch [3/3], Step [57700/138038], Loss: 3.9555, Perplexity: 52.22129
Epoch [3/3], Step [57800/138038], Loss: 2.4012, Perplexity: 11.0369
Epoch [3/3], Step [57900/138038], Loss: 2.4166, Perplexity: 11.2078
Epoch [3/3], Step [58000/138038], Loss: 2.7511, Perplexity: 15.6600
Epoch [3/3], Step [58100/138038], Loss: 3.0152, Perplexity: 20.39299
Epoch [3/3], Step [58200/138038], Loss: 2.0535, Perplexity: 7.795295
Epoch [3/3], Step [58300/138038], Loss: 1.7628, Perplexity: 5.82895
Epoch [3/3], Step [58400/138038], Loss: 1.9892, Perplexity: 7.30947
Epoch [3/3], Step [58500/138038], Loss: 1.8210, Perplexity: 6.17811
Epoch [3/3], Step [58600/138038], Loss: 2

Epoch [3/3], Step [69200/138038], Loss: 2.2241, Perplexity: 9.24526
Epoch [3/3], Step [69300/138038], Loss: 3.0635, Perplexity: 21.4021
Epoch [3/3], Step [69400/138038], Loss: 2.9057, Perplexity: 18.2789
Epoch [3/3], Step [69500/138038], Loss: 2.2903, Perplexity: 9.87813
Epoch [3/3], Step [69600/138038], Loss: 2.6831, Perplexity: 14.63072
Epoch [3/3], Step [69700/138038], Loss: 2.0903, Perplexity: 8.087134
Epoch [3/3], Step [69800/138038], Loss: 2.1101, Perplexity: 8.24919
Epoch [3/3], Step [69900/138038], Loss: 2.7217, Perplexity: 15.2067
Epoch [3/3], Step [70000/138038], Loss: 1.2688, Perplexity: 3.556570
Epoch [3/3], Step [70100/138038], Loss: 2.7572, Perplexity: 15.75611
Epoch [3/3], Step [70200/138038], Loss: 1.9148, Perplexity: 6.78556
Epoch [3/3], Step [70300/138038], Loss: 2.6391, Perplexity: 14.00029
Epoch [3/3], Step [70400/138038], Loss: 3.8407, Perplexity: 46.5598
Epoch [3/3], Step [70500/138038], Loss: 3.2088, Perplexity: 24.74927
Epoch [3/3], Step [70600/138038], Loss: 1.

Epoch [3/3], Step [81100/138038], Loss: 3.0732, Perplexity: 21.6102
Epoch [3/3], Step [81200/138038], Loss: 2.5904, Perplexity: 13.3355
Epoch [3/3], Step [81300/138038], Loss: 3.1670, Perplexity: 23.73654
Epoch [3/3], Step [81400/138038], Loss: 2.5190, Perplexity: 12.4164
Epoch [3/3], Step [81500/138038], Loss: 3.3525, Perplexity: 28.5744
Epoch [3/3], Step [81600/138038], Loss: 2.7432, Perplexity: 15.5363
Epoch [3/3], Step [81700/138038], Loss: 2.0731, Perplexity: 7.949456
Epoch [3/3], Step [81800/138038], Loss: 2.4630, Perplexity: 11.74015
Epoch [3/3], Step [81900/138038], Loss: 2.4359, Perplexity: 11.4264
Epoch [3/3], Step [82000/138038], Loss: 1.9464, Perplexity: 7.00370
Epoch [3/3], Step [82100/138038], Loss: 2.2742, Perplexity: 9.720668
Epoch [3/3], Step [82200/138038], Loss: 2.2739, Perplexity: 9.717397
Epoch [3/3], Step [82300/138038], Loss: 3.2490, Perplexity: 25.7657
Epoch [3/3], Step [82400/138038], Loss: 2.0384, Perplexity: 7.67804
Epoch [3/3], Step [82500/138038], Loss: 2.5

Epoch [3/3], Step [93000/138038], Loss: 2.4403, Perplexity: 11.47619
Epoch [3/3], Step [93100/138038], Loss: 2.5799, Perplexity: 13.1962
Epoch [3/3], Step [93200/138038], Loss: 2.4352, Perplexity: 11.4184
Epoch [3/3], Step [93300/138038], Loss: 2.1192, Perplexity: 8.32467
Epoch [3/3], Step [93400/138038], Loss: 2.4789, Perplexity: 11.9277
Epoch [3/3], Step [93500/138038], Loss: 3.2511, Perplexity: 25.8175
Epoch [3/3], Step [93600/138038], Loss: 2.2174, Perplexity: 9.18374
Epoch [3/3], Step [93700/138038], Loss: 2.4181, Perplexity: 11.22443
Epoch [3/3], Step [93800/138038], Loss: 4.1432, Perplexity: 63.00141
Epoch [3/3], Step [93900/138038], Loss: 1.7923, Perplexity: 6.003254
Epoch [3/3], Step [94000/138038], Loss: 2.0700, Perplexity: 7.924554
Epoch [3/3], Step [94100/138038], Loss: 2.7989, Perplexity: 16.4267
Epoch [3/3], Step [94200/138038], Loss: 2.7216, Perplexity: 15.2039
Epoch [3/3], Step [94300/138038], Loss: 1.8444, Perplexity: 6.324674
Epoch [3/3], Step [94400/138038], Loss: 2.

Epoch [3/3], Step [104900/138038], Loss: 2.4330, Perplexity: 11.3932
Epoch [3/3], Step [105000/138038], Loss: 2.9408, Perplexity: 18.9306
Epoch [3/3], Step [105100/138038], Loss: 2.3916, Perplexity: 10.9309
Epoch [3/3], Step [105200/138038], Loss: 2.2373, Perplexity: 9.367648
Epoch [3/3], Step [105300/138038], Loss: 1.9657, Perplexity: 7.13998
Epoch [3/3], Step [105400/138038], Loss: 1.7546, Perplexity: 5.781278
Epoch [3/3], Step [105500/138038], Loss: 3.8104, Perplexity: 45.1678
Epoch [3/3], Step [105600/138038], Loss: 2.5223, Perplexity: 12.4571
Epoch [3/3], Step [105700/138038], Loss: 1.9476, Perplexity: 7.012298
Epoch [3/3], Step [105800/138038], Loss: 2.3949, Perplexity: 10.9671
Epoch [3/3], Step [105900/138038], Loss: 3.0712, Perplexity: 21.5668
Epoch [3/3], Step [106000/138038], Loss: 2.2068, Perplexity: 9.08642
Epoch [3/3], Step [106100/138038], Loss: 1.3863, Perplexity: 4.00023
Epoch [3/3], Step [106200/138038], Loss: 2.1029, Perplexity: 8.18959
Epoch [3/3], Step [106300/13803

Epoch [3/3], Step [116700/138038], Loss: 2.0827, Perplexity: 8.02651
Epoch [3/3], Step [116800/138038], Loss: 2.1352, Perplexity: 8.458641
Epoch [3/3], Step [116900/138038], Loss: 1.8594, Perplexity: 6.42017
Epoch [3/3], Step [117000/138038], Loss: 2.4984, Perplexity: 12.16276
Epoch [3/3], Step [117100/138038], Loss: 2.8563, Perplexity: 17.3973
Epoch [3/3], Step [117200/138038], Loss: 2.4413, Perplexity: 11.4882
Epoch [3/3], Step [117300/138038], Loss: 2.1635, Perplexity: 8.70120
Epoch [3/3], Step [117400/138038], Loss: 3.5980, Perplexity: 36.5235
Epoch [3/3], Step [117500/138038], Loss: 3.8659, Perplexity: 47.7478
Epoch [3/3], Step [117600/138038], Loss: 2.3452, Perplexity: 10.43579
Epoch [3/3], Step [117700/138038], Loss: 2.8199, Perplexity: 16.7744
Epoch [3/3], Step [117800/138038], Loss: 2.2740, Perplexity: 9.71820
Epoch [3/3], Step [117900/138038], Loss: 2.3663, Perplexity: 10.6576
Epoch [3/3], Step [118000/138038], Loss: 2.9567, Perplexity: 19.23494
Epoch [3/3], Step [118100/1380

Epoch [3/3], Step [128500/138038], Loss: 1.9585, Perplexity: 7.08834
Epoch [3/3], Step [128600/138038], Loss: 3.5326, Perplexity: 34.21129
Epoch [3/3], Step [128700/138038], Loss: 1.7996, Perplexity: 6.04749
Epoch [3/3], Step [128800/138038], Loss: 3.5744, Perplexity: 35.6750
Epoch [3/3], Step [128900/138038], Loss: 3.6484, Perplexity: 38.4116
Epoch [3/3], Step [129000/138038], Loss: 2.7261, Perplexity: 15.2739
Epoch [3/3], Step [129100/138038], Loss: 2.2293, Perplexity: 9.29373
Epoch [3/3], Step [129200/138038], Loss: 2.1922, Perplexity: 8.95498
Epoch [3/3], Step [129300/138038], Loss: 2.3897, Perplexity: 10.9103
Epoch [3/3], Step [129400/138038], Loss: 1.9724, Perplexity: 7.18808
Epoch [3/3], Step [129500/138038], Loss: 1.9659, Perplexity: 7.14153
Epoch [3/3], Step [129600/138038], Loss: 3.2038, Perplexity: 24.6251
Epoch [3/3], Step [129700/138038], Loss: 1.3629, Perplexity: 3.90775
Epoch [3/3], Step [129800/138038], Loss: 2.5524, Perplexity: 12.8382
Epoch [3/3], Step [129900/138038]

<a id='step3'></a>
## Step 3: (Optional) Validate your Model

To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. 

If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [None]:
# (Optional) TODO: Validate your model.