**Captioning images using CNN and RNN**

The problem is posed as follows: Given an image, we want to obtain a sentence that describes what the image consists of.

The solution I propose in this notebook consist of an Encoder and Decoder Model.
The Encoder is a ConvNet and the decoder is a LSTM RNN model which performed the image captioning.
My solution was inspired from the amazing medium article written by Stepan Ulyanin [medium article](https://medium.com/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3)

The dataset, I'm going to use is the COCO dataset for training the model. Every image comes with 5 different captions produced by different humans, as a result, every caption is slightly different from the other captions for the same image

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [3]:
# Pytorch libraries 
import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models

In [4]:
# Graphic library
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

In [5]:
# Config matplotlib
import matplotlib as mpl
mpl.rcParams['axes.grid'] = False
mpl.rcParams['image.interpolation'] = 'nearest'
mpl.rcParams['figure.figsize'] = 15, 25

In [6]:

def show_dataset(dataset, n=6):
  img = np.vstack((np.hstack((np.asarray(dataset[i][0]) for _ in range(n)))
                   for i in range(5)))
  plt.imshow(img)
  plt.axis('off')

**Encoder**

To encode the images in inputs, we are going to use a ConvNet. The must better choice is using transfer learning to perform this kind of task. Hence, we are going to use the DenseNet121 model pretrained on the ImageNet dataset

In [9]:
class EncoderConvNet(nn.Module):
    def __init__(self, embed_size=1024):
        super(EncoderConvNet, self).__init__()
        # Get the densenet121 pretrained model
        self.densenet = models.densenet121(pretrained=True)
        # The densenet classifiers' layer is as follows :
        # (classifier) : Linear(in_features=1024, out_features=1000, bias=True)
        # Let's defined a fully connected layer
        self.fc = nn.Linear(in_features=1024, out_features=embed_size)
        # Let's defined a dropout layer
        self.dropout = nn.Dropout(p=0.5)
        # Let's defined the activation layer.
        self.prelu = nn.PReLU()
        #Applies the element-wise function: PReLU(x)=max(0,x)+a∗min(0,x)
    
    def forward(self, images):
        #
        outputs = self.dropout(self.prelu(self.densenet(images)))
        #
        outputs = self.fc(outputs)
        
        return outputs

**Decoder**


In [10]:
class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
        super(DecoderRNN, self).__init__()
        # define the properties
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        
        # lstm cell
        self.lstm_cell = nn.LSTMCell(input_size=embed_size, hidden_size=hidden_size)
    
        # output fully connected layer
        self.fc_out = nn.Linear(in_features=self.hidden_size, out_features=self.vocab_size)
    
        # embedding layer
        self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.embed_size)
    
    def forward(self, features, captions):
        # batch size
        batch_size = features.size(0)
        
        # init the hidden and cell states to zeros
        hidden_state = torch.zeros((batch_size, self.hidden_size)).cuda()
        cell_state = torch.zeros((batch_size, self.hidden_size)).cuda()
    
        # define the output tensor placeholder
        outputs = torch.empty((batch_size, captions.size(1), self.vocab_size)).cuda()

        # embed the captions
        captions_embed = self.embed(captions)
        
        # pass the caption word by word
        for t in range(captions.size(1)):

            # for the first time step the input is the feature vector
            if t == 0:
                hidden_state, cell_state = self.lstm_cell(features, (hidden_state, cell_state))
                
            # for the 2nd+ time step, using teacher forcer
            else:
                hidden_state, cell_state = self.lstm_cell(captions_embed[:, t, :], (hidden_state, cell_state))
            
            # output of the attention mechanism
            out = self.fc_out(hidden_state)
            
            # build the output tensor
            outputs[:, t, :] = out
    
        return outputs

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [18]:
# how many samples per batch to load
batch_size = 48
# percentage of training set to use as validation
valid_size = 0.3

#train_data = datasets.Flickr8k(root='data', ann_file='captions', transform=transforms.ToTensor())

IsADirectoryError: [Errno 21] Is a directory: 'captions'

**Training**