<a href="https://colab.research.google.com/github/koralpc/image-captioning/blob/main/Image_Captioning_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Image Captioning

This notebook provides a short tutorial on image captioning task. In this tutorial we will train a Encoder-Decoder network with attention and use ImageNet backbone for encoding. We will train the network on COCO2014 dataset using the images and the caption annotations.


## Step 1: Copying the source
The source code for the dataset, model and training is in my repo [image_captioning](https://github.com/koralpc/image-captioning). So we will first copy it.

In [None]:
!git clone https://github.com/koralpc/image-captioning.git

In [4]:
import os
os.chdir('image-captioning')

## Step 2: Download & setup dataset
First we download the dataset. The dataset is around 13GB, so it might take a while to download and make sure you have enough space.

In [5]:
#@title Set your input variables
#@markdown You can modify the fields here to change the dataset settings
#@markdown limit_size limits how many instances you will use in the training dataset
#@markdown top_k will keep top_k words in vocabulary

annotation_url = "http://images.cocodataset.org/annotations/annotations_trainval2014.zip"  #@param
img_url = "http://images.cocodataset.org/zips/train2014.zip"  #@param {type: "string"}
buffer_size = 100  #@param {type: "slider", min: 1, max: 1000}
limit_size = 10000 #@param {type: "slider", min: 10, max: 10000}
batch_size = 64  #@param {type: "integer"}
top_k = 5000  #@param {type: "integer"}


### Initialize dataset loader

In [6]:
from src.dataset import ImageCaptionDataset
caption_dataset = ImageCaptionDataset(img_url, annotation_url)

First we fetch and extract the dataset. This step downloads the 13GB data and extracts the data. This step is both disk and RAM intensive so might take a while

In [7]:
annotation_file, image_path = caption_dataset._fetch_dataset()

Now that dataset is downloaded, we will load the annotation files to extract the image paths and the captions per image

In [8]:
train_captions, img_name_vector = caption_dataset.load_dataset(
            annotation_file, image_path, limit_size=limit_size
        )

Next step is for users with limited RAM. The image shapes are 8*8*2048 which can overflow the RAM during training. So if you have limited RAM space, you can run the code below, which will use pre-trained ImageNet to preprocess the images till the last layer before it's output layer. Then we will save these pre-processed features and train the network over it.

In [None]:
caption_dataset.preprocess_features(img_name_vector)

Next step after processing images is to tokenize the captions. Since we will use a RNN based decoder, our outputs will be caption vectors that are encoded.

In [10]:
from src.preprocess import Preprocess
cap_vector, max_length, tokenizer = Preprocess.tokenize(train_captions,top_k)

After both images and captions are processed, we split the dataset into train and validation

In [11]:
img_name_train, cap_train, img_name_val, cap_val = caption_dataset.split_dataset(
    img_name_vector, cap_vector
)

Finally, we construct a `tf.data.Dataset` element for training and validation

In [12]:
train_dataset = caption_dataset.create_dataset(
    img_name_train, cap_train, buffer_size, batch_size
)
val_dataset = caption_dataset.create_dataset(
    img_name_val, cap_val, buffer_size, batch_size
)

### Note
If you want do all these steps at once in your code, just uncomment the one-liner version below, which will give you the same output as the steps above

In [13]:
#train_data, val_data, max_length, tokenizer = caption_dataset.prepare_data(limit_size, buffer_size, batch_size)

## Setting up the model
In this tutorial we use a Encoder-Decoder Model with attention
You can play with the variables below to find the optimal setting!

In [14]:
#@title Set your model parameters
#@markdown Here you can play with some of the variables used in model architecture
embedding_dim = 512  #@param {type: "integer"}
units = 1024  #@param {type: "integer"}
vocab_size = top_k + 1
num_epochs = 40  #@param {type: "integer"}
features_shape = 2048
attention_features_shape = 64

In [None]:
from src.model import EDModel
model = EDModel(embedding_dim,units,vocab_size,tokenizer)

## Setting up the trainer
In the tutorial we will use a separate trainer class, which manages the training/evaluation,loading and saving of the model.

In [15]:
from src.train import Trainer
checkpoint_dir = "./checkpoints/train"
train_config = dict(buffer_size=buffer_size,limit_size=limit_size,batch_size=batch_size,max_length=max_length,attn_shape=attention_features_shape)
trainer = Trainer(checkpoint_path=checkpoint_dir,train_config=train_config)

Here we set the checkpoint directory and initialize the manager for the model

In [16]:
num_steps = len(train_dataset) // batch_size
trainer.set_checkpoint(model)

#### Start of training

In [None]:
trainer.train(model,train_dataset,num_epochs,num_steps)

## Evaluation
For evaluation, we use the `eval_single` function of the trainer object, which takes the validation `tf.data.Dataset` we have prepared before. This function randomly selects an image from the dataset and predicts the caption for it. The captions are then displayed with the attention plots per word.

In [None]:
trainer.eval_single(model,cap_val,img_name_val,visualise=True)

## Exporting the model
Next step is to export our model.
We will export in `tf.SavedModel`, so it is possible to reload the model on TF.js