<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Functions-and-Classes" data-toc-modified-id="Functions-and-Classes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Functions and Classes</a></span></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Preprocessing</a></span></li><li><span><a href="#Loading-Images-and-Captions" data-toc-modified-id="Loading-Images-and-Captions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Loading Images and Captions</a></span></li><li><span><a href="#Split:-Train-and-Test" data-toc-modified-id="Split:-Train-and-Test-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Split: Train and Test</a></span></li></ul></div>

# Image Captioning - Advanced Statistics Topics

In this notebook we adecuate the enviroment to sue the GPU provided by Google Colab. Unfortunately, the storage (CPU) capacity of the vm is too low and we have to read, process, train the model and delete little batches of the training images.   

## Import Libraries

In this section we import the requiere libraries and establish the connection with Google Drive.

In [27]:
# Se importan las librerias necesarias
import numpy as np
import pandas as  pd
import os
import gzip
import timeit
import shutil
import json
import collections
import zipfile
import random
import time
from PIL import Image
from tqdm import tqdm
from google.colab import drive # Necesario para GoogleColab

# Tensorflow 2.x
import tensorflow as tf # Not required for this notebook 
print('TENSORFLOW VERSION: {}'.format(tf.__version__))

TENSORFLOW VERSION: 2.4.1


In [6]:
# Se fija el directorio referencia con el se conecta colab a drive # Necesario para GoogleColab
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [7]:
# Se guarda en una variable el directorio del curso

# Directorio de los datos
data_folder = '../Data/'
os.chdir('drive/MyDrive/Colab Notebooks/AST-ImageCaptioning/Notebooks/')

## Functions and Classes

In [8]:
# Load an image and adjust it to InceptionV3 networks
def image_to_v3_format(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

## Loading Images and Captions

In [9]:
# Process all the captions
annotation_folder = '../Data/Captions/'
PATH = '../Data/Images/Train/'
with open(annotation_folder + 'captions_train2014.json', 'r') as f:
    captions = json.load(f)
    
# Group all captions together having the same image ID.
image_path_to_caption = collections.defaultdict(list)
for val in captions['annotations']:
    caption = f"<start> {val['caption']} <end>"
    image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (val['image_id'])
    image_path_to_caption[image_path].append(caption)

In [10]:
image_paths = list(image_path_to_caption.keys())
random.shuffle(image_paths)
print('THE NUMBER OF IMAGES IS {} AND THERE IS {} CAPTIONS'.format(len(image_paths), len(captions['annotations'])))

THE NUMBER OF IMAGES IS 82783 AND THERE IS 414113 CAPTIONS


In [11]:
captions = []
img_name_vector = []

for image_path in image_paths:
    caption_list = image_path_to_caption[image_path]
    captions.extend(caption_list)
    img_name_vector.extend([image_path] * len(caption_list))

In [12]:
image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


In [13]:
# Get unique images
encode_train = sorted(set(img_name_vector))

# It can be done for big batches because our vm has 128gb of ram
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(image_to_v3_format, num_parallel_calls=tf.data.AUTOTUNE).batch(2**6) 

In [14]:
# Find the maximum length of any caption in our dataset
def calc_max_length(tensor):
    return max(len(t) for t in tensor)

# Choose the top 5000 words from the vocabulary
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(captions)

tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

# Create the tokenized vectors
train_seqs = tokenizer.texts_to_sequences(captions)

# Pad each vector to the max_length of the captions
# If you do not provide a max_length value, pad_sequences calculates it automatically
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

# Calculates the max_length, which is used to store the attention weights
max_length = calc_max_length(train_seqs)

## Split: Train and Test

In [15]:
img_to_cap_vector = collections.defaultdict(list)
for img, cap in zip(img_name_vector, cap_vector):
    img_to_cap_vector[img].append(cap)

# Create training and validation sets using an 80-20 split randomly.
img_keys = list(img_to_cap_vector.keys())
random.shuffle(img_keys)

slice_index = int(len(img_keys)*0.8)
img_name_train_keys, img_name_val_keys = img_keys[:slice_index], img_keys[slice_index:]

img_name_train = []
cap_train = []
for imgt in img_name_train_keys:
    capt_len = len(img_to_cap_vector[imgt])
    img_name_train.extend([imgt] * capt_len)
    cap_train.extend(img_to_cap_vector[imgt])

img_name_val = []
cap_val = []
for imgv in img_name_val_keys:
    capv_len = len(img_to_cap_vector[imgv])
    img_name_val.extend([imgv] * capv_len)
    cap_val.extend(img_to_cap_vector[imgv])

In [16]:
print('NUMBER OF TRAIN IMAGES: {} AND NUMBER OF TRAIN CAPTIONS: {}'.format(len(img_name_train), len(cap_train)))
print('NUMBER OF TEST IMAGES: {} AND NUMBER OF TEST CAPTIONS: {}'.format(len(img_name_val), len(cap_val)))

NUMBER OF TRAIN IMAGES: 331290 AND NUMBER OF TRAIN CAPTIONS: 331290
NUMBER OF TEST IMAGES: 82823 AND NUMBER OF TEST CAPTIONS: 82823


In [17]:
# Feel free to change these parameters according to your system's configuration

BATCH_SIZE = 32
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = top_k + 1
num_steps = len(img_name_train) // BATCH_SIZE
# Shape of the vector extracted from InceptionV3 is (64, 2048)
# These two variables represent that vector shape
features_shape = 2048
attention_features_shape = 64

In [18]:
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):
        # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

        # hidden shape == (batch_size, hidden_size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # attention_hidden_layer shape == (batch_size, 64, units)
        attention_hidden_layer = (tf.nn.tanh(self.W1(features) +
                                             self.W2(hidden_with_time_axis)))

        # score shape == (batch_size, 64, 1)
        # This gives you an unnormalized score for each image feature.
        score = self.V(attention_hidden_layer)

        # attention_weights shape == (batch_size, 64, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights

In [19]:
class CNN_Encoder(tf.keras.Model):
    # Since you have already extracted the features and dumped it
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

In [20]:
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units

        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc1 = tf.keras.layers.Dense(self.units)
        self.fc2 = tf.keras.layers.Dense(vocab_size)

        self.attention = BahdanauAttention(self.units)

    def call(self, x, features, hidden):
        # defining attention as a separate model
        context_vector, attention_weights = self.attention(features, hidden)

        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # passing the concatenated vector to the GRU
        output, state = self.gru(x)

        # shape == (batch_size, max_length, hidden_size)
        x = self.fc1(output)

        # x shape == (batch_size * max_length, hidden_size)
        x = tf.reshape(x, (-1, x.shape[2]))

        # output shape == (batch_size * max_length, vocab)
        x = self.fc2(x)

        return x, state, attention_weights

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

In [21]:
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

In [22]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')


def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

In [23]:
checkpoint_path = "../Data/TrainCheckpoints/TrainType0"
ckpt = tf.train.Checkpoint(encoder=encoder, decoder=decoder, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

In [24]:
start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # restoring the latest checkpoint in checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)

In [25]:
# adding this in a separate cell because if you run the training cell
# many times, the loss_plot array will be reset
loss_plot = []

@tf.function
def train_step(img_tensor, target):
    loss = 0

    # initializing the hidden state for each batch
    # because the captions are not related from image to image
    hidden = decoder.reset_state(batch_size=target.shape[0])

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)

    with tf.GradientTape() as tape:
        features = encoder(img_tensor)

        for i in range(1, target.shape[1]):
            # passing the features through the decoder
            predictions, hidden, _ = decoder(dec_input, features, hidden)

            loss += loss_function(target[:, i], predictions)

            # using teacher forcing
            dec_input = tf.expand_dims(target[:, i], 1)

    total_loss = (loss / int(target.shape[1]))

    trainable_variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, trainable_variables)

    optimizer.apply_gradients(zip(gradients, trainable_variables))

    return loss, total_loss

In [40]:
zip_files = os.listdir('../Data/Images/Zips/')
EPOCHS = 20

for epoch in range(start_epoch, EPOCHS):
    start = time.time()
    total_loss = 0
    
    real_batch = 0
    for zippy in zip_files:
        with zipfile.ZipFile('../Data/Images/Zips/' + zippy, 'r') as zip_ref:
            new_temp = '../Data/Images/Temp/'
            os.mkdir(new_temp)
            zip_ref.extractall(new_temp)

        img_train_batch = os.listdir(new_temp + 'Data/Images/Train/')
        image_dataset = tf.data.Dataset.from_tensor_slices([new_temp + 'Data/Images/Train/' + i for i in img_train_batch])
        image_dataset = image_dataset.map(image_to_v3_format, num_parallel_calls=tf.data.AUTOTUNE).batch(16)
        first = True
        for img, path in image_dataset:
            batch_features = image_features_extract_model(img)
            batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
            if first:
                batch_features_total = tf.identity(batch_features)
                first = False
            else:
                batch_features_total = tf.concat(axis=0, values=[batch_features_total,  batch_features])
            
        min_caption = np.inf
        cap_train = []
        for imgt in img_train_batch:
            capt_len = len(img_to_cap_vector['../Data/Images/Train/' + imgt])
            if capt_len < min_caption: min_caption = capt_len
        min_caption = 2 # Colab get OOM with the min_caption >= 3
        for idx in range(min_caption):
            if idx == 0:
                batch_train = tf.identity(batch_features_total)
            else:   

                batch_train = tf.concat(axis=0, values=[batch_train, batch_features_total])
                
            for imgt in img_train_batch:
                cap_train.append(img_to_cap_vector['../Data/Images/Train/' + imgt][idx])

        dataset = tf.data.Dataset.from_tensor_slices((batch_train, cap_train))

        # Shuffle and batch
        dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
        dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)
        
      
        shutil.rmtree(new_temp)
        
        for (batch, (img_tensor, target)) in enumerate(dataset):
            batch_loss, t_loss = train_step(img_tensor, target)
            total_loss += t_loss
            real_batch += 1
            if real_batch % 100 == 0:
                average_batch_loss = batch_loss.numpy()/int(target.shape[1])
                print(f'Epoch {epoch+1} Batch {real_batch} Loss {average_batch_loss:.4f}')
        # storing the epoch end loss value to plot later
        loss_plot.append(total_loss/num_steps)

    if epoch % 1 == 0:
        ckpt_manager.save()

    print(f'Epoch {epoch+1} Loss {total_loss/num_steps:.6f}')
    print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

Epoch 1 Batch 100 Loss 0.6092
Epoch 1 Batch 200 Loss 0.6891
Epoch 1 Batch 300 Loss 0.6646
Epoch 1 Batch 400 Loss 0.6236
Epoch 1 Batch 500 Loss 0.7345
Epoch 1 Batch 600 Loss 0.6345
Epoch 1 Batch 700 Loss 0.6344
Epoch 1 Batch 800 Loss 0.6047
Epoch 1 Batch 900 Loss 0.7420
Epoch 1 Batch 1000 Loss 0.7920
Epoch 1 Batch 1100 Loss 0.6598
Epoch 1 Batch 1200 Loss 0.6581
Epoch 1 Batch 1300 Loss 0.5825
Epoch 1 Batch 1400 Loss 0.5495
Epoch 1 Batch 1500 Loss 0.6275
Epoch 1 Batch 1600 Loss 0.6545
Epoch 1 Batch 1700 Loss 0.5883
Epoch 1 Batch 1800 Loss 0.6513
Epoch 1 Batch 1900 Loss 0.6246
Epoch 1 Batch 2000 Loss 0.5561
Epoch 1 Batch 2100 Loss 0.5084
Epoch 1 Batch 2200 Loss 0.7304
Epoch 1 Batch 2300 Loss 0.7075
Epoch 1 Batch 2400 Loss 0.6274
Epoch 1 Batch 2500 Loss 0.5540
Epoch 1 Batch 2600 Loss 0.6554
Epoch 1 Batch 2700 Loss 0.6225
Epoch 1 Batch 2800 Loss 0.6313
Epoch 1 Batch 2900 Loss 0.7246
Epoch 1 Batch 3000 Loss 0.6092
Epoch 1 Batch 3100 Loss 0.5905
Epoch 1 Batch 3200 Loss 0.6840
Epoch 1 Batch 330

KeyboardInterrupt: ignored

In [45]:
!ls ../Data

Captions  Images  TrainCheckpoints


# End

In [39]:
import shutil

dir_path = '../Data/Images/Temp/'

try:
    shutil.rmtree(dir_path)
    print('ELIMINADO')
except OSError as e:
    print("Error: %s : %s" % (dir_path, e.strerror))

ELIMINADO
