# Code structure
### Data Preprocessing
Load and preprocess image data
Preprocess text data

### Feature Extraction
Load pre-trained CNN model
Extract features from images

### Caption Generation Model
Design sequence-based model
Define embedding layer
Implement attention mechanism (if needed)
Train the model using training data

### Training
Split data into train, validation, and test sets
Train the model
Validate the model
Evaluate the model using BLEU scores

### Evaluation
Calculate BLEU scores
Visual inspection of generated captions
Compare with expert and crowd judgments

### Hyperparameter Tuning
Experiment with different architectures and hyperparameters

### Model Deployment
Deploy the model for inference
Provide a user-friendly interface
Monitor the model's performance


# Code starts here 

# 1. Library and Dataset Loading 

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
tf.random.set_seed(42)

# Define paths to dataset files
dataset_dir = "dataset"
image_dir = os.path.join(dataset_dir, "Flicker8k_Dataset")
caption_file = os.path.join(dataset_dir, "Flickr8k_text/Flickr8k.token.txt")

# Load captions into a DataFrame
captions_df = pd.read_csv(caption_file, sep="\t", header=None, names=["image_id", "caption"])

## 2 Preprocessing

We'll resize the images to a fixed size, preprocess them according to the requirements of the VGG16 model, and extract features using the pre-trained VGG16 model.

In [2]:
# Function to load and preprocess images
def load_and_preprocess_image(image_path, target_size=(224, 224)):
    img = load_img(image_path, target_size=target_size)
    img_array = img_to_array(img)
    img_array = preprocess_input(img_array)
    return img_array

# Load and preprocess all images
image_data = {}
for img_file in os.listdir(image_dir):
    img_path = os.path.join(image_dir, img_file)
    image_data[img_file.split('.')[0]] = load_and_preprocess_image(img_path)

# Extract image features using pre-trained VGG16 model
vgg_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
image_features = {}
for img_id, img_data in image_data.items():
    img_data = np.expand_dims(img_data, axis=0)
    features = vgg_model.predict(img_data)
    image_features[img_id] = features.reshape(features.shape[1:])


2024-05-14 10:06:17.684436: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz




## 3 Tokenizing

Preprocess the text data by tokenizing captions, building a vocabulary, and converting words to indices. The following code tokenizes captions using the Tokenizer class from Keras, creates sequences of tokens for each caption, and pads sequences to ensure uniform length. It also prepares input-output pairs for training, where X contains image features and y contains padded sequences of word indices.

In [3]:
# Print image IDs from DataFrame
print("Image IDs from DataFrame:", captions_df["image_id"].head())

# Print keys in image_features dictionary
print("Keys in image_features dictionary:", list(image_features.keys())[:5])


Image IDs from DataFrame: 0    1000268201_693b08cb0e.jpg#0
1    1000268201_693b08cb0e.jpg#1
2    1000268201_693b08cb0e.jpg#2
3    1000268201_693b08cb0e.jpg#3
4    1000268201_693b08cb0e.jpg#4
Name: image_id, dtype: object
Keys in image_features dictionary: ['2387197355_237f6f41ee', '2609847254_0ec40c1cce', '2046222127_a6f300e202', '2853743795_e90ebc669d', '2696951725_e0ae54f6da']


In [4]:
# Tokenize captions
tokenizer = Tokenizer()
tokenizer.fit_on_texts(captions_df["caption"])
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary size:", vocab_size)

# Create sequences of tokens for each caption
sequences = tokenizer.texts_to_sequences(captions_df["caption"])

# Pad sequences to ensure uniform length
max_seq_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_seq_length, padding='post')

# Remove file extensions from image IDs in DataFrame
captions_df["image_id"] = captions_df["image_id"].apply(lambda x: x.split(".")[0])

# Create input-output pairs for training
X = []
y = []

for img_id, seq in zip(captions_df["image_id"], padded_sequences):
    if img_id in image_features:
        X.append(image_features[img_id])
        y.append(seq)

X = np.array(X)
y = np.array(y)

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)


Vocabulary size: 8494
Shape of X: (40455, 7, 7, 512)
Shape of y: (40455, 37)


Successfully preprocessed the data and created input-output pairs for training a caption generation model.

## 4 Training the model

we'll design and build the caption generation model. We'll use a pre-trained CNN (VGG16) as the feature extractor and an LSTM-based sequence model for generating captions

In [5]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Flatten, Concatenate, Dense


# Define feature extractor model (pre-trained CNN)
input_shape = (7, 7, 512)  # Shape of the extracted image features
image_input = Input(shape=input_shape, name='image_input')
# Define the rest of your feature extractor model (e.g., using VGG16)

# Flatten the image features
flattened_image = Flatten()(image_input)

# Define sequence-based model
caption_input = Input(shape=(max_seq_length,), name='caption_input')
# Embedding layer to convert word indices to dense vectors
embedding_dim = 300  # Example value for embedding dimensionality
embedding_layer = Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_length)(caption_input)
# LSTM layer to process the embedded sequences
lstm_units = 256  # Example value for LSTM units
lstm_layer = LSTM(lstm_units)(embedding_layer)
# Define the rest of your sequence-based model (e.g., adding Attention mechanism)

# Concatenate image and caption features
concatenated_features = Concatenate(axis=1)([flattened_image, lstm_layer])

# Output layer to predict the next word
output_layer = Dense(vocab_size, activation='softmax')(concatenated_features)

# Combine feature extractor and sequence-based model
model = Model(inputs=[image_input, caption_input], outputs=output_layer)

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 caption_input (InputLayer)     [(None, 37)]         0           []                               
                                                                                                  
 image_input (InputLayer)       [(None, 7, 7, 512)]  0           []                               
                                                                                                  
 embedding (Embedding)          (None, 37, 300)      2548200     ['caption_input[0][0]']          
                                                                                                  
 flatten (Flatten)              (None, 25088)        0           ['image_input[0][0]']            
                                                                                              

Breakdown of the model summary:

- The model consists of three input layers: caption_input, image_input, and embedding_1.
- The caption_input layer takes sequences of word indices representing captions.
- The image_input layer takes extracted features from images.
- The embedding_1 layer is an embedding layer that converts word indices into dense vectors.
- The flatten layer flattens the image features into a 1D tensor.
- The lstm_1 layer is an LSTM layer that processes the embedded sequences.
- The concatenate_1 layer concatenates the flattened image features and the LSTM output.
- The dense layer is the output layer that predicts the next word in the caption.
- The model has a total of 218,398,998 parameters, all of which are trainable.

## 5

In [6]:
# Load training and test image filenames
train_images_file = os.path.join(dataset_dir, "Flickr8k_text/Flickr_8k.trainImages.txt")
test_images_file = os.path.join(dataset_dir, "Flickr8k_text/Flickr_8k.testImages.txt")

with open(train_images_file, 'r') as file:
    train_image_filenames = file.readlines()
    train_image_filenames = [name.strip() for name in train_image_filenames]

with open(test_images_file, 'r') as file:
    test_image_filenames = file.readlines()
    test_image_filenames = [name.strip() for name in test_image_filenames]


In [7]:
# Filter captions DataFrame for training and test images
train_captions_df = captions_df[captions_df['image_id'].isin(train_image_filenames)]
test_captions_df = captions_df[captions_df['image_id'].isin(test_image_filenames)]

# Load expert judgments
expert_annotations_file = os.path.join(dataset_dir, "Flickr8k_text/ExpertAnnotations.txt")
expert_annotations_df = pd.read_csv(expert_annotations_file, sep="\t")


## 5 Data generators for training and validation, define the model checkpoint to save the best model during training, and then train the model

### Loading training and test image filenames

In [8]:
train_images_file = os.path.join(dataset_dir, "Flickr8k_text/Flickr_8k.trainImages.txt")
test_images_file = os.path.join(dataset_dir, "Flickr8k_text/Flickr_8k.testImages.txt")

with open(train_images_file, 'r') as file: train_image_filenames = file.readlines() 
train_image_filenames = [name.strip() for name in train_image_filenames]

with open(test_images_file, 'r') as file: test_image_filenames = file.readlines()
test_image_filenames = [name.strip() for name in test_image_filenames]

### Filter captions DataFrame for training and test images

In [9]:
train_captions_df = captions_df[captions_df['image_id'].isin(train_image_filenames)]
test_captions_df = captions_df[captions_df['image_id'].isin(test_image_filenames)]

### Load expert judgments

In [10]:
expert_annotations_file = os.path.join(dataset_dir, "Flickr8k_text/ExpertAnnotations.txt")
expert_annotations_df = pd.read_csv(expert_annotations_file, sep="\t")

# 6 Data generators for training and validation, define the model checkpoint to save the best model during training, and then train the model

In [11]:
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

batch_size = 32

# Create training and validation data generators

def create_data_generator(data, labels, batch_size): 
    train_gen = TimeseriesGenerator(data, labels, length=max_seq_length, batch_size=batch_size) 

    return train_gen

train_gen = create_data_generator(X, y, batch_size)

# Calculate the number of batches for validation

val_batches = len(X) // batch_size

# Create validation data generator

def create_validation_generator(data, batch_size): 
    val_gen = TimeseriesGenerator(data, data, length=max_seq_length, batch_size=batch_size) 
    
    return val_gen

val_gen = create_validation_generator(X, batch_size)


# 7 Model Checkpoint

In [12]:
from tensorflow.keras.callbacks import ModelCheckpoint

# Define model checkpoint to save the best model during training

checkpoint_callback = ModelCheckpoint( filepath="best_model.h5", save_weights_only=True, monitor="val_loss", mode="min", save_best_only=True )

# 8 Train the Model

In [13]:
# Define number of epochs

epochs = 20

# Train the model

history = model.fit( train_gen, epochs=epochs, validation_data=val_gen, callbacks=[checkpoint_callback] )

Epoch 1/20


ValueError: in user code:

    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1146, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1135, in run_step  **
        outputs = model.train_step(data)
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 993, in train_step
        y_pred = self(x, training=True)
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/input_spec.py", line 216, in assert_input_compatibility
        raise ValueError(

    ValueError: Layer "model" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, None, None, None, None) dtype=float32>]


In [14]:
# %% [markdown]
# ## 5 Data Generators for Training and Validation

# %%
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

# Define batch size
batch_size = 32

# Create training and validation data generators
def create_data_generator(data, labels, batch_size):
    image_gen = TimeseriesGenerator(data, data, length=max_seq_length, batch_size=batch_size)
    caption_gen = TimeseriesGenerator(labels, labels, length=max_seq_length, batch_size=batch_size)
    return zip(image_gen, caption_gen)

train_gen = create_data_generator(X, y, batch_size)

# Calculate the number of batches for validation
val_batches = len(X) // batch_size

# Create validation data generator
val_gen = create_data_generator(X, y, batch_size)

# %% [markdown]
# ## 7 Train the Model

# %%
# Define number of epochs
epochs = 20

# Train the model
history = model.fit(
    train_gen,
    epochs=epochs,
    validation_data=val_gen,
    callbacks=[checkpoint_callback]
)

Epoch 1/20


ValueError: in user code:

    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1146, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1135, in run_step  **
        outputs = model.train_step(data)
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/training.py", line 993, in train_step
        y_pred = self(x, training=True)
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/hemang/anaconda3/lib/python3.10/site-packages/keras/engine/input_spec.py", line 232, in assert_input_compatibility
        raise ValueError(

    ValueError: Exception encountered when calling layer "model" "                 f"(type Functional).
    
    Input 0 of layer "lstm" is incompatible with the layer: expected ndim=3, found ndim=5. Full shape received: (None, None, None, None, 300)
    
    Call arguments received by layer "model" "                 f"(type Functional):
      • inputs=('tf.Tensor(shape=(None, None, None, None, None), dtype=float32)', 'tf.Tensor(shape=(None, None, None, None), dtype=float32)')
      • training=True
      • mask=None


# 9 Evaluate the Model

In [None]:
# Load the best model

best_model = tf.keras.models.load_model("best_model.h5")

# Evaluate the model using BLEU scores

from nltk.translate.bleu_score import corpus_bleu

# Generate predictions for test captions

test_predictions = [] 
for i in range(0, len(test_image_filenames), batch_size): 
    batch_images = X[i:i+batch_size] 
    batch_predictions = best_model.predict([batch_images, np.zeros((batch_size, max_seq_length))]) 
    
    test_predictions.extend(batch_predictions)

# Convert predictions to text

predictions_text = tokenizer.sequences_to_texts(np.argmax(test_predictions, axis=2))

# Calculate BLEU scores

ref_texts = list(test_captions_df["caption"]) 
ref_texts = [ref.split() for ref in ref_texts] 
bleu_scores = corpus_bleu(ref_texts, predictions_text)

print("BLEU-1 score:", bleu_scores[0]) 
print("BLEU-2 score:", bleu_scores[1]) 
print("BLEU-3 score:", bleu_scores[2]) 
print("BLEU-4 score:", bleu_scores[3])

# 10 Visual Inspection of Generated Captions

In [None]:
# Function to display images and their corresponding captions

def display_image_and_caption(image_id, image_data, captions): 
    img_path = os.path.join(image_dir, image_id + ".jpg") 
    img = load_img(img_path, target_size=(224, 224)) 
    plt.imshow(img) 
    plt.axis('off') 
    plt.title(captions[0]) 
    plt.show()

# Display images and their corresponding captions

image_id = test_image_filenames[0] 
display_image_and_caption(image_id, image_data, predictions_text)