# Code structure
### Data Preprocessing
Load and preprocess image data
Preprocess text data

### Feature Extraction
Load pre-trained CNN model
Extract features from images

### Caption Generation Model
Design sequence-based model
Define embedding layer
Implement attention mechanism (if needed)
Train the model using training data

### Training
Split data into train, validation, and test sets
Train the model
Validate the model
Evaluate the model using BLEU scores

### Evaluation
Calculate BLEU scores
Visual inspection of generated captions
Compare with expert and crowd judgments

### Hyperparameter Tuning
Experiment with different architectures and hyperparameters

### Model Deployment
Deploy the model for inference
Provide a user-friendly interface
Monitor the model's performance


# Code starts here 

In [47]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# Set random seed for reproducibility
tf.random.set_seed(42)

# Define paths to dataset files
dataset_dir = "dataset"
image_dir = os.path.join(dataset_dir, "Flicker8k_Dataset")
caption_file = os.path.join(dataset_dir, "Flickr8k_text/Flickr8k.token.txt")

# Load captions into a DataFrame
captions_df = pd.read_csv(caption_file, sep="\t", header=None, names=["image_id", "caption"])

## Preprocessing

We'll resize the images to a fixed size, preprocess them according to the requirements of the VGG16 model, and extract features using the pre-trained VGG16 model.

In [48]:
# Function to load and preprocess images
def load_and_preprocess_image(image_path, target_size=(224, 224)):
    img = load_img(image_path, target_size=target_size)
    img_array = img_to_array(img)
    img_array = preprocess_input(img_array)
    return img_array

# Load and preprocess all images
image_data = {}
for img_file in os.listdir(image_dir):
    img_path = os.path.join(image_dir, img_file)
    image_data[img_file.split('.')[0]] = load_and_preprocess_image(img_path)

# Extract image features using pre-trained VGG16 model
vgg_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
image_features = {}
for img_id, img_data in image_data.items():
    img_data = np.expand_dims(img_data, axis=0)
    features = vgg_model.predict(img_data)
    image_features[img_id] = features.reshape(features.shape[1:])




## Tokenizing

Preprocess the text data by tokenizing captions, building a vocabulary, and converting words to indices. The following code tokenizes captions using the Tokenizer class from Keras, creates sequences of tokens for each caption, and pads sequences to ensure uniform length. It also prepares input-output pairs for training, where X contains image features and y contains padded sequences of word indices.

In [49]:
# Print image IDs from DataFrame
print("Image IDs from DataFrame:", captions_df["image_id"].head())

# Print keys in image_features dictionary
print("Keys in image_features dictionary:", list(image_features.keys())[:5])


Image IDs from DataFrame: 0    1000268201_693b08cb0e.jpg#0
1    1000268201_693b08cb0e.jpg#1
2    1000268201_693b08cb0e.jpg#2
3    1000268201_693b08cb0e.jpg#3
4    1000268201_693b08cb0e.jpg#4
Name: image_id, dtype: object
Keys in image_features dictionary: ['2387197355_237f6f41ee', '2609847254_0ec40c1cce', '2046222127_a6f300e202', '2853743795_e90ebc669d', '2696951725_e0ae54f6da']


In [50]:
# Tokenize captions
tokenizer = Tokenizer()
tokenizer.fit_on_texts(captions_df["caption"])
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary size:", vocab_size)

# Create sequences of tokens for each caption
sequences = tokenizer.texts_to_sequences(captions_df["caption"])

# Pad sequences to ensure uniform length
max_seq_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_seq_length, padding='post')

# Remove file extensions from image IDs in DataFrame
captions_df["image_id"] = captions_df["image_id"].apply(lambda x: x.split(".")[0])

# Create input-output pairs for training
X = []
y = []

for img_id, seq in zip(captions_df["image_id"], padded_sequences):
    if img_id in image_features:
        X.append(image_features[img_id])
        y.append(seq)

X = np.array(X)
y = np.array(y)

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)


Vocabulary size: 8494
Shape of X: (40455, 7, 7, 512)
Shape of y: (40455, 37)


Successfully preprocessed the data and created input-output pairs for training a caption generation model.

## Training the model