# Installing and Importing Libraries

In this section, we install and import all the required libraries needed for data loading, preprocessing, model building, training, and evaluation.

- We install essential packages like TensorFlow, NumPy, Pandas, Matplotlib, and Scikit-learn.
- We import additional built-in libraries such as `os`, `re`, and `pickle`.
- We set random seeds to ensure reproducibility of the results.
- If running on Google Colab, we also mount Google Drive to access the dataset and save models.

In [None]:
!pip install --upgrade pip
!pip3 install numpy pandas matplotlib scikit-learn tensorflow transformers torch Pillow requests pyngrok streamlit deep_translator googletrans==4.0.0-rc1

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1
Collecting pyngrok
  Downloading pyngrok-7.2.7-py3-none-any.whl.metadata (9.4 kB)
Collecting streamlit
  Downloading streamlit-1.45.0-py3-none-any.whl.metadata (8.9 kB)
Collecting deep_translator
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreloa

In [None]:
import numpy as np
import pandas as pd
import os
import re
import pickle
import tensorflow as tf
from tensorflow.keras import Model, layers
from tensorflow.keras.layers import Input, Dense, Embedding, LSTM, Bidirectional, concatenate, Layer, MultiHeadAttention, LayerNormalization, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.applications import DenseNet201
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

tf.random.set_seed(42)
np.random.seed(42)

print("DEBUG: تم استيراد المكتبات بنجاح.")

try:
    from google.colab import drive
    drive.mount('/content/drive')
    print("DEBUG: تم تحميل Google Drive بنجاح.")
    drive_base_path = '/content/drive/MyDrive/'
    data_path = os.path.join(drive_base_path, 'captions.txt')
    image_dir = os.path.join(drive_base_path, 'Flicker8k_Dataset')
    features_path = os.path.join(drive_base_path, 'features.pkl')
    tokenizer_path = os.path.join(drive_base_path, 'tokenizer.pkl')
    model_weights_dir = os.path.join(drive_base_path, 'ImageCaptioningModels_Org')
    os.makedirs(model_weights_dir, exist_ok=True)
    print(f"DEBUG: مسارات: \n{data_path}\n{image_dir}\n{features_path}\n{tokenizer_path}\n{model_weights_dir}")
except ModuleNotFoundError:
    print("DEBUG: لم يتم اكتشاف بيئة Google Colab.")
    data_path = 'captions.txt'
    image_dir = 'Flicker8k_Dataset'
    features_path = 'features.pkl'
    tokenizer_path = 'tokenizer.pkl'
    model_weights_dir = 'ImageCaptioningModels_Org'
    os.makedirs(model_weights_dir, exist_ok=True)

DEBUG: تم استيراد المكتبات بنجاح.
Mounted at /content/drive
DEBUG: تم تحميل Google Drive بنجاح.
DEBUG: مسارات: 
/content/drive/MyDrive/captions.txt
/content/drive/MyDrive/Flicker8k_Dataset
/content/drive/MyDrive/features.pkl
/content/drive/MyDrive/tokenizer.pkl
/content/drive/MyDrive/ImageCaptioningModels_Org


# 2. Loading Captions Data and Processing Image Names and Captions

In this section, we load the captions data and preprocess the image names and captions.
- The captions are loaded from a `.txt` file.
- We clean the image names by removing extra parts (e.g., `#`).
- We store the cleaned captions and image names in separate columns.
- We also perform some initial checks to ensure that the dataset has the correct structure and data.

We also extract unique image names and print some statistics for validation.

In [None]:
# 2. Loading Captions Data and Processing Image Names and Captions
try:
    data = pd.read_csv(data_path, sep='\t', header=None, names=['image_raw', 'caption_raw'])
    print(f"DEBUG: Captions file loaded successfully with {len(data)} rows.")
    if 'image_raw' not in data.columns or 'caption_raw' not in data.columns:
        raise ValueError("Columns 'image_raw' or 'caption_raw' are missing!")
    data['image'] = data['image_raw'].astype(str).apply(lambda x: x.split('#')[0])  # Clean image name
    data['caption'] = data['caption_raw'].astype(str)  # Clean captions
    print(f"DEBUG: Example cleaned image: {data['image'].iloc[0]}")
    print(f"DEBUG: Number of unique images: {data['image'].nunique()}")
except Exception as e:
    print(f"Error loading or preparing captions file: {e}")

DEBUG: Captions file loaded successfully with 40455 rows.
DEBUG: Example cleaned image: 1000268201_693b08cb0e.jpg
DEBUG: Number of unique images: 8091


# 3. Text Cleaning and Tokenizer Preparation

In this section, we clean the captions and prepare the tokenizer:
- Each caption is converted to lowercase, unwanted characters are removed, and it is tokenized.
- We add start and end tokens to each caption to define the beginning and end of each sequence.
- We also save or load the tokenizer to/from a pickle file, ensuring it is ready for future use.

In [None]:
# 3. Text Cleaning and Tokenizer Preparation
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-z ]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return 'startseq ' + text + ' endseq'  # Add start and end tokens

data['caption_cleaned'] = data['caption'].apply(clean_text)
print("DEBUG: Example cleaned caption:", data['caption_cleaned'].iloc[0])

if os.path.exists(tokenizer_path):
    with open(tokenizer_path, 'rb') as f:
        tokenizer = pickle.load(f)  # Load existing tokenizer
    print("DEBUG: Tokenizer loaded.")
else:
    tokenizer = Tokenizer()  # Create a new tokenizer
    tokenizer.fit_on_texts(data['caption_cleaned'])
    with open(tokenizer_path, 'wb') as f:
        pickle.dump(tokenizer, f)  # Save the tokenizer
    print("DEBUG: Tokenizer initialized and saved.")

vocab_size  = len(tokenizer.word_index) + 1  # Size of vocabulary
max_length  = max(len(txt.split()) for txt in data['caption_cleaned'])  # Max length of captions
print(f"DEBUG: vocab_size={vocab_size} | max_length={max_length}")

DEBUG: Example cleaned caption: startseq a child in a pink dress is climbing up a set of stairs in an entry way endseq
DEBUG: Tokenizer loaded.
DEBUG: vocab_size=8768 | max_length=37


# 4. Extracting Image Features or Loading from Pickle File

In this section, we extract features from the images using a pre-trained DenseNet model.
- If the features are already extracted and stored in a pickle file, we load them from the file.
- If not, we use DenseNet201 model (without the top layer) to extract features from each image and save them for future use.
- Each feature vector represents the image in a 1920-dimensional space.

In [None]:
# 4. Extracting Image Features or Loading from Pickle File
print("DEBUG: Starting image feature extraction or loading...")

if os.path.exists(features_path):
    with open(features_path, 'rb') as f:
        features = pickle.load(f)  # Load precomputed image features
else:
    base_model = DenseNet201(include_top=False, weights='imagenet', pooling='avg', input_shape=(224, 224, 3))  # Load DenseNet model
    features = {}
    for img_name in data['image'].unique():
        img_path = os.path.join(image_dir, img_name)
        if not os.path.exists(img_path):
            continue
        img = load_img(img_path, target_size=(224, 224))  # Load and resize image
        img_array = img_to_array(img)
        img_array = np.expand_dims(img_array, axis=0)
        img_array = tf.keras.applications.densenet.preprocess_input(img_array)  # Preprocess image for DenseNet
        feature_vector = base_model.predict(img_array, verbose=0)
        feature_vector = np.array(feature_vector).squeeze()  # Flatten the feature vector
        features[img_name] = feature_vector
    with open(features_path, 'wb') as f:
        pickle.dump(features, f)  # Save extracted features
    print("DEBUG: Image features saved.")

# Example feature
example_feature = next(iter(features.values()))
print("DEBUG: Example image feature shape:", example_feature.shape)
print("DEBUG: Example feature values (partial):", example_feature[:10])

feature_size = example_feature.shape[-1]  # Size of feature vector
print(f"DEBUG: Feature vector size: {feature_size}")

DEBUG: Starting image feature extraction or loading...
DEBUG: Example image feature shape: (1, 1920)
DEBUG: Example feature values (partial): [[7.8687917e-05 7.3524064e-04 1.1395990e-03 ... 5.6523514e-01
  2.2903775e-01 6.9639796e-01]]
DEBUG: Feature vector size: 1920


# 5. Converting Captions to Sequences

In this section, we convert the cleaned captions into integer sequences:
- The tokenizer is used to map each word in the caption to an integer.
- The sequences are padded to ensure they all have the same length (max_length).

In [None]:
# 5. Converting Captions to Sequences
sequences = tokenizer.texts_to_sequences(data['caption_cleaned'])  # Convert captions to sequences of integers
X_seq = pad_sequences(sequences, maxlen=max_length, padding='post')  # Pad sequences to the same length

# Mapping images to their corresponding caption sequences
image_to_seq_map = {}
for img_name, seq in zip(data['image'], X_seq):
    if img_name not in image_to_seq_map:
        image_to_seq_map[img_name] = []
    image_to_seq_map[img_name].append(seq)

# 6. Splitting Images for Training and Validation

In this section, we split the dataset into training and validation sets:
- We split the unique image names into a training set (80%) and validation set (20%).

In [None]:
# 6. Splitting Images for Training and Validation
unique_images = data['image'].unique()
img_train_names, img_val_names = train_test_split(unique_images, test_size=0.2, random_state=42)  # Split images into train/val
print("DEBUG: train images:", len(img_train_names), "| val images:", len(img_val_names))

DEBUG: train images: 6472 | val images: 1619


### 7. Data Generator for Model Training

In this section, we define a custom data generator using TensorFlow's `Sequence` class. This generator handles batching and shuffling of image-caption pairs for training and validation. It provides an efficient way to feed data into the model during training by:

- **Generating Pairs:** The generator creates pairs of image features and their corresponding caption sequences.
- **Teacher Forcing:** During training, the model predicts the next word in the sequence based on the previous words. This is known as teacher forcing, where each step has an input (previous words) and an output (next word).
- **Batching:** The generator returns batches of image features, caption input sequences, and target word sequences, which are used for training the model.
- **Shuffling:** The generator shuffles the data after each epoch if the `shuffle` parameter is set to `True`.

This generator is designed to be used with the Keras `fit` function for model training.

#### Key Components:
- **`__init__`**: Initializes the generator with parameters like image names, caption sequences, and batch size.
- **`make_pairs`**: Creates a list of pairs `(image_name, caption_sequence)` by combining images and their corresponding captions.
- **`__len__`**: Returns the number of batches per epoch.
- **`on_epoch_end`**: Shuffles the data indices after each epoch if `shuffle` is enabled.
- **`__getitem__`**: Retrieves a batch of image-caption pairs, where each pair contains image features and caption sequences, with teacher forcing applied.

#### Purpose:
The primary purpose of this data generator is to streamline the process of training the image captioning model. It ensures that the model receives the data in batches, with captions split into input-output pairs, and handles image feature extraction efficiently.

This generator can be used directly in the model’s `fit` method for training, as shown below:

```
train_gen = DataGenerator(img_train_names, image_to_seq_map, features, batch_size, max_length, vocab_size)```

``` val_gen = DataGenerator(img_val_names, image_to_seq_map, features, batch_size, max_length, vocab_size)```

In [None]:
from tensorflow.keras.utils import Sequence
from tensorflow.keras.utils import to_categorical

class DataGenerator(Sequence):
    def __init__(self, img_names, image_to_seq_map, features, batch_size, max_length, vocab_size, shuffle=True):
        # Initialize the data generator
        self.img_names = img_names
        self.image_to_seq_map = image_to_seq_map
        self.features = features
        self.batch_size = batch_size
        self.max_length = max_length
        self.vocab_size = vocab_size
        self.shuffle = shuffle

        # Prepare a list of (img_name, seq) pairs
        self.pairs = self.make_pairs()
        self.indices = np.arange(len(self.pairs))
        self.on_epoch_end()  # Shuffle data at the end of each epoch if required

    def make_pairs(self):
        # Create pairs of image names and corresponding caption sequences
        pairs = []
        for img_name in self.img_names:
            if img_name in self.image_to_seq_map and img_name in self.features:
                for seq in self.image_to_seq_map[img_name]:
                    pairs.append((img_name, seq))  # Add (image_name, caption_sequence) to pairs list
        return pairs

    def __len__(self):
        # Returns the number of batches per epoch
        return int(np.ceil(len(self.pairs) / float(self.batch_size)))

    def on_epoch_end(self):
        # Shuffle the indices after each epoch (optional)
        if self.shuffle:
            np.random.shuffle(self.indices)

    def __getitem__(self, idx):
        # Generate a batch of data (X_img, X_seq, y) for training
        batch_indices = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_pairs = [self.pairs[i] for i in batch_indices]

        X_img, X_seq, y = [], [], []
        for img_name, seq in batch_pairs:
            feature_vector = self.features[img_name]  # Get image features
            feature_vector = np.array(feature_vector).squeeze()

            # Teacher forcing: each step in the sequence has an input/output
            for i in range(1, len(seq)):
                in_seq = pad_sequences([seq[:i]], maxlen=self.max_length, padding='post')[0]  # Input sequence
                out_word_index = seq[i]  # The next word in the sequence
                if out_word_index == 0: continue  # Skip padding token
                out_word = to_categorical([out_word_index], num_classes=self.vocab_size)[0]  # Convert to one-hot encoding

                X_img.append(feature_vector)  # Append the image feature
                X_seq.append(in_seq)  # Append the input sequence
                y.append(out_word)  # Append the output (target) word

        return (np.array(X_img), np.array(X_seq)), np.array(y)

# Initialize the data generators for training and validation
batch_size = 32  # You can adjust this value depending on memory constraints
train_gen = DataGenerator(img_train_names, image_to_seq_map, features, batch_size, max_length, vocab_size)
val_gen   = DataGenerator(img_val_names, image_to_seq_map, features, batch_size, max_length, vocab_size)

### 8. Training the Model

This function handles the process of training the image captioning model, saving its weights, and restoring the best model when necessary.

#### Key Steps:
1. **Check if Weights Exist**: If the model weights already exist (i.e., training has been done previously), the function will load the weights from the file and return the model without retraining it.
2. **Train the Model**: If no weights are found, the function will proceed with training the model.
   - **Callbacks**:
     - **ModelCheckpoint**: Saves the best version of the model based on the validation loss (`val_loss`).
     - **EarlyStopping**: Stops training early if the validation loss does not improve for a specified number of epochs (patience), and restores the best weights from the training process.

#### Parameters:
- **model**: The image captioning model to be trained.
- **model_name**: The name of the model, which will be used to save the weights.
- **train_gen**: The training data generator that provides the image-caption pairs.
- **val_gen**: The validation data generator.
- **model_weights_dir**: The directory where the model weights will be saved.
- **epochs**: The number of epochs for training (default is 20).

#### Function Logic:
- If the model weights already exist, they are loaded into the model.
- If the weights don't exist, the model is trained with the provided data generators (`train_gen` and `val_gen`) for the specified number of epochs. The best model weights are saved based on the lowest validation loss.

#### Example Usage:

```
history = train_model(model, 'image_captioning_model', train_gen, val_gen, model_weights_dir, epochs=20)```

In [None]:
def train_model(model, model_name, train_gen, val_gen, model_weights_dir, epochs=15):
    model_weights_path = os.path.join(model_weights_dir, f"{model_name}_full_model.keras")
    print(f"\nDEBUG: ----- Training/Loading model: {model_name} -----")

    # Check if the model weights already exist
    if os.path.exists(model_weights_path):
        print(f"DEBUG: Found weights file! Loading the model weights...")
        model.load_weights(model_weights_path)
        return model

    # Train the model if weights don't exist
    history = model.fit(
        train_gen,
        epochs=epochs,
        validation_data=val_gen,
        callbacks=[
            # Save the best model based on validation loss
            tf.keras.callbacks.ModelCheckpoint(
                model_weights_path, save_best_only=True, save_weights_only=False, monitor='val_loss', verbose=1
            ),
            # Stop training early if the validation loss doesn't improve
            tf.keras.callbacks.EarlyStopping(
                monitor='val_loss', patience=3, verbose=1, restore_best_weights=True
            )
        ],
        verbose=1
    )

    print(f"DEBUG: ----- Model training completed: {model_name} -----")
    return history

### 9. Model Architecture: `build_model1`

This function defines the architecture of a simple image captioning model that uses both image features and text sequences (captions). The model consists of two inputs: one for image features and one for the caption sequence. The image features are passed through a dense layer, and the caption sequence goes through an embedding layer followed by an LSTM layer. The outputs of both branches are concatenated and passed through a final dense layer to predict the next word in the sequence.

#### Key Components:
1. **Image Input**:
   - The image features are passed as a vector of shape `(feature_size,)` where `feature_size` is the dimension of the image feature vector extracted earlier using a pre-trained model like DenseNet.
   - These features are passed through a Dense layer with 256 units and ReLU activation.

2. **Caption Input**:
   - The caption input is a sequence of word indices, represented by a vector of shape `(max_length,)`.
   - An Embedding layer is used to convert word indices into dense vectors, followed by an LSTM layer to capture the sequential dependencies in the captions.

3. **Concatenation of Image and Caption**:
   - The outputs of the image feature processing branch (`img_feats`) and the LSTM layer (`lstm`) are concatenated together into a combined feature vector.

4. **Output Layer**:
   - A Dense layer with softmax activation is used to predict the next word in the sequence. The output dimension is equal to the vocabulary size.

#### Parameters:
- **feature_size**: The dimension of the image feature vector.
- **vocab_size**: The total number of unique words in the vocabulary (plus one for padding).
- **max_length**: The maximum length of the caption sequences.

#### Function Logic:
- The model uses both image features and caption sequences as inputs. The image features are passed through a dense layer, and the captions are processed using an embedding followed by an LSTM.
- The outputs of both the image and caption branches are concatenated and passed through a softmax output layer to predict the next word.

In [None]:
def build_model1(feature_size, vocab_size, max_length):
    print(f"feature_size: {feature_size}")

    # Image input branch (input shape is the feature size)
    image_input = Input(shape=(feature_size,))
    img_feats = Dense(256, activation='relu')(image_input)  # Dense layer for image features

    # Caption input branch (input shape is the maximum length of the sequence)
    caption_input = Input(shape=(max_length,))
    emb = Embedding(vocab_size, 256, mask_zero=True)(caption_input)  # Embedding layer for words
    lstm = LSTM(256)(emb)  # LSTM to process the sequence of words

    # Combine image and caption features
    combined = concatenate([img_feats, lstm])

    # Output layer (predict the next word)
    output = Dense(vocab_size, activation='softmax')(combined)

    # Create and compile the model
    model = Model([image_input, caption_input], output)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

### 10. Model Architecture: `build_model2`

This function defines a second variation of the image captioning model. The architecture is similar to `build_model1`, but it incorporates a **Bidirectional LSTM** layer instead of a regular LSTM. This allows the model to capture both past and future contexts in the caption sequences, which can improve the model's performance in sequence-based tasks.

#### Key Components:
1. **Image Input**:
   - The image features are passed as a vector of shape `(feature_size,)`, which represents the dimension of the image feature vector extracted earlier. This feature vector is passed through a Dense layer with 256 units and ReLU activation.

2. **Caption Input**:
   - The caption input is a sequence of word indices with a shape of `(max_length,)`, where `max_length` is the maximum length of the caption sequence.
   - An Embedding layer is used to convert word indices into dense vectors of size 256, followed by a **Bidirectional LSTM** layer. The Bidirectional LSTM processes the sequence in both directions (forward and backward), which helps the model understand the context from both past and future words.

3. **Concatenation of Image and Caption**:
   - The outputs of the image feature processing branch (`img_feats`) and the Bidirectional LSTM layer (`lstm`) are concatenated together into a single vector.

4. **Output Layer**:
   - A Dense layer with softmax activation is used to predict the next word in the sequence. The output dimension is equal to the vocabulary size.

#### Parameters:
- **feature_size**: The dimension of the image feature vector.
- **vocab_size**: The total number of unique words in the vocabulary (plus one for padding).
- **max_length**: The maximum length of the caption sequences.

#### Function Logic:
- The model takes both image features and caption sequences as input. The image features go through a dense layer, and the captions are processed through an embedding layer followed by a Bidirectional LSTM layer.
- The outputs from both the image and caption branches are concatenated and passed through a softmax output layer to predict the next word in the sequence.

#### Example Usage:

```model = build_model2(feature_size=1920, vocab_size=10000, max_length=40)```


In [None]:
def build_model2(feature_size, vocab_size, max_length):
    # Image input branch (input shape is the feature size)
    image_input = Input(shape=(feature_size,))
    img_feats = Dense(256, activation='relu')(image_input)  # Dense layer for image features

    # Caption input branch (input shape is the maximum length of the sequence)
    caption_input = Input(shape=(max_length,))
    emb = Embedding(vocab_size, 256, mask_zero=True)(caption_input)  # Embedding layer for words
    lstm = Bidirectional(LSTM(128))(emb)  # Bidirectional LSTM to process the sequence of words

    # Combine image and caption features
    combined = concatenate([img_feats, lstm])

    # Output layer (predict the next word)
    output = Dense(vocab_size, activation='softmax')(combined)

    # Create and compile the model
    model = Model([image_input, caption_input], output)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

### 11. Model Architecture: `build_model3` with Attention Layer

In this model, we introduce an **Attention Layer** to enhance the performance of the image captioning model. The attention mechanism helps the model focus on important parts of the image features and caption during training, which can improve its ability to generate more accurate and relevant captions.

#### Key Components:
1. **Image Input**:
   - The image features are passed as a vector of shape `(feature_size,)`, which represents the dimension of the image feature vector. This feature vector is passed through a Dense layer with 256 units and ReLU activation.

2. **Caption Input**:
   - The caption input is a sequence of word indices with a shape of `(max_length,)`, where `max_length` is the maximum length of the caption sequence.
   - An Embedding layer is used to convert word indices into dense vectors of size 256. These embeddings are then passed through an LSTM layer with 256 units. The `return_sequences=True` argument ensures that the LSTM outputs sequences for every time step, not just the final state.

3. **Attention Mechanism**:
   - The attention layer takes the image features (`img_feats`) and the output from the LSTM layer (`lstm`). The attention mechanism computes attention scores to focus on important parts of the image features for each word in the caption.
   - The attention weights are computed using a dense layer (`V`) that scores the combination of image features and LSTM outputs. The context vector is then calculated as a weighted sum of the LSTM outputs, based on the attention weights.

4. **Output Layer**:
   - The context vector from the attention mechanism is passed through a Dense layer with softmax activation, which predicts the next word in the caption sequence.

#### Parameters:
- **feature_size**: The dimension of the image feature vector.
- **vocab_size**: The total number of unique words in the vocabulary (plus one for padding).
- **max_length**: The maximum length of the caption sequences.

#### Function Logic:
- The model takes both image features and caption sequences as input. The image features go through a dense layer, and the captions are processed through an embedding layer followed by an LSTM layer.
- The attention layer computes the attention weights and context vector, which are then passed through a softmax output layer to predict the next word in the sequence.

In [None]:
from tensorflow.keras.layers import Layer, Dense
import tensorflow as tf

class AttentionLayer(Layer):
    def __init__(self, units, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        # Initialize the sub-layers
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

    def build(self, input_shape):
        super().build(input_shape)  # Automatically build sub-layers

    def call(self, inputs):
        features, hidden = inputs
        features_expanded = tf.expand_dims(features, 1)
        hidden_dense = self.W2(hidden)
        features_dense = self.W1(features_expanded)
        score = tf.nn.tanh(features_dense + hidden_dense)
        attention_weights = tf.nn.softmax(self.V(score), axis=1)
        context_vector = attention_weights * hidden
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

    def compute_output_shape(self, input_shape):
        features_shape, hidden_shape = input_shape
        context_vector_shape = (features_shape[0], hidden_shape[2])
        attention_weights_shape = (features_shape[0], hidden_shape[1], 1)
        return [context_vector_shape, attention_weights_shape]

    def get_config(self):
        config = super().get_config()
        config.update({"units": self.units})
        return config

def build_model3(feature_size, vocab_size, max_length):
    # Image input branch (input shape is the feature size)
    image_input = Input(shape=(feature_size,))
    img_feats = Dense(256, activation='relu')(image_input)  # Dense layer for image features

    # Caption input branch (input shape is the maximum length of the sequence)
    caption_input = Input(shape=(max_length,))
    emb = Embedding(vocab_size, 256, mask_zero=True)(caption_input)  # Embedding layer for words
    lstm = LSTM(256, return_sequences=True)(emb)  # LSTM layer to process the sequence of words

    # Attention mechanism
    attention = AttentionLayer(256)
    context_vector, _ = attention([img_feats, lstm])

    # Output layer (predict the next word)
    output = Dense(vocab_size, activation='softmax')(context_vector)

    # Create and compile the model
    model = Model([image_input, caption_input], output)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

# `build_model4` with Transformer Decoder

In this model, we combine a Transformer Decoder to handle both image features and text (captions). This model utilizes the Transformer mechanism, which is commonly used in tasks that involve sequence-to-sequence transformation, such as machine translation and image captioning.

## **Model Components:**

### **Image Input:**
The image features are passed as a vector of size `feature_size`. This vector is projected into a 256-dimensional space using a Dense layer.

### **Caption Input:**
The caption input consists of word indices of shape `(max_length,)`, where `max_length` is the length of the longest caption.

These word indices are then converted into dense vector representations using an Embedding layer.

### **RepeatFeatures Layer:**
This custom layer is used to repeat the image features to match the length of the caption sequence. This step is necessary to model the relationship between the image features and the words in the caption at each step.

### **Transformer Decoder:**
The function `transformer_decoder` processes both the caption and image features using the Transformer mechanism.

- The image features are projected into a 256-dimensional space, and then they are repeated to match the caption length.
- The caption is also projected into a 256-dimensional space.
- These projections are combined using the Add layer.
- The combined features are passed through the Multi-Head Attention layer to learn the relationships between the image features and the caption words.
- After attention, LayerNormalization is applied to stabilize the learning.
- Finally, the output is passed through a Feed-Forward Network (FFN) to enhance the representation.

### **Global Average Pooling:**
After the Transformer decoding process, **GlobalAveragePooling1D** is applied to obtain a fixed-size vector representation from the resulting sequence.

### **Output Layer:**
The final output is passed through a Dense layer with a softmax activation function to predict the next word in the sequence.

## **Model Explanation:**

**Parameters:**
- `feature_size`: The size of the image feature vector.
- `vocab_size`: The total number of words in the vocabulary (including padding).
- `max_length`: The maximum allowed length for a caption sequence.


In [None]:
from tensorflow.keras.layers import Layer, Dense, Embedding, Input, MultiHeadAttention, GlobalAveragePooling1D
import tensorflow as tf

# Repeat the image features to match the sequence length
class RepeatFeatures(Layer):
    def __init__(self, max_length, **kwargs):
        super().__init__(**kwargs)
        self.max_length = max_length

    def call(self, inputs):
        features_expanded = tf.expand_dims(inputs, 1)  # Expanding the feature to match sequence length
        return tf.repeat(features_expanded, repeats=self.max_length, axis=1)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.max_length, input_shape[-1])

    def get_config(self):
        config = super().get_config()
        config.update({"max_length": self.max_length})
        return config

# Transformer Decoder function
def transformer_decoder(text_embeddings, image_features, max_length):
    img_proj = Dense(256)(image_features)  # Project image features to match text embeddings size
    img_repeated = RepeatFeatures(max_length)(img_proj)  # Repeat the image features to match caption sequence length
    text_proj = Dense(256)(text_embeddings)  # Project text embeddings to match image features size
    combined = layers.Add()([text_proj, img_repeated])  # Combine both projections

    # Multi-Head Attention
    attn_output = MultiHeadAttention(num_heads=2, key_dim=256)(combined, combined)
    x = layers.LayerNormalization(epsilon=1e-6)(attn_output + combined)  # Normalize the output

    # Feed-Forward Network (FFN)
    ffn = Dense(512, activation='relu')(x)
    ffn = Dense(256)(ffn)  # Another dense layer
    output = layers.LayerNormalization(epsilon=1e-6)(ffn + x)  # Normalize the final output
    return output

# Building the model
def build_model4(feature_size, vocab_size, max_length):
    image_input = Input(shape=(feature_size,))  # Image feature input
    img_feats = Dense(256)(image_input)  # Project image features to a 256-dimensional space

    caption_input = Input(shape=(max_length,))  # Caption input
    emb = Embedding(vocab_size, 256, mask_zero=True)(caption_input)  # Embed the words in the captions

    trans_out = transformer_decoder(emb, img_feats, max_length)  # Pass through the transformer decoder
    trans_out = GlobalAveragePooling1D()(trans_out)  # Global average pooling to get a fixed-size vector

    output = Dense(vocab_size, activation='softmax')(trans_out)  # Output layer to predict the next word
    model = Model([image_input, caption_input], output)  # Create the model

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # Compile the model
    return model


### Training `model1` and `model2`

In this section, we train two models with different architectures: **Model1** using a simple LSTM layer and **Model2** using a Bidirectional LSTM (BiLSTM) layer.

#### **Training `model1`:**

1. **Building the Model:**
   - `build_model1(feature_size, vocab_size, max_length)`: This function builds the model with a simple LSTM layer after embedding the captions.
   - `feature_size`: The dimension of the image features passed to the model.
   - `vocab_size`: The size of the vocabulary, which is the total number of unique words.
   - `max_length`: The maximum sequence length for the captions.

2. **Training the Model:**
   - `train_model(model1, "model1_lstm", train_gen, val_gen, model_weights_dir, epochs=15)`: This trains the model for 15 epochs. It uses the `train_gen` (training data generator) and `val_gen` (validation data generator) to feed the model and monitor performance on the validation set.
   - Model weights are saved with the name `"model1_lstm"`.

#### **Training `model2`:**

1. **Building the Model:**
   - `build_model2(feature_size, vocab_size, max_length)`: This function builds the model with a Bidirectional LSTM layer.
   - Similar to `model1`, this model also uses the `feature_size`, `vocab_size`, and `max_length` parameters.

2. **Training the Model:**
   - `train_model(model2, "model2_bilstm", train_gen, val_gen, model_weights_dir, epochs=15)`: This trains the Bidirectional LSTM model for 15 epochs and saves the weights with the name `"model2_bilstm"`.

### Key Points:
- **LSTM vs BiLSTM**: The key difference between `model1` and `model2` is the architecture. While `model1` uses a regular LSTM layer, `model2` utilizes a Bidirectional LSTM, which processes the input sequence in both forward and backward directions. This allows the model to capture context from both the past and future words in a sentence.

In [None]:
# Build the model using a simple LSTM layer
model1 = build_model1(feature_size, vocab_size, max_length)

# Train the model for 15 epochs
# `train_model` function will handle the training process and saving the model weights
train_model(model1, "model1_lstm", train_gen, val_gen, model_weights_dir, epochs=15)

feature_size: 1920

DEBUG: ----- Training/Loading model: model1_lstm -----
DEBUG: Found weights file! Loading the model weights...


  saveable.load_own_variables(weights_store.get(inner_path))


<Functional name=functional, built=True>

In [None]:
# Build the model using a Bidirectional LSTM layer
model2 = build_model2(feature_size, vocab_size, max_length)

# Train the model for 15 epochs
# `train_model` function will handle the training process and saving the model weights
train_model(model2, "model2_bilstm", train_gen, val_gen, model_weights_dir, epochs=15)


DEBUG: ----- Training/Loading model: model2_bilstm -----
DEBUG: Found weights file! Loading the model weights...


  saveable.load_own_variables(weights_store.get(inner_path))


<Functional name=functional_1, built=True>

### Training `model3` with Attention Mechanism

In this section, we train **Model3** using an Attention mechanism to enhance the caption generation process by focusing on relevant parts of the image and sequence.

#### **Training `model3`:**


In [None]:
# Build the model using the Attention mechanism
model3 = build_model3(feature_size, vocab_size, max_length)

# Train the model for 15 epochs
# `train_model` function will handle the training process and saving the model weights
train_model(model3, "model3_attention", train_gen, val_gen, model_weights_dir, epochs=15)

  self._warn_if_super_not_called()



DEBUG: ----- Training/Loading model: model3_attention -----
Epoch 1/15
[1m 117/1012[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m5:22[0m 360ms/step - accuracy: 0.0946 - loss: 7.2686

KeyboardInterrupt: 

### Training `model4` with Transformer Decoder

In this section, we train **Model4** which uses a simplified Transformer Decoder to handle both the image features and the caption sequence. The Transformer mechanism is widely used in sequence-to-sequence tasks, providing better performance by learning long-range dependencies in the data.

#### **Training `model4`:**

In [None]:
# Build the model using the Transformer Decoder
model4 = build_model4(feature_size, vocab_size, max_length)

# Train the model for 15 epochs
# `train_model` function will handle the training process and saving the model weights
train_model(model4, "model4_transformer_simplified", train_gen, val_gen, model_weights_dir, epochs=15)


DEBUG: ----- Training/Loading model: model4_transformer_simplified -----
Epoch 1/15
[1m  50/1012[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m14:21[0m 895ms/step - accuracy: 0.0891 - loss: 7.1057

KeyboardInterrupt: 

### Ngrok Setup and Streamlit App Launch

This section sets up Ngrok to expose your local Streamlit app to the internet and then runs the app in the background.

#### Step-by-Step Breakdown:

1. **`ngrok.kill()`**:
   - This command stops any currently running ngrok tunnels to ensure you start with a clean slate.

2. **`ngrok.set_auth_token()`**:
   - This sets your ngrok authentication token, which is required to authenticate your session with ngrok's servers. Make sure to keep your token secure.

3. **`public_url = ngrok.connect(8501)`**:
   - This opens a tunnel to the local port 8501 (Streamlit's default port). It then stores the public URL for accessing your app.

4. **`print Statements`**:
   - These print out a message in the console with the public URL of the app that you can open in a web browser.

5. **`app_path = "/content/app.py"`**:
   - This specifies the location of the Streamlit app file that you want to run.

6. **`os.system(f"streamlit run {app_path} --server.port 8501 --server.headless true &")`**:
   - This launches the Streamlit app in the background on the specified port (8501). The `--server.headless true` flag ensures that the app runs without opening a browser window automatically.

---

### Things to Note:
- **ngrok**: This is used to expose a local server to the internet, which is useful when running on a remote environment like Google Colab.
- **Port 8501**: Streamlit by default runs on this port. If you use a different port, make sure to update both the `ngrok.connect()` call and the `streamlit run` command accordingly.


In [None]:
from pyngrok import ngrok
import os
import time

# Kill any existing ngrok processes
ngrok.kill()

# Set ngrok authentication token (replace with your own token)
ngrok.set_auth_token('2wT3kV38165t2IDIM92nXIBVjbT_5MaG5c41bMsjzPgBUcKaF')

# Open a new tunnel on port 8501 (default for Streamlit)
delay_seconds = 2
print(f"Waiting for {delay_seconds} seconds before starting a new tunnel...")
time.sleep(delay_seconds) # Stop ex. during 2 secs
public_url = ngrok.connect(8501)

# Print the public URL
print(f"✅ Streamlit app public URL (open in your browser): {public_url}")

# Define the path to your Streamlit app
app_path = "/content/drive/MyDrive/app.py"

# Run the Streamlit app in the background
os.system(f"streamlit run {app_path} --server.port 8501 --server.headless true &")


Waiting for 2 seconds before starting a new tunnel...
✅ Streamlit app public URL (open in your browser): NgrokTunnel: "https://d719-34-16-201-201.ngrok-free.app" -> "http://localhost:8501"


0