###  Dataset Extraction

We begin by downloading and extracting the dataset from Google Drive. The dataset is provided as a ZIP file, which is unzipped to the `/content/dataset/` directory for further processing.

---

In [None]:
!gdown --id 1FMVcFM78XZE1KE1rIkGBpCdcdI58S1LB -O /content/dataset.zip

import zipfile

import os

dataset_path = "/content/dataset.zip"
extract_path = "/content/dataset/"

with zipfile.ZipFile(dataset_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extracted Files:", os.listdir("/content/dataset"))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading...
From (original): https://drive.google.com/uc?id=1FMVcFM78XZE1KE1rIkGBpCdcdI58S1LB
From (redirected): https://drive.google.com/uc?id=1FMVcFM78XZE1KE1rIkGBpCdcdI58S1LB&confirm=t&uuid=18999ec7-050e-4f26-90c6-58959965351f
To: /content/dataset.zip
100%|████████████████████████████████████████| 286M/286M [00:02<00:00, 96.3MB/s]
Extracted Files: ['custom_captions_dataset']


### Installing Dependencies and Setting Up Zero-Shot Captioning

This section installs required libraries (`transformers`, `pillow`, `rouge-score`, `evaluate`) and sets up the environment for zero-shot image captioning using `SmolVLM`. It defines utility functions to load a pretrained model and processor, and to generate captions for given images using the model in a zero-shot setting. The model runs on GPU if available and uses efficient memory handling techniques.

In [46]:
!pip install -q transformers pillow
!pip install rouge-score
!pip install evaluate

from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
import nltk
import gc
from PIL import Image
from transformers.image_utils import load_image
from PIL import Image
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from evaluate import load

nltk.download('punkt')

GLOBAL_MODEL = None
GLOBAL_PROCESSOR = None
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

if torch.cuda.is_available():
        torch.cuda.empty_cache()

def load_model_and_processor(model_name):
    global GLOBAL_MODEL, GLOBAL_PROCESSOR,DEVICE

    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

    if model_name == "SmolVLM":
      GLOBAL_PROCESSOR = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
      GLOBAL_MODEL = AutoModelForVision2Seq.from_pretrained(
          "HuggingFaceTB/SmolVLM-Instruct",
          torch_dtype=torch.bfloat16
      ).to(DEVICE)
    else:
      GLOBAL_PROCESSOR = AutoProcessor.from_pretrained(model_name)
      GLOBAL_MODEL = AutoModelForVision2Seq.from_pretrained(
          model_name,
          torch_dtype=torch.bfloat16
      ).to(DEVICE)

    return GLOBAL_MODEL, GLOBAL_PROCESSOR

def zero_shot_captioning(image_path: str, model_name="SmolVLM"):
    global GLOBAL_MODEL, GLOBAL_PROCESSOR,DEVICE

    if GLOBAL_MODEL is None or GLOBAL_PROCESSOR is None:
        GLOBAL_MODEL, GLOBAL_PROCESSOR = load_model_and_processor(model_name)

    image = load_image(image_path)

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": "Can you describe the image?"}
            ]
        },
    ]

    prompt = GLOBAL_PROCESSOR.apply_chat_template(messages, add_generation_prompt=True)
    inputs = GLOBAL_PROCESSOR(text=prompt, images=[image], return_tensors="pt").to(DEVICE)

    with torch.inference_mode():
        generated_ids = GLOBAL_MODEL.generate(
            **inputs,
            max_new_tokens=64
        )

    generated_ids = generated_ids[:, inputs["input_ids"].shape[1]:]
    generated_texts = GLOBAL_PROCESSOR.batch_decode(
        generated_ids,
        skip_special_tokens=True,
    )

    del inputs
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    return generated_texts[0]



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Custom ImageCaptionModel: ViT Encoder + DistilGPT2 Decoder

This section defines a custom `ImageCaptionModel` combining a frozen ViT encoder (`vit-small-patch16-224`) with a trainable DistilGPT2 decoder. The encoder extracts a CLS token from the input image, which is projected to match the decoder’s embedding size. The decoder, with cross-attention enabled, generates captions conditioned on the image embedding. The model supports both training (`forward`) and inference (`generate`) modes, using beam search for decoding.

In [None]:
from torch import nn
import torch
import torch.nn as nn
from transformers import (
    AutoImageProcessor,
    AutoModel,
    GPT2Config,
    AutoTokenizer,
    AutoModelForCausalLM
)

class ImageCaptionModel(nn.Module):
    def __init__(self, encoder_name="WinKawaks/vit-small-patch16-224", decoder_name="distilgpt2"):
        super().__init__()

        # Encoder
        self.encoder = AutoModel.from_pretrained(encoder_name)
        self.feature_extractor = AutoImageProcessor.from_pretrained(encoder_name)

        for param in self.encoder.parameters():
            param.requires_grad = False

        # Decoder
        config = GPT2Config.from_pretrained(decoder_name)
        config.add_cross_attention = True
        self.decoder = AutoModelForCausalLM.from_pretrained(decoder_name, config=config)

        self.projection = nn.Linear(
            self.encoder.config.hidden_size,
            self.decoder.config.n_embd
        )

        self.tokenizer = AutoTokenizer.from_pretrained(decoder_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def forward(self, images, decoder_input_ids):
        encoder_outputs = self.encoder(pixel_values=images)
        image_embeddings = encoder_outputs.last_hidden_state[:, 0, :]  # CLS token
        projected_embeds = self.projection(image_embeddings).unsqueeze(1)

        outputs = self.decoder(
            input_ids=decoder_input_ids,
            encoder_hidden_states=projected_embeds
        )
        return outputs.logits

    def generate(self, images, max_length=64):
        encoder_outputs = self.encoder(pixel_values=images)
        image_embeddings = encoder_outputs.last_hidden_state[:, 0, :]
        projected_embeds = self.projection(image_embeddings).unsqueeze(1)

        batch_size = images.size(0)
        start_tokens = torch.full(
            (batch_size, 1),
            self.tokenizer.bos_token_id if self.tokenizer.bos_token_id is not None else self.tokenizer.eos_token_id,
            dtype=torch.long,
            device=images.device
        )

        outputs = self.decoder.generate(
            input_ids=start_tokens,
            encoder_hidden_states=projected_embeds,
            max_length=max_length,
            num_beams=4,
            early_stopping=True,
            pad_token_id=self.tokenizer.pad_token_id
        )

        return outputs


### Custom Image Captioning Model: Training Pipeline

This section prepares and trains a custom image captioning model using a ViT encoder and DistilGPT2 decoder. Image-caption pairs are loaded from a CSV file and processed using a feature extractor and tokenizer. A PyTorch `DataLoader` batches the processed inputs. The model is trained using teacher forcing, with a cross-entropy loss applied to non-padded tokens. The training loop runs for 5 epochs, with real-time progress tracking using `tqdm`. Finally, the trained model weights are saved to disk for later use.

In [48]:
from tqdm import tqdm
import pandas as pd
from torch.utils.data import DataLoader, TensorDataset

def prepare_dataloader(csv_path, image_folder, processor, tokenizer, max_length=64, batch_size=8):
    df = pd.read_csv(csv_path)

    image_tensors = []
    caption_tensors = []

    for idx, row in df.iterrows():
        image_path = os.path.join(image_folder, row["filename"])
        caption = row["caption"]

        image = Image.open(image_path).convert("RGB")
        image_tensor = processor(images=image, return_tensors="pt")["pixel_values"].squeeze(0)
        image_tensors.append(image_tensor)

        tokenized = tokenizer(
            caption,
            padding="max_length",
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        ).input_ids.squeeze(0)
        caption_tensors.append(tokenized)

    images_tensor = torch.stack(image_tensors)
    captions_tensor = torch.stack(caption_tensors)

    dataset = TensorDataset(images_tensor, captions_tensor)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    return dataloader

def train_model(model, dataloader, optimizer, criterion, device, epochs):
    model.train()
    model.to(device)

    for epoch in range(epochs):
        epoch_loss = 0.0
        progress = tqdm(dataloader, desc=f"Epoch {epoch + 1}/{epochs}")

        for images, captions in progress:
            images = images.to(device)
            captions = captions.to(device)

            decoder_input_ids = captions[:, :-1]
            labels = captions[:, 1:].clone()
            labels[captions[:, 1:] == model.tokenizer.pad_token_id] = -100

            optimizer.zero_grad()
            outputs = model(images, decoder_input_ids)
            logits = outputs.view(-1, outputs.size(-1))
            labels = labels.view(-1)

            loss = criterion(logits, labels)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            progress.set_postfix(loss=loss.item())

        print(f"Epoch [{epoch+1}/{epochs}] - Average Loss: {epoch_loss/len(dataloader):.4f}")

csv_path = "/content/dataset/custom_captions_dataset/train.csv"
img_dir = "/content/dataset/custom_captions_dataset/train"

GLOBAL_MODEL_Custom = ImageCaptionModel()
GLOBAL_PROCESSOR_Custom = GLOBAL_MODEL_Custom.feature_extractor
tokenizer = GLOBAL_MODEL_Custom.tokenizer

train_loader = prepare_dataloader(
    csv_path=csv_path,
    image_folder=img_dir,
    processor=GLOBAL_PROCESSOR_Custom,
    tokenizer=tokenizer,
    max_length=64,
    batch_size=8
)


optimizer = torch.optim.Adam(GLOBAL_MODEL_Custom.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

train_model(GLOBAL_MODEL_Custom, train_loader, optimizer, criterion, DEVICE, epochs=5)

save_path = "/content/smolvlm_custom_caption_model.pth"
torch.save(GLOBAL_MODEL_Custom.state_dict(), save_path)

Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-small-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight'

Epoch [1/5] - Average Loss: 2.7940


Epoch 2/5: 100%|██████████| 715/715 [01:45<00:00,  6.77it/s, loss=2.77]


Epoch [2/5] - Average Loss: 2.4627


Epoch 3/5: 100%|██████████| 715/715 [01:47<00:00,  6.62it/s, loss=1.95]


Epoch [3/5] - Average Loss: 2.2908


Epoch 4/5: 100%|██████████| 715/715 [01:48<00:00,  6.56it/s, loss=1.63]


Epoch [4/5] - Average Loss: 2.1575


Epoch 5/5: 100%|██████████| 715/715 [01:49<00:00,  6.55it/s, loss=2.19]


Epoch [5/5] - Average Loss: 2.0348


### Evaluation of Image Captioning Model

The evaluation pipeline assesses the quality of generated captions against ground truth using multiple NLP metrics:

- **BLEU (Bilingual Evaluation Understudy):**  
  Computes n-gram overlap using `nltk.translate.bleu_score`, with smoothing to handle shorter sequences.

- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):**  
  Uses `rouge_scorer` from `rouge_score` to calculate ROUGE-1, ROUGE-2, and ROUGE-L f-measures over all predictions.

- **METEOR (Metric for Evaluation of Translation with Explicit ORdering):**  
  Loaded via Hugging Face’s `evaluate` library, measuring alignment, synonym matching, and ordering.

The `generate_captions` function handles inference by processing test images, generating captions using the model's `generate()` method, and decoding the outputs. It returns both predicted and ground truth captions for metric computation.


In [50]:
def evaluate_captions(generated_captions, ground_truth_captions):
  references = [[caption.split()] for caption in ground_truth_captions]
  hypotheses = [caption.split() for caption in generated_captions]

  smoothie = SmoothingFunction().method1
  bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smoothie)

  scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
  rouge_scores = {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}

  for ref, hyp in zip(ground_truth_captions, generated_captions):
    scores = scorer.score(ref, hyp)
    rouge_scores['rouge1'] += scores['rouge1'].fmeasure
    rouge_scores['rouge2'] += scores['rouge2'].fmeasure
    rouge_scores['rougeL'] += scores['rougeL'].fmeasure

  for key in rouge_scores:
    rouge_scores[key] /= len(ground_truth_captions)

    meteor = load("meteor")
    meteor_score = meteor.compute(
      predictions=generated_captions,
      references=ground_truth_captions
    )

  return {
      "bleu": bleu_score,
      "rouge-1": rouge_scores['rouge1'],
      "rouge-2": rouge_scores['rouge2'],
      "rouge-l": rouge_scores['rougeL'],
      "meteor": meteor_score['meteor']
  }

def generate_captions(model, processor, tokenizer, csv_path, image_folder, device):
    df = pd.read_csv(csv_path)

    generated_captions = []
    ground_truth_captions = []

    model.eval()
    model.to(device)

    with torch.no_grad():
        for idx, row in tqdm(df.iterrows(), total=len(df)):
            image_path = os.path.join(image_folder, row["filename"])
            gt_caption = row["caption"]

            image = Image.open(image_path).convert("RGB")
            inputs = processor(images=image, return_tensors="pt").to(device)

            output_ids = model.generate(inputs["pixel_values"], max_length=64)
            generated_caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)

            generated_captions.append(generated_caption)
            ground_truth_captions.append(gt_caption)

    return generated_captions, ground_truth_captions


In [51]:
test_csv_path = "/content/dataset/custom_captions_dataset/test.csv"
test_img_dir = "/content/dataset/custom_captions_dataset/test"

generated_captions, ground_truth_captions = generate_captions(
    model= GLOBAL_MODEL_Custom,
    processor= GLOBAL_PROCESSOR_Custom,
    tokenizer=tokenizer,
    csv_path=test_csv_path,
    image_folder=test_img_dir,
    device=DEVICE
)

results = evaluate_captions(generated_captions, ground_truth_captions)
print("Evaluation Results:\n", results)

100%|██████████| 928/928 [08:46<00:00,  1.76it/s]
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab i

Evaluation Results:
 {'bleu': 0.04652119564607364, 'rouge-1': 0.36049009736668486, 'rouge-2': 0.10296845457090224, 'rouge-l': 0.27386845266198157, 'meteor': 0.24408502391322467}


### Model Evaluation

This section evaluates the trained image captioning model on the test set.

- Captions are generated using the `generate_captions` function.
- Evaluation is performed using BLEU, ROUGE (1, 2, L), and METEOR metrics via `evaluate_captions`.
- Results are printed for analysis.

In [52]:
test_csv_path = "/content/dataset/custom_captions_dataset/test.csv"
test_img_dir = "/content/dataset/custom_captions_dataset/test"

df = pd.read_csv(test_csv_path)
generated_smolvlm = []
ground_truth_smolvlm = []

for idx, row in tqdm(df.iterrows(), total=len(df)):
    image_path = os.path.join(test_img_dir, row["filename"])
    gt_caption = row["caption"]

    gen_caption = zero_shot_captioning(image_path, model_name="HuggingFaceTB/SmolVLM-Instruct")

    generated_smolvlm.append(gen_caption)
    ground_truth_smolvlm.append(gt_caption)

metrics_smolvlm = evaluate_captions(generated_smolvlm, ground_truth_smolvlm)
print("SmolVLM Evaluation:", metrics_smolvlm)


100%|██████████| 928/928 [2:14:05<00:00,  8.67s/it]  
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_t

SmolVLM Evaluation: {'bleu': 0.04042871294140659, 'rouge-1': 0.3883891408616449, 'rouge-2': 0.10463825938056362, 'rouge-l': 0.2512128810068725, 'meteor': 0.24336596110007008}


### Occlusion Augmentation

This block applies patch-based occlusion to test images:

- Each image is divided into `16x16` patches.
- A specified percentage of patches are masked with black boxes.
- Occluded images are saved to the output directory for robustness evaluation.

In [None]:
import numpy as np
import random
import os
import cv2
from tqdm import tqdm

def occlude_image(image: np.array, mask_percentage: int) -> np.array:
    patch_size = 16
    height, width, _ = image.shape

    height = (height // patch_size) * patch_size
    width = (width // patch_size) * patch_size
    image = image[:height, :width, :]

    num_patches_y = height // patch_size
    num_patches_x = width // patch_size

    patches = [(i, j) for i in range(num_patches_y) for j in range(num_patches_x)]
    total_patches = len(patches)
    num_to_mask = int((mask_percentage / 100) * total_patches)

    selected_patches = random.sample(patches, num_to_mask)

    for i, j in selected_patches:
        y_start = i * patch_size
        x_start = j * patch_size
        image[y_start:y_start+patch_size, x_start:x_start+patch_size, :] = 0  # Black patch

    return image

def apply_occlusion_to_dataset(test_img_dir, output_dir, mask_percentage=50):
    os.makedirs(output_dir, exist_ok=True)
    image_files = [f for f in os.listdir(test_img_dir) if f.lower().endswith('.jpg')]

    for img_name in tqdm(image_files, desc=f"Applying {mask_percentage}% occlusion"):
        img_path = os.path.join(test_img_dir, img_name)
        image = cv2.imread(img_path)
        if image is None:
            print(f"Skipping unreadable image: {img_name}")
            continue

        image = cv2.resize(image, (224, 224))  
        occluded = occlude_image(image.copy(), mask_percentage)
        cv2.imwrite(os.path.join(output_dir, img_name), occluded)


### Generate Occluded Test Sets

This block creates test sets with varying occlusion levels (10%, 50%, 80%) by masking image patches and saving them into corresponding folders for further evaluation.



In [None]:
test_img_dir = "/content/dataset/custom_captions_dataset/test"
occlusion_levels = [10, 50, 80]

for level in occlusion_levels:
    output_dir = f"/content/dataset/custom_captions_dataset/test_{level}_per"
    apply_occlusion_to_dataset(test_img_dir, output_dir, mask_percentage=level)


Applying 10% occlusion: 100%|██████████| 928/928 [00:01<00:00, 563.07it/s]
Applying 50% occlusion: 100%|██████████| 928/928 [00:01<00:00, 558.40it/s]
Applying 80% occlusion: 100%|██████████| 928/928 [00:01<00:00, 561.09it/s]


### Evaluation on Occluded Images

This function evaluates the image captioning model on images with varying levels of occlusion, comparing the performance to a baseline:

- **Occlusion Levels:**  
  The function processes images with different occlusion percentages, specified in the `occlusion_levels` list.

- **Caption Generation:**  
  For each occlusion level, the `generate_captions` function generates captions for the occluded images using the model, processor, and tokenizer. Captions are stored in a dictionary.

- **Metrics Calculation:**  
  The performance of the generated captions is assessed using multiple metrics, including BLEU, ROUGE, and METEOR. The differences between the occluded and baseline metrics are calculated to measure the impact of occlusion on the model's performance.

- **Results:**  
  The function returns a dictionary of score deltas (metric differences) and absolute metrics for each occlusion level, along with the generated captions for further analysis.

In [None]:
def evaluate_on_occluded_images(
    model,
    processor,
    tokenizer,
    test_csv_path,
    base_image_folder,
    device,
    occlusion_levels,
    baseline_metrics,  
    ground_truths       
):
    score_deltas = {}

    occluded_captions_dict = {}

    for occ in occlusion_levels:
        occluded_image_folder = os.path.join(base_image_folder, f"test_{occ}_per")
        print(f"Evaluating on {occ}% occluded images...")

        occluded_captions, _ = generate_captions(
            model=model,
            processor=processor,
            tokenizer=tokenizer,
            csv_path=test_csv_path,
            image_folder=occluded_image_folder,
            device=device
        )
        variable_name = f"occluded_captions_{occ}_model_B"
        occluded_captions_dict[variable_name] = occluded_captions
        occluded_metrics = evaluate_captions(occluded_captions, ground_truths)

        delta_metrics = {
            metric: round(occluded_metrics[metric] - baseline_metrics[metric], 4)
            for metric in occluded_metrics
        }

        score_deltas[occ] = {
            "delta": delta_metrics,
            "absolute": occluded_metrics
        }

    for var_name, captions in occluded_captions_dict.items():
        globals()[var_name] = captions

    return score_deltas, occluded_captions_dict


### Evaluating on Occluded Images

This code snippet calls the `evaluate_on_occluded_images` function to assess the performance of a custom image captioning model on images with varying occlusion levels (10%, 50%, 80%). It uses the provided model, processor, and tokenizer to generate captions for the occluded images and compares the results with baseline metrics (`results`). The ground truth captions are used for evaluation. The function returns the `score_deltas`, which contain the metric differences between occluded and baseline images for each occlusion level.

In [None]:
score_deltas,_ = evaluate_on_occluded_images(
    model=GLOBAL_MODEL_Custom,
    processor=GLOBAL_PROCESSOR_Custom,
    tokenizer=tokenizer,
    test_csv_path=test_csv_path,
    base_image_folder="/content/dataset/custom_captions_dataset",
    device=DEVICE,
    occlusion_levels=[10, 50, 80],
    baseline_metrics=results,  
    ground_truths=ground_truth_captions
)

Evaluating on 10% occluded images...


100%|██████████| 928/928 [08:40<00:00,  1.78it/s]
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab i

Evaluating on 50% occluded images...


100%|██████████| 928/928 [08:39<00:00,  1.79it/s]
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab i

Evaluating on 80% occluded images...


100%|██████████| 928/928 [08:39<00:00,  1.79it/s]
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab i

### Printing Evaluation Results for Custom Model

This code prints the evaluation results for the custom image captioning model, showing the metric differences (deltas) between the baseline and occluded image performance for each occlusion level. The output includes the changes in the following metrics:

- **BLEU:** Measures n-gram overlap.
- **ROUGE-1:** Measures recall-based precision at the word level.
- **ROUGE-2:** Measures recall-based precision at the bigram level.
- **ROUGE-L:** Measures longest common subsequence-based recall.
- **METEOR:** Evaluates alignment, synonym matching, and word order.

The results are displayed for each occlusion level (10%, 50%, 80%), showing how the model's performance changes as the occlusion increases.

In [58]:
print(f"Results for custom model - ")
for level, data in score_deltas.items():
    delta = data["delta"]
    print(f"\n--- Occlusion Level: {level}% ---")
    print(f"Δ BLEU     : {delta['bleu']:+.4f}")
    print(f"Δ ROUGE-1  : {delta['rouge-1']:+.4f}")
    print(f"Δ ROUGE-2   : {delta['rouge-2']:+.4f}")
    print(f"Δ ROUGE-L   : {delta['rouge-l']:+.4f}")
    print(f"Δ METEOR   : {delta['meteor']:+.4f}")

Results for custom model - 

--- Occlusion Level: 10% ---
Δ BLEU     : -0.0037
Δ ROUGE-1  : -0.0095
Δ ROUGE-2   : -0.0064
Δ ROUGE-L   : -0.0023
Δ METEOR   : -0.0080

--- Occlusion Level: 50% ---
Δ BLEU     : -0.0065
Δ ROUGE-1  : -0.0176
Δ ROUGE-2   : -0.0134
Δ ROUGE-L   : -0.0058
Δ METEOR   : -0.0104

--- Occlusion Level: 80% ---
Δ BLEU     : -0.0141
Δ ROUGE-1  : -0.0339
Δ ROUGE-2   : -0.0285
Δ ROUGE-L   : -0.0213
Δ METEOR   : -0.0243


### Evaluating on Occluded Images with SmolVLM

This function evaluates the performance of a zero-shot image captioning model (SmolVLM) on images with different occlusion levels. It processes each occluded image by generating captions using the `zero_shot_captioning` function. The captions are then compared to baseline metrics (such as BLEU, ROUGE, METEOR) to compute metric differences (deltas) for each occlusion level. The function returns the score deltas and the generated captions for further analysis.

In [None]:
def evaluate_on_occluded_images_smolvlm(
    test_csv_path,
    base_image_folder,
    device,
    occlusion_levels,
    baseline_metrics,  
    ground_truths       
):
    score_deltas = {}

    occluded_captions_dict = {}

    for occ in occlusion_levels:
        occluded_image_folder = os.path.join(base_image_folder, f"test_{occ}_per")
        print(f"Evaluating on {occ}% occluded images...")

        df = pd.read_csv(test_csv_path)
        occluded_captions = []

        for idx, row in tqdm(df.iterrows(), total=len(df)):
            image_path = os.path.join(occluded_image_folder, row["filename"])
            gen_caption = zero_shot_captioning(image_path)
            occluded_captions.append(gen_caption)

        variable_name = f"occluded_captions_{occ}_model_A"
        occluded_captions_dict[variable_name] = occluded_captions

        globals()[variable_name] = occluded_captions

        occluded_metrics = evaluate_captions(occluded_captions, ground_truths)

        delta_metrics = {
            metric: round(occluded_metrics[metric] - baseline_metrics[metric], 4)
            for metric in occluded_metrics
        }

        score_deltas[occ] = {
            "delta": delta_metrics,
            "absolute": occluded_metrics
        }

    return score_deltas, occluded_captions_dict

### Evaluating on Occluded Images with SmolVLM

This code snippet calls the `evaluate_on_occluded_images_smolvlm` function to evaluate the SmolVLM model on images with varying occlusion levels (10%, 50%, 80%). It generates captions for the occluded images using the zero-shot captioning method and compares the results with the baseline metrics (`metrics_smolvlm`). The ground truth captions (`ground_truth_smolvlm`) are used for evaluation. The function returns the `score_deltas_smolvlm`, which contains the metric differences for each occlusion level.

In [None]:
score_deltas_smolvlm,_ = evaluate_on_occluded_images_smolvlm(
    test_csv_path=test_csv_path,
    base_image_folder="/content/dataset/custom_captions_dataset",
    device=DEVICE,
    occlusion_levels=[10, 50, 80],
    baseline_metrics= metrics_smolvlm,  
    ground_truths= ground_truth_smolvlm
)


Evaluating on 10% occluded images...


100%|██████████| 928/928 [2:45:07<00:00, 10.68s/it]  
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_t

Evaluating on 50% occluded images...


100%|██████████| 928/928 [2:44:34<00:00, 10.64s/it]  
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_t

Evaluating on 80% occluded images...


100%|██████████| 928/928 [2:44:30<00:00, 10.64s/it]  
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_t

### Printing Evaluation Results for SmolVLM Model

This code snippet prints the evaluation results for the SmolVLM model, showing the metric differences (deltas) between the baseline and occluded image performance for each occlusion level. The output includes the changes in the following metrics:

- **BLEU:** Measures n-gram overlap.
- **ROUGE-1:** Measures recall-based precision at the word level.
- **ROUGE-2:** Measures recall-based precision at the bigram level.
- **ROUGE-L:** Measures longest common subsequence-based recall.
- **METEOR:** Evaluates alignment, synonym matching, and word order.

The results are displayed for each occlusion level (10%, 50%, 80%), highlighting how the model's performance is affected by different levels of image occlusion.

In [61]:
print(f"Results for SmolVLM model - ")
for level, data in score_deltas_smolvlm.items():
    delta = data["delta"]
    print(f"\n--- Occlusion Level: {level}% ---")
    print(f"Δ BLEU     : {delta['bleu']:+.4f}")
    print(f"Δ ROUGE-1  : {delta['rouge-1']:+.4f}")
    print(f"Δ ROUGE-2   : {delta['rouge-2']:+.4f}")
    print(f"Δ ROUGE-L   : {delta['rouge-l']:+.4f}")
    print(f"Δ METEOR   : {delta['meteor']:+.4f}")

Results for SmolVLM model - 

--- Occlusion Level: 10% ---
Δ BLEU     : +0.0000
Δ ROUGE-1  : +0.0033
Δ ROUGE-2   : +0.0012
Δ ROUGE-L   : +0.0017
Δ METEOR   : +0.0003

--- Occlusion Level: 50% ---
Δ BLEU     : -0.0042
Δ ROUGE-1  : -0.0131
Δ ROUGE-2   : -0.0088
Δ ROUGE-L   : -0.0048
Δ METEOR   : -0.0115

--- Occlusion Level: 80% ---
Δ BLEU     : -0.0255
Δ ROUGE-1  : -0.0871
Δ ROUGE-2   : -0.0515
Δ ROUGE-L   : -0.0463
Δ METEOR   : -0.0601


### Creating Caption Classifier Dataset

This function, `create_caption_classifier_dataset`, generates a dataset for training a classifier to distinguish between captions generated by two different models (Model A and Model B) at varying levels of image occlusion. It takes in ground truth captions and occlusion levels, and for each occlusion level, appends the generated captions from both models to the dataset. The dataset includes the original captions, generated captions, perturbation percentages (occlusion levels), and model labels (0 for Model A, 1 for Model B). The dataset is saved as a CSV file (`caption_classifier_dataset.csv`). A downloadable link to the generated file is provided.

In [None]:

def create_caption_classifier_dataset(ground_truths, occlusion_levels, output_csv_path):
    import pandas as pd
    from tqdm import tqdm

    original_captions = []
    generated_captions = []
    perturbation_percentages = []
    model_labels = []

    for occ in occlusion_levels:
        print(f"Processing occlusion level: {occ}%")

        model_a_captions = globals()[f"occluded_captions_{occ}_model_A"]
        model_b_captions = globals()[f"occluded_captions_{occ}_model_B"]

        for i in range(len(ground_truths)):
            original_captions.append(ground_truths[i])
            generated_captions.append(model_a_captions[i])
            perturbation_percentages.append(occ)
            model_labels.append(0)  # Model A (SmolVLM)

        for i in range(len(ground_truths)):
            original_captions.append(ground_truths[i])
            generated_captions.append(model_b_captions[i])
            perturbation_percentages.append(occ)
            model_labels.append(1)  # Model B (Custom)

    df = pd.DataFrame({
        'original_caption': original_captions,
        'generated_caption': generated_captions,
        'perturbation_percentage': perturbation_percentages,
        'model_label': model_labels
    })

    df['text'] = df['original_caption'] + " [SEP] " + df['generated_caption'] + " [SEP] " + df['perturbation_percentage'].astype(str)

    df.to_csv(output_csv_path, index=False, quoting=1)  # quoting=1 is for QUOTE_ALL
    print(f"Dataset saved to {output_csv_path}")

    return df
create_caption_classifier_dataset(
    ground_truths=ground_truth_captions,
    occlusion_levels=[10, 50, 80],
    output_csv_path="caption_classifier_dataset.csv"

)
from IPython.display import FileLink
FileLink('caption_classifier_dataset.csv')

Processing occlusion level: 10%
Processing occlusion level: 50%
Processing occlusion level: 80%
Dataset saved to caption_classifier_dataset.csv


### Reading and Printing CSV Dataset

This function, `read_and_print_csv`, uses the `pandas` library to read a CSV file and display its contents. The function configures display settings to show all rows and columns without truncation or wrapping. It then reads the CSV file into a DataFrame and prints the entire dataset. The file `"caption_classifier_dataset.csv"` is specified as an example input. This allows for an easy inspection of the dataset contents.

In [8]:
# import pandas as pd
# import csv

# def read_and_print_csv(filename):
#     pd.set_option('display.max_rows', None)  # show all rows
#     pd.set_option('display.max_columns', None)  # show all columns (if needed)
#     pd.set_option('display.width', None)  # no line wrapping
#     pd.set_option('display.max_colwidth', None)  # show full content of each cell

#     df = pd.read_csv(filename)
#     print(df)

# read_and_print_csv("caption_classifier_dataset.csv")


### Loading Caption Dataset from Google Drive

The `load_caption_dataset_from_gdrive` function downloads a dataset from Google Drive using a shared link. It extracts the file ID from the URL, uses `gdown` to download the file, and loads it into a Pandas DataFrame. The function prints key information about the dataset, including the total number of rows, the number of unique original captions, the number of samples for each model (Model A and Model B), and the perturbation levels present in the dataset. The dataset is read with the `csv.QUOTE_ALL` setting to ensure proper handling of quoted fields.

In [9]:
import csv
import pandas as pd
try:
    import gdown
except ImportError:
    !pip install gdown
    import gdown

try:
    import transformers
except ImportError:
    !pip install transformers
    import transformers
def load_caption_dataset_from_gdrive(gdrive_url):
    file_id = gdrive_url.split('/d/')[1].split('/view')[0]
    output_path = 'caption_classifier_dataset.csv'
    gdown.download(f'https://drive.google.com/uc?id={file_id}', output_path, quiet=False)

    print(f"File downloaded to {output_path}")
    df = pd.read_csv(output_path, quoting=csv.QUOTE_ALL)
    print(f"Dataset loaded with {len(df)} rows")
    print(f"Number of unique original captions: {df['original_caption'].nunique()}")
    print(f"Number of samples per model: Model A (SmolVLM): {sum(df['model_label'] == 0)}, Model B (Custom): {sum(df['model_label'] == 1)}")
    print(f"Perturbation levels in the dataset: {df['perturbation_percentage'].unique()}")

    return df

gdrive_url = "https://drive.google.com/file/d/1GoQ9AFVaWkaMxxB8iPh8K-MGePhkmbJI/view?usp=sharing"
load_caption_dataset_from_gdrive(gdrive_url)



Downloading...
From: https://drive.google.com/uc?id=1GoQ9AFVaWkaMxxB8iPh8K-MGePhkmbJI
To: /kaggle/working/caption_classifier_dataset.csv
100%|██████████| 6.80M/6.80M [00:00<00:00, 23.2MB/s]

File downloaded to caption_classifier_dataset.csv
Dataset loaded with 5568 rows
Number of unique original captions: 928
Number of samples per model: Model A (SmolVLM): 2784, Model B (Custom): 2784
Perturbation levels in the dataset: [10 50 80]





Unnamed: 0,original_caption,generated_caption,perturbation_percentage,model_label,text
0,A large building with bars on the windows in f...,The image depicts a modern urban setting with...,10,0,A large building with bars on the windows in f...
1,A person is skiing through the snow. There is ...,The image features a person skiing down a sno...,10,0,A person is skiing through the snow. There is ...
2,There is a bed in a room against a wall. There...,"The image depicts a bedroom scene, likely fro...",10,0,There is a bed in a room against a wall. There...
3,A black and red train is on the tracks and has...,The image depicts a scene featuring a steam l...,10,0,A black and red train is on the tracks and has...
4,A white and yellow public transportation bus w...,The image depicts a bus parked on a city stre...,10,0,A white and yellow public transportation bus w...
...,...,...,...,...,...
5563,This is a picture of a stream. The stream is m...,picture is in black and white. The picture is...,80,1,This is a picture of a stream. The stream is m...
5564,This is a black and white photograph of a very...,picture is taken outside on a cloudy day. Two...,80,1,This is a black and white photograph of a very...
5565,There are three trains in the image. The build...,picture is taken outside on a cloudy day. The...,80,1,There are three trains in the image. The build...
5566,A flock of white sheep are grazing on both gre...,image is of a zoo. The zoo is surrounded by t...,80,1,A flock of white sheep are grazing on both gre...


### Caption Classifier Dataset Preparation

The code prepares a dataset for a caption classification task, where each data sample contains an original caption, a generated caption, perturbation percentage, and model label. 

- The dataset is split into training, validation, and test sets based on the uniqueness of the original captions.
- A custom `Dataset` class, `CaptionClassifierDataset`, is defined to handle the tokenization of text using a BERT tokenizer. This class:
  - Tokenizes the input texts (original caption + generated caption) and pads or truncates them to a specified maximum length.
  - Returns tokenized input IDs, attention masks, token type IDs, and the corresponding label for each sample.

The data is now ready for further processing and training a model on the caption classification task.

In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from tqdm.notebook import tqdm
import csv

df = pd.read_csv("caption_classifier_dataset.csv", quoting=csv.QUOTE_ALL)

unique_captions = df['original_caption'].unique()
train_captions, temp_captions = train_test_split(unique_captions, test_size=0.3, random_state=42)
val_captions, test_captions = train_test_split(temp_captions, test_size=0.67, random_state=42)

train_df = df[df['original_caption'].isin(train_captions)]
val_df = df[df['original_caption'].isin(val_captions)]
test_df = df[df['original_caption'].isin(test_captions)]

print(f"Train set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"Test set size: {len(test_df)}")

class CaptionClassifierDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]

        # Tokenize the text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=True,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding['token_type_ids'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

Train set size: 3894
Validation set size: 552
Test set size: 1122


### Caption Classifier Model Definition

The `CaptionClassifier` class defines a deep learning model using BERT for classifying caption pairs based on their origin (SmolVLM or Custom model). Key components:

- **BERT Model**: The `BertModel` is loaded from the Hugging Face `transformers` library and used to extract contextual embeddings from the input text.
- **Dropout Layer**: A dropout layer is applied to the pooled output from BERT to prevent overfitting.
- **Classifier Layer**: A fully connected linear layer maps the pooled output to a final class prediction (two classes: SmolVLM or Custom model).

The model takes tokenized inputs (`input_ids`, `attention_mask`, `token_type_ids`) and returns logits representing the predicted class probabilities.

In [11]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from tqdm.notebook import tqdm
import csv
class CaptionClassifier(nn.Module):
    def __init__(self, bert_model_name="bert-base-uncased", num_classes=2, dropout_rate=0.3):
        super(CaptionClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(dropout_rate)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

        pooled_output = outputs.pooler_output

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        return logits

### Evaluation of Image Captioning Model

The evaluation pipeline assesses the quality of generated captions against ground truth using multiple NLP metrics:

- **BLEU (Bilingual Evaluation Understudy):**  
  Computes n-gram overlap using `nltk.translate.bleu_score`, with smoothing to handle shorter sequences.

- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):**  
  Uses `rouge_scorer` from `rouge_score` to calculate ROUGE-1, ROUGE-2, and ROUGE-L f-measures over all predictions.

- **METEOR (Metric for Evaluation of Translation with Explicit ORdering):**  
  Loaded via Hugging Face’s `evaluate` library, measuring alignment, synonym matching, and ordering.

The `generate_captions` function handles inference by processing test images, generating captions using the model's `generate()` method, and decoding the outputs. It returns both predicted and ground truth captions for metric computation.


In [12]:
def train_classifier(model, dataloader, optimizer, criterion, device, epochs=3):
    model.train()
    total_loss = 0.0

    for epoch in range(epochs):
        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}")
        epoch_loss = 0.0

        for batch in progress_bar:
            optimizer.zero_grad()

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask, token_type_ids)
            loss = criterion(outputs, labels)

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            progress_bar.set_postfix({'loss': loss.item()})

        avg_epoch_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{epochs} completed. Average loss: {avg_epoch_loss:.4f}")
        total_loss += avg_epoch_loss

    print(f"Training completed. Average loss over all epochs: {total_loss/epochs:.4f}")

### Evaluation Function for the Caption Classifier

The `evaluate_classifier` function evaluates the trained `CaptionClassifier` model on a given dataset. Key components:

- **Model Evaluation Mode**: The model is set to evaluation mode using `model.eval()`, which disables dropout and batch normalization layers.
- **Prediction and Label Collection**: For each batch, predictions are made, and the true labels are collected.
- **Metrics Calculation**: The function calculates key classification metrics:
  - **Accuracy**: The proportion of correct predictions.
  - **Precision (Macro)**: The precision averaged over all classes.
  - **Recall (Macro)**: The recall averaged over all classes.
  - **F1 Score (Macro)**: The F1 score averaged over all classes.
- **No Gradient Computation**: The `torch.no_grad()` context manager is used to prevent gradient computation during evaluation, which reduces memory usage and speeds up the process.

This function provides a summary of the model's performance on the evaluation set.

In [13]:
def evaluate_classifier(model, dataloader, device):

    model.eval()
    all_predictions = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            # Move batch to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask, token_type_ids)
            _, preds = torch.max(outputs, 1)

            all_predictions.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_labels, all_predictions, average='macro'
    )
    accuracy = accuracy_score(all_labels, all_predictions)

    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

    print(f"Evaluation Results:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision (Macro): {precision:.4f}")
    print(f"Recall (Macro): {recall:.4f}")
    print(f"F1 Score (Macro): {f1:.4f}")

    return metrics

### Hyperparameter Tuning for Caption Classifier

The `hyperparameter_tuning` function aims to find the best hyperparameters for training the `CaptionClassifier` model. Here's how it works:

- **Parameter Combinations**: The function tests multiple combinations of learning rate (`learning_rate`), batch size (`batch_size`), and epochs (`epochs`).
- **Dataset Preparation**: It creates training and validation datasets from the `train_df` and `val_df` DataFrames using the `CaptionClassifierDataset` class.
- **Model Training**: For each parameter combination, it:
  - Initializes the `CaptionClassifier` model.
  - Sets up the optimizer (`AdamW`) and loss function (`CrossEntropyLoss`).
  - Trains the model using the `train_classifier` function.
- **Validation and Evaluation**: After training, the model is evaluated on the validation set using the `evaluate_classifier` function, and the accuracy is recorded.
- **Best Hyperparameters**: The function tracks the best-performing hyperparameters based on validation accuracy.

At the end, it prints and returns the hyperparameters that resulted in the best validation accuracy. This allows for fine-tuning the model with the most optimal settings.

In [14]:
def hyperparameter_tuning(train_df, val_df, tokenizer, device):
    param_combinations = [
        {'learning_rate': 1e-5, 'batch_size': 16, 'epochs': 2},
        {'learning_rate': 2e-5, 'batch_size': 16, 'epochs': 2},
        {'learning_rate': 5e-5, 'batch_size': 16, 'epochs': 2},
        {'learning_rate': 2e-5, 'batch_size': 8, 'epochs': 2},
        {'learning_rate': 2e-5, 'batch_size': 32, 'epochs': 2},
        {'learning_rate': 2e-5, 'batch_size': 16, 'epochs': 3},
        {'learning_rate': 2e-5, 'batch_size': 16, 'epochs': 4}
    ]
    
    train_dataset = CaptionClassifierDataset(
        texts=train_df['text'].tolist(),
        labels=train_df['model_label'].tolist(),
        tokenizer=tokenizer
    )
    
    val_dataset = CaptionClassifierDataset(
        texts=val_df['text'].tolist(),
        labels=val_df['model_label'].tolist(),
        tokenizer=tokenizer
    )
    
    val_dataloader = DataLoader(val_dataset, batch_size=16)
    best_val_accuracy = 0.0
    best_params = None

    for params in param_combinations:
        lr = params['learning_rate']
        bs = params['batch_size']
        ep = params['epochs']
        
        print(f"\nTesting parameters: LR={lr}, BS={bs}, Epochs={ep}")
        
        train_dataloader = DataLoader(train_dataset, batch_size=bs, shuffle=True)
        
        model = CaptionClassifier().to(device)
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()
        
        train_classifier(model, train_dataloader, optimizer, criterion, device, epochs=ep)
        
        val_metrics = evaluate_classifier(model, val_dataloader, device)
        val_accuracy = val_metrics['accuracy']
        
        print(f"Validation accuracy: {val_accuracy:.4f}")
        
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_params = params
    
    print("\nBest hyperparameters found:")
    print(f"Learning Rate: {best_params['learning_rate']}")
    print(f"Batch Size: {best_params['batch_size']}")
    print(f"Epochs: {best_params['epochs']}")
    print(f"Validation Accuracy: {best_val_accuracy:.4f}")
    
    return best_params

### Final Model Training and Evaluation

1. **Device Selection**: Chooses between `cuda` (GPU) or `cpu` for training based on availability.

2. **Tokenizer Initialization**: Loads `BertTokenizer` from the pre-trained 'bert-base-uncased' model.

3. **Dataset Preparation**: Prepares the training, validation, and test datasets using `CaptionClassifierDataset`.

4. **Hyperparameter Tuning**: Runs `hyperparameter_tuning` to find optimal settings (`learning_rate`, `batch_size`, `epochs`).

5. **Model Initialization**: Initializes `CaptionClassifier` with the best hyperparameters and sets up the optimizer and loss function.

6. **Final Training**: Trains the model using the best hyperparameters.

7. **Evaluation**: Evaluates the model on validation and test sets, printing accuracy, precision, recall, and F1 score.

8. **Model Saving**: Saves the trained model as `'caption_classifier_model.pt'`.

In [15]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_dataset = CaptionClassifierDataset(
    texts=train_df['text'].tolist(),
    labels=train_df['model_label'].tolist(),
    tokenizer=tokenizer
)

val_dataset = CaptionClassifierDataset(
    texts=val_df['text'].tolist(),
    labels=val_df['model_label'].tolist(),
    tokenizer=tokenizer
)

test_dataset = CaptionClassifierDataset(
    texts=test_df['text'].tolist(),
    labels=test_df['model_label'].tolist(),
    tokenizer=tokenizer
)


print("Starting hyperparameter tuning...")
best_params = hyperparameter_tuning(train_df, val_df, tokenizer, device)

model = CaptionClassifier().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=best_params['learning_rate'])
criterion = nn.CrossEntropyLoss()
train_dataloader = DataLoader(train_dataset, batch_size=best_params['batch_size'], shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=best_params['batch_size'])
test_dataloader = DataLoader(test_dataset, batch_size=best_params['batch_size'])


print("Starting final training with best hyperparameters...")
train_classifier(
    model=model,
    dataloader=train_dataloader,
    optimizer=optimizer,
    criterion=criterion,
    device=device,
    epochs=best_params['epochs']
)


print("\nEvaluating on validation set...")
val_metrics = evaluate_classifier(
    model=model,
    dataloader=val_dataloader,
    device=device
)

print("\nEvaluating on test set...")
test_metrics = evaluate_classifier(
    model=model,
    dataloader=test_dataloader,
    device=device
)


torch.save(model.state_dict(), 'caption_classifier_model.pt')
print("Model saved to caption_classifier_model.pt")


Using device: cuda
Starting hyperparameter tuning...

Testing parameters: LR=1e-05, BS=16, Epochs=2


Epoch 1/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 1/2 completed. Average loss: 0.1450


Epoch 2/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 2/2 completed. Average loss: 0.0409
Training completed. Average loss over all epochs: 0.0929


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9838
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Testing parameters: LR=2e-05, BS=16, Epochs=2


Epoch 1/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 1/2 completed. Average loss: 0.0947


Epoch 2/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 2/2 completed. Average loss: 0.0318
Training completed. Average loss over all epochs: 0.0633


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9838
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Testing parameters: LR=5e-05, BS=16, Epochs=2


Epoch 1/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 1/2 completed. Average loss: 0.0875


Epoch 2/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 2/2 completed. Average loss: 0.0589
Training completed. Average loss over all epochs: 0.0732


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9838
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Testing parameters: LR=2e-05, BS=8, Epochs=2


Epoch 1/2:   0%|          | 0/487 [00:00<?, ?it/s]

Epoch 1/2 completed. Average loss: 0.0752


Epoch 2/2:   0%|          | 0/487 [00:00<?, ?it/s]

Epoch 2/2 completed. Average loss: 0.0347
Training completed. Average loss over all epochs: 0.0549


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9838
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Testing parameters: LR=2e-05, BS=32, Epochs=2


Epoch 1/2:   0%|          | 0/122 [00:00<?, ?it/s]

Epoch 1/2 completed. Average loss: 0.1077


Epoch 2/2:   0%|          | 0/122 [00:00<?, ?it/s]

Epoch 2/2 completed. Average loss: 0.0349
Training completed. Average loss over all epochs: 0.0713


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9838
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Testing parameters: LR=2e-05, BS=16, Epochs=3


Epoch 1/3:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 1/3 completed. Average loss: 0.0957


Epoch 2/3:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 2/3 completed. Average loss: 0.0337


Epoch 3/3:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 3/3 completed. Average loss: 0.0321
Training completed. Average loss over all epochs: 0.0538


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9842
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Testing parameters: LR=2e-05, BS=16, Epochs=4


Epoch 1/4:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 1/4 completed. Average loss: 0.1049


Epoch 2/4:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 2/4 completed. Average loss: 0.0330


Epoch 3/4:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 3/4 completed. Average loss: 0.0322


Epoch 4/4:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 4/4 completed. Average loss: 0.0310
Training completed. Average loss over all epochs: 0.0503


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9842
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837
Validation accuracy: 0.9837

Best hyperparameters found:
Learning Rate: 1e-05
Batch Size: 16
Epochs: 2
Validation Accuracy: 0.9837
Starting final training with best hyperparameters...


Epoch 1/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 1/2 completed. Average loss: 0.1379


Epoch 2/2:   0%|          | 0/244 [00:00<?, ?it/s]

Epoch 2/2 completed. Average loss: 0.0358
Training completed. Average loss over all epochs: 0.0868

Evaluating on validation set...


Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9837
Precision (Macro): 0.9838
Recall (Macro): 0.9837
F1 Score (Macro): 0.9837

Evaluating on test set...


Evaluating:   0%|          | 0/71 [00:00<?, ?it/s]

Evaluation Results:
Accuracy: 0.9866
Precision (Macro): 0.9866
Recall (Macro): 0.9866
F1 Score (Macro): 0.9866
Model saved to caption_classifier_model.pt
