This part of the code contains necessary imports and installs so that we can run it on **Kaggle Notebook** specifically.  
Please note that we need to make changes to `'directory'` and `'path'` variables wherever necessary.  
No guarantee can be given if the code will run from another setup unlike Kaggle Notebook.

In [1]:
# Run this is working on Kaggle (for local just pip install the requirements.txt)
!pip install evaluate rouge_score transformers
!pip install --upgrade nltk
!pip install bert_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=2386a8aa6e75ec3a643a3148d44ba89154bb79966d08ce6326b6f8e916fc0d88
  Stored in directory: /

In [2]:
import os
import torch
import json
import pandas as pd
import numpy as np
import nltk
import evaluate
import random
from tqdm import tqdm
from PIL import Image
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader, random_split
import torch.nn as nn
import torch.optim as optim
from transformers import(
    AutoProcessor,
    AutoModelForVision2Seq,
    ViTModel,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    GPT2Config,
    ViTFeatureExtractor,
    BertModel,
    BertTokenizer
)
from sklearn.metrics import precision_recall_fscore_support
from transformers.image_utils import load_image
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score
from bert_score import score

In [6]:
# run this if working on Kaggle {by changing path as required} (it should run automatically on local {not tested though})
# !mkdir -p /kaggle/working/nltk_data/corpora
nltk_data_dir = "/kaggle/working/"
nltk.download("wordnet", download_dir=nltk_data_dir)
nltk.download("omw-1.4", download_dir=nltk_data_dir)
nltk.download("punkt", download_dir=nltk_data_dir)
nltk.download("punkt_tab", download_dir=nltk_data_dir)

[nltk_data] Downloading package wordnet to /kaggle/working/...
[nltk_data] Downloading package omw-1.4 to /kaggle/working/...
[nltk_data] Downloading package punkt to /kaggle/working/...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /kaggle/working/...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [7]:
!unzip -o /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora
!unzip -o /kaggle/working/corpora/omw-1.4.zip -d /kaggle/working/corpora

Archive:  /kaggle/working/corpora/wordnet.zip
   creating: /kaggle/working/corpora/wordnet/
  inflating: /kaggle/working/corpora/wordnet/lexnames  
  inflating: /kaggle/working/corpora/wordnet/data.verb  
  inflating: /kaggle/working/corpora/wordnet/index.adv  
  inflating: /kaggle/working/corpora/wordnet/adv.exc  
  inflating: /kaggle/working/corpora/wordnet/index.verb  
  inflating: /kaggle/working/corpora/wordnet/cntlist.rev  
  inflating: /kaggle/working/corpora/wordnet/data.adj  
  inflating: /kaggle/working/corpora/wordnet/index.adj  
  inflating: /kaggle/working/corpora/wordnet/LICENSE  
  inflating: /kaggle/working/corpora/wordnet/citation.bib  
  inflating: /kaggle/working/corpora/wordnet/noun.exc  
  inflating: /kaggle/working/corpora/wordnet/verb.exc  
  inflating: /kaggle/working/corpora/wordnet/README  
  inflating: /kaggle/working/corpora/wordnet/index.sense  
  inflating: /kaggle/working/corpora/wordnet/data.noun  
  inflating: /kaggle/working/corpora/wordnet/data.adv  


In [8]:
nltk.data.path.append("/kaggle/working/corpora")
nltk.data.path.append("/kaggle/working/tokenizers")
nltk.data.path.append("/kaggle/working/")

In [9]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using Device: {DEVICE}")

Using Device: cuda


#### PART A - Implementing and Benchmarking a Custom Encoder-Decoder Model

This part of the code contains the following functions/classes:<br>
    - def zero_shot_captioning(image_path, model_name="HuggingFaceTB/SmolVLM-256M-Instruct")<br>
    - def generate_and_save_captions(test_dir, output_csv_path)<br>
    - def evaluate_model(test_csv_pathl, generated_csv_path)<br>
    - class ImageCaptionDataset(Dataset)<br>
    - class ImageCaptionModel(nn.Module)<br>
    - def train_model(model, train_loader, val_loader, optimizer, criterion, device, epochs=3, save_path="best_custom_model.pth")<br>
    - def generate_captions_with_custom_model(model, image_path, tokenizer, device)<br>
    - def generate_captions_for_test_set(model, test_dir, output_csv, device)<br>
    - def create_dataloaders(train_csv, val_csv, train_img_dir, val_img_dir, tokenizer, batch_size=8)<br>

In [10]:
def zero_shot_captioning(image_path, model_name="HuggingFaceTB/SmolVLM-256M-Instruct"):
    """
    Generate captions by Zero Shot on SmolVLM Model
    Args:
        image_path (str): Path to the image file.
        model_name (str): Model name (default: HuggingFaceTB/SmolVLM-256M-Instruct).
    Returns:
        str: Generated caption for the image.
    """
    
    if not hasattr(zero_shot_captioning, "model") or not hasattr(zero_shot_captioning, "processor"):
        print("Loading Model and Processor...")
        zero_shot_captioning.model = AutoModelForVision2Seq.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
            device_map=DEVICE,
            _attn_implementation="eager" # here "flash_attention_2" could not be used due to cuda version errors on Kaggle Notebook
        ).to(DEVICE)
        zero_shot_captioning.processor = AutoProcessor.from_pretrained(model_name)
        print("Model and Processor loaded Successfully!")
        
    try:
        image = load_image(image_path)
    except Exception as e:
        print(f"Loading image from {image_path} failed: {e}")
        return None
    
    messages = [
        {
            "role": "user",
            "content" : [
                {"type": "image"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        }
    ]
    
    prompt = zero_shot_captioning.processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = zero_shot_captioning.processor(text=prompt, images=image, return_tensors="pt").to(DEVICE)
    
    with torch.no_grad():
        generated_ids = zero_shot_captioning.model.generate(**inputs, max_new_tokens=100)
        generated_ids = generated_ids[:, inputs["input_ids"].shape[1]:]
        generated_text = zero_shot_captioning.processor.batch_decode(
            generated_ids, skip_special_tokens=True
        )[0].strip()
        
    return generated_text


In [11]:
def generate_and_save_captions(test_dir, output_csv_path):
    """
    Generate captions for all images in the directory and save to a CSV file.
    Args:
        test_dir (str): Directory containing the images.
        output_csv_path (str): Path to save the output CSV file.
    Returns:
        pd.DataFrame: DataFrame containing filenames and generated captions.
    """
    image_files = [f for f in os.listdir(test_dir) if f.lower().endswith('.jpg')]
    image_files.sort()
    
    captions = []
    
    for image_file in tqdm(image_files, desc="Generating Captions"):
        image_path = os.path.join(test_dir, image_file)
        caption = zero_shot_captioning(image_path)
        
        if caption:
            captions.append({
                'filename': image_file,
                'generated_caption': caption
            })
            
    df = pd.DataFrame(captions)
    df.to_csv(output_csv_path, index=True)
    print(f"Captions saved to {output_csv_path}")
    
    return df

In [12]:
def evaluate_model(test_csv_pathl, generated_csv_path):
    """
    Evaluate the model performance using the BLEU, ROUGE-L, and METEOR metrics.
    Args:
        test_csv_path (str): Path to the test CSV file containing ground truth captions.
        generated_csv_path (str): Path to the CSV file containing generated captions.
    Returns:
        dict: Dictionary containing the evaluation metrics.
    """
    
    test_df = pd.read_csv(test_csv_pathl)
    generated_df = pd.read_csv(generated_csv_path)
    
    merged_df = pd.merge(test_df, generated_df, on='filename', how='inner')
    
    if len(merged_df) == 0:
        print("No matching filenames found between test and generated dataframes.")
        return None
    
    # BLEU Score
    references = []
    hypotheses = []
    
    for _, row in merged_df.iterrows():
        reference = nltk.word_tokenize(row['caption'].lower())
        hypothesis = nltk.word_tokenize(row['generated_caption'].lower())
        
        references.append([reference])
        hypotheses.append(hypothesis)
    
    smooth = SmoothingFunction().method1
    bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smooth)
    
    # ROUGE-L Score
    rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    rouge_scores = []
    
    for i in range(len(merged_df)):
        score = rouge.score(merged_df.iloc[i]['caption'], merged_df.iloc[i]['generated_caption'])
        rouge_scores.append(score['rougeL'].fmeasure)
    rouge_l_score = np.mean(rouge_scores)
    
    # METEOR Score
    meteor_scores = []
    
    for i in range(len(merged_df)):
        ref = nltk.word_tokenize(merged_df.iloc[i]['caption'].lower())
        hyp = nltk.word_tokenize(merged_df.iloc[i]['generated_caption'].lower())
        score = meteor_score([ref], hyp)
        meteor_scores.append(score)
    meteor_score_avg = np.mean(meteor_scores)
    
    results = {
        'BLEU': bleu_score,
        'ROUGE-L': rouge_l_score,
        'METEOR': meteor_score_avg
    }
    
    return results

In [13]:
class ImageCaptionDataset(Dataset):
    """
    Dataset for image captioning model using encoder, decoder architecture.
    """
    def __init__(self, csv_file, img_dir, tokenizer, max_length=50, transform=None):
        self.df = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        if transform is None:
            self.transform = transforms.Compose([
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
            ])
        else:
            self.transform = transform
            
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
            
        img_name = os.path.join(self.img_dir, self.df.iloc[idx]['filename'])
        caption = self.df.iloc[idx]['caption']
        
        try:
            image = Image.open(img_name).convert("RGB")
            image = self.transform(image)
        except Exception as e:
            print(f"Error loading image {img_name}: {e}")
            image = torch.zeros((3, 224, 224)) # generate a dummy black image if error occurs
            
        caption_encoding = self.tokenizer(
            caption, padding="max_length", 
            truncation=True, max_length=self.max_length,
            return_tensors="pt"
        )
        caption_ids = caption_encoding.input_ids.squeeze(0)
        
        return image, caption_ids

In [14]:
class ImageCaptionModel(nn.Module):
    """
    Image Captioning Model using ViT and GPT2.
    """
    def __init__(self, vit_model="WinKawaks/vit-small-patch16-224", gpt2_model="gpt2", freeze_vit=True, freeze_gpt2_partial=True):
        super(ImageCaptionModel, self).__init__()

        # setup the encoder
        self.encoder = ViTModel.from_pretrained(vit_model)
        self.encoder_dim = self.encoder.config.hidden_size
        
        if freeze_vit:
            for param in self.encoder.parameters():
                param.requires_grad = False
                
        # setup the decoder
        gpt2_config = GPT2Config.from_pretrained(gpt2_model)
        gpt2_config.add_cross_attention = True # add cross attention layer to the decoder
        self.decoder = GPT2LMHeadModel.from_pretrained(gpt2_model, config=gpt2_config)
        self.decoder_dim = self.decoder.config.hidden_size
        
        if freeze_gpt2_partial:
            for i, block in enumerate(self.decoder.transformer.h):
                if i < len(self.decoder.transformer.h) - 2:
                    for param in block.parameters():
                        param.requires_grad = False
                        
        self.connect = nn.Linear(self.encoder_dim, self.decoder_dim)
        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_model)
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        self.img_token_id = self.tokenizer.convert_tokens_to_ids("<|img|>") if "<|img|>" in self.tokenizer.get_vocab() else self.tokenizer.convert_tokens_to_ids("<|endoftext|>")
        
    def forward(self, images, captions=None):
        """
        Forward pass of the model.
        Args:
            image (torch.Tensor): Input image tensor.
            captions (torch.Tensor, optional): Input caption tensor. Defaults to None.
        Returns:
            torch.Tensor: Output logits from the decoder.
        """
        
        encoder_outputs = self.encoder(images).last_hidden_state
        cls_output = encoder_outputs[:, 0, :]
        img_features = self.connect(cls_output)
        batch_size = images.size(0)
        img_tokens = torch.full((batch_size, 1), self.img_token_id, dtype=torch.long, device=images.device)
        
        if captions is not None:
            if not isinstance(captions, torch.Tensor):
                raise ValueError("Captions should be a tensor of input_ids.")
            
            input_ids = torch.cat([img_tokens, captions], dim=1)
            outputs = self.decoder(
                input_ids=input_ids,
                encoder_hidden_states=img_features.unsqueeze(1),
            )
            logits = outputs.logits
            shifted_logits = logits[:, :-1, :].contiguous()
            labels = input_ids[:, 1:].contiguous()
            
            pad_mask = (labels != self.tokenizer.pad_token_id)
            
            shifted_logits = shifted_logits.view(-1, self.decoder.config.vocab_size)
            labels = labels.view(-1)
            mask = pad_mask.view(-1)
            
            shifted_logits = shifted_logits[mask]
            labels = labels[mask]
            
            return shifted_logits, labels
        else:
            # inference mode
            input_ids = img_tokens
            encoder_outputs = img_features.unsqueeze(1)
            
            generated = self.decoder.generate(
                input_ids=input_ids,
                encoder_hidden_states=encoder_outputs,
                max_length=50,
                num_beams=4,
                early_stopping=True,
                pad_token_id=self.tokenizer.pad_token_id,
            )
            
            return generated

In [15]:
def train_model(model, train_loader, val_loader, optimizer, criterion, device, epochs=3, save_path="best_custom_model.pth"):
    """
    Train the image captioning model.
    Args:
        model (nn.Module): The image captioning model.
        train_loader (DataLoader): DataLoader for training data.
        val_loader (DataLoader): DataLoader for validation data.
        optimizer (torch.optim.Optimizer): Optimizer for training.
        criterion (nn.Module): Loss function.
        device (str): Device to train on ("cuda" or "cpu").
        epochs (int, optional): Number of epochs to train. Defaults to 3.
        save_path (str, optional): Path to save the best model. Defaults to "best_custom_model.pth".
    """
    
    best_val_loss = float('inf')
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        
        train_pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} [Training]")
        for images, captions in train_pbar:
            images = images.to(device)
            captions = captions.to(device)
            
            logits, labels = model(images, captions)
            
            loss = criterion(logits, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            train_pbar.set_postfix({'loss': loss.item()})
            
        avg_train_loss = train_loss / len(train_loader)
        
        model.eval()
        val_loss = 0
        
        val_pbar = tqdm(val_loader, desc=f"Epoch {epoch+1}/{epochs} [Validation]")
        
        with torch.no_grad():
            for images, captions in val_pbar:
                images = images.to(device)
                captions = captions.to(device)
                
                logits, labels = model(images, captions)
                
                loss = criterion(logits, labels)
                
                val_loss += loss.item()
                val_pbar.set_postfix({'loss': loss.item()})
            
        avg_val_loss = val_loss / len(val_loader)
        
        print(f"Epoch {epoch+1}/{epochs} - Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
        
        # save the best model to desired path
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save(model.state_dict(), save_path)
            print(f"Best model saved to {save_path} with loss: {best_val_loss:.4f}")

In [16]:
def generate_captions_with_custom_model(model, image_path, tokenizer, device):
    """
    Generate the captions using the trained custom model
    Args:
        model (nn.Module): The trained image captioning model.
        image_path (str): Path to the image file.
        device (str): Device to use ("cuda" or "cpu").
    Returns:
        str: Generated caption for the image.
    """
    
    model.eval()
    
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    try:
        image = Image.open(image_path).convert("RGB")
        image = transform(image).unsqueeze(0).to(device)
    except Exception as e:
        print(f"Error loading image {image_path}: {e}")
        return None
    
    with torch.no_grad():
        batch_size = image.size(0)
        input_ids = torch.full((batch_size, 1), model.img_token_id, dtype=torch.long, device=device)
        attention_mask = torch.ones_like(input_ids)

        encoder_outputs = model.encoder(image).last_hidden_state
        cls_output = encoder_outputs[:, 0, :]
        img_features = model.connect(cls_output).unsqueeze(1)

        generated = model.decoder.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            encoder_hidden_states=img_features,
            max_length=50,
            num_beams=4,
            early_stopping=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
        
    caption = tokenizer.decode(generated[0], skip_special_tokens=True)
    return caption

In [17]:
def generate_captions_for_test_set(model, test_dir, output_csv, device):
    """
    Generate the captions using the trained model on the test set images
    Args:
        model (nn.Module): The trained image captioning model.
        test_dir (str): Directory containing the test images.
        output_csv (str): Path to save the generated captions.
        device (str): Device to use ("cuda" or "cpu").
    """
    
    image_files = [f for f in os.listdir(test_dir) if f.lower().endswith('.jpg')]
    image_files.sort()
    
    captions = []
    
    for image_file in tqdm(image_files, desc="Generating Captions"):
        image_path = os.path.join(test_dir, image_file)
        caption = generate_captions_with_custom_model(model, image_path, model.tokenizer, device)
        
        if caption:
            captions.append({
                'filename': image_file,
                'generated_caption': caption
            })
            
    df = pd.DataFrame(captions)
    df.to_csv(output_csv, index=False)
    print(f"Captions saved to {output_csv}")

In [18]:
def create_dataloaders(train_csv, val_csv, train_img_dir, val_img_dir, tokenizer, batch_size=8):
    train_dataset = ImageCaptionDataset(csv_file=train_csv, img_dir=train_img_dir, tokenizer=tokenizer)
    val_dataset = ImageCaptionDataset(csv_file=val_csv, img_dir=val_img_dir, tokenizer=tokenizer)
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    return train_loader, val_loader

All the funtion calls and class instances required for Part A

In [19]:
# we will keep the running of this more modular and separated instead of forming a single main, because a single point of failure should not affect more
# and also it will be easier to debug and run in parts if needed
test_dir = "/kaggle/input/dataset/Dataset/test"             # change as required
test_csv_path = "/kaggle/input/dataset/Dataset/test.csv"    # change as required
generated_csv_path = "smolvlm_captions_0.csv"               # change as required

In [20]:
print("Generating captions using the SmolVLM model...")
generate_and_save_captions(test_dir, generated_csv_path)
print("Generation of captions using SmolVLM model completed.")

Generating captions using the SmolVLM model...


Generating Captions:   0%|          | 0/928 [00:00<?, ?it/s]

Loading Model and Processor...


config.json:   0%|          | 0.00/7.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/513M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.55M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.74k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Model and Processor loaded Successfully!


Generating Captions: 100%|██████████| 928/928 [1:12:01<00:00,  4.66s/it]

Captions saved to smolvlm_captions_0.csv
Generation of captions using SmolVLM model completed.





In [21]:
print("Evaluating the SmolVLM model...")
evaluation_results = evaluate_model(test_csv_path, generated_csv_path)
if evaluation_results:
    print("\nEvaluation Results:")
    print(f"BLEU Score: {evaluation_results['BLEU']:.4f}")
    print(f"ROUGE-L Score: {evaluation_results['ROUGE-L']:.4f}")
    print(f"METEOR Score: {evaluation_results['METEOR']:.4f}")
df_results_smolvlm = pd.DataFrame([evaluation_results])
df_results_smolvlm.to_csv("smolvlm_evaluation_results.csv", index=False)
print("Evaluation of SmolVLM model completed and results saved to smolvlm_evaluation_results.csv")

Evaluating the SmolVLM model...

Evaluation Results:
BLEU Score: 0.0545
ROUGE-L Score: 0.2396
METEOR Score: 0.2750
Evaluation of SmolVLM model completed and results saved to smolvlm_evaluation_results.csv


In [22]:
train_csv = "/kaggle/input/dataset/Dataset/train.csv"       # change as required
val_csv = "/kaggle/input/dataset/Dataset/val.csv"           # change as required
test_csv = "/kaggle/input/dataset/Dataset/test.csv"         # change as required
train_img_dir = "/kaggle/input/dataset/Dataset/train"       # change as required
val_img_dir = "/kaggle/input/dataset/Dataset/val"           # change as required
test_img_dir = "/kaggle/input/dataset/Dataset/test"         # change as required

In [23]:
model = ImageCaptionModel()
tokenizer = model.tokenizer
model = model.to(DEVICE)

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/88.2M [00:00<?, ?B/s]

Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-small-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.bias', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.10.crossattention.c_attn.bias', 'transformer.h.10.crossattention.c_attn.weight', 'transformer.h.10.crossattention.c_proj.bias', 'transformer.h.10.cros

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [24]:
train_loader, val_loader = create_dataloaders(
    train_csv=train_csv,
    val_csv=val_csv,
    train_img_dir=train_img_dir,
    val_img_dir=val_img_dir,
    tokenizer=tokenizer,
    batch_size=8            # adjust according to the GPU availability
)

In [25]:
optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4)
criterion  = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
print("Starting Model Training...")
train_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    criterion=criterion,
    device=DEVICE,
    epochs=5,               # change as required
    save_path="best_custom_model.pth"
)
print("Model Training Completed and Best Model Saved to best_custom_model.pth")

Starting Model Training...


Epoch 1/5 [Training]: 100%|██████████| 715/715 [02:11<00:00,  5.43it/s, loss=2.82]
Epoch 1/5 [Validation]: 100%|██████████| 119/119 [00:16<00:00,  7.42it/s, loss=2.63]


Epoch 1/5 - Train Loss: 2.7448, Val Loss: 2.4590
Best model saved to best_custom_model.pth with loss: 2.4590


Epoch 2/5 [Training]: 100%|██████████| 715/715 [01:36<00:00,  7.44it/s, loss=2.8] 
Epoch 2/5 [Validation]: 100%|██████████| 119/119 [00:09<00:00, 11.94it/s, loss=2.56]


Epoch 2/5 - Train Loss: 2.4296, Val Loss: 2.4042
Best model saved to best_custom_model.pth with loss: 2.4042


Epoch 3/5 [Training]: 100%|██████████| 715/715 [01:35<00:00,  7.47it/s, loss=2.33]
Epoch 3/5 [Validation]: 100%|██████████| 119/119 [00:10<00:00, 11.67it/s, loss=2.5] 


Epoch 3/5 - Train Loss: 2.2758, Val Loss: 2.3820
Best model saved to best_custom_model.pth with loss: 2.3820


Epoch 4/5 [Training]: 100%|██████████| 715/715 [01:35<00:00,  7.48it/s, loss=2.19]
Epoch 4/5 [Validation]: 100%|██████████| 119/119 [00:10<00:00, 11.88it/s, loss=2.53]


Epoch 4/5 - Train Loss: 2.1526, Val Loss: 2.3875


Epoch 5/5 [Training]: 100%|██████████| 715/715 [01:35<00:00,  7.47it/s, loss=2]   
Epoch 5/5 [Validation]: 100%|██████████| 119/119 [00:09<00:00, 11.96it/s, loss=2.56]

Epoch 5/5 - Train Loss: 2.0431, Val Loss: 2.4055
Model Training Completed and Best Model Saved to best_custom_model.pth





In [26]:
print("Loading the best model for inference...")
model.load_state_dict(torch.load("best_custom_model.pth"))

Loading the best model for inference...


  model.load_state_dict(torch.load("best_custom_model.pth"))


<All keys matched successfully>

In [28]:
print("Generating Captions for the test set using the trained model...")
generate_captions_for_test_set(
    model=model,
    test_dir=test_img_dir,
    output_csv="custom_model_captions_0.csv",
    device=DEVICE
)
print("Captions for the test set generated and saved to custom_model_captions_0.csv")

Generating Captions for the test set using the trained model...


Generating Captions: 100%|██████████| 928/928 [11:26<00:00,  1.35it/s]

Captions saved to custom_model_captions_0.csv
Captions for the test set generated and saved to custom_model_captions_0.csv





In [29]:
print("Evaluating the custom model performance...")
evaluation_results = evaluate_model(test_csv, "custom_model_captions_0.csv")
if evaluation_results:
    print("\nEvaluation Results:")
    print(f"BLEU Score: {evaluation_results['BLEU']:.4f}")
    print(f"ROUGE-L Score: {evaluation_results['ROUGE-L']:.4f}")
    print(f"METEOR Score: {evaluation_results['METEOR']:.4f}")
df = pd.DataFrame([evaluation_results])
df.to_csv("custom_model_evaluation_results.csv", index=False)
print("Evaluation of custom model completed and results saved to custom_model_evaluation_results.csv")

Evaluating the custom model performance...

Evaluation Results:
BLEU Score: 0.0592
ROUGE-L Score: 0.2722
METEOR Score: 0.2156
Evaluation of custom model completed and results saved to custom_model_evaluation_results.csv


#### PART B - Studying Performance Change Under Image Occlusion

This part of the code contains the following functions/classes:<br>
    - def occlude_image(image, mask_percentage)<br>
    - def generate_smolvlm_captions(test_dir, occlusion_levels, output_path_template="smolvlm_captions_{}.csv", model_name="HuggingFaceTB/SmolVLM-256M-Instruct")<br>
    - def generate_custom_model_captions(test_dir, occlusion_levels, model_path, output_path_template="custom_model_captions_{}.csv")<br>
    - def evaluate_on_occluded_images(model_name, test_csv_path, baseline_csv_path, occluded_captions_template, occlusion_levels)<br>
    - def evaluate_and_save_results(test_csv_path, generated_dir, output_csv_path, partc_csv_path)<br>

In [30]:
def occlude_image(image, mask_percentage):
    """
    Apply patch waise occlusion to the given image
    Args:
        image (np.array): Input image array.
        mask_percentage (float): Percentage of the image to be occluded
    Returns:
        np.array: Occluded image array.
    """
    
    if isinstance(image, Image.Image):
        np_image = np.array(image)
    else:
        np_image = image.copy()
        
    height, width, _ = np_image.shape
    
    patch_size = 16 # as mentioned in the assignment
    
    patches_h = height // patch_size
    patches_w = width // patch_size
    total_patches = patches_h * patches_w
    
    num_patches_to_mask = int((mask_percentage / 100) * total_patches)
    mask_indices = random.sample(range(total_patches), num_patches_to_mask)
    
    for idx in mask_indices:
        patch_h = idx // patches_w
        patch_w = idx % patches_w
        
        h_start = patch_h * patch_size
        h_end = min(h_start + patch_size, height)
        w_start = patch_w * patch_size
        w_end = min(w_start + patch_size, width)
        
        np_image[h_start:h_end, w_start:w_end, :] = 0 # set the patch to black
        
    return np_image

In [31]:
def generate_smolvlm_captions(test_dir, occlusion_levels, output_path_template="smolvlm_captions_{}.csv", model_name="HuggingFaceTB/SmolVLM-256M-Instruct"):
    """
    Generate captions using the SmolVLM model on the given different levels of occlusion
    Args:
        test_dir (str): Directory containing the images.
        occlusion_levels (list): List of occlusion levels (in percentage).
        output_path_template (str): Template for output CSV file names.
        model_name (str): Model name (default: HuggingFaceTB/SmolVLM-256M-Instruct).
    """
    
    filenames = [f for f in os.listdir(test_dir) if f.lower().endswith('.jpg')]
    filenames.sort()
    
    print("Loading SmolVLm Model and Processor...")
    model = AutoModelForVision2Seq.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32,
        device_map=DEVICE,
        _attn_implementation="eager" # here "flash_attention_2" could not be used due to cuda version errors on Kaggle Notebook
    ).to(DEVICE)
    processor = AutoProcessor.from_pretrained(model_name)
    print("Model and Processor loaded Successfully!")
    
    for occlusion_level in occlusion_levels:
        print(f"ProcessingOcclusion Level: {occlusion_level}%")
        output_path = output_path_template.format(occlusion_level)
        
        captions = []
        
        for filename in tqdm(filenames, desc=f"Generating captions for {occlusion_level}% occlusion"):
            try:
                image_path = os.path.join(test_dir, filename)
                image = Image.open(image_path).convert("RGB")
                
                occluded_image = Image.fromarray(occlude_image(image, occlusion_level))
                
                messages = [
                    {
                        "role": "user",
                        "content" : [
                            {"type": "image"},
                            {"type": "text", "text": "Describe this image in detail."}
                        ]
                    }
                ]
                
                prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
                inputs = processor(text=prompt, images=occluded_image, return_tensors="pt").to(DEVICE)
                
                with torch.no_grad():
                    generated_ids = model.generate(**inputs, max_new_tokens=100)
                    generated_ids = generated_ids[:, inputs["input_ids"].shape[1]:]
                    generated_text = processor.batch_decode(
                        generated_ids, skip_special_tokens=True
                    )[0].strip()
                    
                    captions.append({
                        'filename': filename,
                        'generated_caption': generated_text
                    })
            except Exception as e:
                print(f"Error processing image {filename}: {e}")
                captions.append({
                    'filename': filename,
                    'generated_caption': ""
                })
                
        df = pd.DataFrame(captions)
        df.to_csv(output_path, index=False)
        print(f"Captions saved to {output_path}")
        print(f"Captions for {occlusion_level}% occlusion completed successfully!")

In [32]:
def generate_custom_model_captions(test_dir, occlusion_levels, model_path, output_path_template="custom_model_captions_{}.csv"):
    """
    Generate captions using the custom model on the given different levels of occlusion
    Args:
        test_dir (str): Directory containing the images.
        occlusion_levels (list): List of occlusion levels (in percentage).
        model_path (str): Path to the trained custom model.
        output_path_template (str): Template for output CSV file names.
    """
    
    filenames = [f for f in os.listdir(test_dir) if f.lower().endswith('.jpg')]
    filenames.sort()
    
    print("Loading the trained custom model...")
    model = ImageCaptionModel().to(DEVICE)
    use_weights_only = torch.__version__ >= "2.3" # for compatibility with torch 2.3 and above
    state_dict = torch.load(model_path, map_location=DEVICE, weights_only=use_weights_only)
    model.load_state_dict(state_dict)
    
    model.eval()    # set the model to evaluation mode
    
    # create the pad token if not already present or if it is same as eos token
    # this is important for the model to work properly in inference mode
    if model.tokenizer.pad_token is None or model.tokenizer.pad_token == model.tokenizer.eos_token:
        model.tokenizer.pad_token = "[PAD]"
        
    preprocess = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    for occlusion_level in occlusion_levels:
        print(f"Generating Captions for Occlusion Level: {occlusion_level}%")
        output_path = output_path_template.format(occlusion_level)
        
        captions = []
        
        for filename in tqdm(filenames, desc=f"Generating Captions for {occlusion_level}% occlusion"):
            try:
                image_path = os.path.join(test_dir, filename)
                image = Image.open(image_path).convert("RGB")
                
                occluded_image = Image.fromarray(occlude_image(image, occlusion_level))
                
                image_tensor = preprocess(occluded_image).unsqueeze(0).to(DEVICE)
                
                with torch.no_grad():
                    generated_ids = model(image_tensor)
                    generated_text = model.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
                    
                captions.append({
                    'filename': filename,
                    'generated_caption': generated_text
                })
            except Exception as e:
                print(f"Error processing image {filename}: {e}")
                captions.append({
                    'filename': filename,
                    'generated_caption': ""
                })
                
            if len(captions) % 100 == 0:
                temp_df = pd.DataFrame(captions)
                temp_path = f"temp_{output_path}"
                temp_df.to_csv(temp_path, index=False)
                print(f"Saved intermediate results to {temp_path} ({len(captions)} / {len(filenames)})")
                
        df = pd.DataFrame(captions)
        df.to_csv(output_path, index=False)
        print(f"Captions saved to {output_path}")

In [33]:
def evaluate_on_occluded_images(model_name, test_csv_path, baseline_csv_path, occluded_captions_template, occlusion_levels):
    """
    Evaluate after occluding images
    Args:
        model_name (str): Name of the model used for evaluation.
        test_csv_path (str): Path to the test CSV file containing ground truth captions.
        baseline_csv_path (str): Path to the CSV file containing baseline captions.
        occluded_captions_template (str): Template for occluded captions CSV file names.
        occlusion_levels (list): List of occlusion levels (in percentage).
    Returns:
        dict: Dictionary containing the evaluation metrics for each occlusion level.
    """
    
    print(f"Evaluating {model_name} Baseline performance (0 occlusion)")
    baseline_metrics = evaluate_model(test_csv_path, baseline_csv_path)
    
    results = {
        "model": model_name,
        "baseline": baseline_metrics,
        "occlusion_results": {}
    }
    
    for occlusion_level in occlusion_levels:
        print(f"Evaluating {model_name} for {occlusion_level}% Occlusion...")
        occluded_csv_path = occluded_captions_template.format(occlusion_level)
        
        occluded_metrics = evaluate_model(test_csv_path, occluded_csv_path)
        
        changes = {
            "BLEU_change": occluded_metrics["BLEU"] - baseline_metrics["BLEU"],
            "ROUGE-L_change": occluded_metrics["ROUGE-L"] - baseline_metrics["ROUGE-L"],
            "METEOR_change": occluded_metrics["METEOR"] - baseline_metrics["METEOR"]
        }
        
        results["occlusion_results"][occlusion_level] = {
            "metrics": occluded_metrics,
            "changes": changes
        }
        
    return results

In [46]:
def evaluate_and_save_results(test_csv_path, generated_dir, output_csv_path, partc_csv_path):
    """
    Evaluate model against the 0 occlusion level and save the obtaained results
    Args:
        test_csv_path (str): Path to the test CSV file containing ground truth captions.
        generated_dir (str): Directory containing the generated captions.
        output_csv_path (str): Path to save the evaluation results.
        partc_csv_path (str): Path to save the participant CSV file.
    """
    test_df = pd.read_csv(test_csv_path)
    
    results_list = []
    partc_list = []
    
    occlusion_levels = [0, 10, 50, 80]
    models = ["custom_model", "smolvlm"]
    
    baseline_scores = {}
    
    for model in models:
        for occlusion_level in occlusion_levels:
            generated_csv_path = os.path.join(generated_dir, f"{model}_captions_{occlusion_level}.csv")
            
            if not os.path.exists(generated_csv_path):
                print(f"File not found: {generated_csv_path}")
                continue
            
            generated_df = pd.read_csv(generated_csv_path)
            merged_df = pd.merge(test_df, generated_df, on='filename', how='inner')
            
            if len(merged_df) == 0:
                print(f"No matching filenames found between test and generated dataframes for {model} at {occlusion_level}% occlusion.")
                continue
            
            references = []
            hypotheses = []
            
            for _, row in merged_df.iterrows():
                reference = nltk.word_tokenize(row['caption'].lower())
                hypothesis = nltk.word_tokenize(row['generated_caption'].lower())
                
                references.append([reference])
                hypotheses.append(hypothesis)
                
                partc_list.append([
                    row['caption'], row['generated_caption'], occlusion_level, model
                ])
                
            # BLEU Score
            smooth = SmoothingFunction().method1
            bleu_score = corpus_bleu(references, hypotheses, smoothing_function=smooth)
            
            # ROUGE-L Score
            rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
            rouge_scores = [
                rouge.score(row['caption'], row['generated_caption'])['rougeL'].fmeasure
                for _, row in merged_df.iterrows()
            ]
            rouge_l_score = np.mean(rouge_scores)
            
            # METEOR Score
            meteor_scores = [
                meteor_score([nltk.word_tokenize(row['caption'].lower())],
                             nltk.word_tokenize(row['generated_caption'].lower())) 
                for _, row in merged_df.iterrows()
            ]
            meteor_score_avg = np.mean(meteor_scores)
            
            if occlusion_level == 0:
                baseline_scores[model] = {
                    'BLEU': bleu_score,
                    'ROUGE-L': rouge_l_score,
                    'METEOR': meteor_score_avg
                }
                
            differences = {
                "BLEU_diff": baseline_scores[model]["BLEU"] - bleu_score if occlusion_level != 0 else 0,
                "ROUGE-L_diff": baseline_scores[model]["ROUGE-L"] - rouge_l_score if occlusion_level != 0 else 0,
                "METEOR_diff": baseline_scores[model]["METEOR"] - meteor_score_avg if occlusion_level != 0 else 0,
            }
            
            results_list.append([
                model, occlusion_level, bleu_score, rouge_l_score, meteor_score_avg,
                differences["BLEU_diff"], differences["ROUGE-L_diff"], differences["METEOR_diff"]
            ])
            
        results_df = pd.DataFrame(results_list, columns=[
            "Model", "Occlusion Level", "BLEU", "ROUGE-L", "METEOR", 
            "BLEU Difference", "ROUGE-L Difference", "METEOR Difference"
        ])
        results_df.to_csv(output_csv_path, index=False)
        print(f"Results saved to {output_csv_path}")
        
        # for further Part C of the assignment
        partc_df = pd.DataFrame(partc_list, columns=["original_caption", "generated_caption", "occlusion_level", "model"])
        partc_df.to_csv(partc_csv_path, index=False)
        print(f"Part C required csv saved to {partc_csv_path}")

All the funtion calls and class instances required for Part B

In [36]:
# we will keep the running of this more modular and separated instead of forming a single main, because a single point of failure should not affect more
# and also it will be easier to debug and run in parts if needed
test_dir = "/kaggle/input/dataset/Dataset/test"             # change as required
test_csv_path = "/kaggle/input/dataset/Dataset/test.csv"    # change as required
model_path = "/kaggle/working/best_custom_model.pth"        # change as required

In [37]:
occlusion_levels = [10, 50, 80] # as mentioned in the assignment

In [38]:
generate_smolvlm_captions(test_dir, occlusion_levels)

Loading SmolVLm Model and Processor...
Model and Processor loaded Successfully!
ProcessingOcclusion Level: 10%


Generating captions for 10% occlusion: 100%|██████████| 928/928 [1:12:03<00:00,  4.66s/it]


Captions saved to smolvlm_captions_10.csv
Captions for 10% occlusion completed successfully!
ProcessingOcclusion Level: 50%


Generating captions for 50% occlusion: 100%|██████████| 928/928 [1:11:16<00:00,  4.61s/it]


Captions saved to smolvlm_captions_50.csv
Captions for 50% occlusion completed successfully!
ProcessingOcclusion Level: 80%


Generating captions for 80% occlusion: 100%|██████████| 928/928 [1:11:35<00:00,  4.63s/it]

Captions saved to smolvlm_captions_80.csv
Captions for 80% occlusion completed successfully!





In [40]:
generate_custom_model_captions(test_dir, occlusion_levels, model_path)

Loading the trained custom model...


Some weights of ViTModel were not initialized from the model checkpoint at WinKawaks/vit-small-patch16-224 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', '

Generating Captions for Occlusion Level: 10%


Generating Captions for 10% occlusion:  11%|█         | 100/928 [01:13<10:00,  1.38it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (100 / 928)


Generating Captions for 10% occlusion:  22%|██▏       | 200/928 [02:27<09:03,  1.34it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (200 / 928)


Generating Captions for 10% occlusion:  32%|███▏      | 300/928 [03:41<07:46,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (300 / 928)


Generating Captions for 10% occlusion:  43%|████▎     | 400/928 [04:55<06:32,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (400 / 928)


Generating Captions for 10% occlusion:  54%|█████▍    | 500/928 [06:09<05:17,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (500 / 928)


Generating Captions for 10% occlusion:  65%|██████▍   | 600/928 [07:23<04:02,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (600 / 928)


Generating Captions for 10% occlusion:  75%|███████▌  | 700/928 [08:37<02:49,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (700 / 928)


Generating Captions for 10% occlusion:  86%|████████▌ | 800/928 [09:51<01:35,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (800 / 928)


Generating Captions for 10% occlusion:  97%|█████████▋| 900/928 [11:05<00:20,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_10.csv (900 / 928)


Generating Captions for 10% occlusion: 100%|██████████| 928/928 [11:26<00:00,  1.35it/s]


Captions saved to custom_model_captions_10.csv
Generating Captions for Occlusion Level: 50%


Generating Captions for 50% occlusion:  11%|█         | 100/928 [01:13<10:07,  1.36it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (100 / 928)


Generating Captions for 50% occlusion:  22%|██▏       | 200/928 [02:27<09:07,  1.33it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (200 / 928)


Generating Captions for 50% occlusion:  32%|███▏      | 300/928 [03:41<07:46,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (300 / 928)


Generating Captions for 50% occlusion:  43%|████▎     | 400/928 [04:55<06:28,  1.36it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (400 / 928)


Generating Captions for 50% occlusion:  54%|█████▍    | 500/928 [06:09<05:17,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (500 / 928)


Generating Captions for 50% occlusion:  65%|██████▍   | 600/928 [07:23<04:02,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (600 / 928)


Generating Captions for 50% occlusion:  75%|███████▌  | 700/928 [08:37<02:48,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (700 / 928)


Generating Captions for 50% occlusion:  86%|████████▌ | 800/928 [09:51<01:35,  1.34it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (800 / 928)


Generating Captions for 50% occlusion:  97%|█████████▋| 900/928 [11:05<00:20,  1.36it/s]

Saved intermediate results to temp_custom_model_captions_50.csv (900 / 928)


Generating Captions for 50% occlusion: 100%|██████████| 928/928 [11:25<00:00,  1.35it/s]


Captions saved to custom_model_captions_50.csv
Generating Captions for Occlusion Level: 80%


Generating Captions for 80% occlusion:  11%|█         | 100/928 [01:14<10:16,  1.34it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (100 / 928)


Generating Captions for 80% occlusion:  22%|██▏       | 200/928 [02:27<08:56,  1.36it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (200 / 928)


Generating Captions for 80% occlusion:  32%|███▏      | 300/928 [03:42<07:43,  1.36it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (300 / 928)


Generating Captions for 80% occlusion:  43%|████▎     | 400/928 [04:56<06:30,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (400 / 928)


Generating Captions for 80% occlusion:  54%|█████▍    | 500/928 [06:10<05:16,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (500 / 928)


Generating Captions for 80% occlusion:  65%|██████▍   | 600/928 [07:24<04:02,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (600 / 928)


Generating Captions for 80% occlusion:  75%|███████▌  | 700/928 [08:38<02:48,  1.35it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (700 / 928)


Generating Captions for 80% occlusion:  86%|████████▌ | 800/928 [09:52<01:36,  1.32it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (800 / 928)


Generating Captions for 80% occlusion:  97%|█████████▋| 900/928 [11:06<00:20,  1.34it/s]

Saved intermediate results to temp_custom_model_captions_80.csv (900 / 928)


Generating Captions for 80% occlusion: 100%|██████████| 928/928 [11:27<00:00,  1.35it/s]

Captions saved to custom_model_captions_80.csv





In [47]:
# the paths have been given wrt Kaggle Notebook, change as required
evaluate_and_save_results("/kaggle/input/dataset/Dataset/test.csv", "/kaggle/working/", "/kaggle/working/overall_results.csv", "/kaggle/working/partc.csv")

Results saved to /kaggle/working/overall_results.csv
Part C required csv saved to /kaggle/working/partc.csv
Results saved to /kaggle/working/overall_results.csv
Part C required csv saved to /kaggle/working/partc.csv


In [49]:
# FOR BERT SCORE ANALYSIS
test_df = pd.read_csv('/kaggle/input/dataset/Dataset/test.csv') # change as required
print(f"Test DataFrame shape: {test_df.shape}")
occlusion_levels = [0, 10, 50, 80]
results = {
    'model' : [],
    'occlusion_level' : [],
    'precision' : [],
    'recall' : [],
    'f1' : []
}

Test DataFrame shape: (928, 3)


Separate section to calculate the BERT score for all the captions generated

In [50]:
def calculate_bert_score(references, candidates, model_type='bert-base-uncased'):
    """
    Calculate the BERT score between the reference and candidate sentences.
    Args:
        reference (str): Reference sentence.
        candidates (list): List of candidate sentences.
        model_type (str): BERT model type. Defaults to 'bert-base-uncased'.
    Returns:
        tuple: Precision, Recall, F1 score.
    """
    P, R, F1 = score(candidates, references, lang='en', model_type=model_type, device=DEVICE)
    return P.numpy(), R.numpy(), F1.numpy()

In [52]:
for ocl in occlusion_levels:
    smolvlm_captions = f"smolvlm_captions_{ocl}.csv" # change as required
    custom_model_captions = f"custom_model_captions_{ocl}.csv" # change as required
    
    if os.path.exists(smolvlm_captions):
        smolvlm_df = pd.read_csv(smolvlm_captions)
        print(f"Processing SmolVLM captions with occlusion level {ocl}%")
        
        references = test_df['caption'].tolist()
        candidates = smolvlm_df['generated_caption'].tolist()
        
        precision, recall, f1 = calculate_bert_score(references, candidates)
        
        results['model'].append('SmolVLM')
        results['occlusion_level'].append(ocl)
        results['precision'].append(np.mean(precision))
        results['recall'].append(np.mean(recall))
        results['f1'].append(np.mean(f1))
        
        print(f"Precision: {np.mean(precision)}, Recall: {np.mean(recall)}, F1: {np.mean(f1)}")
    else:
        print(f"File not found: {smolvlm_captions}")
        
    
    if os.path.exists(custom_model_captions):
        custom_model_df = pd.read_csv(custom_model_captions)
        print(f"Processing Custom Model captions with occlusion level {ocl}%")
        
        references = test_df['caption'].tolist()
        candidates = custom_model_df['generated_caption'].tolist()
        
        precision, recall, f1 = calculate_bert_score(references, candidates)
        
        results['model'].append('Custom Model')
        results['occlusion_level'].append(ocl)
        results['precision'].append(np.mean(precision))
        results['recall'].append(np.mean(recall))
        results['f1'].append(np.mean(f1))
        
        print(f"Precision: {np.mean(precision)}, Recall: {np.mean(recall)}, F1: {np.mean(f1)}")
    else:
        print(f"File not found: {custom_model_captions}")

Processing SmolVLM captions with occlusion level 0%


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Precision: 0.4970172345638275, Recall: 0.5353716015815735, F1: 0.5149171948432922
Processing Custom Model captions with occlusion level 0%
Precision: 0.561847448348999, Recall: 0.5097543001174927, F1: 0.5337085127830505
Processing SmolVLM captions with occlusion level 10%
Precision: 0.49908581376075745, Recall: 0.5365116000175476, F1: 0.5165361166000366
Processing Custom Model captions with occlusion level 10%
Precision: 0.56614750623703, Recall: 0.5117161870002747, F1: 0.5367445945739746
Processing SmolVLM captions with occlusion level 50%
Precision: 0.4868716597557068, Recall: 0.5268229842185974, F1: 0.5054410696029663
Processing Custom Model captions with occlusion level 50%
Precision: 0.5678975582122803, Recall: 0.5069512128829956, F1: 0.5346135497093201
Processing SmolVLM captions with occlusion level 80%
Precision: 0.4533844590187073, Recall: 0.4975966513156891, F1: 0.4738430380821228
Processing Custom Model captions with occlusion level 80%
Precision: 0.5716530084609985, Recall:

In [53]:
bert_results_df = pd.DataFrame(results)
bert_results_df.to_csv("bert_score_results.csv", index=False)
print("BERT Score results saved to bert_score_results.csv")
print("BERT Score Summary:")
print(bert_results_df.groupby(['model', 'occlusion_level']).agg({'precision': 'mean', 'recall': 'mean', 'f1': 'mean'}).reset_index())
print("BERT Score Analysis Completed!")

BERT Score results saved to bert_score_results.csv
BERT Score Summary:
          model  occlusion_level  precision    recall        f1
0  Custom Model                0   0.561847  0.509754  0.533709
1  Custom Model               10   0.566148  0.511716  0.536745
2  Custom Model               50   0.567898  0.506951  0.534614
3  Custom Model               80   0.571653  0.505543  0.535693
4       SmolVLM                0   0.497017  0.535372  0.514917
5       SmolVLM               10   0.499086  0.536512  0.516536
6       SmolVLM               50   0.486872  0.526823  0.505441
7       SmolVLM               80   0.453384  0.497597  0.473843
BERT Score Analysis Completed!


#### PART C - Building a BERT-based Classifier for Model Identification 

This part of the code contains the following functions/classes:<br>
    - class CaptionClassifier(nn.Module)<br>
    - class CaptionDataset(Dataset)<br>
    - def train_classifier(model, train_dataloader, val_dataloader, optimizer, criterion, device, epochs=10)<br>
    - def evaluate_classifier(model, test_dataloader, device)<br>
    - def run_part_c()<br>
    - def prepare_dataframe(df)<br>


In [54]:
df = pd.read_csv("/kaggle/working/partc.csv")
df['model'] = df['model'].replace({'custom_model': 'Model B', 'smolvlm': 'Model A'})
df.head(10)

Unnamed: 0,original_caption,generated_caption,occlusion_level,model
0,A large building with bars on the windows in f...,This is an image of a street. The street is ma...,0,Model B
1,A person is skiing through the snow. There is ...,A man is standing on top of a snowboard. He is...,0,Model B
2,There is a bed in a room against a wall. There...,This is an image of a bed. The bed is made of ...,0,Model B
3,A black and red train is on the tracks and has...,This is an image of a train station. The train...,0,Model B
4,A white and yellow public transportation bus w...,This is an image of a bus. The bus is white. T...,0,Model B
5,A large white house with a brown door sits beh...,This is an image of a sheep. The sheep is stan...,0,Model B
6,A man in a red and yellow t-shirt is holding a...,This is an image of a man and woman. The man i...,0,Model B
7,There are three men riding bicycles. Two of th...,A man is riding a bike on a road. He is wearin...,0,Model B
8,TWO CELL PHONES ON THE TABLE.THEY BOTH ARE DAR...,This is an image of a clock. The clock is blac...,0,Model B
9,A painted mannequin head sits on a table. It i...,This is an image of a clock. The clock is blac...,0,Model B


In [55]:
class CaptionClassifier(nn.Module):
    """
    BERT Based classifier for identifying which model generated the caption
    """
    def __init__(self, pretrained_model='google-bert/bert-base-uncased', num_classes=2):
        super(CaptionClassifier, self).__init__()
        
        self.bert = BertModel.from_pretrained(pretrained_model)
        self.dropout = nn.Dropout(0.1)
        self.fc1 = nn.Linear(self.bert.config.hidden_size, 256)
        self.fc2 = nn.Linear(256, num_classes)
        self.relu = nn.ReLU()
        
    def forward(self, input_ids, attention_mask):
        """
        Forward passof the model
        Args:
            input_ids (torch.Tensor): Input IDs for the BERT model.
            attention_mask (torch.Tensor): Attention mask for the BERT model.
        Returns:
            torch.Tensor: Output logits from the classifier.
        """
        
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        
        x = self.dropout(pooled_output)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x

In [56]:
class CaptionDataset(Dataset):
    """
    Dataset for caption classification task
    """
    def __init__(self, dataframe, tokenizer, max_length=128):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        
        self.label_map = {'Model A': 0, 'Model B': 1}
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        
        text = f"{row['original_caption']} [SEP] {row['generated_caption']} [SEP] {row['occlusion_level']}"
        
        encoding = self.tokenizer(
            text, add_special_tokens=True,
            max_length=self.max_length, padding='max_length',
            truncation=True, return_tensors='pt'
        )
        
        label = self.label_map[row['model']]
        
        return{
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

In [57]:
def train_classifier(model, train_dataloader, val_dataloader, optimizer, criterion, device, epochs=10):
    """
    Training the BERT-based classifier
    Args:
        model (nn.Module): The caption classifier model.
        train_dataloader (DataLoader): DataLoader for training data.
        val_dataloader (DataLoader): DataLoader for validation data.
        optimizer (torch.optim.Optimizer): Optimizer for training.
        criterion (nn.Module): Loss function.
        device (str): Device to train on ("cuda" or "cpu").
        epochs (int, optional): Number of epochs to train. Defaults to 10.
    """
    
    best_val_loss = float('inf')
    best_model_state = None
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        train_pbar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{epochs} [Training]")
        
        for batch in train_pbar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            train_pbar.set_postfix({'loss': f"{loss.item():.4f}"})
        
        avg_train_loss = train_loss / len(train_dataloader)
        
        model.eval()
        val_loss = 0
        val_pbar = tqdm(val_dataloader, desc=f"Epoch {epoch+1}/{epochs} [Validation]")
        
        with torch.no_grad():
            for batch in val_pbar:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)
                
                outputs = model(input_ids, attention_mask)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item()
                val_pbar.set_postfix({'loss': f"{loss.item():.4f}"})
        
        avg_val_loss = val_loss / len(val_dataloader)
        
        print(f"Epoch {epoch+1}/{epochs} - Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")
        
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            best_model_state = model.state_dict().copy()
            print(f"New best model saved with validation loss: {best_val_loss:.4f}")
    
    model.load_state_dict(best_model_state)
    return model

In [73]:
def evaluate_classifier(model, test_dataloader, device):
    """
    Evaluate the classifier
    Args:
        model (nn.Module): The caption classifier model.
        test_dataloader (DataLoader): DataLoader for test data.
        device (str): Device to evaluate on ("cuda" or "cpu").
    Returns:
        dict: Dictionary containing evaluation metrics.
    """
    model.eval()
    all_preds = []
    all_labels = []
    
    test_pbar = tqdm(test_dataloader, desc="Evaluating Classifier")
    
    with torch.no_grad():
        for batch in test_pbar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(input_ids, attention_mask)
            _, preds = torch.max(outputs, dim=1)

            # Ensure data is moved to CPU and converted to numpy
            all_preds.extend(preds.cpu().numpy().tolist())
            all_labels.extend(labels.cpu().numpy().tolist())

    # Sanity check: make sure lengths match and not empty
    if len(all_preds) == 0 or len(all_labels) == 0:
        raise ValueError("No predictions or labels found. Ensure your test_dataloader has data and model outputs correctly.")

    if len(all_preds) != len(all_labels):
        raise ValueError(f"Mismatch in predictions and labels: {len(all_preds)} preds vs {len(all_labels)} labels.")

    precision, recall, f1, _ = precision_recall_fscore_support(
        all_labels, all_preds, average='macro', zero_division=0
    )
    
    results = {
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }

    print(f"Evaluation Results - Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")
    return results

In [74]:
# since this part is small we can have a dedicated function to run the overall Part C
def run_part_c():
    # set seed
    torch.manual_seed(42)
    np.random.seed(42)
    
    print(f"Using Device: {DEVICE}")
    
    tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
    full_dataset = CaptionDataset(df, tokenizer=tokenizer)
    
    dataset_size = len(full_dataset)
    train_size = int(0.7 * dataset_size)
    val_size = int(0.1 * dataset_size)
    test_size = dataset_size - train_size - val_size
    
    print(f"Total Dataset Size: {dataset_size}")
    print(f"Train size: {train_size}, Val size: {val_size}, Test size: {test_size}")
    
    train_dataset, val_dataset, test_dataset = random_split(
        full_dataset, [train_size, val_size, test_size]
    )
    
    batch_size = 16 # can be varied according to GPU constraints
    
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
    test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
    
    model = CaptionClassifier().to(DEVICE)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    criterion = nn.CrossEntropyLoss()
    
    print("Training Model...")
    model = train_classifier(model, train_dataloader, val_dataloader, optimizer, criterion, DEVICE, epochs=10)
    print("Model Training Finished for the Classifier")
    
    print("Evaluating Model...")
    results = evaluate_classifier(model, test_dataloader, DEVICE)
    
    return model, results

In [75]:
def prepare_dataframe(df):
    required_cols = ['original_caption', 'generated_caption', 'occlusion_level', 'model']
    for col in required_cols:
        if col not in df.columns:
            raise ValueError(f"Required column '{col}' is missing from your dataframe")
        
    df['occlusion_level'] = df['occlusion_level'].astype(str)
         
    if not set(df['model']).issubset({'Model A', 'Model B'}):
        raise ValueError("Model column should only contain 'Model A' and 'Model B' values")
    
    return df

Running the Part C all at once using the main() funtion

In [77]:
if __name__ == "__main__":
    df = prepare_dataframe(df)
    
    # run for Part C
    model, results = run_part_c()
    
    print("\nFinal Results:")
    print(f"Precision: {results['precision']:.4f}")
    print(f"Recall: {results['recall']:.4f}")
    print(f"F1 Score: {results['f1_score']:.4f}")
    
    # save the results to csv
    results_df = pd.DataFrame({
        'metric': ['precision', 'recall', 'f1'],
        'value': [results['precision'], results['recall'], results['f1_score']]
    })
    results_df.to_csv('part_c_results.csv', index=False)
    torch.save(model.state_dict(), "caption_classifier_model.pth")
    print("Results and Model saved successfully!")

Using Device: cuda
Total Dataset Size: 7424
Train size: 5196, Val size: 742, Test size: 1486
Training Model...


Epoch 1/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.21it/s, loss=0.0531]
Epoch 1/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.25it/s, loss=0.0022]


Epoch 1/10 - Train Loss: 0.0914, Val Loss: 0.0392
New best model saved with validation loss: 0.0392


Epoch 2/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.21it/s, loss=0.0575]
Epoch 2/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.11it/s, loss=0.0009]


Epoch 2/10 - Train Loss: 0.0318, Val Loss: 0.0414


Epoch 3/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.19it/s, loss=0.1372]
Epoch 3/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 10.99it/s, loss=0.0003]


Epoch 3/10 - Train Loss: 0.0283, Val Loss: 0.0593


Epoch 4/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.19it/s, loss=0.0005]
Epoch 4/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 10.83it/s, loss=0.0004]


Epoch 4/10 - Train Loss: 0.0307, Val Loss: 0.0453


Epoch 5/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.19it/s, loss=0.0004]
Epoch 5/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.03it/s, loss=0.0003]


Epoch 5/10 - Train Loss: 0.0283, Val Loss: 0.0382
New best model saved with validation loss: 0.0382


Epoch 6/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.19it/s, loss=0.0002]
Epoch 6/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.09it/s, loss=0.0002]


Epoch 6/10 - Train Loss: 0.0254, Val Loss: 0.0574


Epoch 7/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.19it/s, loss=0.0002]
Epoch 7/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.02it/s, loss=0.0002]


Epoch 7/10 - Train Loss: 0.0249, Val Loss: 0.0568


Epoch 8/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.20it/s, loss=0.0001]
Epoch 8/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.16it/s, loss=0.0001]


Epoch 8/10 - Train Loss: 0.0252, Val Loss: 0.0605


Epoch 9/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.22it/s, loss=0.0001]
Epoch 9/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.14it/s, loss=0.0001]


Epoch 9/10 - Train Loss: 0.0240, Val Loss: 0.0646


Epoch 10/10 [Training]: 100%|██████████| 325/325 [01:17<00:00,  4.22it/s, loss=0.0001]
Epoch 10/10 [Validation]: 100%|██████████| 47/47 [00:04<00:00, 11.30it/s, loss=0.0001]


Epoch 10/10 - Train Loss: 0.0267, Val Loss: 0.0748
Model Training Finished for the Classifier
Evaluating Model...


Evaluating Classifier: 100%|██████████| 93/93 [00:08<00:00, 11.26it/s]


Evaluation Results - Precision: 0.9732, Recall: 0.9731, F1 Score: 0.9731

Final Results:
Precision: 0.9732
Recall: 0.9731
F1 Score: 0.9731
Results and Model saved successfully!
