# Lab 1 — Tiny VLM Adversarial Cost Challenge

## Overview

Welcome to the ML Security Lab! In this lab, you'll implement adversarial attacks against a Vision-Language Model (VLM) using the custom dataset and a frozen TinyCLIP scorer.

### Objective
Your task is to build an attack function that can manipulate either:
- **Caption tokens** (text modifications)
- **Image pixels** (visual modifications)
- **Both** (multimodal attack)

The goal is to flip the model's decision (match → no-match or vice versa) while minimizing the attack cost.

### Constraints
- **T_MAX = 10**: Maximum token edits per sample
- **P_MAX = 100**: Maximum pixel edits per sample  
- **Q_MAX = 100**: Maximum queries per sample
- **Evaluation**: Public leaderboard (1,000 val pairs) + Private leaderboard (1,000 test pairs)

### Scoring
Your attack will be evaluated based on:
1. **Success Rate**: Percentage of samples where you successfully flip the decision
2. **Cost Efficiency**: Lower total cost (token edits + pixel edits + queries) is better
3. **Attack Budget**: Must stay within the specified limits

In [2]:
import os
import zipfile

# Check if 'images' directory or 'val_pairs.json' file is missing
if not os.path.exists('images') or not os.path.exists('val_pairs.json'):
    print("Required data not found. Extracting 'data.zip'...")
    
    # Check if 'data.zip' exists
    if os.path.exists('data.zip'):
        with zipfile.ZipFile('data.zip', 'r') as zip_ref:
            zip_ref.extractall()  # Extract all files in the current directory
        print("Extraction complete!")
    else:
        print("Error: 'data.zip' not found. Please ensure the file is in the current directory.")
else:
    print("All required data is already present.")

All required data is already present.


In [29]:
# Import required libraries
import numpy as np
import pandas as pd
from PIL import Image
import PIL
import torch
import torch.nn.functional as F
from tqdm.auto import tqdm
import open_clip
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import json
import random
import os
from typing import List, Tuple, Dict, Optional
import warnings
import requests
from urllib.parse import urlparse
import hashlib
from collections import defaultdict
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

print("All packages imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

All packages imported successfully!
PyTorch version: 2.8.0+cu128
CUDA available: True
Using device: cuda


In [6]:
# Data Loading
print("Loading validation data...")

# Load validation pairs from JSON
with open('val_pairs.json', 'r') as f:
    val_pairs = json.load(f)

print(f"Loaded {len(val_pairs)} validation pairs")

# Helper function to load images
def load_image_from_pair(pair: dict) -> Image.Image:
    """Load image from the pair dictionary using image_path"""
    return Image.open(pair['image_path']).convert('RGB')

# Sample a few pairs to verify data loading
print("\nSample validation pairs:")
for i in range(3):
    pair = val_pairs[i]
    print(f"  Image ID: {pair['image_id']}, Path: {pair['image_path']}")
    print(f"  Caption: {pair['caption'][:50]}..., Match: {pair['is_match']}")
    print()

print(f"Data distribution:")
labels = [pair['is_match'] for pair in val_pairs]
print(f"  Match (True): {sum(labels)}")
print(f"  No-match (False): {len(labels) - sum(labels)}")
print(f"  Balance: {sum(labels)/len(labels):.2%} positive")

Loading validation data...
Loaded 1000 validation pairs

Sample validation pairs:
  Image ID: 38137, Path: images/val/000000119402.jpg
  Caption: A gold bus traveling on a single lane road..., Match: True

  Image ID: 15194, Path: images/val/000000404780.jpg
  Caption: Two women in the snow on skis in front of a large ..., Match: False

  Image ID: 19082, Path: images/val/000000148898.jpg
  Caption: A man jumping a brown horse over an obstacle...., Match: False

Data distribution:
  Match (True): 500
  No-match (False): 500
  Balance: 50.00% positive


In [None]:
# TinyCLIP Scorer Implementation
print("Loading CLIP model...")

# Try to load TinyCLIP, fallback to OpenCLIP ViT-B/32 if failed
try:
    # Attempt to load TinyCLIP from HuggingFace hub
    model, preprocess, tokenizer = open_clip.create_model_and_transforms(
        "hf-hub:microsoft/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"
    )
    print("Successfully loaded TinyCLIP model")
except Exception as e:
    print(f"Failed to load TinyCLIP: {e}")
    print("Falling back to OpenCLIP ViT-B/32...")

    model, preprocess, tokenizer = open_clip.create_model_and_transforms(
        "ViT-B-32", 
        pretrained="laion2b_s34b_b79k"
    )
    print("Successfully loaded OpenCLIP ViT-B/32")

# Move model to device
model = model.to(device)
model.eval()

print(f"Model loaded on: {device}")


Loading CLIP model...
Failed to load TinyCLIP: Failed initial config/weights load from HF Hub microsoft/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M: Failed to download file (open_clip_config.json) for microsoft/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M. Last error: 404 Client Error. (Request ID: Root=1-68e0baa7-1dc50171703be15951316f1c;819ab7e2-e670-4da3-9e49-09736cf8d2cc)

Repository Not Found for url: https://huggingface.co/microsoft/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M/resolve/main/open_clip_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
Falling back to OpenCLIP ViT-B/32...


Successfully loaded OpenCLIP ViT-B/32
Model loaded on: cuda
<class 'torchvision.transforms.transforms.Compose'>
Compose(
    RandomResizedCrop(size=(224, 224), scale=(0.9, 1.0), ratio=(0.75, 1.3333), interpolation=bicubic, antialias=True)
    <function _convert_to_rgb at 0x7fed8fc6e0c0>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)


In [9]:
def clip_embed(image: Image.Image, caption: str) -> float:
    """
    Compute cosine similarity between image and text embeddings.
    
    Args:
        image: PIL Image
        caption: Text string
    
    Returns:
        Cosine similarity score between normalized embeddings
    """
    with torch.no_grad():
        # Preprocess image
        image_tensor = preprocess(image).unsqueeze(0).to(device)
        
        # Tokenize text properly using open_clip tokenizer
        text_tokens = open_clip.tokenize([caption]).to(device)
        
        # Get embeddings
        image_features = model.encode_image(image_tensor)
        text_features = model.encode_text(text_tokens)
        
        # Normalize embeddings
        image_features = F.normalize(image_features, dim=-1)
        text_features = F.normalize(text_features, dim=-1)
        
        # Compute cosine similarity
        similarity = (image_features @ text_features.T).item()
        
    return similarity

# Test the embedding function
print("\nTesting CLIP embedding function...")
test_pair = val_pairs[0]
test_image = load_image_from_pair(test_pair)
test_similarity = clip_embed(test_image, test_pair['caption'])
print(f"Sample similarity score: {test_similarity:.4f} (Expected match: {test_pair['is_match']})")

# Test on a few more samples
print("\nTesting on more samples:")
for i in range(3):
    pair = val_pairs[i]
    image = load_image_from_pair(pair)
    similarity = clip_embed(image, pair['caption'])
    print(f"Sample {i+1}: similarity={similarity:.4f}, match={pair['is_match']}")


Testing CLIP embedding function...
Sample similarity score: 0.3473 (Expected match: True)

Testing on more samples:
Sample 1: similarity=0.3543, match=True
Sample 2: similarity=0.0343, match=False
Sample 3: similarity=0.0779, match=False


In [36]:
# Calibration: Fit logistic regression to get alpha, beta parameters
print("Calibrating scorer with logistic regression...")

# Use first 200 samples for calibration
tune_slice = val_pairs[:200]
print(f"Using {len(tune_slice)} samples for calibration")

# Compute similarities for calibration
similarities = []
ground_truths = []

print("Computing similarities for calibration...")
for pair in tqdm(tune_slice, desc="Calibration"):
    image = load_image_from_pair(pair)
    similarity = clip_embed(image, pair['caption'])
    similarities.append(similarity)
    ground_truths.append(int(pair['is_match']))

similarities = np.array(similarities).reshape(-1, 1)
ground_truths = np.array(ground_truths)

# Fit logistic regression: sigmoid(alpha * cosine + beta)
lr = LogisticRegression()
lr.fit(similarities, ground_truths)

# Extract alpha and beta
alpha = lr.coef_[0][0]  # Coefficient for similarity
beta = lr.intercept_[0]  # Intercept

print(f"Calibration complete!")
print(f"   Alpha (slope): {alpha:.4f}")
print(f"   Beta (intercept): {beta:.4f}")

# Test calibration
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

test_similarities = similarities[:5].flatten()
test_labels = ground_truths[:5]
calibrated_probs = sigmoid(alpha * test_similarities + beta)

print(f"\nCalibration test:")
for i in range(5):
    print(f"  Sim: {test_similarities[i]:.4f} → Prob: {calibrated_probs[i]:.4f}, True: {test_labels[i]}")

Calibrating scorer with logistic regression...
Using 200 samples for calibration
Computing similarities for calibration...


Calibration: 100%|██████████| 200/200 [00:03<00:00, 64.28it/s]

Calibration complete!
   Alpha (slope): 6.6427
   Beta (intercept): -1.1133

Calibration test:
  Sim: 0.3779 → Prob: 0.8017, True: 1
  Sim: 0.0343 → Prob: 0.2921, True: 0
  Sim: 0.0812 → Prob: 0.3604, True: 0
  Sim: 0.1899 → Prob: 0.5370, True: 0
  Sim: 0.3772 → Prob: 0.8010, True: 1





In [None]:
from torchvision.transforms import RandomResizedCrop, ToTensor, Normalize, Compose

def clip_loss(image_tensor, caption: str, label) -> torch.Tensor:
    """
    Compute cosine similarity between image and text embeddings.
    Returns a torch scalar that supports autograd.
    """
    # Preprocess image (make sure to keep gradient tracking)

    # Tokenize text (token embeddings are usually non-differentiable wrt raw string)
    text_tokens = open_clip.tokenize([caption]).to(device)

    # Get embeddings (these will remain differentiable wrt image_tensor)
    image_features = model.encode_image(image_tensor)
    text_features = model.encode_text(text_tokens)

    # Normalize embeddings
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    similarity = (image_features @ text_features.T).squeeze()  # shape: scalar

    output = similarity*alpha + beta
    output = torch.sigmoid(output)
    

    # Cosine similarity
    if label == 1:
        loss = 1 - output  # want to maximize similarity
    elif label == 0:
        loss = output  # want to minimize similarity
    return loss

def calc_image_grad(image_tensor, caption: str, label) -> torch.Tensor:
    image_tensor.requires_grad_(True)  # allow gradient wrt input pixels
    loss = clip_loss(image_tensor, caption, label)
    image_grad = torch.autograd.grad(loss, [image_tensor])[0]
    return image_grad

for pair in val_pairs[:5]:
    image = load_image_from_pair(pair)
    image_tensor = torch.tensor(np.array(image)).permute(2, 0, 1).unsqueeze(0).to(device)  # Change to (C, H, W)
    # this is how it should be done according to the preprocess function
    # but its kinda wrong since the preprocess function is maybe for training
    # idk maybe we can find it out later, so the gradient can be passed through it
    #image_tensor = preprocess(image).unsqueeze(0).to(device)
    print(image_tensor.shape)
    image_grad = calc_image_grad(image_tensor, pair['caption'], pair['is_match'])
    loss = clip_loss(image_tensor, pair['caption'], pair['is_match'])

    similarity = clip_embed(image, pair['caption'])
    print(f"loss: {loss.item():.4f}, Similarity: {similarity:.4f}, Match: {pair['is_match']}")
    alpha = 10 # step size for adversarial perturbation
    adversarial_image_tensor = torch.clamp(image_tensor + alpha * image_grad.sign(), 0, 255)
    adversarial_image = PIL.Image.fromarray(adversarial_image_tensor.detach().squeeze().cpu().numpy().astype(np.uint8).transpose(1, 2, 0))
    adversarial_loss = clip_loss(adversarial_image_tensor, pair['caption'], pair['is_match'])
    adversarial_similarity = clip_embed(adversarial_image, pair['caption'])
    print(f"Adversarial loss: {adversarial_loss.item():.4f}, Adversarial Similarity: {adversarial_similarity:.4f}")


torch.Size([1, 3, 480, 640])


RuntimeError: only Tensors of floating point dtype can require gradients

In [90]:
#investigate the preprocessing:

print(type(preprocess))
print(preprocess)

<class 'torchvision.transforms.transforms.Compose'>
Compose(
    RandomResizedCrop(size=(224, 224), scale=(0.9, 1.0), ratio=(0.75, 1.3333), interpolation=bicubic, antialias=True)
    <function _convert_to_rgb at 0x7fed8fc6e0c0>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)


In [None]:
# Looking up the code for clip _convert_to_rgb it is:
def _convert_to_rgb(image):
    return image.convert("RGB")
# aka the built-in RGB conversion from PIL

In [93]:
from scipy.ndimage import label

# BlackBox API Implementation
import editdistance  # For Levenshtein distance

class BlackBoxAPI:
    """
    Black-box API for the VLM scorer with query budget tracking and cost calculation.
    """
    
    def __init__(self, alpha: float, beta: float, q_max: int = 200):
        """
        Initialize the black-box API.
        
        Args:
            alpha: Logistic regression slope parameter
            beta: Logistic regression intercept parameter  
            q_max: Maximum queries allowed per sample
        """
        self.alpha = alpha
        self.beta = beta
        self.q_max = q_max
        self.query_count = 0
        
    def score(self, image_uint8: np.ndarray, caption_str: str) -> float:
        """
        Score image-caption pair and return probability.
        
        Args:
            image_uint8: Image as uint8 numpy array (H, W, C)
            caption_str: Caption string
            
        Returns:
            Probability in [0, 1] using sigmoid(alpha * cosine + beta)
        """
        if self.query_count >= self.q_max:
            raise RuntimeError(f"Query budget exceeded! Used {self.query_count}/{self.q_max}")
            
        # Convert numpy array to PIL Image
        image_pil = Image.fromarray(image_uint8)
        
        # Get cosine similarity
        cosine_sim = clip_embed(image_pil, caption_str)
        
        # Apply calibrated sigmoid
        logit = self.alpha * cosine_sim + self.beta
        probability = 1 / (1 + np.exp(-logit))
        
        self.query_count += 1
        
        return probability
    
    def reset_query_count(self):
        """Reset query counter for new sample."""
        self.query_count = 0
        
    def get_remaining_queries(self) -> int:
        """Get remaining query budget."""
        return self.q_max - self.query_count

# Cost Functions
def token_edit_cost(original: str, modified: str) -> int:
    """
    Compute token-level Levenshtein distance using CLIP tokenizer.
    
    Args:
        original: Original caption
        modified: Modified caption
        
    Returns:
        Number of token edits (insertions, deletions, substitutions)
    """
    # Use CLIP tokenizer for more accurate tokenization
    orig_tokens = open_clip.tokenize([original], context_length=77)[0].numpy()
    mod_tokens = open_clip.tokenize([modified], context_length=77)[0].numpy()
    
    # Remove padding tokens (0s) and special tokens for fair comparison
    # Keep only actual content tokens
    orig_tokens = orig_tokens[orig_tokens != 0]
    mod_tokens = mod_tokens[mod_tokens != 0]
    
    return editdistance.eval(orig_tokens.tolist(), mod_tokens.tolist())

def pixel_edit_cost(original: np.ndarray, modified: np.ndarray) -> int:
    """
    Compute number of changed pixels with reduced cost for continuous regions.
    
    Args:
        original: Original image as uint8 numpy array
        modified: Modified image as uint8 numpy array
        
    Returns:
        Adjusted cost based on number of changed pixels, with reduced cost for continuous regions.
    """
    # Find the difference mask
    diff_mask = np.any(original != modified, axis=-1)
    
    # Label connected components in the difference mask
    labeled_regions, num_features = label(diff_mask)
    
    # Count pixels in each connected region
    total_cost = 0
    for region_id in range(1, num_features + 1):
        region_size = np.sum(labeled_regions == region_id)
        if region_size > 0:
            # Full cost for the first pixel, half cost for the rest
            total_cost += 1 + (region_size - 1) * 0.5
    
    return int(total_cost)

# Test the BlackBox API
print("Testing BlackBox API...")

# Initialize API with calibrated parameters
api = BlackBoxAPI(alpha, beta, q_max=200)

# Test on a sample
test_pair = val_pairs[0]
test_image = load_image_from_pair(test_pair)
test_image_uint8 = np.array(test_image)

# Get score
score = api.score(test_image_uint8, test_pair['caption'])
print(f"API Score: {score:.4f} (Expected match: {test_pair['is_match']})")
print(f"Queries used: {api.query_count}/{api.q_max}")

# Test cost functions
original_caption = "A cat sitting on a mat"
modified_caption = "A dog standing on a rug" 
token_cost = token_edit_cost(original_caption, modified_caption)
print(f"\nToken edit cost example:")
print(f"  Original: '{original_caption}'")  
print(f"  Modified: '{modified_caption}'")
print(f"  Cost: {token_cost} token edits")

# Test pixel cost (create a simple modification)
original_img = np.zeros((100, 100, 3), dtype=np.uint8)
modified_img = original_img.copy()
modified_img[10:20, 10:20] = 255  # Change a 10x10 region
pixel_cost = pixel_edit_cost(original_img, modified_img)
print(f"\nPixel edit cost example:")
print(f"  Modified {pixel_cost} pixels in 100x100 image")

Testing BlackBox API...
API Score: 0.9191 (Expected match: True)
Queries used: 1/200

Token edit cost example:
  Original: 'A cat sitting on a mat'
  Modified: 'A dog standing on a rug'
  Cost: 3 token edits

Pixel edit cost example:
  Modified 50 pixels in 100x100 image


## Your Task: Implement Adversarial Attacks

### Attack Function Template

Replace the trivial baseline in the `attack()` function with sophisticated adversarial attacks:

```python
def attack(image_np_uint8, caption_str, api, budgets):
    # Your attack implementation here!
    # You can modify:
    # - Caption tokens (text modifications)  
    # - Image pixels (visual modifications)
    # - Both (multimodal attack)
    
    # Stay within budgets:
    # - budgets['T_MAX'] = 10  token edits
    # - budgets['P_MAX'] = 100 pixel edits
    # - budgets['Q_MAX'] = 100 queries
    
    return {
        'success': success,      # bool: did you flip the decision?
        'image': final_image,    # np.array: attacked image
        'caption': final_caption, # str: attacked caption  
        'token_cost': token_cost, # int: tokens changed
        'pixel_cost': pixel_cost, # int: pixels changed
        'query_cost': query_cost  # int: API calls made
    }
```

### Attack Strategies to Consider

- **Text Attacks**: Synonym replacement, word insertion/deletion, semantic paraphrasing
- **Image Attacks**: Adversarial noise, targeted pixel modifications, patch attacks
- **Query Optimization**: Gradient-free optimization, genetic algorithms, hill climbing
- **Multimodal**: Combined text+image attacks for maximum effectiveness

### Evaluation Metrics

Your attack will be scored as: **ASR - 0.5×ANC - 0.1×(AQ/Q_MAX)**

- **ASR**: Attack Success Rate (higher is better)
- **ANC**: Average Number of Changes (lower is better) 
- **AQ**: Average Queries (lower is better)

### Next Steps

1. **Implement your attack** in the `attack()` function above
2. **Test locally** using the evaluation framework  
3. **Run on full dataset** by changing `max_samples=None`

Good luck! 

### <span style="color:red">**FILL THIS CODE BLOCK**</span>

In [73]:
def attack(image_np_uint8: np.ndarray, caption_str: str, api: BlackBoxAPI, budgets: dict) -> dict:
    """
    Student attack function to implement adversarial attacks.
    
    Args:
        image_np_uint8: Original image as uint8 numpy array (H, W, C)
        caption_str: Original caption string
        api: BlackBoxAPI instance for querying the model
        budgets: Dictionary with 'T_MAX', 'P_MAX', 'Q_MAX' limits
        
    Returns:
        Dictionary with:
        - 'success': bool, whether attack succeeded (flipped decision)
        - 'image': np.ndarray, final attacked image  
        - 'caption': str, final attacked caption
        - 'token_cost': int, number of token edits used
        - 'pixel_cost': int, number of pixel edits used
        - 'query_cost': int, number of queries used
    """
    
    # TRIVIAL BASELINE - Students should replace this!
    # This baseline just returns the original inputs without any attack
    
    original_image = image_np_uint8.copy()
    original_caption = caption_str
    
    # Get original score to determine target (flip the decision)
    original_score = api.score(original_image, original_caption)
    

    # For this baseline, we don't actually perform any attack
    # Students should implement sophisticated attacks here!
    
    # TODO: IMPLEMENT YOUR ATTACK HERE!
    # You can modify:
    # - Caption tokens (text modifications)  
    # - Image pixels (visual modifications)
    # - Both (multimodal attack)
    
    # Stay within budgets:
    # - budgets['T_MAX'] = maximum token edits
    # - budgets['P_MAX'] = maximum pixel edits
    # - budgets['Q_MAX'] = maximum queries
    
    # Attack Strategies to Consider:
    # - Text Attacks: Synonym replacement, word insertion/deletion, semantic paraphrasing
    # - Image Attacks: Adversarial noise, targeted pixel modifications, patch attacks
    # - Query Optimization: Gradient-free optimization, genetic algorithms, hill climbing
    # - Multimodal: Combined text+image attacks for maximum effectiveness
    print(original_image.shape)
    image_tensor = preprocess(Image.fromarray(original_image)).unsqueeze(0).to(device)
    print(image_tensor.shape)
    image_grad = calc_image_grad(image_tensor, caption_str, 1)
    print(image_grad.shape)
    alpha = 10 # step size for adversarial perturbation
    adversarial_image_tensor = torch.clamp(image_tensor + alpha * image_grad.sign(), 0, 255)
    print(adversarial_image_tensor.shape)
    adversarial_image = PIL.Image.fromarray(adversarial_image_tensor.detach().squeeze().cpu().numpy().astype(np.uint8).transpose(1, 2, 0))
    final_image = np.array(adversarial_image)        # Comment out this line when implementing image attacks
    print(final_image.shape)
    final_caption = original_caption    # Comment out this line when implementing text attacks
    
    # Get final score  
    final_score = api.score(final_image, final_caption)
    
    # Check if attack succeeded (decision flipped)
    original_decision = original_score > 0.5
    final_decision = final_score > 0.5
    success = (original_decision != final_decision)
    
    # Calculate costs
    token_cost = token_edit_cost(original_caption, final_caption)
    pixel_cost = pixel_edit_cost(original_image, final_image)
    query_cost = api.query_count
    
    return {
        'success': success,
        'image': final_image,
        'caption': final_caption,
        'token_cost': token_cost,
        'pixel_cost': pixel_cost,  
        'query_cost': query_cost,
        'original_score': original_score,
        'final_score': final_score
    }

print("Attack function defined (TRIVIAL BASELINE)")
print("   Students should replace the trivial implementation with sophisticated attacks!")
print("   Current baseline: Returns original inputs unchanged (0% success rate expected)")

# Test the attack function
print("\nTesting attack function...")
test_pair = val_pairs[0]
test_image = np.array(load_image_from_pair(test_pair))

# Create fresh API instance  
test_api = BlackBoxAPI(alpha, beta, q_max=100)

attack_budgets = {
    'T_MAX': 10,     # Maximum token edits per sample
    'P_MAX': 100,   # Maximum pixel edits per sample  
    'Q_MAX': 100    # Maximum queries per sample
}

result = attack(test_image, test_pair['caption'], test_api, attack_budgets)

Attack function defined (TRIVIAL BASELINE)
   Students should replace the trivial implementation with sophisticated attacks!
   Current baseline: Returns original inputs unchanged (0% success rate expected)

Testing attack function...
(480, 640, 3)
torch.Size([1, 3, 224, 224])
torch.Size([1, 3, 224, 224])
torch.Size([1, 3, 224, 224])
(224, 224, 3)


ValueError: operands could not be broadcast together with shapes (480,640,3) (224,224,3) 

In [60]:
# Evaluation Framework
def evaluate_attack(val_pairs: list, attack_function, alpha: float, beta: float, budgets: dict, max_samples: int = None):
    """
    Evaluate attack function on validation pairs.
    
    Args:
        val_pairs: List of validation pairs
        attack_function: Attack function to evaluate
        alpha, beta: Calibrated parameters
        budgets: Attack budgets dictionary
        max_samples: Limit number of samples (None = all)
        
    Returns:
        Dictionary with evaluation metrics
    """
    
    print(f"Starting evaluation...")
    
    # Limit samples if specified
    eval_pairs = val_pairs[:max_samples] if max_samples else val_pairs
    print(f"Evaluating on {len(eval_pairs)} samples")
    
    results = []
    total_success = 0
    total_token_cost = 0
    total_pixel_cost = 0
    total_query_cost = 0
    
    for i, pair in enumerate(tqdm(eval_pairs, desc="Attacking")):
        # Create fresh API instance for each sample
        api = BlackBoxAPI(alpha, beta, q_max=budgets['Q_MAX'])
        
        # Load image
        image = np.array(load_image_from_pair(pair))
        caption = pair['caption']
        
        # try:
        # Run attack
        result = attack_function(image, caption, api, budgets)
        
        # Validate budget constraints
        budget_valid = (
            result['token_cost'] <= budgets['T_MAX'] and
            result['pixel_cost'] <= budgets['P_MAX'] and  
            result['query_cost'] <= budgets['Q_MAX']
        )
        
        if not budget_valid:
            print(f"Sample {i}: Budget violation!")
            print(f"   Tokens: {result['token_cost']}/{budgets['T_MAX']}")
            print(f"   Pixels: {result['pixel_cost']}/{budgets['P_MAX']}")  
            print(f"   Queries: {result['query_cost']}/{budgets['Q_MAX']}")
            result['success'] = False  # Invalid attacks count as failures
        
        results.append(result)
        
        if result['success']:
            total_success += 1
        total_token_cost += result['token_cost']
        total_pixel_cost += result['pixel_cost']
        total_query_cost += result['query_cost']
            
        # except Exception as e:
        #     print(f"Sample {i}: Attack failed with error: {e}")
        #     # Add failed result
        #     results.append({
        #         'success': False,
        #         'token_cost': budgets['T_MAX'],  # Penalize failures
        #         'pixel_cost': budgets['P_MAX'], 
        #         'query_cost': budgets['Q_MAX'],
        #         'error': str(e)
        #     })
    
    # Calculate metrics
    n_samples = len(results)
    asr = total_success / n_samples  # Attack Success Rate
    anc = (10*total_token_cost + total_pixel_cost) / n_samples  # Average Number of Changes  
    aq = total_query_cost / n_samples  # Average Queries
    
    # Final score: ASR - 0.5*ANC - 0.1*(AQ/Q_MAX)
    score = asr - 0.5 * (anc / (10*budgets['T_MAX'] + budgets['P_MAX'])) - 0.1 * (aq / budgets['Q_MAX'])
    
    evaluation_result = {
        'ASR': asr,
        'ANC': anc, 
        'AQ': aq,
        'Score': score,
        'n_samples': n_samples,
        'total_success': total_success,
        'avg_token_cost': total_token_cost / n_samples,
        'avg_pixel_cost': total_pixel_cost / n_samples,
        'budgets': budgets,
        'results': results
    }
    
    return evaluation_result

# Run Evaluation
print("Running evaluation on validation set...")

# Define attack budgets
attack_budgets = {
    'T_MAX': 10,     # Maximum token edits per sample
    'P_MAX': 100,   # Maximum pixel edits per sample  
    'Q_MAX': 100    # Maximum queries per sample
}

# Evaluate on subset first (faster for testing)
print("Running on first 50 samples for quick testing...")
eval_result = evaluate_attack(
    val_pairs=val_pairs, 
    attack_function=attack,
    alpha=alpha,
    beta=beta, 
    budgets=attack_budgets,
    max_samples=50  # Quick test on 50 samples
)

# Print results
print(f"\nEVALUATION RESULTS (50 samples):")
print(f"{'='*50}")
print(f"Attack Success Rate (ASR): {eval_result['ASR']:.1%}")
print(f"Average Number of Changes (ANC): {eval_result['ANC']:.2f}")  
print(f"Average Queries (AQ): {eval_result['AQ']:.1f}")
print(f"Final Score: {eval_result['Score']:.4f}")
print(f"{'='*50}")
print(f"Budget Usage:")
print(f"  Avg Token Cost: {eval_result['avg_token_cost']:.2f}/{attack_budgets['T_MAX']}")
print(f"  Avg Pixel Cost: {eval_result['avg_pixel_cost']:.2f}/{attack_budgets['P_MAX']}")  
print(f"  Avg Query Cost: {eval_result['AQ']:.1f}/{attack_budgets['Q_MAX']}")
print(f"\nNOTE: This is a trivial baseline (0% ASR expected)")
print(f"Students should implement sophisticated attacks to improve ASR!")

Running evaluation on validation set...
Running on first 50 samples for quick testing...
Starting evaluation...
Evaluating on 50 samples


Attacking:   0%|          | 0/50 [00:00<?, ?it/s]


TypeError: Unexpected type <class 'numpy.ndarray'>

In [14]:
# Full Evaluation (Uncomment when ready to test your attack)

def run_full_evaluation():
    """Run evaluation on all 1000 validation samples."""
    print("Running FULL evaluation on all 1000 validation samples...")
    print("This may take several minutes depending on your attack implementation.")
    
    full_result = evaluate_attack(
        val_pairs=val_pairs,
        attack_function=attack, 
        alpha=alpha,
        beta=beta,
        budgets=attack_budgets,
        max_samples=None  # All samples
    )
    
    print(f"\nFINAL EVALUATION RESULTS:")
    print(f"{'='*60}")
    print(f"Attack Success Rate (ASR): {full_result['ASR']:.1%}")
    print(f"Average Number of Changes (ANC): {full_result['ANC']:.2f}")
    print(f"Average Queries (AQ): {full_result['AQ']:.1f}")
    print(f"Final Score: {full_result['Score']:.4f}")
    print(f"{'='*60}")
    
    return full_result

# Uncomment the line below when ready to run full evaluation:
full_results = run_full_evaluation()

print("To run full evaluation on all 1000 samples:")
print("Uncomment: full_results = run_full_evaluation()")
print("\nCurrent status: Framework ready for student implementations!")

Running FULL evaluation on all 1000 validation samples...
This may take several minutes depending on your attack implementation.
Starting evaluation...
Evaluating on 1000 samples


Attacking: 100%|██████████| 1000/1000 [00:34<00:00, 28.72it/s]


FINAL EVALUATION RESULTS:
Attack Success Rate (ASR): 0.1%
Average Number of Changes (ANC): 0.00
Average Queries (AQ): 2.0
Final Score: -0.0010
To run full evaluation on all 1000 samples:
Uncomment: full_results = run_full_evaluation()

Current status: Framework ready for student implementations!





## Baseline scoring

In [None]:
def baseline_attack(image_np_uint8: np.ndarray, caption_str: str, api: BlackBoxAPI, budgets: dict) -> dict:

    # TRIVIAL BASELINE - Students should replace this!
    # This baseline just returns the original inputs without any attack
    
    original_image = image_np_uint8.copy()
    original_caption = caption_str
    
    # Get original score to determine target (flip the decision)
    original_score = api.score(original_image, original_caption)

    final_image = original_image        # Comment out this line when implementing image attacks
    final_caption = original_caption    # Comment out this line when implementing text attacks
    
    # Get final score  
    final_score = api.score(final_image, final_caption)
    
    # Check if attack succeeded (decision flipped)
    original_decision = original_score > 0.5
    final_decision = final_score > 0.5
    success = (original_decision != final_decision)
    
    # Calculate costs
    token_cost = token_edit_cost(original_caption, final_caption)
    pixel_cost = pixel_edit_cost(original_image, final_image)
    query_cost = api.query_count
    
    return {
        'success': success,
        'image': final_image,
        'caption': final_caption,
        'token_cost': token_cost,
        'pixel_cost': pixel_cost,  
        'query_cost': query_cost,
        'original_score': original_score,
        'final_score': final_score
    }

In [52]:


# Run Evaluation
print("Running evaluation on validation set...")

# Define attack budgets
attack_budgets = {
    'T_MAX': 10,     # Maximum token edits per sample
    'P_MAX': 100,   # Maximum pixel edits per sample  
    'Q_MAX': 100    # Maximum queries per sample
}

# Evaluate on subset first (faster for testing)
print("Running on first 50 samples for quick testing...")
eval_result = evaluate_attack(
    val_pairs=val_pairs, 
    attack_function=baseline_attack,
    alpha=alpha,
    beta=beta, 
    budgets=attack_budgets,
    max_samples=50  # Quick test on 50 samples
)

# Print results
print(f"\nEVALUATION RESULTS (50 samples):")
print(f"{'='*50}")
print(f"Attack Success Rate (ASR): {eval_result['ASR']:.1%}")
print(f"Average Number of Changes (ANC): {eval_result['ANC']:.2f}")  
print(f"Average Queries (AQ): {eval_result['AQ']:.1f}")
print(f"Final Score: {eval_result['Score']:.4f}")
print(f"{'='*50}")
print(f"Budget Usage:")
print(f"  Avg Token Cost: {eval_result['avg_token_cost']:.2f}/{attack_budgets['T_MAX']}")
print(f"  Avg Pixel Cost: {eval_result['avg_pixel_cost']:.2f}/{attack_budgets['P_MAX']}")  
print(f"  Avg Query Cost: {eval_result['AQ']:.1f}/{attack_budgets['Q_MAX']}")
print(f"\nNOTE: This is a trivial baseline (0% ASR expected)")
print(f"Students should implement sophisticated attacks to improve ASR!")

Running evaluation on validation set...
Running on first 50 samples for quick testing...
Starting evaluation...
Evaluating on 50 samples


Attacking: 100%|██████████| 50/50 [00:01<00:00, 28.71it/s]


EVALUATION RESULTS (50 samples):
Attack Success Rate (ASR): 0.0%
Average Number of Changes (ANC): 0.00
Average Queries (AQ): 2.0
Final Score: -0.0020
Budget Usage:
  Avg Token Cost: 0.00/10
  Avg Pixel Cost: 0.00/100
  Avg Query Cost: 2.0/100

NOTE: This is a trivial baseline (0% ASR expected)
Students should implement sophisticated attacks to improve ASR!





In [53]:
# Full Evaluation (Uncomment when ready to test your attack)

def run_full_evaluation():
    """Run evaluation on all 1000 validation samples."""
    print("Running FULL evaluation on all 1000 validation samples...")
    print("This may take several minutes depending on your attack implementation.")
    
    full_result = evaluate_attack(
        val_pairs=val_pairs,
        attack_function=baseline_attack, 
        alpha=alpha,
        beta=beta,
        budgets=attack_budgets,
        max_samples=None  # All samples
    )
    
    print(f"\nFINAL EVALUATION RESULTS:")
    print(f"{'='*60}")
    print(f"Attack Success Rate (ASR): {full_result['ASR']:.1%}")
    print(f"Average Number of Changes (ANC): {full_result['ANC']:.2f}")
    print(f"Average Queries (AQ): {full_result['AQ']:.1f}")
    print(f"Final Score: {full_result['Score']:.4f}")
    print(f"{'='*60}")
    
    return full_result

# Uncomment the line below when ready to run full evaluation:
full_results = run_full_evaluation()

print("To run full evaluation on all 1000 samples:")
print("Uncomment: full_results = run_full_evaluation()")
print("\nCurrent status: Framework ready for student implementations!")

Running FULL evaluation on all 1000 validation samples...
This may take several minutes depending on your attack implementation.
Starting evaluation...
Evaluating on 1000 samples


Attacking: 100%|██████████| 1000/1000 [00:34<00:00, 28.58it/s]


FINAL EVALUATION RESULTS:
Attack Success Rate (ASR): 0.6%
Average Number of Changes (ANC): 0.00
Average Queries (AQ): 2.0
Final Score: 0.0040
To run full evaluation on all 1000 samples:
Uncomment: full_results = run_full_evaluation()

Current status: Framework ready for student implementations!



