# Metrics Notebook
This is a simple notebook designed to run all the metrics for a given variant of CLIP.  It expects a model as defined in the config below that is "CLIP-like" in that it can take in an image or text and output an embedding of some size.  The clip-like model should specifically adhere to the Clip interface defined in `models/clip.py`

The four metrics we implement are outlined here: 

### Top-K Retrieval Accuracy.
Given an image, we compute its CLIP embedding and retrieve the closest K captions based on cosine similarity with caption embeddings. If the target caption is within the top K, we count this as a correct retrieval. This metric is a direct proxy for classification accuracy in multimodal retrieval. It is valuable because strong cross-modal alignment should yield high retrieval accuracy. However, since our approach aims to reduce sparsity on the hypersphere, the embeddings may become less linearly separable, potentially lowering retrieval performance even as uniformity improves.

### Modality Gap via Linear Separability.
Following \citet{modalityGAP}, we measure the modality gap between text and image embeddings by training a soft-margin SVM classifier to distinguish modality type. We evaluate classification accuracy, precision, and recall. High separability indicates a strong modality gap, which is undesirable because semantically matched image–text pairs should ideally share indistinguishable representations. Reducing modality separability would thus reflect improved multimodal coordination.

### Hyperspherical Entropy Estimation.
We measure the entropy of the embedding distribution on the hypersphere using the k-nearest neighbor–based estimator proposed by \citet{entropy}. This estimator leverages angular distances to compute local density estimates, which are aggregated into a global entropy measure. Entropy serves as a proxy for sparsity: low entropy distributions are clustered and “spiky,” while high entropy indicates more uniform coverage of the hypersphere. Since our method encourages uniformity, we expect an increase in entropy relative to standard CLIP.

### Downstream Captioning Performance (BLEU)
Finally, we evaluate the utility of embeddings on a generative downstream task: image captioning. Image embeddings are passed into a pretrained language model to generate captions, which are compared against ground-truth captions using BLEU score. BLEU measures n-gram overlap between generated and reference text, rewarding fluency and accuracy. This extrinsic metric demonstrates how improvements in embedding geometry translate into practical benefits for end-user tasks, beyond abstract geometric properties.

In [None]:
from models.clipModel import CLIPModel

model = CLIPModel()

In [None]:
import torch

def top_k_similarities(embeddings, query_embedding, k=5):
    """
    Compute the top-k most similar embeddings to the query_embedding.
    
    Args:
        embeddings (torch.Tensor): Tensor of shape (N, D) where N is the number of embeddings and D is the embedding dimension.
        query_embedding (torch.Tensor): Tensor of shape (D,) representing the query embedding.
        k (int): Number of top similar embeddings to return.

    Returns:
        List[Tuple[int, float]]: List of tuples containing the index and similarity score of the top-k most similar embeddings.
    """
    # Compute cosine similarities
    similarities = torch.nn.functional.cosine_similarity(embeddings, query_embedding.unsqueeze(0), dim=1)

    # Get top-k indices
    top_k_indices = similarities.topk(k).indices

    # Return list of (index, similarity) tuples
    return [(idx.item(), similarities[idx].item()) for idx in top_k_indices]

def top_k_score(embedding_pairs, k=5):
    """
    Given a list of (text_embedding[], image_embedding) pairs, return the percentage of texts that are in the top-k most similar to their corresponding image embeddings.
    """
    correct_count = 0
    for text_embeddings, image_embedding in embedding_pairs:
        top_k = top_k_similarities(text_embeddings, image_embedding, k)
        if 0 in [idx for idx, _ in top_k]:  # Assuming the correct text is always at index 0
            correct_count += 1
    return correct_count / len(embedding_pairs) if embedding_pairs else 0.0

In [None]:
import torch.nn as nn

def linear_separability(image_embeddings, text_embeddings, num_epochs=100, learning_rate=1e-3):
    """
    Train a linear classifier to distinguish between image and text embeddings, and report the accuracy.
    
    Args:
        image_embeddings (torch.Tensor): Tensor of shape (N, D) for image embeddings.
        text_embeddings (torch.Tensor): Tensor of shape (N, D) for text embeddings.
        num_epochs (int): Number of training epochs.
        learning_rate (float): Learning rate for the optimizer.

    Returns:
        float: Accuracy of the classifier on the given set.
    """
    # Combine image and text embeddings
    embeddings = torch.cat([image_embeddings, text_embeddings], dim=0)
    labels = torch.cat([torch.zeros(image_embeddings.size(0)), torch.ones(text_embeddings.size(0))], dim=0)

    # Train a linear classifier
    classifier = nn.Linear(embeddings.size(1), 2)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        optimizer.zero_grad()
        outputs = classifier(embeddings)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer.step()

    # Evaluate the classifier
    with torch.no_grad():
        outputs = classifier(embeddings)
        preds = outputs.argmax(dim=1)
        accuracy = (preds == labels).float().mean().item()

    return accuracy