# Metrics Notebook
This is a simple notebook designed to run all the metrics for a given variant of CLIP.  It expects a model as defined in the config below that is "CLIP-like" in that it can take in an image or text and output an embedding of some size.  The clip-like model should specifically adhere to the Clip interface defined in `models/clip.py`

The four metrics we implement are outlined here: 

### Top-K Retrieval Accuracy.
Given an image, we compute its CLIP embedding and retrieve the closest K captions based on cosine similarity with caption embeddings. If the target caption is within the top K, we count this as a correct retrieval. This metric is a direct proxy for classification accuracy in multimodal retrieval. It is valuable because strong cross-modal alignment should yield high retrieval accuracy. However, since our approach aims to reduce sparsity on the hypersphere, the embeddings may become less linearly separable, potentially lowering retrieval performance even as uniformity improves.

### Modality Gap via Linear Separability.
Following \citet{modalityGAP}, we measure the modality gap between text and image embeddings by training a soft-margin SVM classifier to distinguish modality type. We evaluate classification accuracy, precision, and recall. High separability indicates a strong modality gap, which is undesirable because semantically matched image–text pairs should ideally share indistinguishable representations. Reducing modality separability would thus reflect improved multimodal coordination.

### Hyperspherical Entropy Estimation.
We measure the entropy of the embedding distribution on the hypersphere using the k-nearest neighbor–based estimator proposed by \citet{entropy}. This estimator leverages angular distances to compute local density estimates, which are aggregated into a global entropy measure. Entropy serves as a proxy for sparsity: low entropy distributions are clustered and “spiky,” while high entropy indicates more uniform coverage of the hypersphere. Since our method encourages uniformity, we expect an increase in entropy relative to standard CLIP.

### Downstream Captioning Performance (BLEU)
Finally, we evaluate the utility of embeddings on a generative downstream task: image captioning. Image embeddings are passed into a pretrained language model to generate captions, which are compared against ground-truth captions using BLEU score. BLEU measures n-gram overlap between generated and reference text, rewarding fluency and accuracy. This extrinsic metric demonstrates how improvements in embedding geometry translate into practical benefits for end-user tasks, beyond abstract geometric properties.

In [None]:
from models.clipModel import CLIPModel
import nltk

model = CLIPModel()

In [None]:
import torch

def top_k_similarities(embeddings, query_embedding, k=5):
    """
    Compute the top-k most similar embeddings to the query_embedding.
    
    Args:
        embeddings (torch.Tensor): Tensor of shape (N, D) where N is the number of embeddings and D is the embedding dimension.
        query_embedding (torch.Tensor): Tensor of shape (D,) representing the query embedding.
        k (int): Number of top similar embeddings to return.

    Returns:
        List[Tuple[int, float]]: List of tuples containing the index and similarity score of the top-k most similar embeddings.
    """
    # Compute cosine similarities
    similarities = torch.nn.functional.cosine_similarity(embeddings, query_embedding.unsqueeze(0), dim=1)

    # Get top-k indices
    top_k_indices = similarities.topk(k).indices

    # Return list of (index, similarity) tuples
    return [(idx.item(), similarities[idx].item()) for idx in top_k_indices]

def top_k_score(embedding_pairs, k=5):
    """
    Given a list of (text_embedding[], image_embedding) pairs, return the percentage of texts that are in the top-k most similar to their corresponding image embeddings.
    """
    correct_count = 0
    for text_embeddings, image_embedding in embedding_pairs:
        top_k = top_k_similarities(text_embeddings, image_embedding, k)
        if 0 in [idx for idx, _ in top_k]:  # Assuming the correct text is always at index 0
            correct_count += 1
    return correct_count / len(embedding_pairs) if embedding_pairs else 0.0

In [None]:
import torch.nn as nn

def linear_separability(image_embeddings, text_embeddings, num_epochs=100, learning_rate=1e-3):
    """
    Train a linear classifier to distinguish between image and text embeddings, and report the accuracy.
    
    Args:
        image_embeddings (torch.Tensor): Tensor of shape (N, D) for image embeddings.
        text_embeddings (torch.Tensor): Tensor of shape (N, D) for text embeddings.
        num_epochs (int): Number of training epochs.
        learning_rate (float): Learning rate for the optimizer.

    Returns:
        float: Accuracy of the classifier on the given set.
    """
    # Combine image and text embeddings
    embeddings = torch.cat([image_embeddings, text_embeddings], dim=0)
    labels = torch.cat([torch.zeros(image_embeddings.size(0)), torch.ones(text_embeddings.size(0))], dim=0)

    # Train a linear classifier
    classifier = nn.Linear(embeddings.size(1), 2)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        optimizer.zero_grad()
        outputs = classifier(embeddings)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer.step()

    # Evaluate the classifier
    with torch.no_grad():
        outputs = classifier(embeddings)
        preds = outputs.argmax(dim=1)
        accuracy = (preds == labels).float().mean().item()

    return accuracy

In [None]:
def bleu_score(predictions, references):
    """
    Compute a simple BLEU score for a list of predictions and references.
    
    Args:
        predictions (List[str]): List of predicted sentences.
        references (List[str]): List of reference sentences.

    Returns:
        float: Average BLEU score across all predictions.
    """
    from nltk.translate.bleu_score import sentence_bleu

    total_score = 0.0
    for pred, ref in zip(predictions, references):
        ref_tokens = [ref.split()]
        pred_tokens = pred.split()
        score = sentence_bleu(ref_tokens, pred_tokens)
        total_score += score

    return total_score / len(predictions) if predictions else 0.0

In [None]:
import torch
import numpy as np
from scipy.special import digamma, beta
from scipy.spatial.distance import pdist, squareform

def knn_entropy(embeddings, k=5):
    """
    Compute the k-nearest neighbor entropy estimator for hyperspherical data.
    
    This estimator is designed for data on a unit hypersphere and uses the 
    k-nearest neighbor approach to estimate entropy consistently.
    
    Args:
        embeddings (torch.Tensor): Tensor of shape (N, D) representing N embeddings on the unit hypersphere of dimension D-1.
        k (int): Number of nearest neighbors to consider.
    
    Returns:
        float: Estimated entropy of the distribution.
    """
    # Ensure embeddings are normalized to unit sphere
    embeddings = embeddings / torch.norm(embeddings, dim=1, keepdim=True)
    embeddings_np = embeddings.detach().cpu().numpy()
    
    n, d = embeddings_np.shape
    
    # Compute pairwise angular distances using arccos(x^T y)
    # Since embeddings are normalized, dot product gives cosine similarity
    dot_products = np.dot(embeddings_np, embeddings_np.T)
    # Clamp to avoid numerical issues with arccos
    dot_products = np.clip(dot_products, -1.0, 1.0)
    angular_distances = np.arccos(dot_products)
    
    # For each point, find the k-th nearest neighbor distance
    phi_values = []
    for i in range(n):
        # Get distances to all other points (excluding self)
        distances_to_i = angular_distances[i]
        distances_to_i = np.delete(distances_to_i, i)  # Remove self-distance (which is 0)
        
        # Sort and get k-th nearest neighbor distance
        sorted_distances = np.sort(distances_to_i)
        phi_i = sorted_distances[k-1]  # k-th nearest (0-indexed)
        phi_values.append(phi_i)
    
    phi_values = np.array(phi_values)
    
    # Compute surface area of caps S(phi_i)
    def hypersphere_cap_area(phi, d):
        """
        Compute the area of a spherical cap with angle phi on a (d-1)-sphere.
        
        S(φ) = (1/2) * S_p * [1 - sgn(cos φ) * I_{cos²φ}(1/2, (p-1)/2)]
        
        where S_p is the surface area of the (d-1)-sphere and I is the regularized 
        incomplete beta function.
        """
        # Surface area of (d-1)-sphere: S_p = 2π^(d/2) / Γ(d/2)
        from scipy.special import gamma
        S_p = 2 * (np.pi ** (d/2)) / gamma(d/2)
        
        cos_phi = np.cos(phi)
        cos_phi_squared = cos_phi ** 2
        
        # Regularized incomplete beta function I_x(a,b) = B(x;a,b) / B(a,b)
        from scipy.special import betainc
        alpha = 0.5
        beta_param = (d - 1) / 2
        
        # Handle the sign function and incomplete beta function
        sign_cos_phi = np.sign(cos_phi)
        incomplete_beta = betainc(alpha, beta_param, cos_phi_squared)
        
        cap_area = 0.5 * S_p * (1 - sign_cos_phi * incomplete_beta)
        
        return cap_area
    
    # Compute cap areas for all phi values
    S_phi = np.array([hypersphere_cap_area(phi, d) for phi in phi_values])
    
    # Compute L_{n,i} = ln(f_n(X_i)) = ln(k/n / S(phi_i))
    L_values = np.log(k/n) - np.log(S_phi)
    
    # Compute digamma function ψ(k)
    psi_k = digamma(k)
    
    # Compute entropy using the first formulation:
    # H_n(f) = -(1/n) * Σ[L_{n,i} - ln(k) + ψ(k)]
    entropy = -(1/n) * np.sum(L_values - np.log(k) + psi_k)
    
    return entropy

# Alternative implementation using the second formulation for verification
def knn_entropy_alternative(embeddings, k=5):
    """
    Alternative implementation using the second formulation:
    H_n(f) = (1/n) * Σ ln[n * S(φ_i)] - ψ(k)
    """
    # Ensure embeddings are normalized to unit sphere
    embeddings = embeddings / torch.norm(embeddings, dim=1, keepdim=True)
    embeddings_np = embeddings.detach().cpu().numpy()
    
    n, d = embeddings_np.shape
    
    # Compute pairwise angular distances
    dot_products = np.dot(embeddings_np, embeddings_np.T)
    dot_products = np.clip(dot_products, -1.0, 1.0)
    angular_distances = np.arccos(dot_products)
    
    # Find k-th nearest neighbor distances
    phi_values = []
    for i in range(n):
        distances_to_i = angular_distances[i]
        distances_to_i = np.delete(distances_to_i, i)
        sorted_distances = np.sort(distances_to_i)
        phi_i = sorted_distances[k-1]
        phi_values.append(phi_i)
    
    phi_values = np.array(phi_values)
    
    # Compute cap areas
    def hypersphere_cap_area(phi, d):
        from scipy.special import gamma, betainc
        S_p = 2 * (np.pi ** (d/2)) / gamma(d/2)
        cos_phi = np.cos(phi)
        cos_phi_squared = cos_phi ** 2
        alpha = 0.5
        beta_param = (d - 1) / 2
        sign_cos_phi = np.sign(cos_phi)
        incomplete_beta = betainc(alpha, beta_param, cos_phi_squared)
        cap_area = 0.5 * S_p * (1 - sign_cos_phi * incomplete_beta)
        return cap_area
    
    S_phi = np.array([hypersphere_cap_area(phi, d) for phi in phi_values])
    
    # Second formulation: H_n(f) = (1/n) * Σ ln[n * S(φ_i)] - ψ(k)
    psi_k = digamma(k)
    entropy = (1/n) * np.sum(np.log(n * S_phi)) - psi_k
    
    return entropy