# Metrics Notebook
This is a simple notebook designed to run all the metrics for a given variant of CLIP.  It expects a model as defined in the config below that is "CLIP-like" in that it can take in an image or text and output an embedding of some size.  The clip-like model should specifically adhere to the Clip interface defined in `models/clip.py`

The four metrics we implement are outlined here: 

### Top-K Retrieval Accuracy.
Given an image, we compute its CLIP embedding and retrieve the closest K captions based on cosine similarity with caption embeddings. If the target caption is within the top K, we count this as a correct retrieval. This metric is a direct proxy for classification accuracy in multimodal retrieval. It is valuable because strong cross-modal alignment should yield high retrieval accuracy. However, since our approach aims to reduce sparsity on the hypersphere, the embeddings may become less linearly separable, potentially lowering retrieval performance even as uniformity improves.

### Modality Gap via Linear Separability.
Following \citet{modalityGAP}, we measure the modality gap between text and image embeddings by training a soft-margin SVM classifier to distinguish modality type. We evaluate classification accuracy, precision, and recall. High separability indicates a strong modality gap, which is undesirable because semantically matched image–text pairs should ideally share indistinguishable representations. Reducing modality separability would thus reflect improved multimodal coordination.

### Hyperspherical Entropy Estimation.
We measure the entropy of the embedding distribution on the hypersphere using the k-nearest neighbor–based estimator proposed by \citet{entropy}. This estimator leverages angular distances to compute local density estimates, which are aggregated into a global entropy measure. Entropy serves as a proxy for sparsity: low entropy distributions are clustered and “spiky,” while high entropy indicates more uniform coverage of the hypersphere. Since our method encourages uniformity, we expect an increase in entropy relative to standard CLIP.

### Downstream Captioning Performance (BLEU)
Finally, we evaluate the utility of embeddings on a generative downstream task: image captioning. Image embeddings are passed into a pretrained language model to generate captions, which are compared against ground-truth captions using BLEU score. BLEU measures n-gram overlap between generated and reference text, rewarding fluency and accuracy. This extrinsic metric demonstrates how improvements in embedding geometry translate into practical benefits for end-user tasks, beyond abstract geometric properties.

In [None]:
from models.clipModel import CLIPModel

model = CLIPModel()