# Embedding Model Benchmark — Demo Notebook

This notebook walks through using the `benchmark` scaffold to compare embedding models.

**Structure:**
```
benchmark/
├── __init__.py
├── models.py   ← adapters (HuggingFace, OpenAI, …)
├── tasks.py    ← benchmark tasks (STS, Retrieval, Clustering)
├── cache.py    ← disk caching of embeddings
└── runner.py   ← orchestrator + result helpers
```

## 0. Install dependencies

In [None]:
# Run once
# !pip install sentence-transformers datasets scipy scikit-learn pandas tqdm

## 1. Setup & logging

In [None]:
import logging
import sys

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%H:%M:%S",
    stream=sys.stdout,
)

# Make sure the benchmark package is on the path
# (if running from the repo root this is not needed)
import os, sys
sys.path.insert(0, os.path.abspath(".."))

## 2. Inspect available tasks

In [None]:
from benchmark import TASK_REGISTRY

for name, task in TASK_REGISTRY.items():
    print(f"  {name:20s}  →  {task.description}")

## 3. Define models to benchmark

Models are plain dicts — no code changes needed to add/remove models.

In [None]:
MODEL_CONFIGS = [
    {
        "type": "sentence_transformer",
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        # Fast, ~80 MB — good baseline
    },
    {
        "type": "sentence_transformer",
        "model": "BAAI/bge-small-en-v1.5",
        # Strong small model from BAAI
    },
    # Uncomment to add OpenAI (requires OPENAI_API_KEY):
    # {
    #     "type": "openai",
    #     "model": "text-embedding-3-small",
    # },
]

# Which tasks to run
TASK_NAMES = ["sts", "retrieval"]   # add "clustering" for a heavier run

## 4. Run the benchmark

In [None]:
from pathlib import Path
from benchmark import BenchmarkRunner

runner = BenchmarkRunner(
    model_configs=MODEL_CONFIGS,
    task_names=TASK_NAMES,
    output_dir=Path("results"),
    cache_dir=Path(".cache/embeddings"),
    batch_size=128,
    show_progress=True,
)

results = runner.run()

## 5. Explore results

In [None]:
from benchmark import results_to_dataframe, pivot_main_scores
import pandas as pd

pd.set_option("display.float_format", "{:.4f}".format)

df = results_to_dataframe(results)
df

In [None]:
# Model × Task pivot of main scores (higher is better for all tasks)
pivot_main_scores(results)

In [None]:
# Encode time comparison
df[["model", "task", "encode_time_s", "eval_time_s"]].sort_values("encode_time_s")

## 6. Plot

In [None]:
import matplotlib.pyplot as plt

pivot = pivot_main_scores(results)

ax = pivot.plot(kind="bar", figsize=(8, 4), rot=30)
ax.set_ylabel("main_score")
ax.set_title("Embedding Model Comparison")
ax.legend(title="Task")
plt.tight_layout()
plt.show()

## 7. Load results from disk (useful after long runs)

Each `(model, task)` pair is written atomically after it finishes, so you can safely
reload partial results.

In [None]:
import json
from pathlib import Path

result_files = sorted(Path("results").glob("*.json"))
print(f"Found {len(result_files)} result file(s):")
for p in result_files:
    print(" ", p.name)

In [None]:
# Load summary
summary_path = Path("results") / "summary.json"
if summary_path.exists():
    with open(summary_path) as f:
        summary = json.load(f)
    pd.DataFrame(summary)

## 8. Adding your own task

```python
from benchmark.tasks import Task, TASK_REGISTRY
from benchmark.cache import encode_with_cache
from dataclasses import dataclass
from pathlib import Path
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

@dataclass
class MyClassificationTask(Task):
    name: str = "My-Classification"
    description: str = "Linear probe accuracy on my dataset"

    def run(self, model, cache_dir: Path, **kwargs):
        # 1. Load your data
        train_texts, train_labels = load_my_data(split="train")
        test_texts,  test_labels  = load_my_data(split="test")

        # 2. Encode (with cache)
        train_embs = encode_with_cache(model, train_texts, self.name + "_train", cache_dir, **kwargs)
        test_embs  = encode_with_cache(model, test_texts,  self.name + "_test",  cache_dir, **kwargs)

        # 3. Your benchmark logic
        clf = LogisticRegression(max_iter=1000).fit(train_embs, train_labels)
        acc = accuracy_score(test_labels, clf.predict(test_embs))

        return {"accuracy": acc, "main_score": acc}

# Register it
TASK_REGISTRY["my-cls"] = MyClassificationTask()
```

Then just add `"my-cls"` to `TASK_NAMES` above and re-run — everything else is handled automatically.