# 📊 01_data_selection.ipynb — Data Selection Pipeline

This notebook prepares the dataset for our instruction tuning experiments by applying **active example selection** to the OpenMathInstruct-1 dataset. Inspired by the LESS paper (Zhang et al., 2024), we aim to test whether **strategically selected examples** can improve instruction-following performance more efficiently than random sampling.

## 🔧 What this notebook does:

1. **Load** a 10,000-example subset of the OpenMathInstruct-1 dataset
2. **Generate sentence embeddings** for the instructions using `MiniLM`
3. **Select 1,000 examples** using **KMeans clustering** to maximize diversity
4. **Also sample 1,000 random examples** as a baseline
5. **Save all outputs** to the `data/` folder for training and evaluation

## 🗂 Outputs:

| File | Description |
|------|-------------|
| `data/openmath_full.json` | All 10K instruction-response pairs |
| `data/openmath_embeddings.npy` | Vector embeddings of instructions |
| `data/openmath_selected_1k.json` | 1K clustered examples (smart selection) |
| `data/openmath_random_1k.json` | 1K random examples (baseline) |

These datasets will be used to train separate models and evaluate instruction-following quality using AlpacaEval.

In [None]:
# Only Run if Do not have env set up
!pip install datasets sentence-transformers scikit-learn tqdm

Load Dataset

In [None]:
from datasets import load_dataset
import json
import os

os.makedirs("data", exist_ok=True)

# Load first 10k examples (adjust as needed)
dataset = load_dataset("OpenMathInstruct/OpenMathInstruct-1", split="train")
dataset = dataset.select(range(10000))

# Format as instruction-response pairs
examples = [{"instruction": ex["instruction"], "response": ex["response"]} for ex in dataset]

# Save full subset
with open("data/openmath_full.json", "w") as f:
    json.dump(examples, f, indent=2)

print(f"✅ Saved {len(examples)} examples.")

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm

# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Get instruction texts
texts = [ex["instruction"] for ex in examples]

# Embed instructions
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

# Save embeddings
np.save("data/openmath_embeddings.npy", embeddings)

In [None]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

# Cluster into 1000 groups
k = 1000
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Get example closest to each centroid
centers = kmeans.cluster_centers_
distances = cdist(centers, embeddings)
closest_idxs = distances.argmin(axis=1)

# Select final training examples
selected_examples = [examples[i] for i in closest_idxs]

# Save selected examples
with open("data/openmath_selected_1k.json", "w") as f:
    json.dump(selected_examples, f, indent=2)

print("✅ Saved data/openmath_selected_1k.json")


In [None]:
import random

random.seed(42)
random_idxs = random.sample(range(len(examples)), 1000)
random_1k = [examples[i] for i in random_idxs]

with open("data/openmath_random_1k.json", "w") as f:
    json.dump(random_1k, f, indent=2)

print("✅ Saved data/openmath_random_1k.json")
