In [24]:
import pandas as pd
import numpy as np
import sagemaker
import boto3

### Notebook 4 â€” Model Capability Profile Construction

This notebook transforms benchmark-level achievement data into aggregated model-level capability profiles.

It computes a composite quality score using performance, peer review status, and human comparison indicators, then aggregates these metrics by model.

The resulting model profiles provide structured inputs for downstream routing and cost-performance optimization.


### Configure AWS Environment and Data Access

Initialize AWS session and define S3 paths for reading input data and writing processed outputs.

In [25]:
# --- S3 paths ---
sess = sagemaker.Session()
bucket = sess.default_bucket()
raw_path = f"s3://{bucket}/llmachievements.csv"
out_local = "model_profiles.csv"
out_s3 = f"s3://{bucket}/processed/model_profiles.csv"

print("Reading:", raw_path)

Reading: s3://sagemaker-us-east-1-907086662522/llmachievements.csv


### Load Benchmark Achievement Data

Read the benchmark achievement dataset from Amazon S3.

This dataset contains model performance results across evaluation domains and tasks.

In [26]:
df = pd.read_csv(raw_path)

In [27]:
required_cols = ["Model", "Field", "Achievement"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"Dataset missing required columns: {missing}")

### Standardize and Clean Dataset Columns

Normalize column names and data formats to ensure consistent processing.

This step prepares the dataset for feature engineering and aggregation.

In [28]:
# --- basic cleaning ---
# normalize column names
df.columns = [c.strip().replace("\n", " ").replace("\r", " ") for c in df.columns]

# handle the weird "Peer-\nreviewed?" header if it exists
peer_col = None
for c in df.columns:
    if "Peer" in c and "review" in c.lower():
        peer_col = c
        break

### Engineer Model Performance Indicators

Create numeric indicators that capture important evaluation signals, including:

- Peer review status  
- Human performance comparison  
- Normalized benchmark results  

These features contribute to the model quality score.

In [29]:
# normalize text fields
for col in ["Model", "Field", "Outperforms human avg?"]:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip()

# coerce numeric result fields if present
for col in ["Result", "Human result"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# normalize peer-reviewed to boolean-ish
if peer_col:
    df[peer_col] = df[peer_col].astype(str).str.strip().str.lower()
    df["peer_reviewed_flag"] = df[peer_col].isin(["yes", "y", "true", "1"])
else:
    df["peer_reviewed_flag"] = False

# normalize "outperforms" to boolean-ish
if "Outperforms human avg?" in df.columns:
    df["outperforms_flag"] = df["Outperforms human avg?"].astype(str).str.strip().str.lower().isin(["yes", "y", "true", "1"])
else:
    df["outperforms_flag"] = False

### Compute Composite Quality Score

Calculate a weighted performance score that summarizes model capability across benchmark evaluations.

The score combines multiple performance indicators into a single interpretable metric.

In [30]:
# --- Build a "quality score" ---
# Simple, explainable scoring:
# +2 if outperforms human
# +1 if peer-reviewed
# + up to +2 from numeric Result (scaled to 0..2)
# +0.5 if Human result exists (means comparable benchmark exists) : proxy for design doc
score = np.zeros(len(df), dtype=float)

score += 2.0 * df["outperforms_flag"].astype(float)
score += 1.0 * df["peer_reviewed_flag"].astype(float)

if "Result" in df.columns:
    # normalize Result to 0..2 using percentile scaling (robust)
    r = df["Result"].copy()
    r_min = np.nanpercentile(r, 5) if np.isfinite(r).any() else 0
    r_max = np.nanpercentile(r, 95) if np.isfinite(r).any() else 100
    denom = (r_max - r_min) if (r_max - r_min) != 0 else 1.0
    r_norm = ((r - r_min) / denom).clip(0, 1)
    score += 2.0 * r_norm.fillna(0)

if "Human result" in df.columns:
    score += 0.5 * df["Human result"].notna().astype(float)

df["quality_score"] = score

### Aggregate Metrics at Model Level

Group benchmark results by model and compute summary statistics, including:

- Mean performance score  
- Evaluation coverage  
- Domain diversity  

This produces one profile per model.

### Assign Performance Tiers

Classify models into relative performance tiers based on aggregated quality scores.

This enables simplified comparison and routing decisions.

#### Derive Domain Coverage Features

Identify evaluation domains associated with each model to measure breadth of capability.


In [31]:
# --- Aggregate to model-level profiles ---
# We keep: avg score, count of achievements, domains covered, peer-reviewed ratio
agg = df.groupby("Model", dropna=False).agg(
    quality_score_mean=("quality_score", "mean"),
    quality_score_max=("quality_score", "max"),
    achievements_count=("Achievement", "count"),
    domains_count=("Field", lambda x: x.nunique()),
    peer_reviewed_rate=("peer_reviewed_flag", "mean"),
    outperforms_rate=("outperforms_flag", "mean"),
).reset_index()

# domains list
domains = df.groupby("Model")["Field"].apply(lambda s: ", ".join(sorted(set([str(x).strip() for x in s.dropna()])))).reset_index()
domains = domains.rename(columns={"Field": "domains_covered"})
model_profiles = agg.merge(domains, on="Model", how="left")

# --- Create tiers (1..5) for routing constraints ---
# Use quantiles so tiers are balanced
model_profiles["quality_tier"] = pd.qcut(
    model_profiles["quality_score_mean"].rank(method="first"),
    q=min(5, model_profiles.shape[0]),
    labels=False
) + 1

# sort helpful
model_profiles = model_profiles.sort_values(["quality_tier", "quality_score_mean"], ascending=[False, False])

print("Model profiles preview:")
display(model_profiles.head(10))

# --- Save locally and upload to S3 ---
model_profiles.to_csv(out_local, index=False)
print("Saved:", out_local)

# upload (works in SageMaker notebooks)
!aws s3 cp model_profiles.csv {out_s3}
!aws s3 ls s3://{bucket}/processed/

Model profiles preview:


Unnamed: 0,Model,quality_score_mean,quality_score_max,achievements_count,domains_count,peer_reviewed_rate,outperforms_rate,domains_covered,quality_tier
8,Gemini 3,4.47688,4.47688,1,1,0.0,1.0,Transcription,5
3,Claude 3.6S,4.415792,4.415792,1,1,0.0,1.0,Persuasion,5
14,o3-mini-high,4.360643,4.360643,1,1,0.0,1.0,Health reviews,5
13,o1,4.040036,4.5,2,2,0.0,1.0,"Maths, Medicine",5
5,"GPT-4, etc",3.694612,3.694612,1,1,0.0,1.0,Emotional intelligence,4
15,o4-mini,3.61401,3.61401,1,1,0.0,1.0,Finance,4
10,davinci,3.468992,4.0,4,3,0.0,1.0,"General knowledge, IQ (Binet-Simon Scale, verb...",4
0,Bing Chat,3.385513,3.70368,2,2,0.0,1.0,"Japan: National Medical Licensure Examination,...",3
6,GPT-4.5,3.355234,3.355234,1,1,0.0,1.0,Being human,3
4,GPT-4,2.38128,4.168894,16,15,0.0,0.9375,"Academia, Aerospace, Art (via prompting Midjou...",3


Saved: model_profiles.csv
upload: ./model_profiles.csv to s3://sagemaker-us-east-1-907086662522/processed/model_profiles.csv
2026-02-22 23:20:31       1769 model_profiles.csv
2026-02-22 22:40:31    2667107 synthetic_requests_labeled_v2.csv


### Summary

This notebook generated aggregated model capability profiles from benchmark performance data.

The resulting dataset provides structured inputs for model routing, comparison, and cost-performance optimization.