New here? Start with the QUICKSTART for a no-keys-needed demo in under 2 minutes, then come back here for the full reference. If you prefer Jupyter, jump straight into WALKTHROUGH.ipynb.
Model Router is a Microsoft Foundry feature that automatically picks the cheapest model that can answer each prompt well. This toolkit answers the question every team asks before adopting it: "Will Model Router actually save me money on my workload without hurting quality?"
You plug in your prompts, point at your endpoints, and get a side-by-side report comparing Model Router to any baseline model on three things that matter: quality, cost, and latency.
Automated quality, cost, and latency evaluation of Microsoft Foundry Model Router against any baseline model — bring your own prompts, get a full report in one command.
- Zero-friction start — run with mock data first to explore the reports and dashboard, no API keys needed
- Scale when ready — handles 1,000+ prompts with concurrency, async I/O, checkpoints, and resume
- Flexible data import — load prompts from JSONL, CSV, or a SQL database
- 24 models pre-configured — all pricing built into the YAML config, including router markup
- Fully configurable — swap the judge model, baseline model, scoring prompts, and concurrency limits
- One-file HTML dashboard — self-contained report with 8 embedded charts; share it, attach it to a PR, no server needed
- Anti-bias judge — every pairwise comparison runs twice with A/B order swapped; disagreements become ties, eliminating LLM position bias
- Value & efficiency scores — quality-per-dollar and quality-per-second ratios so you can weigh trade-offs, not just raw numbers
- Compare runs — diff two evaluations side-by-side to track improvements across configs, baselines, or time
- Crash-safe & auto-retry — append-only checkpoints lose at most one in-flight request; exponential backoff handles rate limits automatically
Optional: Foundry cloud evaluation — submit results to Microsoft Foundry as a post-processing step for cloud-based quality (score_model), cost, and latency grading. Adds governance, CI/CD integration, RBAC, and Foundry portal observability.
Benchmark Microsoft Foundry Model Router against a baseline model on quality, cost, and latency — then decide whether the router is right for your workload.
| Metric | What it measures |
|---|---|
| Quality | LLM-as-a-judge pairwise + absolute scoring (1–5) |
| Cost | Per-model token pricing with router markup formula |
| Latency | Response time — mean, p50, p90, p95, p99 |
| Value & Efficiency | Quality-per-dollar and quality-per-second composite scores |
| Model Distribution | Which models the router selects and how often |
| Step | Action | Link |
|---|---|---|
| 0 | Explore interactively — step through the pipeline in Jupyter (no API keys) | WALKTHROUGH.ipynb |
| 0b | Or run the CLI quickstart — one command, open the dashboard | QUICKSTART.md |
| 1 | Install — clone and pip install -e . |
Setup |
| 2 | Credentials — create .env with your Azure keys |
Credentials |
| 3 | Configure — review configs/default.yaml |
Configuration |
| 4 | Run — python scripts/run_eval.py |
Run an Evaluation |
| 5 | Results — open results/*/dashboard.html |
View Results |
| 6 | Go deeper — guides, methodology, architecture | docs/ |
Open WALKTHROUGH.ipynb and click Run All — see every chart, metric, and table inline in under 2 minutes. No API keys, no config, no installs beyond the base package.
Or run the CLI quickstart if you prefer a terminal:
# Windows
.\scripts\demo.ps1
# Linux / macOS
bash scripts/demo.shSee QUICKSTART.md for the CLI quickstart details.
Prerequisites: Python 3.9+, a Microsoft Foundry Model Router endpoint, and an Azure OpenAI baseline endpoint.
git clone https://github.com/microsoft-foundry/Model-Router-Auto-Evaluation.git
cd Model-Router-Auto-EvaluationA virtual environment keeps this project's dependencies isolated from your system Python.
Windows (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1Windows (Command Prompt)
python -m venv .venv
.venv\Scripts\activate.batmacOS / Linux
python3 -m venv .venv
source .venv/bin/activateTip: You'll know the virtual environment is active when you see
(.venv)at the start of your terminal prompt. To deactivate later, rundeactivate.
pip install -e ".[dev]"cp .env.example .env
# Edit .env with your endpoints and API keys| Variable | Description |
|---|---|
AZURE_MODEL_ROUTER_ENDPOINT |
Model Router endpoint URL |
AZURE_MODEL_ROUTER_KEY |
Model Router API key |
AZURE_MODEL_ROUTER_DEPLOYMENT |
Model Router deployment name (e.g. model-router) |
AZURE_OPENAI_ENDPOINT |
Azure OpenAI endpoint URL (baseline) |
AZURE_OPENAI_KEY |
Azure OpenAI API key (baseline) |
AZURE_BASELINE_DEPLOYMENT |
Baseline model deployment name (e.g. gpt-5) |
AZURE_JUDGE_ENDPOINT |
Judge model endpoint URL (can be same as baseline) |
AZURE_JUDGE_KEY |
Judge model API key |
AZURE_JUDGE_DEPLOYMENT |
Judge model deployment name (e.g. gpt-5) |
AZURE_AI_PROJECT_ENDPOINT |
Microsoft Foundry project endpoint (optional, for cloud eval) |
AZURE_AI_MODEL_DEPLOYMENT_NAME |
Foundry judge model deployment (optional, for cloud eval) |
Edit configs/default.yaml to set endpoints, baseline model, pricing, and judge settings.
- 24 models pre-configured — Azure OpenAI, Anthropic, xAI, DeepSeek, Meta
- LLM-as-a-judge enabled by default; configure via
AZURE_JUDGE_*env vars orjudge.endpointin YAML - Environment variables (
${VAR}) are resolved from.envat load time
To swap the baseline or judge model, update the deployment name in .env or override in YAML:
# configs/default.yaml
baseline:
deployment: gpt-5 # change to any Azure OpenAI deployment
judge:
deployment: gpt-5 # model used for LLM-as-a-judge scoring
concurrency: 3 # parallel judge callsSee configs/ for all presets (quick_test, large_scale, foundry).
# Dry-run — validate config and dataset, no API calls
python scripts/run_eval.py --dry-run
# Full evaluation (default: 10 sample prompts)
python scripts/run_eval.py
# Custom dataset / sample size (JSONL, CSV, or database)
python scripts/run_eval.py --dataset my_prompts.jsonl --sample-size 100
python scripts/run_eval.py --dataset my_prompts.csv
python scripts/run_eval.py --dataset "sqlite:///prompts.db?table=prompts"
# Resume an interrupted run from checkpoint
python scripts/run_eval.py --resume --output-dir results/my-runFor large-scale runs (500–1000+ prompts), see docs/how-to-resume-and-scale.md.
Results are written to results/<run-name>/:
| File | What |
|---|---|
dashboard.html |
Self-contained HTML dashboard with all charts |
report.md |
Markdown summary |
results.json |
Machine-readable metrics |
detailed_results.csv |
Per-prompt detail for further analysis |
Compare runs or export to CSV/JSON:
python scripts/compare_results.py results/run-a results/run-b
python scripts/export_results.py results/my-run --format csvThe evaluation pipeline runs in five stages:
Load Dataset → Resume Checkpoint → Eval Prompts → Judge (LLM) → Generate Report
↓ flush ↓ flush
checkpoint_eval checkpoint_judge
Checkpoint/resume — every result is flushed to disk immediately. If interrupted (Ctrl+C or crash), --resume picks up where you left off. See docs/how-to-resume-and-scale.md.
Graceful shutdown — SIGINT is caught, in-flight requests finish, then exit with a resume command.
Concurrency — asyncio.Semaphore controls parallel API calls (default: 5 eval, 3 judge). Per-prompt, router and baseline are called sequentially for fair latency comparison.
Auto-retry — API calls retry with exponential backoff (2^attempt seconds) on transient errors and 429 rate limits, so large runs survive throttling without manual intervention.
Router cost = router_markup × input_tokens
+ underlying_model_input × input_tokens
+ underlying_model_output × output_tokens
The underlying model is identified from each response and matched to pricing via prefix matching. See docs/methodology.md for full details.
| Method | Description |
|---|---|
| Pairwise | Side-by-side comparison with dual-ordering to cancel position bias |
| Absolute | Independent scoring on Accuracy, Completeness, Clarity, Helpfulness (1–5) |
See docs/methodology.md for statistical methodology and sample size guidance.
Supports JSONL, CSV, and SQL databases. Only id and prompt are required; all other fields are optional.
JSONL (one JSON object per line):
{"id": "001", "prompt": "Explain quantum entanglement in simple terms."}
{"id": "002", "prompt": "Write a Python function to merge two sorted lists.", "category": "code_generation"}CSV (header row with id and prompt columns):
id,prompt,category,difficulty
001,Explain quantum entanglement in simple terms.,,
002,Write a Python function to merge two sorted lists.,code_generation,easyDatabase (SQLite built-in; other DBs via pip install -e ".[db]"):
python scripts/run_eval.py --dataset "sqlite:///prompts.db?table=prompts"
python scripts/run_eval.py --dataset "postgresql://user:pw@host/db?table=prompts"Optional fields: category, difficulty, ground_truth, metadata. See docs/how-to-custom-dataset.md.
Project Structure → — full file/folder map with annotations
Full guides live in docs/ — start with the docs index for a suggested reading order tailored to your goal.
| Guide | Description |
|---|---|
| WALKTHROUGH.ipynb | Interactive step-by-step in Jupyter — see every chart inline (no API keys) |
| QUICKSTART.md | CLI quickstart — one command, open the dashboard |
| How to Run a Live Eval | End-to-end walkthrough with real Azure endpoints |
| Interpreting Results | What each chart and metric means, with a glossary |
| Custom Datasets | JSONL / CSV / SQL schemas, examples, best practices |
| FAQ | Troubleshooting, rate limits, Foundry issues |
| Guide | Description |
|---|---|
| Resume & Scale | Checkpoint/resume, 1,000+ prompt runs, rate limits |
| Compare Runs | Side-by-side diff of two evaluations |
| Methodology | Scoring, cost formula, statistical approach, judge bias mitigation |
| Guide | Description |
|---|---|
| Foundry Cloud Eval | Run grading in Microsoft Foundry with managed graders + portal visibility |
| Cost & Latency Design | Why we use Python graders for cost/latency in Foundry (advanced) |
| Guide | Description |
|---|---|
| Architecture | Component design, data flow, extension points |
| STRUCTURE.md | Annotated file/folder map of the whole repo |
Submit your evaluation results to Microsoft Foundry for cloud-based grading with governance, RBAC, and portal visibility.
# Install Foundry SDK dependencies
pip install -e ".[foundry]"
# Authenticate
az login
# Add Foundry variables to your .env (see .env.example)
# AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com
# AZURE_AI_MODEL_DEPLOYMENT_NAME=gpt-5
# Dry run (validate config + transform data, no API calls)
python scripts/run_foundry_eval.py --dry-run
# Full cloud evaluation
python scripts/run_foundry_eval.py --input-dir results/full-eval
# Cross-validate local vs Foundry results
python scripts/cross_validate.pyThe Foundry integration is a separate post-processing layer — the core eval flow is untouched. See docs/how-to-foundry-eval-sdk.md for details and docs/faq.md for Foundry troubleshooting.
How cost & latency work in Foundry: Foundry Evaluations supports quality grading natively via
score_modelgraders. For cost and latency — which Foundry doesn't evaluate out of the box — we usepythongraders that score pre-computed metrics from the local eval run. This gives you a unified quality + cost + latency view in a single Foundry eval, while keeping the door open for native support when it arrives. See docs/foundry-cost-latency-design.md for the full design rationale.
# All unit tests (167 tests; 3 live-integration tests are skipped without Azure credentials)
pytest tests/ -v
# Skip live integration tests
pytest tests/ -v -m "not integration"
# Only SDK compatibility checks
pytest tests/foundry/test_sdk_compat.py -v
# Only Foundry integration tests (requires az login + env vars)
pytest tests/foundry/test_integration.py -v -m integration- Fork the repo and create a feature branch
- Install dev dependencies:
pip install -e ".[dev]" - Run tests:
pytest - Open a PR with a clear description
We welcome bug reports, feature requests, documentation fixes, and general feedback:
- 🐛 Bug? Open a bug report
- ✨ Feature idea? Open a feature request
- 📚 Docs unclear? File a documentation issue
- 💬 General feedback or experience report? Share feedback or use GitHub Discussions for open-ended questions.
- 🔒 Security vulnerability? Please report privately via the Microsoft Security Response Center, not as a public issue.
MIT — see LICENSE for details.