Microsoft Foundry — Model Router Auto Evaluation

New here? Start with the QUICKSTART for a no-keys-needed demo in under 2 minutes, then come back here for the full reference. If you prefer Jupyter, jump straight into WALKTHROUGH.ipynb.

What is this?

Model Router is a Microsoft Foundry feature that automatically picks the cheapest model that can answer each prompt well. This toolkit answers the question every team asks before adopting it: "Will Model Router actually save me money on my workload without hurting quality?"

You plug in your prompts, point at your endpoints, and get a side-by-side report comparing Model Router to any baseline model on three things that matter: quality, cost, and latency.

What You Get

Automated quality, cost, and latency evaluation of Microsoft Foundry Model Router against any baseline model — bring your own prompts, get a full report in one command.

Zero-friction start — run with mock data first to explore the reports and dashboard, no API keys needed
Scale when ready — handles 1,000+ prompts with concurrency, async I/O, checkpoints, and resume
Flexible data import — load prompts from JSONL, CSV, or a SQL database
24 models pre-configured — all pricing built into the YAML config, including router markup
Fully configurable — swap the judge model, baseline model, scoring prompts, and concurrency limits
One-file HTML dashboard — self-contained report with 8 embedded charts; share it, attach it to a PR, no server needed
Anti-bias judge — every pairwise comparison runs twice with A/B order swapped; disagreements become ties, eliminating LLM position bias
Value & efficiency scores — quality-per-dollar and quality-per-second ratios so you can weigh trade-offs, not just raw numbers
Compare runs — diff two evaluations side-by-side to track improvements across configs, baselines, or time
Crash-safe & auto-retry — append-only checkpoints lose at most one in-flight request; exponential backoff handles rate limits automatically

Optional: Foundry cloud evaluation — submit results to Microsoft Foundry as a post-processing step for cloud-based quality (score_model), cost, and latency grading. Adds governance, CI/CD integration, RBAC, and Foundry portal observability.

What it Measures

Benchmark Microsoft Foundry Model Router against a baseline model on quality, cost, and latency — then decide whether the router is right for your workload.

Metric	What it measures
Quality	LLM-as-a-judge pairwise + absolute scoring (1–5)
Cost	Per-model token pricing with router markup formula
Latency	Response time — mean, p50, p90, p95, p99
Value & Efficiency	Quality-per-dollar and quality-per-second composite scores
Model Distribution	Which models the router selects and how often

Get Started

Step	Action	Link
0	Explore interactively — step through the pipeline in Jupyter (no API keys)	WALKTHROUGH.ipynb
0b	Or run the CLI quickstart — one command, open the dashboard	QUICKSTART.md
1	Install — clone and `pip install -e .`	Setup
2	Credentials — create `.env` with your Azure keys	Credentials
3	Configure — review `configs/default.yaml`	Configuration
4	Run — `python scripts/run_eval.py`	Run an Evaluation
5	Results — open `results/*/dashboard.html`	View Results
6	Go deeper — guides, methodology, architecture	docs/

0. Explore (No Keys Required)

Open WALKTHROUGH.ipynb and click Run All — see every chart, metric, and table inline in under 2 minutes. No API keys, no config, no installs beyond the base package.

Or run the CLI quickstart if you prefer a terminal:

# Windows
.\scripts\demo.ps1

# Linux / macOS
bash scripts/demo.sh

See QUICKSTART.md for the CLI quickstart details.

1. Install

Prerequisites: Python 3.9+, a Microsoft Foundry Model Router endpoint, and an Azure OpenAI baseline endpoint.

git clone https://github.com/microsoft-foundry/Model-Router-Auto-Evaluation.git
cd Model-Router-Auto-Evaluation

Create a virtual environment (recommended)

A virtual environment keeps this project's dependencies isolated from your system Python.

Windows (PowerShell)

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Windows (Command Prompt)

python -m venv .venv
.venv\Scripts\activate.bat

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate

Tip: You'll know the virtual environment is active when you see (.venv) at the start of your terminal prompt. To deactivate later, run deactivate.

Install the package

pip install -e ".[dev]"

2. Credentials

cp .env.example .env
# Edit .env with your endpoints and API keys

Variable	Description
`AZURE_MODEL_ROUTER_ENDPOINT`	Model Router endpoint URL
`AZURE_MODEL_ROUTER_KEY`	Model Router API key
`AZURE_MODEL_ROUTER_DEPLOYMENT`	Model Router deployment name (e.g. `model-router`)
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint URL (baseline)
`AZURE_OPENAI_KEY`	Azure OpenAI API key (baseline)
`AZURE_BASELINE_DEPLOYMENT`	Baseline model deployment name (e.g. `gpt-5`)
`AZURE_JUDGE_ENDPOINT`	Judge model endpoint URL (can be same as baseline)
`AZURE_JUDGE_KEY`	Judge model API key
`AZURE_JUDGE_DEPLOYMENT`	Judge model deployment name (e.g. `gpt-5`)
`AZURE_AI_PROJECT_ENDPOINT`	Microsoft Foundry project endpoint (optional, for cloud eval)
`AZURE_AI_MODEL_DEPLOYMENT_NAME`	Foundry judge model deployment (optional, for cloud eval)

3. Configuration

Edit configs/default.yaml to set endpoints, baseline model, pricing, and judge settings.

24 models pre-configured — Azure OpenAI, Anthropic, xAI, DeepSeek, Meta
LLM-as-a-judge enabled by default; configure via AZURE_JUDGE_* env vars or judge.endpoint in YAML
Environment variables (${VAR}) are resolved from .env at load time

To swap the baseline or judge model, update the deployment name in .env or override in YAML:

# configs/default.yaml
baseline:
  deployment: gpt-5          # change to any Azure OpenAI deployment
judge:
  deployment: gpt-5          # model used for LLM-as-a-judge scoring
  concurrency: 3             # parallel judge calls

See configs/ for all presets (quick_test, large_scale, foundry).

4. Run an Evaluation

# Dry-run — validate config and dataset, no API calls
python scripts/run_eval.py --dry-run

# Full evaluation (default: 10 sample prompts)
python scripts/run_eval.py

# Custom dataset / sample size (JSONL, CSV, or database)
python scripts/run_eval.py --dataset my_prompts.jsonl --sample-size 100
python scripts/run_eval.py --dataset my_prompts.csv
python scripts/run_eval.py --dataset "sqlite:///prompts.db?table=prompts"

# Resume an interrupted run from checkpoint
python scripts/run_eval.py --resume --output-dir results/my-run

For large-scale runs (500–1000+ prompts), see docs/how-to-resume-and-scale.md.

5. View Results

Results are written to results/<run-name>/:

File	What
`dashboard.html`	Self-contained HTML dashboard with all charts
`report.md`	Markdown summary
`results.json`	Machine-readable metrics
`detailed_results.csv`	Per-prompt detail for further analysis

Compare runs or export to CSV/JSON:

python scripts/compare_results.py results/run-a results/run-b
python scripts/export_results.py results/my-run --format csv

How It Works

The evaluation pipeline runs in five stages:

Load Dataset → Resume Checkpoint → Eval Prompts → Judge (LLM) → Generate Report
                                     ↓ flush         ↓ flush
                               checkpoint_eval   checkpoint_judge

Checkpoint/resume — every result is flushed to disk immediately. If interrupted (Ctrl+C or crash), --resume picks up where you left off. See docs/how-to-resume-and-scale.md.

Graceful shutdown — SIGINT is caught, in-flight requests finish, then exit with a resume command.

Concurrency — asyncio.Semaphore controls parallel API calls (default: 5 eval, 3 judge). Per-prompt, router and baseline are called sequentially for fair latency comparison.

Auto-retry — API calls retry with exponential backoff (2^attempt seconds) on transient errors and 429 rate limits, so large runs survive throttling without manual intervention.

Cost Formula

Router cost = router_markup × input_tokens
            + underlying_model_input × input_tokens
            + underlying_model_output × output_tokens

The underlying model is identified from each response and matched to pricing via prefix matching. See docs/methodology.md for full details.

Quality Evaluation

Method	Description
Pairwise	Side-by-side comparison with dual-ordering to cancel position bias
Absolute	Independent scoring on Accuracy, Completeness, Clarity, Helpfulness (1–5)

See docs/methodology.md for statistical methodology and sample size guidance.

Dataset Format

Supports JSONL, CSV, and SQL databases. Only id and prompt are required; all other fields are optional.

JSONL (one JSON object per line):

{"id": "001", "prompt": "Explain quantum entanglement in simple terms."}
{"id": "002", "prompt": "Write a Python function to merge two sorted lists.", "category": "code_generation"}

CSV (header row with id and prompt columns):

id,prompt,category,difficulty
001,Explain quantum entanglement in simple terms.,,
002,Write a Python function to merge two sorted lists.,code_generation,easy

Database (SQLite built-in; other DBs via pip install -e ".[db]"):

python scripts/run_eval.py --dataset "sqlite:///prompts.db?table=prompts"
python scripts/run_eval.py --dataset "postgresql://user:pw@host/db?table=prompts"

Optional fields: category, difficulty, ground_truth, metadata. See docs/how-to-custom-dataset.md.

Project Structure → — full file/folder map with annotations

Documentation

Full guides live in docs/ — start with the docs index for a suggested reading order tailored to your goal.

Start here (beginners)

Guide	Description
WALKTHROUGH.ipynb	Interactive step-by-step in Jupyter — see every chart inline (no API keys)
QUICKSTART.md	CLI quickstart — one command, open the dashboard
How to Run a Live Eval	End-to-end walkthrough with real Azure endpoints
Interpreting Results	What each chart and metric means, with a glossary
Custom Datasets	JSONL / CSV / SQL schemas, examples, best practices
FAQ	Troubleshooting, rate limits, Foundry issues

Going deeper (intermediate)

Guide	Description
Resume & Scale	Checkpoint/resume, 1,000+ prompt runs, rate limits
Compare Runs	Side-by-side diff of two evaluations
Methodology	Scoring, cost formula, statistical approach, judge bias mitigation

Microsoft Foundry cloud evaluation

Guide	Description
Foundry Cloud Eval	Run grading in Microsoft Foundry with managed graders + portal visibility
Cost & Latency Design	Why we use Python graders for cost/latency in Foundry (advanced)

Contributors

Guide	Description
Architecture	Component design, data flow, extension points
STRUCTURE.md	Annotated file/folder map of the whole repo

Foundry Cloud Evaluation (Optional)

Submit your evaluation results to Microsoft Foundry for cloud-based grading with governance, RBAC, and portal visibility.

# Install Foundry SDK dependencies
pip install -e ".[foundry]"

# Authenticate
az login

# Add Foundry variables to your .env (see .env.example)
# AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com
# AZURE_AI_MODEL_DEPLOYMENT_NAME=gpt-5

# Dry run (validate config + transform data, no API calls)
python scripts/run_foundry_eval.py --dry-run

# Full cloud evaluation
python scripts/run_foundry_eval.py --input-dir results/full-eval

# Cross-validate local vs Foundry results
python scripts/cross_validate.py

The Foundry integration is a separate post-processing layer — the core eval flow is untouched. See docs/how-to-foundry-eval-sdk.md for details and docs/faq.md for Foundry troubleshooting.

How cost & latency work in Foundry: Foundry Evaluations supports quality grading natively via score_model graders. For cost and latency — which Foundry doesn't evaluate out of the box — we use python graders that score pre-computed metrics from the local eval run. This gives you a unified quality + cost + latency view in a single Foundry eval, while keeping the door open for native support when it arrives. See docs/foundry-cost-latency-design.md for the full design rationale.

Running Tests

# All unit tests (167 tests; 3 live-integration tests are skipped without Azure credentials)
pytest tests/ -v

# Skip live integration tests
pytest tests/ -v -m "not integration"

# Only SDK compatibility checks
pytest tests/foundry/test_sdk_compat.py -v

# Only Foundry integration tests (requires az login + env vars)
pytest tests/foundry/test_integration.py -v -m integration

Contributing

Fork the repo and create a feature branch
Install dev dependencies: pip install -e ".[dev]"
Run tests: pytest
Open a PR with a clear description

Reporting issues and feedback

We welcome bug reports, feature requests, documentation fixes, and general feedback:

🐛 Bug? Open a bug report
✨ Feature idea? Open a feature request
📚 Docs unclear? File a documentation issue
💬 General feedback or experience report? Share feedback or use GitHub Discussions for open-ended questions.
🔒 Security vulnerability? Please report privately via the Microsoft Security Response Center, not as a public issue.

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.devcontainer		.devcontainer
.github		.github
configs		configs
datasets		datasets
docs		docs
sample-results		sample-results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
STRUCTURE.md		STRUCTURE.md
SUPPORT.md		SUPPORT.md
WALKTHROUGH.ipynb		WALKTHROUGH.ipynb
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microsoft Foundry — Model Router Auto Evaluation

What is this?

What You Get

What it Measures

Get Started

0. Explore (No Keys Required)

1. Install

Create a virtual environment (recommended)

Install the package

2. Credentials

3. Configuration

4. Run an Evaluation

5. View Results

How It Works

Cost Formula

Quality Evaluation

Dataset Format

Documentation

Start here (beginners)

Going deeper (intermediate)

Microsoft Foundry cloud evaluation

Contributors

Foundry Cloud Evaluation (Optional)

Running Tests

Contributing

Reporting issues and feedback

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Microsoft Foundry — Model Router Auto Evaluation

What is this?

What You Get

What it Measures

Get Started

0. Explore (No Keys Required)

1. Install

Create a virtual environment (recommended)

Install the package

2. Credentials

3. Configuration

4. Run an Evaluation

5. View Results

How It Works

Cost Formula

Quality Evaluation

Dataset Format

Documentation

Start here (beginners)

Going deeper (intermediate)

Microsoft Foundry cloud evaluation

Contributors

Foundry Cloud Evaluation (Optional)

Running Tests

Contributing

Reporting issues and feedback

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages