evals_intro

A from-scratch introduction to AI evaluation harnesses — built by doing, not by reading frameworks.

This repo contains a practical essay and working Python code for understanding how to evaluate AI systems in production. Built around a support ticket triage use case.

What's in here

File	What it does
`Evals Intro Essay.md`	Conceptual walkthrough — ground truth, confusion matrix, metrics, use-case mapping
`eval_core.py`	Batch eval harness — runs a dataset through AI, computes precision/recall/consistency
`prod_eval_hitl.py`	Live HITL eval runner — one ticket at a time, human labels in real time
`run_interactive.py`	Entry point for interactive mode
`run_preloaded.py`	Entry point for batch/preloaded dataset mode
`*.json`	Example eval run outputs — timestamped sessions with metrics and results

Core concepts covered

Ground Truth — Human-labelled benchmarks. Without these, you're just comparing AI to itself.

Confusion Matrix

	AI: Flag	AI: OK
Truth: Flag	True Positive (TP)	False Negative (FN)
Truth: OK	False Positive (FP)	True Negative (TN)

Metrics

Recall = TP / (TP + FN) — minimise FN; use for safety-critical cases (fraud, medical, legal)
Precision = TP / (TP + FP) — minimise FP; use for UX-critical cases (spam filters, recommendations)
F1 = 2 * (P * R) / (P + R) — penalises imbalance between precision and recall
Reason Consistency — % of cases where AI reason category matches human label; tracks drift over time

Metric selection by use case

Use Case	Minimise	Why
Credit card fraud	FN	Missing fraud = real financial loss
Email spam filter	FP	Legit emails in junk = trust broken
Contract risk review	FN	Missed clause = legal liability
Support ticket triage	FN	Missed P1 = production down
Content moderation	FN	Wrong-age content = serious harm

Rule of thumb: Safety-critical → minimise FN. UX-critical → minimise FP. Business tradeoff → cost model decides.

How to run

Prerequisites

pip install httpx

Set your OpenRouter API key in the relevant .py file (replace "Bearer KEY HERE").

⚠️ Never commit real API keys. Use environment variables in any real project.

Batch eval (preloaded dataset)

python run_preloaded.py

Runs a fixed set of pre-labelled tickets through the AI and outputs metrics + a timestamped JSON file.

Live HITL eval (one ticket at a time)

python run_interactive.py

You enter a ticket title → AI predicts → you provide ground truth → repeat until done → metrics calculated and saved.

What the output looks like

{
  "evaluation_completed_at": "2026-05-30T17:14:59",
  "summary_metrics": {
    "precision": 0.91,
    "recall": 0.88,
    "reason_consistency": 0.75
  },
  "total_items_evaluated": 12,
  "results": [...]
}

Why this exists

Most evals content either throws you into a framework (Ragas, LangSmith) or stays at the theory level. This is the gap in between — understanding what the numbers mean and writing the harness yourself before you trust a library to do it for you.

The support ticket example is intentional: it's a real-world AI agent use case where getting FN wrong means production is down.

What's next

Threshold sweep — vary decision boundary, plot precision-recall curve
Multi-label support (beyond binary flag/ok)
LLM-as-judge consistency variant

Author

Built by productiser as part of a practical AI skills sprint.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Evals Intro Essay.md		Evals Intro Essay.md
README.md		README.md
consistency_session_20260530_171459.json		consistency_session_20260530_171459.json
eval_core.py		eval_core.py
eval_pipeline.png		eval_pipeline.png
eval_run_20260530_154843.json		eval_run_20260530_154843.json
preloaded_eval_run_20260530_163049.json		preloaded_eval_run_20260530_163049.json
prod_eval_hitl.py		prod_eval_hitl.py
run_interactive.py		run_interactive.py
run_preloaded.py		run_preloaded.py
user_eval_run_20260530_165056.json		user_eval_run_20260530_165056.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evals_intro

What's in here

Core concepts covered

How to run

Prerequisites

Batch eval (preloaded dataset)

Live HITL eval (one ticket at a time)

What the output looks like

Why this exists

What's next

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

evals_intro

What's in here

Core concepts covered

How to run

Prerequisites

Batch eval (preloaded dataset)

Live HITL eval (one ticket at a time)

What the output looks like

Why this exists

What's next

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages