# 04b — Custom Metrics (Registration & Usage)

This notebook demonstrates how to:

- Register a custom metric with `MetricCalculator.register_metric()`
- Run an evaluation with `EvaluationHarness` and then compute the custom metric on the run's predictions and references
- Optionally, combine custom and built-in metrics

Note: `EvaluationHarness` uses its internal metric set during evaluation. To compute custom metrics, we re-compute metrics post-hoc using `MetricCalculator` with the run's predictions and references.

In [None]:
from bench.evaluation.harness import EvaluationHarness
from bench.evaluation.metric_calculator import MetricCalculator

# Run a quick evaluation using the local demo model
h = EvaluationHarness(tasks_dir="bench/tasks", results_dir="results", cache_dir="cache")
rep = h.evaluate(
    model_id="demo-local",
    task_ids=["simple_qa"],
    model_type="local",
    module_path="bench.examples.mypkg.mylocal",
    model_path=None,
)
len(rep.detailed_results), list(rep.overall_scores.items())

Extract predictions and references for a task from the `EvaluationResult` objects in the report.

In [None]:
# Collect predictions and references from the first task result
er = rep.detailed_results[0]
task_id = er.task_id
predictions = er.model_outputs  # standardized prediction dicts
references = er.expected_outputs
task_id, predictions[:2], references[:2]

Register a custom metric function and compute it with `MetricCalculator`.

Below we define a simple exact-match score over a string field (defaulting to `label`).

In [None]:
mc = MetricCalculator()


def exact_match(y_true, y_pred, *, field="label", **kwargs):
    t = [(r.get(field) if isinstance(r, dict) else r) for r in y_true]
    p = [(r.get(field) if isinstance(r, dict) else r) for r in y_pred]
    num = sum(int(tt == pp) for tt, pp in zip(t, p))
    den = max(1, len(t))
    return float(num / den)


mc.register_metric("exact_match", exact_match, field="label")
custom = mc.calculate_metrics(
    task_id, predictions, references, metric_names=["exact_match"]
)
{k: v.value for k, v in custom.items()}

You can also compute both built-in and custom metrics together by passing a combined list to `metric_names`.

In [None]:
both = mc.calculate_metrics(
    task_id, predictions, references, metric_names=["accuracy", "exact_match"]
)
{k: v.value for k, v in both.items()}

For a pure code example, see also `bench/examples/register_custom_metric.py`.

Next: see `02b_python_task_interface.ipynb` for defining tasks programmatically and registering them with the registry.