# 03 — Result Analysis & Visualization

This notebook demonstrates how to:

- Load or create evaluation runs
- Use `ResultAggregator` to retrieve a report by `run_id`
- Export reports to JSON/CSV/Markdown/HTML
- Compare two runs and visualize metric distributions


In [None]:
from bench.evaluation.harness import EvaluationHarness
from bench.evaluation.result_aggregator import ResultAggregator

# First, ensure at least one run exists (use the local demo model).
h = EvaluationHarness(tasks_dir="bench/tasks", results_dir="results", cache_dir="cache")
run1 = h.evaluate(
    model_id="demo-local",
    task_ids=["simple_qa", "medical_qa_symptoms"],
    model_type="local",
    module_path="bench.examples.mypkg.mylocal",
    model_path=None,
)
run1_id = run1.metadata.get("run_id")
run1_id

Create a second run (simulating a configuration change).


In [None]:
h2 = EvaluationHarness(
    tasks_dir="bench/tasks", results_dir="results", cache_dir="cache"
)
run2 = h2.evaluate(
    model_id="demo-local-v2",
    task_ids=["simple_qa", "medical_qa_symptoms"],
    model_type="local",
    module_path="bench.examples.mypkg.mylocal",
    model_path=None,
)
run2_id = run2.metadata.get("run_id")
run1_id, run2_id

Use `ResultAggregator` to access run reports and export formats.


In [None]:
ra = ResultAggregator(output_dir="results")
# Re-add results from both runs to the aggregator for this session
ra.add_evaluation_result(run1.detailed_results[0], run_id=run1_id)
for er in run1.detailed_results[1:]:
    ra.add_evaluation_result(er, run_id=run1_id)
for er in run2.detailed_results:
    ra.add_evaluation_result(er, run_id=run2_id)

# Export to various formats
p_json = ra.export_report_json(run1_id)
p_csv = ra.export_report_csv(run1_id, "results/" + run1_id + ".csv")
p_md = ra.export_report_markdown(run1_id, "results/" + run1_id + ".md")
p_html = ra.export_report_html(
    run1_id, "results/" + run1_id + ".html", include_examples=True
)
p_json, p_csv, p_md, p_html

Compare the two runs using `compare_runs()`.


In [None]:
diff = ra.compare_runs(run1_id, run2_id)
diff

Visualize a metric distribution across tasks with `plot_metric_distribution()`. If matplotlib isn't installed, data is returned without plotting.


In [None]:
plot_data = ra.plot_metric_distribution(
    run1_id,
    metric="clinical_accuracy",
    output_path="results/" + run1_id + "_clinical_accuracy.png",
)
plot_data

You can also call `BenchmarkReport.plot_overall_scores()` and `plot_task_scores(metric)` on a loaded report.
