# 01 — Basic Model Evaluation

This quickstart notebook shows how to:

- List available tasks in `bench/tasks/`
- Run a simple local demo model via `EvaluationHarness`
- Save and inspect a `BenchmarkReport`

It defaults to a tiny local model so it runs without external downloads.

In [None]:
from bench.evaluation.harness import EvaluationHarness
from bench.models.benchmark_report import BenchmarkReport
from pathlib import Path

# Initialize harness pointing to repo-relative paths
harness = EvaluationHarness(
    tasks_dir=str(Path("bench/tasks")),
    results_dir=str(Path("results")),
    cache_dir=str(Path("cache")),
    log_level="INFO",
)

# List a few available tasks
available = harness.list_available_tasks()
len(available), available[:3]  # show first 3

We'll evaluate a couple of lightweight example tasks that ship with the repo.

In [None]:
# Pick real task IDs from bench/tasks/
task_ids = [
    "simple_qa",
    "medical_qa_symptoms",
]

# Use the minimal local model example implemented at bench/examples/mypkg/mylocal.py
# ModelRunner will import module_path and call load_model()
report = harness.evaluate(
    model_id="demo-local",
    task_ids=task_ids,
    model_type="local",
    batch_size=8,
    strict_validation=False,
    module_path="bench.examples.mypkg.mylocal",
    model_path=None,  # not used by this toy loader
)
report.to_dict()["overall_scores"]

The `BenchmarkReport` is returned and also saved to `results/`. You can reload it later.

In [None]:
# Find a recently saved report JSON and load it back
paths = sorted(Path("results").glob("*.json"))
latest = paths[-1] if paths else None
print("Latest report:", latest)
if latest:
    rep2 = BenchmarkReport.from_file(latest)
    print(rep2.overall_scores)
    # Optional: visualize (requires matplotlib)
    try:
        rep2.plot_overall_scores()
    except Exception as e:
        print("Plotting skipped:", e)

Next: see `02_custom_task_creation.ipynb` to learn how to create your own task.