# 04 — Advanced Configuration

This notebook demonstrates advanced options when running evaluations:

- Custom batch sizes and caching
- Exporting additional report formats (JSON, CSV, Markdown, HTML)
- Passing generation/inference kwargs (HuggingFace)
- Device/dtype and other loading options (HuggingFace)

It runs by default with the repo's local demo model and includes a commented example for HuggingFace.

In [None]:
from bench.evaluation.harness import EvaluationHarness
from bench.evaluation.result_aggregator import ResultAggregator
from pathlib import Path

tasks = ["simple_qa", "medical_qa_symptoms"]
h = EvaluationHarness(
    tasks_dir="bench/tasks", results_dir="results", cache_dir="cache", log_level="INFO"
)

# Use a smaller batch size and enable caching
rep = h.evaluate(
    model_id="demo-local-adv",
    task_ids=tasks,
    model_type="local",
    batch_size=4,
    use_cache=True,
    save_results=True,
    report_formats=["json", "csv", "md", "html"],
    module_path="bench.examples.mypkg.mylocal",
    model_path=None,
)
rep.metadata, rep.overall_scores

Exporting via `ResultAggregator` can also be done manually if you keep results in memory.


In [None]:
ra = ResultAggregator(output_dir="results")
run_id = rep.metadata.get("run_id")
for er in rep.detailed_results:
    ra.add_evaluation_result(er, run_id=run_id)

ra.export_report_csv(run_id, f"results/{run_id}.csv")
ra.export_report_markdown(run_id, f"results/{run_id}.md")
ra.export_report_html(run_id, f"results/{run_id}.html", include_examples=True)
sorted(Path("results").glob(run_id + "*"))[:5]

## HuggingFace example (optional)

To use a HuggingFace pipeline instead of the local demo model, uncomment the cell below.

Notes:
- You need `transformers` and `torch` installed.
- Replace the model and task as desired. For summarization, input fields like `document`/`text`/`note` are supported by the runner.
- Advanced options like `device_map`, `torch_dtype`, `low_cpu_mem_usage`, `trust_remote_code`, `revision` are passed through.
- Generation kwargs (e.g., `max_new_tokens`) are forwarded for generative tasks.


In [None]:
# from bench.evaluation.harness import EvaluationHarness
# h_hf = EvaluationHarness(tasks_dir='bench/tasks', results_dir='results', cache_dir='cache')
# rep_hf = h_hf.evaluate(
#     model_id='hf-tiny-sum',
#     task_ids=['clinical_summarization_basic'],
#     model_type='huggingface',
#     hf_task='summarization',
#     model_path='sshleifer/tiny-t5',
#     batch_size=2,
#     use_cache=True,
#     generation_kwargs={'max_new_tokens': 32, 'do_sample': False},
#     device=-1,  # CPU; set to 0 for first CUDA GPU if available
#     trust_remote_code=False,
#     low_cpu_mem_usage=True,
# )
# rep_hf.overall_scores


## Performance tips

See the comprehensive guide: [Performance Tips — MedAISure] (docs/04c_performance_tips.md).

Highlights:
- Use smaller batch sizes on constrained hardware.
- Enable caching (`use_cache=True`) to avoid recomputation.
- For HuggingFace, set appropriate `device` and `torch_dtype` (e.g., 'float16'/'bfloat16').
- Limit `max_new_tokens` and sample size when iterating.
