A benchmark for evaluating AI agents on spatial biology analysis tasks
SpatialBench is a comprehensive evaluation framework designed to test the capabilities of Large Language Models (LLMs) and AI agents on real-world spatial transcriptomics and epigenomics analysis workflows. The full benchmark comprises 98 evaluations across multiple platforms (Xenium, Vizgen MERFISH, AtlasXomics ATAC-seq, Curio Seeker) and covers key analysis tasks from quality control to differential expression.
This repository contains 7 representative examples from the full benchmark to demonstrate the evaluation format and grading system. The complete benchmark is withheld to prevent overfitting and ensure reliable model comparisons.
| Technology | Evaluations |
|---|---|
| Xenium | 30 |
| Vizgen (MERFISH) | 31 |
| AtlasXomics | 25 |
| Seeker/Curio | 12 |
| Total | 98 |
This repository contains 7 example evaluations that demonstrate:
- Standardized evaluation format: JSON-based test cases with clear task specifications
- Extensible grading system: Multiple grader types for different evaluation criteria
- Platform coverage: Examples from RNA (Xenium, MERFISH, Seeker) and ATAC-seq platforms
- Task diversity: Samples across QC, preprocessing, clustering, cell typing, differential expression, and spatial analysis
- Framework-agnostic: Works with any agent that can write JSON outputs
The full 98-evaluation benchmark is withheld to prevent overfitting and ensure that performance metrics reflect genuine spatial biology reasoning capabilities rather than memorization.
pip install spatialbenchUsing mini-swe-agent (Recommended)
SpatialBench includes built-in support for mini-swe-agent:
# Install spatialbench with mini-swe-agent
pip install spatialbench
# Configure your model
export MSWEA_MODEL_NAME=anthropic/claude-sonnet-4-5
export ANTHROPIC_API_KEY=your_api_key
# Run a single evaluation
spatialbench run evals/qc/seeker_qc_basic.json --agent minisweagent
# Run batch evaluations
spatialbench batch evals_full/seeker \
--agent minisweagent \
--model anthropic/claude-sonnet-4-5 \
--output results/run1 \
--parallel 6 \
--keep-workspaceOr programmatically:
from spatialbench import EvalRunner, run_minisweagent_task
runner = EvalRunner("evals/qc/seeker_qc_basic.json")
result = runner.run(agent_function=run_minisweagent_task)
print(f"Passed: {result['passed']}")Using a custom agent
from spatialbench import EvalRunner
def my_agent(task_prompt, work_dir):
import json
answer = {"mean_genes_per_bead": 45.2}
answer_file = work_dir / "eval_answer.json"
answer_file.write_text(json.dumps(answer))
return answer
runner = EvalRunner("evals/qc/seeker_qc_basic.json")
result = runner.run(agent_function=my_agent)
print(f"Passed: {result['passed']}")The full benchmark spans six major categories of spatial biology analysis. This repository includes one representative example from each:
Assess basic dataset properties: gene counts, UMI counts, mitochondrial fraction, etc.
Example: evals/qc/seeker_qc_basic.json
Test normalization, dimensionality reduction, and batch correction pipelines.
Example: evals/preprocessing/xenium_normalization.json
Evaluate clustering algorithm application and parameter selection.
Example: evals/clustering/xenium_leiden.json
Test marker-based cell type assignment and biological reasoning.
Example: evals/cell_typing/xenium_kidney_typing.json
Assess statistical testing and marker gene discovery.
Example: evals/differential_expression/vizgen_de_temporal.json
Evaluate spatial-specific analyses like tissue composition and spatial contiguity.
Examples: evals/spatial_analysis/seeker_spatial_contiguity.json, evals/spatial_analysis/vizgen_tissue_composition.json
SpatialBench includes 6 built-in graders:
- NumericToleranceGrader: For QC metrics, counts, percentages
- LabelSetJaccardGrader: For cell type label sets
- DistributionComparisonGrader: For cell type proportions
- MarkerGenePrecisionRecallGrader: For marker gene lists (P@K, R@K)
- MarkerGeneSeparationGrader: For expression-based validation (AUROC)
- SpatialAdjacencyGrader: For spatial proximity metrics
See docs/graders.md for detailed documentation.
Evaluations are defined in JSON:
{
"id": "eval_identifier",
"task": "Natural language task description...",
"data_node": "latch://path/to/dataset.h5ad",
"grader": {
"type": "numeric_tolerance",
"config": {
"ground_truth": {"field": 100},
"tolerances": {"field": {"type": "absolute", "value": 5}}
}
}
}See docs/specification.md for the complete specification.
We welcome contributions! See CONTRIBUTING.md for:
- Adding new evaluations
- Creating custom graders
- Submitting benchmark results
SpatialBench supports batch evaluation with parallel execution:
spatialbench batch evals_full/seeker \
--agent minisweagent \
--model anthropic/claude-sonnet-4-5 \
--output results/run1 \
--parallel 6 \
--keep-workspaceFor running benchmarks across multiple models, use the benchmark script:
./scripts/benchmark_models.sh evals_full/seeker 6 trueResults are organized by run timestamp in results/run_TIMESTAMP/ directories.
For monitoring batch runs and troubleshooting, see BATCH_MONITORING.md.
Batch runs produce:
batch_results.json- Full results with pass/fail status, agent answers, and grader outputsbatch_log.txt- Execution log with progress updates- Agent metrics: cost, steps, duration per evaluation
- Summary statistics: pass rate, average cost, average steps
- CLI Usage Guide
- Evaluation Specification
- Grader API
- Adding Evaluations
- Evaluation Catalog
- Batch Monitoring Guide
If you use SpatialBench in your research, please cite:
@software{spatialbench2024,
title = {SpatialBench: A Benchmark for AI Agents on Spatial Biology Analysis},
author = {Latch Bio},
year = {2024},
url = {https://github.com/latchbio/spatialbench}
}Apache 2.0 - see LICENSE for details.