SpatialBench

A benchmark for evaluating AI agents on spatial biology analysis tasks

SpatialBench is a comprehensive evaluation framework designed to test the capabilities of Large Language Models (LLMs) and AI agents on real-world spatial transcriptomics and epigenomics analysis workflows. The full benchmark comprises 98 evaluations across multiple platforms (Xenium, Vizgen MERFISH, AtlasXomics ATAC-seq, Curio Seeker) and covers key analysis tasks from quality control to differential expression.

This repository contains 7 representative examples from the full benchmark to demonstrate the evaluation format and grading system. The complete benchmark is withheld to prevent overfitting and ensure reliable model comparisons.

Overview

Full Benchmark Scale

Technology	Evaluations
Xenium	30
Vizgen (MERFISH)	31
AtlasXomics	25
Seeker/Curio	12
Total	98

This Repository

This repository contains 7 example evaluations that demonstrate:

Standardized evaluation format: JSON-based test cases with clear task specifications
Extensible grading system: Multiple grader types for different evaluation criteria
Platform coverage: Examples from RNA (Xenium, MERFISH, Seeker) and ATAC-seq platforms
Task diversity: Samples across QC, preprocessing, clustering, cell typing, differential expression, and spatial analysis
Framework-agnostic: Works with any agent that can write JSON outputs

The full 98-evaluation benchmark is withheld to prevent overfitting and ensure that performance metrics reflect genuine spatial biology reasoning capabilities rather than memorization.

Quick Start

Installation

pip install spatialbench

Running an Evaluation

Using mini-swe-agent (Recommended)

SpatialBench includes built-in support for mini-swe-agent:

# Install spatialbench with mini-swe-agent
pip install spatialbench

# Configure your model
export MSWEA_MODEL_NAME=anthropic/claude-sonnet-4-5
export ANTHROPIC_API_KEY=your_api_key

# Run a single evaluation
spatialbench run evals/qc/seeker_qc_basic.json --agent minisweagent

# Run batch evaluations
spatialbench batch evals_full/seeker \
  --agent minisweagent \
  --model anthropic/claude-sonnet-4-5 \
  --output results/run1 \
  --parallel 6 \
  --keep-workspace

Or programmatically:

from spatialbench import EvalRunner, run_minisweagent_task

runner = EvalRunner("evals/qc/seeker_qc_basic.json")
result = runner.run(agent_function=run_minisweagent_task)

print(f"Passed: {result['passed']}")

Using a custom agent

from spatialbench import EvalRunner

def my_agent(task_prompt, work_dir):
    import json
    answer = {"mean_genes_per_bead": 45.2}
    answer_file = work_dir / "eval_answer.json"
    answer_file.write_text(json.dumps(answer))
    return answer

runner = EvalRunner("evals/qc/seeker_qc_basic.json")
result = runner.run(agent_function=my_agent)

print(f"Passed: {result['passed']}")

Task Categories

The full benchmark spans six major categories of spatial biology analysis. This repository includes one representative example from each:

Quality Control (QC)

Assess basic dataset properties: gene counts, UMI counts, mitochondrial fraction, etc.

Example: evals/qc/seeker_qc_basic.json

Preprocessing

Test normalization, dimensionality reduction, and batch correction pipelines.

Example: evals/preprocessing/xenium_normalization.json

Clustering

Evaluate clustering algorithm application and parameter selection.

Example: evals/clustering/xenium_leiden.json

Cell Type Annotation

Test marker-based cell type assignment and biological reasoning.

Example: evals/cell_typing/xenium_kidney_typing.json

Differential Expression

Assess statistical testing and marker gene discovery.

Example: evals/differential_expression/vizgen_de_temporal.json

Spatial Analysis

Evaluate spatial-specific analyses like tissue composition and spatial contiguity.

Examples: evals/spatial_analysis/seeker_spatial_contiguity.json, evals/spatial_analysis/vizgen_tissue_composition.json

Grader System

SpatialBench includes 6 built-in graders:

NumericToleranceGrader: For QC metrics, counts, percentages
LabelSetJaccardGrader: For cell type label sets
DistributionComparisonGrader: For cell type proportions
MarkerGenePrecisionRecallGrader: For marker gene lists (P@K, R@K)
MarkerGeneSeparationGrader: For expression-based validation (AUROC)
SpatialAdjacencyGrader: For spatial proximity metrics

See docs/graders.md for detailed documentation.

Evaluation Format

Evaluations are defined in JSON:

{
  "id": "eval_identifier",
  "task": "Natural language task description...",
  "data_node": "latch://path/to/dataset.h5ad",
  "grader": {
    "type": "numeric_tolerance",
    "config": {
      "ground_truth": {"field": 100},
      "tolerances": {"field": {"type": "absolute", "value": 5}}
    }
  }
}

See docs/specification.md for the complete specification.

Contributing

We welcome contributions! See CONTRIBUTING.md for:

Adding new evaluations
Creating custom graders
Submitting benchmark results

Batch Evaluations

SpatialBench supports batch evaluation with parallel execution:

spatialbench batch evals_full/seeker \
  --agent minisweagent \
  --model anthropic/claude-sonnet-4-5 \
  --output results/run1 \
  --parallel 6 \
  --keep-workspace

For running benchmarks across multiple models, use the benchmark script:

./scripts/benchmark_models.sh evals_full/seeker 6 true

Results are organized by run timestamp in results/run_TIMESTAMP/ directories.

For monitoring batch runs and troubleshooting, see BATCH_MONITORING.md.

Batch Results

Batch runs produce:

batch_results.json - Full results with pass/fail status, agent answers, and grader outputs
batch_log.txt - Execution log with progress updates
Agent metrics: cost, steps, duration per evaluation
Summary statistics: pass rate, average cost, average steps

Documentation

Citation

If you use SpatialBench in your research, please cite:

@software{spatialbench2024,
  title = {SpatialBench: A Benchmark for AI Agents on Spatial Biology Analysis},
  author = {Latch Bio},
  year = {2024},
  url = {https://github.com/latchbio/spatialbench}
}

License

Apache 2.0 - see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpatialBench

Overview

Full Benchmark Scale

This Repository

Quick Start

Installation

Running an Evaluation

Task Categories

Quality Control (QC)

Preprocessing

Clustering

Cell Type Annotation

Differential Expression

Spatial Analysis

Grader System

Evaluation Format

Contributing

Batch Evaluations

Batch Results

Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
evals		evals
evals_full		evals_full
examples		examples
scripts		scripts
spatialbench		spatialbench
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

latchbio/spatialbench

Folders and files

Latest commit

History

Repository files navigation

SpatialBench

Overview

Full Benchmark Scale

This Repository

Quick Start

Installation

Running an Evaluation

Task Categories

Quality Control (QC)

Preprocessing

Clustering

Cell Type Annotation

Differential Expression

Spatial Analysis

Grader System

Evaluation Format

Contributing

Batch Evaluations

Batch Results

Documentation

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages