BioEval is a benchmark framework that evaluates whether large language models can reason about biology, not just recall biological facts. While existing benchmarks test factual knowledge (e.g., "What does TP53 encode?"), BioEval tests the reasoning capabilities scientists actually need: executing protocols, predicting experimental outcomes, identifying methodological flaws, and handling adversarial scenarios.
Current LLM benchmarks for biology measure whether models have learned about biology from text. BioEval measures whether models have learned biology — the causal reasoning that predicts what happens when you perturb a biological system.
| Benchmark | Limitation |
|---|---|
| MedQA, MedMCQA | Multiple choice, knowledge retrieval only |
| GPQA | Intentionally removes questions where experts disagree |
| PubMedQA | Yes/no questions on abstracts |
| BioASQ | Question answering, not reasoning |
| LAB-Bench | Factual accuracy only |
BioEval fills this gap with procedural reasoning, causal grounding from experimental data, design critique, and adversarial robustness testing.
| Component | What It Tests | Base Tasks | Ground Truth |
|---|---|---|---|
| ProtoReason | Protocol execution, calculation, troubleshooting | 17 | Expert annotation |
| CausalBio | Perturbation outcome prediction | 13 | Experimental data (DepMap, CMap) |
| DesignCheck | Experimental design critique | 10 | Annotated flaws |
| Adversarial | Robustness to trick questions | 24 | Trap detection |
| MultiTurn | Scientific dialogue coherence | 6 | Conversation flow |
| Calibration | Confidence calibration | 10 | "I don't know" tests |
| Tier | Tasks | Description |
|---|---|---|
| Base | 80 | Core evaluation tasks across all 6 components |
| Extended | 114 | Additional ProtoReason (70) and CausalBio (44) tasks |
| Advanced | 78 | Advanced ProtoReason (49), CausalBio (19), DesignCheck (10) |
| Total | 272 | Full benchmark suite |
# Clone repository
git clone https://github.com/jang1563/BioEval.git
cd BioEval
# Option 1: pip install
pip install -e .
# Option 2: conda (recommended for Apple Silicon)
conda create -n bioeval python=3.11
conda activate bioeval
pip install -e .# Show complete task inventory (no API key needed)
bioeval inventory
# Dry run — shows what would be evaluated without API calls
bioeval run --all --dry-run
# Run full evaluation
export ANTHROPIC_API_KEY="your-key-here"
bioeval run --all --model claude-sonnet-4-20250514
# Run specific component
bioeval run -c adversarial -m claude-sonnet-4-20250514
# Run with extended data tier
bioeval run --all --data-tier extended
# Compare two result files
bioeval compare results_a.json results_b.json
# Show pre-cached results (no API key needed)
bioeval demofrom bioeval.protoreason.evaluator import ProtoReasonEvaluator
from bioeval.causalbio.evaluator import CausalBioEvaluator
from bioeval.adversarial.tasks import AdversarialEvaluator
# Run individual components
evaluator = ProtoReasonEvaluator(model_name="claude-sonnet-4-20250514")
results = evaluator.run_evaluation()
# With enhanced prompts (adversarial)
evaluator = AdversarialEvaluator(use_enhanced_prompts=True)
results = evaluator.run_evaluation()| Test Type | Baseline | Enhanced | Improvement |
|---|---|---|---|
| False Premise | 60% | 100% | +40% |
| Plausible Nonsense | 67% | 100% | +33% |
| Edge Case | 75% | 100% | +25% |
| Hallucination Trap | 80% | 100% | +20% |
| Overall Pass Rate | 62.5% | 83.3% | +20.8% |
BioEval includes targeted prompt engineering strategies that address specific failure modes:
- Calibration Enhancement — Reduces overconfidence by requiring explicit evidence listing
- Context Defense — Filters misleading/irrelevant information via relevance analysis
- Edge Case Recognition — Forces explicit consideration of boundary conditions
- Nonsense Detection — Catches hallucination traps by requiring entity verification
- Chain-of-Thought — Structured 6-step reasoning for causal biology questions
from bioeval.prompts import enhance_prompt, PromptEnhancementConfig
config = PromptEnhancementConfig(
calibration=True,
context_defense=True,
edge_case=True,
nonsense_detection=True,
chain_of_thought=True
)
enhanced = enhance_prompt(original_prompt, config)| Provider | Models | Method |
|---|---|---|
| Anthropic | Claude Sonnet 4, Claude Opus 4 | API |
| OpenAI | GPT-4o, GPT-4-turbo | API |
| HuggingFace | Mistral, Llama, etc. | Local (with LoRA support) |
- Async Execution — Parallel evaluation with rate limiting (
scripts/run_enhanced.py) - Response Caching — SQLite-based caching to avoid redundant API calls
- LLM-as-Judge — Semantic evaluation using structured rubrics
- Confidence Calibration — ECE, overconfidence rates, reliability diagrams
- Multi-Turn Dialogue — Hypothesis refinement, iterative design, troubleshooting
BioEval/
├── bioeval/
│ ├── __init__.py # Package exports
│ ├── cli.py # Unified CLI entry point
│ ├── config.py # Configuration settings
│ ├── models/
│ │ └── base.py # Model wrappers (Claude, OpenAI, HuggingFace)
│ ├── prompts/
│ │ └── prompt_templates.py # Enhancement templates
│ ├── protoreason/ # Protocol reasoning component
│ │ ├── evaluator.py # Base tasks (17)
│ │ ├── extended_data.py # Extended tasks (70)
│ │ └── advanced_data.py # Advanced tasks (49)
│ ├── causalbio/ # Causal biology component
│ │ ├── evaluator.py # Base tasks (13)
│ │ ├── extended_data.py # Extended tasks (44)
│ │ └── advanced_data.py # Advanced tasks (19)
│ ├── designcheck/ # Experimental design critique
│ │ ├── evaluator.py # Base tasks (10)
│ │ └── advanced_data.py # Advanced tasks (10)
│ ├── adversarial/ # Adversarial robustness (24 tasks)
│ ├── multiturn/ # Multi-turn dialogues (6 scenarios)
│ ├── scoring/ # Scoring & calibration
│ └── calibration/ # Calibration tests (10 tasks)
├── scripts/
│ ├── run_evaluation.py # Basic evaluation runner
│ ├── run_enhanced.py # Full-featured async runner
│ ├── run_comparison.py # Enhanced vs baseline comparison
│ └── visualize_results.py # Results visualization
├── docs/ # Project documentation
│ ├── PRD.md # Product Requirements Document
│ ├── IMPROVEMENT_PLAN.md # Development roadmap
│ ├── BIOLOGICAL_AMBIGUITY_DESIGN.md # BioAmbiguity component design
│ ├── LITERATURE_SURVEY.md # Related work survey
│ ├── EXPERT_PANEL_REVIEW.md # Expert panel feedback
│ ├── PUBLICATION_QUALITY_REVIEW.md # Quality review
│ └── PHASE0_BASELINE.md # Phase 0 baseline report
├── results/ # Evaluation outputs
├── tests/ # Test suite (27 tests)
├── notebooks/ # Analysis notebooks
├── setup.py
├── requirements.txt
└── README.md
| Phase | Goal | Status |
|---|---|---|
| Phase 0 | Make It Run — imports, tests, CLI, baseline | COMPLETE |
| Phase 1 | Make It Score — real metrics (Kendall's tau, directional accuracy, detection rate) | Planned |
| Phase 2 | Make It Credible — 3-model comparison, statistical tests, judge validation | Planned |
| Phase 2b | BioAmbiguity — novel component for context-dependent biological reasoning (45 tasks) | Planned |
| Phase 3 | Make It Impressive — dashboard, publication prep, HuggingFace distribution | Planned |
Publication target: NeurIPS Datasets & Benchmarks / Nature Methods
BioEval includes expert-curated evaluation tasks that work out of the box:
- 13+ protocols with 235+ steps (sources: protocols.io, Nature Protocols, STAR Protocols)
- 44+ causal biology tasks with experimental ground truth
- 10 flawed experimental designs with 30 annotated flaws
- 24 adversarial robustness tests across 7 categories
- 6 multi-turn dialogue scenarios
| Source | License | Used For |
|---|---|---|
| DepMap | CC BY 4.0 | Gene essentiality ground truth |
| Connectivity Map | CC BY 4.0 | Drug response signatures |
| protocols.io | Various | Additional protocols |
| GEO | Public | Expression data |
# Run full test suite
pytest tests/ -v
# Expected output: 27 passed in ~1sTest coverage includes:
- Data loading for all 6 components
- Task structure validation
- Confidence extraction
- Adversarial scoring
- Calibration metrics
- Response caching
- Statistics functions
| Tier | Tasks | Cost per Run (Claude) |
|---|---|---|
| Base | 80 | ~$1.40 |
| Base + Judge | 80 | ~$1.80 |
| Extended + Judge | 194 | ~$3.50 |
@software{bioeval2026,
author = {JangKeun Kim},
title = {BioEval: Multi-dimensional Evaluation of LLMs for Biological Research},
year = {2026},
url = {https://github.com/jang1563/BioEval}
}MIT License. See LICENSE for details.
- DepMap project for CRISPR screening data
- Connectivity Map for drug perturbation signatures
- protocols.io community for open protocols