Validating LinkML schemas and datasets that depend on external terms
A collection of LinkML ValidationPlugin implementations for validating ontology term references:
- Schema Validation: Validate
meaningfields in enum permissible values - Data Validation: Validate data against dynamic enums and binding constraints
- ✅ Three composable validation plugins for LinkML validator framework
- ✅ Validates
meaningfields inpermissible_valuesin LinkML schemas - ✅ Validates data against dynamic enums (reachable_from, matches, concepts)
- ✅ Validates binding constraints on nested object fields
- ✅ Supports multiple ontology sources via OAK (Ontology Access Kit)
- ✅ Multi-level caching (in-memory + file-based) for fast repeated validation
- ✅ Configurable per-prefix validation via
oak_config.yaml - ✅ Standalone CLI + LinkML validator integration
- ✅ Tracks unknown ontology prefixes
pip install linkml-term-validatorOr with uv:
uv add linkml-term-validatorFor interactive tutorials, see the Jupyter notebooks in the notebooks/ directory.
Check that meaning fields in your schema reference valid ontology terms:
linkml-term-validator validate-schema schema.yamlValidate data instances against dynamic enums and binding constraints:
linkml-term-validator validate-data data.yaml --schema schema.yamlThe validate-data command checks:
- Dynamic enums - values match
reachable_from,matches, orconceptsdefinitions - Binding constraints - nested object fields satisfy binding ranges
- Labels (optional with
--labels) - ontology term labels match
Here's a LinkML schema that uses ontology terms:
id: https://example.org/my-schema
name: my-schema
prefixes:
GO: http://purl.obolibrary.org/obo/GO_
CHEBI: http://purl.obolibrary.org/obo/CHEBI_
enums:
BiologicalProcessEnum:
description: Examples of biological processes
permissible_values:
BIOLOGICAL_PROCESS:
title: biological process
meaning: GO:0008150
CELL_CYCLE:
title: cell cycle
meaning: GO:0007049
ChemicalEntityEnum:
description: Examples of chemical entities
permissible_values:
WATER:
title: water
meaning: CHEBI:15377
GLUCOSE:
title: glucose
meaning: CHEBI:17234When you run validation:
linkml-term-validator my-schema.yamlThe validator will:
- Check that
GO:0008150exists and has label "biological_process" (or "biological process") - Check that
GO:0007049exists and has label "cell cycle" - Check that
CHEBI:15377exists and has label "water" - Check that
CHEBI:17234exists and has label "glucose" - Report any mismatches or missing terms
Validation Results for my-schema.yaml
============================================================
Enums checked: 2
Values checked: 4
Meanings validated: 4
✅ No issues found!
Or if there's an issue:
⚠️ WARNING: Label mismatch
Enum: BiologicalProcessEnum
Value: BIOLOGICAL_PROCESS
Expected label: biological process
Found label: biological_process
Meaning: GO:0008150
Validation Results for my-schema.yaml
============================================================
Enums checked: 2
Values checked: 4
Meanings validated: 4
Issues found: 1
Warnings: 1
Errors: 0
Schema with a dynamic enum using reachable_from:
enums:
NeuronTypeEnum:
description: Any neuron type
reachable_from:
source_ontology: obo:cl
source_nodes:
- CL:0000540 # neuron
relationship_types:
- rdfs:subClassOfData file with neuron instances:
neurons:
- id: "1"
cell_type: CL:0000540 # neuron - valid
- id: "2"
cell_type: CL:0000100 # neuron associated cell - valid (descendant)
- id: "3"
cell_type: GO:0008150 # biological process - INVALIDValidate:
linkml-term-validator validate-data neurons.yaml --schema schema.yamlOutput:
❌ Validation failed with 1 issue(s):
❌ ERROR: Value 'GO:0008150' not in dynamic enum NeuronTypeEnum
Expected one of the descendants of CL:0000540
Schema with binding constraints:
classes:
GeneAnnotation:
slots:
- gene
- go_term
slot_usage:
go_term:
range: GOTerm
bindings:
- binds_value_of: id
range: BiologicalProcessEnum
GOTerm:
slots:
- id
- labelData file:
annotations:
- gene: BRCA1
go_term:
id: GO:0008150 # biological_process
label: biological processValidate with label checking:
linkml-term-validator validate-data annotations.yaml --schema schema.yaml --labelsThe validator uses multi-level caching to speed up repeated validations:
During a single validation run, ontology labels are cached in memory. This means if multiple permissible values use the same ontology term, it's only looked up once.
Labels are persisted to CSV files in the cache directory (default: cache/). The cache is organized by ontology prefix:
cache/
├── go/
│ └── terms.csv # GO term labels
├── chebi/
│ └── terms.csv # CHEBI term labels
└── uberon/
└── terms.csv # UBERON term labels
Each CSV contains:
curie,label,retrieved_at
GO:0008150,biological_process,2025-11-15T10:30:00
GO:0007049,cell cycle,2025-11-15T10:30:01
- First run: Queries ontology databases, saves to cache
- Subsequent runs: Loads from cache files (very fast!)
- Cache location: Configurable via
--cache-dirflag - Disable caching: Use
--no-cacheflag
You might want to clear the cache if:
- Ontology databases have been updated
- You suspect stale or incorrect labels
# Clear cache for specific ontology
rm -rf cache/go/
# Clear entire cache
rm -rf cache/Create an oak_config.yaml to control which ontologies are validated:
ontology_adapters:
GO: sqlite:obo:go # Use local GO database
CHEBI: sqlite:obo:chebi # Use local CHEBI database
UBERON: sqlite:obo:uberon # Use local UBERON database
CUSTOM: "" # Skip validation for CUSTOM prefixThen validate with this config:
linkml-term-validator schema.yaml --config oak_config.yamlImportant: When using oak_config.yaml, ONLY the prefixes listed in the config will be validated. Any prefix not in the config will be tracked as "unknown" and reported at the end of validation.
Without an oak_config.yaml, the validator uses sqlite:obo: as the default adapter. This automatically creates per-prefix adapters:
GO:0008150→ usessqlite:obo:goCHEBI:15377→ usessqlite:obo:chebiUBERON:0000468→ usessqlite:obo:uberon
This works for any OBO ontology that has been downloaded via OAK.
linkml-term-validator supports two main validation use cases:
Validates meaning fields in enum permissible values.
CLI:
# Validate schema permissible values
linkml-term-validator validate-schema schema.yaml
# With strict mode (warnings become errors)
linkml-term-validator validate-schema --strict schema.yaml
# With custom config
linkml-term-validator validate-schema --config oak_config.yaml schema.yamlPython API:
from linkml.validator import Validator
from linkml_term_validator.plugins import PermissibleValueMeaningPlugin
plugin = PermissibleValueMeaningPlugin(
oak_adapter_string="sqlite:obo:",
strict_mode=False
)
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
report = validator.validate("schema.yaml")
if len(report.results) == 0:
print("Valid!")
else:
for result in report.results:
print(f"{result.severity}: {result.message}")Validates data instances against dynamic enums and binding constraints.
CLI:
# Validate data (checks both dynamic enums and bindings)
linkml-term-validator validate-data data.yaml --schema schema.yaml
# With specific target class
linkml-term-validator validate-data data.yaml -s schema.yaml -t Person
# Also validate labels match ontology
linkml-term-validator validate-data data.yaml -s schema.yaml --labels
# Only check bindings, skip dynamic enums
linkml-term-validator validate-data data.yaml -s schema.yaml --no-dynamic-enums
# Only check dynamic enums, skip bindings
linkml-term-validator validate-data data.yaml -s schema.yaml --no-bindingsData validation includes two aspects:
Validates against enums defined via reachable_from, matches, concepts.
Example schema:
enums:
NeuronTypeEnum:
reachable_from:
source_ontology: obo:cl
source_nodes: [CL:0000540] # neuron
relationship_types: [rdfs:subClassOf]Python API:
from linkml.validator import Validator
from linkml_term_validator.plugins import DynamicEnumPlugin
plugin = DynamicEnumPlugin(oak_adapter_string="sqlite:obo:")
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
report = validator.validate("data.yaml")Validates nested object fields against binding constraints.
Example schema:
classes:
Annotation:
slots:
- term
slot_usage:
term:
range: Term
bindings:
- binds_value_of: id
range: GOTermEnumPython API:
from linkml.validator import Validator
from linkml_term_validator.plugins import BindingValidationPlugin
plugin = BindingValidationPlugin(
validate_labels=True # Also check labels match ontology
)
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
report = validator.validate("data.yaml")CLI:
# Validate data with both dynamic enums and bindings (default)
linkml-term-validator validate-data data.yaml --schema schema.yaml
# With label validation enabled
linkml-term-validator validate-data data.yaml -s schema.yaml --labelsPython API:
from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin
from linkml_term_validator.plugins import (
DynamicEnumPlugin,
BindingValidationPlugin,
)
# Comprehensive validation pipeline
plugins = [
JsonschemaValidationPlugin(closed=True), # Structural validation
DynamicEnumPlugin(), # Dynamic enum validation
BindingValidationPlugin(validate_labels=True), # Binding validation
]
validator = Validator(schema="schema.yaml", validation_plugins=plugins)
report = validator.validate("data.yaml")The linkml-term-validator plugins can be used directly with the standard linkml-validate command via configuration files.
Create a validation config file (e.g., validation_config.yaml):
# Validation configuration for linkml-validate
schema: schema.yaml
target_class: Person
data_sources:
- data.yaml
plugins:
# Standard JSON Schema validation
JsonschemaValidationPlugin:
closed: true
# Ontology term validation for dynamic enums
"linkml_term_validator.plugins.DynamicEnumPlugin":
oak_adapter_string: "sqlite:obo:"
cache_labels: true
cache_dir: cache
# Binding constraint validation
"linkml_term_validator.plugins.BindingValidationPlugin":
oak_adapter_string: "sqlite:obo:"
validate_labels: true
cache_labels: true
cache_dir: cacheThen run validation:
linkml-validate --config validation_config.yamlSee the examples/ directory for complete examples:
- simple_config.yaml - Basic validation config
- linkml_validate_config.yaml - Full config with ontology plugins
- simple_schema.yaml - Example schema
- simple_data.yaml - Example data
"linkml_term_validator.plugins.DynamicEnumPlugin":
oak_adapter_string: "sqlite:obo:" # OAK adapter (default: sqlite:obo:)
cache_labels: true # Enable label caching (default: true)
cache_dir: cache # Cache directory (default: cache)
oak_config_path: oak_config.yaml # Optional: custom OAK config"linkml_term_validator.plugins.BindingValidationPlugin":
oak_adapter_string: "sqlite:obo:" # OAK adapter (default: sqlite:obo:)
validate_labels: true # Check labels match ontology (default: false)
cache_labels: true # Enable label caching (default: true)
cache_dir: cache # Cache directory (default: cache)
oak_config_path: oak_config.yaml # Optional: custom OAK configYou can also use the plugins programmatically:
from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin
from linkml_term_validator.plugins import (
DynamicEnumPlugin,
BindingValidationPlugin,
)
# Build validation pipeline
plugins = [
JsonschemaValidationPlugin(closed=True),
DynamicEnumPlugin(oak_adapter_string="sqlite:obo:"),
BindingValidationPlugin(validate_labels=True),
]
# Create validator
validator = Validator(
schema="schema.yaml",
validation_plugins=plugins,
)
# Validate
report = validator.validate("data.yaml")
# Check results
if len(report.results) == 0:
print("✅ Validation passed")
else:
for result in report.results:
print(f"{result.severity.name}: {result.message}")- docs/ - mkdocs-managed documentation
- src/ - source files (edit these)
- tests/ - Python tests
- data/ - Example data
There are several pre-defined command-recipes available.
They are written for the command runner just. To list all pre-defined commands, run just or just --list.
While linkml-term-validator is designed for standard data validation, it serves a crucial role as an anti-hallucination guardrail for agentic AI pipelines that generate ontology term references.
Language models frequently hallucinate identifiers like gene IDs, ontology terms, and other structured references. These fake identifiers often appear structurally correct (e.g., GO:9999999, CHEBI:88888) but don't actually exist in the source ontologies.
A robust guardrail requires dual validation—forcing the AI to provide both the identifier and its canonical label, then validating that they match:
Instead of accepting:
term: GO:0005515 # Single piece of information - easy to hallucinateRequire and validate:
term:
id: GO:0005515
label: protein binding # Must match canonical label in ontologyThis dramatically reduces hallucinations because the AI must get two interdependent facts correct simultaneously, which is significantly harder to fake convincingly than inventing a single plausible-looking identifier.
Use linkml-term-validator to embed validation directly into your agentic workflow:
1. Define schemas with binding constraints:
classes:
GeneAnnotation:
slots:
- gene
- go_term
slot_usage:
go_term:
range: GOTerm
bindings:
- binds_value_of: id
range: BiologicalProcessEnum
GOTerm:
slots:
- id # AI must provide both
- label # fields correctly2. Validate AI-generated outputs before committing:
from linkml.validator import Validator
from linkml_term_validator.plugins import BindingValidationPlugin
# Create validator with label checking enabled
plugin = BindingValidationPlugin(validate_labels=True)
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
# Validate AI-generated data
report = validator.validate(ai_generated_data)
if len(report.results) > 0:
# Reject hallucinated terms, prompt AI to regenerate
raise ValueError("Invalid ontology terms detected")3. Use validation during generation (not just post-hoc):
The most effective approach embeds validation during AI generation rather than treating it as a filtering step afterward. This transforms hallucination resistance from a detection problem into a generation constraint.
- Prevents fake identifiers from entering curated datasets
- Catches label mismatches where AI uses real IDs but wrong labels
- Validates dynamic constraints (e.g., only disease terms, only neuron types)
- Enables reliable automation of curation tasks traditionally requiring human experts
For detailed patterns and best practices on making ontology IDs hallucination-resistant in AI workflows, see:
- Make IDs Hallucination Resistant - Comprehensive guide from the AI for Curation project
- Jupyter Notebooks - Interactive tutorials demonstrating validation workflows