-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Datasets sometimes contain typos or invalid terms from well-known vocabularies. For example, the Kleksi Musiom collection uses schema:ceator (31 records) instead of schema:creator. These invalid terms go undetected and propagate through the pipeline into the dataset browser.
Proposed solution
Add a vocabulary term validation step to @lde/pipeline-void that checks whether properties and classes found in the data are actually defined in their respective vocabularies.
Implementation: TermValidationExecutor (ExecutorDecorator)
A new executor decorator (like VocabularyExecutor) that wraps existing analysis executors:
- Passes through all quads from the inner executor.
- Collects
void:propertyandvoid:classIRIs from the output. - Checks each term against known vocabulary definitions for its namespace.
- Appends validation quads for unrecognized terms.
Output format
The output is not pure VoID — it extends the VoID property/class partitions with data quality annotations using schema:error:
# Existing output (from entity-properties.rq):
<.../void#property-partition-abc123>
void:property schema:ceator ;
void:entities 31 .
# Appended by TermValidationExecutor:
<.../void#property-partition-abc123>
schema:error "schema:ceator is not a recognized schema.org property" .
# Same for invalid classes (from class-partition.rq):
<.../void#class-abc456>
void:class schema:ceator ;
void:entities 31 .
<.../void#class-abc456>
schema:error "schema:ceator is not a recognized schema.org class" .This reuses schema:error as a literal, consistent with how distribution probe failures are reported in @lde/pipeline. Consumers can query for schema:error on partition nodes to find data quality issues.
Sourcing valid term lists
The best fit appears to be @zazuko/rdf-vocabularies (and its @vocabulary/* packages), which is in the same Zazuko ecosystem as @zazuko/prefixes (already a dependency). It bundles full RDF definitions for ~60 vocabularies including schema.org, Dublin Core, FOAF, and SKOS. Valid terms can be extracted by filtering on rdf:type rdf:Property / rdf:type rdfs:Class.
Scope
- Start with schema.org validation (most common vocabulary in the datasets we process).
Context
Discovered via the Kleksi Phase 1 quality report, which identified schema:ceator as a high-severity data quality issue affecting 31 records.