feat(sparql-qlever): support JSON-LD and zipped distributions#397
Merged
Conversation
- Add JSON-LD support via in-Node preprocessing to N-Quads (jsonld lib). - Add zip extraction when compressFormat=application/zip (yauzl), with the inner mediaType driving entry filtering. Standalone application/zip distributions are rejected: the inner format must be declared. - Sort distributions in Importer.import() to prefer native QLever formats (nt/nq/ttl) over JSON-LD; the preprocessor is only invoked when needed. - Add compressMimeType getter on Distribution that strips the IANA prefix. - Treat application/zip as a compression Content-Type in distribution-probe so a zip Content-Type no longer raises a format-mismatch warning.
…or native zips - Replace the in-memory jsonld lib with jsonld-streaming-parser + n3.StreamWriter so JSON-LD documents flow through the pipeline as a stream and memory use stays bounded for large distributions. - Restrict Node-side preprocessing to JSON-LD only. Native RDF (nt/nq/ttl) in a zip container is now handled by the shell pipeline via unzip -p, which is already available in the QLever Docker image. - Add application/zip to the importer's compressionTypes set so a server returning that Content-Type doesn't get flagged as a format mismatch.
…pressionMediaTypes Code-review follow-ups on the JSON-LD/zip support: - Switch JSON-LD preprocessing to rdf-parse + rdf-serialize, matching the stack @lde/fastify-rdf already uses. Drops the direct deps on jsonld-streaming-parser and n3 (still present transitively via rdf-parse) and leaves the preprocessor format-agnostic for future formats. - Lift the compression-content-type set into @lde/dataset as compressionMediaTypes and reuse it from the importer and the probe. - Collapse the importer's supportedFormats Set + preferenceOrder Record into a single ordered acceptedMediaTypes list. - preprocess.ts: open the output writable once per call and use a PassThrough tap to keep it open across zip entries; close the yauzl handle in a finally; tighten the mtime cache check to strict greater-than. - Hoist basename(file) in index() and drop the unused PreprocessedFormat type export.
ddeboer
added a commit
to netwerk-digitaal-erfgoed/dataset-knowledge-graph
that referenced
this pull request
May 22, 2026
…ough (#292) Refs #284. Companion of ldelements/lde#397, which teaches `@lde/sparql-qlever` to handle JSON-LD and zip-compressed distributions. This PR is the consumer-side change here in the DKG. ## Changes - Emit `dcat:compressFormat` in the CONSTRUCT so the LDE `Distribution` model receives the compression info it now uses to decide between `gunzip -c` (gzip), `unzip -p` (zip) and JSON-LD preprocessing. - `OPTIONAL { ?distribution dcat:compressFormat ?distribution_compressFormat }` added to the WHERE. - Drop the `application/{ld+json,n-quads,n-triples}+gzip` and `text/turtle+gzip` lines from the `FILTER`: the dataset register normalizes those suffixes into a separate `dcat:compressFormat` during ingestion, so those values never appear in `?distribution_mediaType`. The lines were dead. ## What it unlocks Now selectable end-to-end: - Gzipped JSON-LD — e.g. Nijmegen `LOD+Beelddocumenten.jsonld.gz` (`mediaType=application/ld+json` + `compressFormat=application/gzip`). - Plain JSON-LD — already in the FILTER, now actually processable. - Zipped JSON-LD when the publisher declares the inner format — i.e. `encodingFormat=application/ld+json+zip` on the schema.org side, which the register splits into `mediaType=application/ld+json` + `compressFormat=application/zip`. ## What it does NOT unlock The Verhaal van Utrecht (`vvu_verhalen`) distribution mentioned in #284 declares `encodingFormat=application/zip` alone, with the inner format only in a free-text `description`. The pipeline intentionally rejects that — without a declared inner mediaType we can't safely process the archive. The publisher needs to update their schema.org to `application/ld+json+zip`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs netwerk-digitaal-erfgoed/dataset-knowledge-graph#284 — the consumer-side change that takes advantage of this lives in that repo.
Summary
Extends
@lde/sparql-qleverso the QLever importer can ingest:qlever-indexruns.dcat:compressFormat=application/zipwith a known innerdcat:mediaType. The inner mediaType drives which zip entries are accepted (application/ld+json,application/n-triples,application/n-quads).Standalone
dcat:mediaType=application/zipis intentionally not accepted: the inner RDF format must be declared so we know what to expect inside the archive. Publishers should declare e.g.application/ld+json+zip(which the dataset-register normalizes tomediaType=application/ld+json+compressFormat=application/zip).Changes
@lde/sparql-qleverpreprocess.tsmodule: JSON-LD → N-Quads viajsonld; zip extraction viayauzl; gzip fallback by file extension whencompressFormatis missing; mtime-based output caching.Importer.import()now sorts distributions by preference — native formats first, JSON-LD last — so e.g. annqdistribution is tried before ald+jsonone.Importer.doImport()dispatches through the preprocessor only when needed; the existinggunzip -c | qlever-indexpath is untouched for plain and gzipped native formats.jsonld,yauzl(+ types).@lde/datasetDistribution.compressMimeTypegetter (strips the IANA prefix fromcompressFormat).@lde/distribution-probeapplication/ziptocompressionTypesso a zip Content-Type no longer raises a format-mismatch warning.Tests: 12 new unit tests covering preference ordering, JSON-LD conversion, zip extraction, inner-mediaType validation, and the standalone-zip rejection.