Skip to content

feat(sparql-qlever): support JSON-LD and zipped distributions#397

Merged
ddeboer merged 7 commits into
mainfrom
feat/zip-jsonld-distributions
May 21, 2026
Merged

feat(sparql-qlever): support JSON-LD and zipped distributions#397
ddeboer merged 7 commits into
mainfrom
feat/zip-jsonld-distributions

Conversation

@ddeboer
Copy link
Copy Markdown
Member

@ddeboer ddeboer commented May 21, 2026

Refs netwerk-digitaal-erfgoed/dataset-knowledge-graph#284 — the consumer-side change that takes advantage of this lives in that repo.

Summary

Extends @lde/sparql-qlever so the QLever importer can ingest:

  • JSON-LD distributions (plain, gzipped, or zipped) — converted to N-Quads in Node before qlever-index runs.
  • Zip-compressed RDF distributionsdcat:compressFormat=application/zip with a known inner dcat:mediaType. The inner mediaType drives which zip entries are accepted (application/ld+json, application/n-triples, application/n-quads).

Standalone dcat:mediaType=application/zip is intentionally not accepted: the inner RDF format must be declared so we know what to expect inside the archive. Publishers should declare e.g. application/ld+json+zip (which the dataset-register normalizes to mediaType=application/ld+json + compressFormat=application/zip).

Changes

  • @lde/sparql-qlever

    • New preprocess.ts module: JSON-LD → N-Quads via jsonld; zip extraction via yauzl; gzip fallback by file extension when compressFormat is missing; mtime-based output caching.
    • Importer.import() now sorts distributions by preference — native formats first, JSON-LD last — so e.g. an nq distribution is tried before a ld+json one.
    • Importer.doImport() dispatches through the preprocessor only when needed; the existing gunzip -c | qlever-index path is untouched for plain and gzipped native formats.
    • New deps: jsonld, yauzl (+ types).
  • @lde/dataset

    • Add Distribution.compressMimeType getter (strips the IANA prefix from compressFormat).
  • @lde/distribution-probe

    • Add application/zip to compressionTypes so a zip Content-Type no longer raises a format-mismatch warning.
  • Tests: 12 new unit tests covering preference ordering, JSON-LD conversion, zip extraction, inner-mediaType validation, and the standalone-zip rejection.

ddeboer added 6 commits May 21, 2026 13:35
- Add JSON-LD support via in-Node preprocessing to N-Quads (jsonld lib).
- Add zip extraction when compressFormat=application/zip (yauzl), with the
  inner mediaType driving entry filtering. Standalone application/zip
  distributions are rejected: the inner format must be declared.
- Sort distributions in Importer.import() to prefer native QLever formats
  (nt/nq/ttl) over JSON-LD; the preprocessor is only invoked when needed.
- Add compressMimeType getter on Distribution that strips the IANA prefix.
- Treat application/zip as a compression Content-Type in distribution-probe
  so a zip Content-Type no longer raises a format-mismatch warning.
…or native zips

- Replace the in-memory jsonld lib with jsonld-streaming-parser + n3.StreamWriter
  so JSON-LD documents flow through the pipeline as a stream and memory use stays
  bounded for large distributions.
- Restrict Node-side preprocessing to JSON-LD only. Native RDF (nt/nq/ttl) in a
  zip container is now handled by the shell pipeline via unzip -p, which is
  already available in the QLever Docker image.
- Add application/zip to the importer's compressionTypes set so a server
  returning that Content-Type doesn't get flagged as a format mismatch.
…pressionMediaTypes

Code-review follow-ups on the JSON-LD/zip support:

- Switch JSON-LD preprocessing to rdf-parse + rdf-serialize, matching the
  stack @lde/fastify-rdf already uses. Drops the direct deps on
  jsonld-streaming-parser and n3 (still present transitively via rdf-parse)
  and leaves the preprocessor format-agnostic for future formats.
- Lift the compression-content-type set into @lde/dataset as
  compressionMediaTypes and reuse it from the importer and the probe.
- Collapse the importer's supportedFormats Set + preferenceOrder Record
  into a single ordered acceptedMediaTypes list.
- preprocess.ts: open the output writable once per call and use a
  PassThrough tap to keep it open across zip entries; close the yauzl handle
  in a finally; tighten the mtime cache check to strict greater-than.
- Hoist basename(file) in index() and drop the unused PreprocessedFormat
  type export.
@ddeboer ddeboer enabled auto-merge (rebase) May 21, 2026 13:02
@ddeboer ddeboer merged commit 260757d into main May 21, 2026
2 checks passed
@ddeboer ddeboer deleted the feat/zip-jsonld-distributions branch May 21, 2026 13:08
ddeboer added a commit to netwerk-digitaal-erfgoed/dataset-knowledge-graph that referenced this pull request May 22, 2026
…ough (#292)

Refs #284.

Companion of ldelements/lde#397, which teaches `@lde/sparql-qlever` to
handle JSON-LD and zip-compressed distributions. This PR is the
consumer-side change here in the DKG.

## Changes

- Emit `dcat:compressFormat` in the CONSTRUCT so the LDE `Distribution`
model receives the compression info it now uses to decide between
`gunzip -c` (gzip), `unzip -p` (zip) and JSON-LD preprocessing.
- `OPTIONAL { ?distribution dcat:compressFormat
?distribution_compressFormat }` added to the WHERE.
- Drop the `application/{ld+json,n-quads,n-triples}+gzip` and
`text/turtle+gzip` lines from the `FILTER`: the dataset register
normalizes those suffixes into a separate `dcat:compressFormat` during
ingestion, so those values never appear in `?distribution_mediaType`.
The lines were dead.

## What it unlocks

Now selectable end-to-end:
- Gzipped JSON-LD — e.g. Nijmegen `LOD+Beelddocumenten.jsonld.gz`
(`mediaType=application/ld+json` + `compressFormat=application/gzip`).
- Plain JSON-LD — already in the FILTER, now actually processable.
- Zipped JSON-LD when the publisher declares the inner format — i.e.
`encodingFormat=application/ld+json+zip` on the schema.org side, which
the register splits into `mediaType=application/ld+json` +
`compressFormat=application/zip`.

## What it does NOT unlock

The Verhaal van Utrecht (`vvu_verhalen`) distribution mentioned in #284
declares `encodingFormat=application/zip` alone, with the inner format
only in a free-text `description`. The pipeline intentionally rejects
that — without a declared inner mediaType we can't safely process the
archive. The publisher needs to update their schema.org to
`application/ld+json+zip`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant