Skip to content

fix(sparql-qlever): shell-escape data filenames in the QLever index command#424

Merged
ddeboer merged 1 commit into
mainfrom
fix/qlever-importer-shell-quote-filenames
May 29, 2026
Merged

fix(sparql-qlever): shell-escape data filenames in the QLever index command#424
ddeboer merged 1 commit into
mainfrom
fix/qlever-importer-shell-quote-filenames

Conversation

@ddeboer
Copy link
Copy Markdown
Member

@ddeboer ddeboer commented May 29, 2026

Problem

Importer.index() builds the QLever index command by interpolating the local data filename into a shell pipeline inside single quotes:

`(gunzip -c '${localName}' 2>/dev/null || cat '${localName}') | qlever-index … && cat ${metadataFile}`

If localName contains a single quote, the quoting breaks. This is not hypothetical: a Dataset Register entry titled “Beeldbank Erfgoed ’s-Hertogenbosch” has distribution URLs whose %27 decodes to an apostrophe, so the downloaded file is …Erfgoed+'s-Hertogenbosch.nt. The apostrophe terminates the single-quoted argument, cat/gunzip receive a mangled, non-existent path, and qlever-index is fed empty input.

The result: QLever reports 0 triples, the import is marked failed, and the importer falls through every distribution (they all share the apostrophe filename) – including the JSON-LD fallback, whose .preprocessed.nq also keeps the apostrophe and fails the same way after a long Node-side conversion. Net effect observed in production: a dataset whose native .nt indexes to 1.24 M triples in seconds instead burned ~3 hours and was skipped entirely.

It is also a latent shell-injection vector: a crafted distribution URL mapping to a filename with shell metacharacters would inject into taskRunner.run.

Fix

Add a small POSIX shellQuote helper – wrap in single quotes and escape embedded single quotes as '\'' – and apply it to the data filename (both the unzip and gunzip/cat branches) and the metadata file.

POSIX single-quote semantics are identical across every POSIX shell, so this is correct regardless of which shell runs the command – important here because the Docker task runner executes it inside the QLever container, not the host shell. (For that reason a host-shell-detecting library such as shescape would be the wrong tool; this needs no dependency.)

Test

test/importer.test.ts adds a regression test that:

  • feeds a distribution whose local filename contains an apostrophe,
  • asserts the generated command POSIX-escapes it (and not the broken unescaped form), and
  • runs the generated decompress sub-command through sh and verifies it reads the real file’s contents.

The test fails on the previous code and passes with the fix. Full package suite: 37/37 pass (incl. the Docker-based tests); coverage thresholds auto-updated upward.

Follow-up (not in this PR)

Separately, JSON-LD preprocessing is impractically slow for large dumps (~100 MB JSON-LD took hours of single-threaded jsonld parsing). Worth tracking on its own – but with this fix the native formats index directly and the JSON-LD path is rarely reached.

…ommand

The importer interpolated the downloaded data filename into a shell command
inside single quotes (`cat '<name>'`). A filename containing a single quote –
e.g. a dataset titled ’s-Hertogenbosch, whose distribution URL maps to a local
file like `...Erfgoed+'s-Hertogenbosch.nt` – terminated the quoting, so
`cat`/`gunzip` read a non-existent path and fed `qlever-index` empty input.

QLever then “succeeded” with 0 triples, so the import was treated as failed and
every distribution fell through – including the JSON-LD fallback, whose
preprocessed output keeps the apostrophe and fails identically after hours of
Node-side conversion – leaving the whole dataset skipped.

Add a POSIX `shellQuote` helper (wrap in single quotes, escape embedded single
quotes as '\''), shell-agnostic so it is correct for the in-container shell the
Docker task runner uses, and apply it to the data filename and metadata file.
Includes a regression test that runs the generated decompress command through
`sh` and verifies it reads the real file.
@ddeboer ddeboer force-pushed the fix/qlever-importer-shell-quote-filenames branch from 37daa91 to c99c6bb Compare May 29, 2026 17:56
@ddeboer ddeboer enabled auto-merge (squash) May 29, 2026 17:57
@ddeboer ddeboer merged commit f69e623 into main May 29, 2026
2 checks passed
@ddeboer ddeboer deleted the fix/qlever-importer-shell-quote-filenames branch May 29, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant