fix(sparql-qlever): shell-escape data filenames in the QLever index command#424
Merged
Merged
Conversation
…ommand The importer interpolated the downloaded data filename into a shell command inside single quotes (`cat '<name>'`). A filename containing a single quote – e.g. a dataset titled ’s-Hertogenbosch, whose distribution URL maps to a local file like `...Erfgoed+'s-Hertogenbosch.nt` – terminated the quoting, so `cat`/`gunzip` read a non-existent path and fed `qlever-index` empty input. QLever then “succeeded” with 0 triples, so the import was treated as failed and every distribution fell through – including the JSON-LD fallback, whose preprocessed output keeps the apostrophe and fails identically after hours of Node-side conversion – leaving the whole dataset skipped. Add a POSIX `shellQuote` helper (wrap in single quotes, escape embedded single quotes as '\''), shell-agnostic so it is correct for the in-container shell the Docker task runner uses, and apply it to the data filename and metadata file. Includes a regression test that runs the generated decompress command through `sh` and verifies it reads the real file.
37daa91 to
c99c6bb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Importer.index()builds the QLever index command by interpolating the local data filename into a shell pipeline inside single quotes:`(gunzip -c '${localName}' 2>/dev/null || cat '${localName}') | qlever-index … && cat ${metadataFile}`If
localNamecontains a single quote, the quoting breaks. This is not hypothetical: a Dataset Register entry titled “Beeldbank Erfgoed ’s-Hertogenbosch” has distribution URLs whose%27decodes to an apostrophe, so the downloaded file is…Erfgoed+'s-Hertogenbosch.nt. The apostrophe terminates the single-quoted argument,cat/gunzipreceive a mangled, non-existent path, andqlever-indexis fed empty input.The result: QLever reports 0 triples, the import is marked failed, and the importer falls through every distribution (they all share the apostrophe filename) – including the JSON-LD fallback, whose
.preprocessed.nqalso keeps the apostrophe and fails the same way after a long Node-side conversion. Net effect observed in production: a dataset whose native.ntindexes to 1.24 M triples in seconds instead burned ~3 hours and was skipped entirely.It is also a latent shell-injection vector: a crafted distribution URL mapping to a filename with shell metacharacters would inject into
taskRunner.run.Fix
Add a small POSIX
shellQuotehelper – wrap in single quotes and escape embedded single quotes as'\''– and apply it to the data filename (both theunzipandgunzip/catbranches) and the metadata file.POSIX single-quote semantics are identical across every POSIX shell, so this is correct regardless of which shell runs the command – important here because the Docker task runner executes it inside the QLever container, not the host shell. (For that reason a host-shell-detecting library such as
shescapewould be the wrong tool; this needs no dependency.)Test
test/importer.test.tsadds a regression test that:shand verifies it reads the real file’s contents.The test fails on the previous code and passes with the fix. Full package suite: 37/37 pass (incl. the Docker-based tests); coverage thresholds auto-updated upward.
Follow-up (not in this PR)
Separately, JSON-LD preprocessing is impractically slow for large dumps (~100 MB JSON-LD took hours of single-threaded
jsonldparsing). Worth tracking on its own – but with this fix the native formats index directly and the JSON-LD path is rarely reached.