New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shell quote user-provided subsampling options #885
Conversation
The monkeypatch I was thinking of turns out to be reducible a short one-liner at the top of the Snakefile:
The (private, undocumented) Doing this would mean all bare Snakemake interpolations like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got some time to test this using the ncov-tutorial builds as an example of something that should work. I found that a filter query like:
--query "(custom_data == 'yes')"
gets converted to:
--query '(custom_data == '"'"'yes'"'"')'
This result makes sense as the outer quotes get converted from double to single and the original single quotes need to be escaped by double quotes. Similarly, a query like this:
--query "(custom_data != 'yes') & (country == 'USA')"
gets converted to:
--query '(custom_data != '"'"'yes'"'"') & (country == '"'"'USA'"'"')'
These queries all worked as expected when they passed through to Augur.
I also inspected the filter commands for the Nextstrain GISAID builds which use other subsampling parameters like the min-date, priorities, include, exclude-where etc. These commands all look correctly rendered.
After we resolve conflicts with the current workflow, we could merge this. I'm not excited about the idea of monkey-patching using an undocumented, private Snakemake variable, though. Switching to consistent use of :q
seems reasonable.
Many (most?) of the subsampling config a user can specify takes the form of command-line options to `augur filter`. Previously the values were passed through unmodified, leaving shell features like variable interpolation and command substitution as tripping hazards for the user, who would have to know to escape those in their YAML build config. This change preserves the shell-like semantics of single and double quotes and backslashes in subsampling values, but renders the tripping hazards inert by re-quoting each word in the value. This assumes that we don't need to support those shell features within subsampling config values (e.g. a user subsampling config doesn't need to refer to an environment variable), or at least that their cost in complexity for all users outweighs their benefit to a few users. There are many other places in the workflow which blindly interpolate, without quoting, user-provided values into shell commands. This change demonstrates a pattern we can use to handle them more robustly if we choose. Relatedly, we should consider a) explicitly using Snakemake's built in {foo:q} interpolation syntax by default for single-word values, or b) coercing Snakemake to always apply quoting unless asked otherwise with {foo:u} (currently requires a monkeypatch).
Resolved the simple merge conflict with a rebase and force-push. |
Thanks! Merged.
Fair enough. Given the lackluster appetite for my previous attempt (context), I'm not excited by the idea of trying to make using |
Err...I forgot that style guide existed! 🤦🏻 Seems good to merge with the style guide in the wiki (that probably no one else knows about or looks at?)? |
Ha, well, it took me a bit to find the right repo name since I was blind typing in URLs based on vague recollections. Amazed the repo is not archived given it went nowhere... but I would be keen to revive it (maybe under new management of the Nextstrain org on GitHub) if we think there's more appetite for such a thing nowadays. I'd want a Nextstrain style guide to be public, so would prefer a repo over our internal wiki. Instead of a separate repo though, could be a section maintained in the docs repo. |
Got it. I like the separate section of the docs repo, especially given recent discussion about moving more dev docs there. |
Description of proposed changes
Many (most?) of the subsampling config a user can specify takes the form
of command-line options to
augur filter
. Previously the values werepassed through unmodified, leaving shell features like variable
interpolation and command substitution as tripping hazards for the user,
who would have to know to escape those in their YAML build config. This
change preserves the shell-like semantics of single and double quotes
and backslashes in subsampling values, but renders the tripping hazards
inert by re-quoting each word in the value.
This assumes that we don't need to support those shell features within
subsampling config values (e.g. a user subsampling config doesn't need
to refer to an environment variable), or at least that their cost in
complexity for all users outweighs their benefit to a few users.
There are many other places in the workflow which blindly interpolate,
without quoting, user-provided values into shell commands. This change
demonstrates a pattern we can use to handle them more robustly if we
choose.
Relatedly, we should consider a) explicitly using Snakemake's built in
{foo:q} interpolation syntax by default for single-word values, or b)
coercing Snakemake to always apply quoting unless asked otherwise with
{foo:u} (currently requires a monkeypatch).
Related issue(s)
Motivated by discussion in Slack.
Testing
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.md
in this pull request to document these changes and the new version number.If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.