Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell quote user-provided subsampling options #885

Merged
merged 1 commit into from May 3, 2022
Merged

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Mar 11, 2022

Description of proposed changes

Many (most?) of the subsampling config a user can specify takes the form
of command-line options to augur filter. Previously the values were
passed through unmodified, leaving shell features like variable
interpolation and command substitution as tripping hazards for the user,
who would have to know to escape those in their YAML build config. This
change preserves the shell-like semantics of single and double quotes
and backslashes in subsampling values, but renders the tripping hazards
inert by re-quoting each word in the value.

This assumes that we don't need to support those shell features within
subsampling config values (e.g. a user subsampling config doesn't need
to refer to an environment variable), or at least that their cost in
complexity for all users outweighs their benefit to a few users.

There are many other places in the workflow which blindly interpolate,
without quoting, user-provided values into shell commands. This change
demonstrates a pattern we can use to handle them more robustly if we
choose.

Relatedly, we should consider a) explicitly using Snakemake's built in
{foo:q} interpolation syntax by default for single-word values, or b)
coercing Snakemake to always apply quoting unless asked otherwise with
{foo:u} (currently requires a monkeypatch).

Related issue(s)

Motivated by discussion in Slack.

Testing

  • Test this doesn't break existing configs

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@tsibley
Copy link
Member Author

tsibley commented Mar 11, 2022

b) coercing Snakemake to always apply quoting unless asked otherwise with
{foo:u} (currently requires a monkeypatch).

The monkeypatch I was thinking of turns out to be reducible a short one-liner at the top of the Snakefile:

snakemake.utils.format.__kwdefaults__["_quote_all"] = True

The (private, undocumented) _quote_all keyword arg exists since Snakemake 3.8.0 (just like support for {foo:q} itself).

Doing this would mean all bare Snakemake interpolations like {foo} will be treated as {foo:q} and explicitly unquoted interpolations can be requested with {foo:u}. (This kind of opt-out behaviour matches very broad consensus on best practice for quoting.)

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got some time to test this using the ncov-tutorial builds as an example of something that should work. I found that a filter query like:

--query "(custom_data == 'yes')"

gets converted to:

--query '(custom_data == '"'"'yes'"'"')'

This result makes sense as the outer quotes get converted from double to single and the original single quotes need to be escaped by double quotes. Similarly, a query like this:

--query "(custom_data != 'yes') & (country == 'USA')"

gets converted to:

--query '(custom_data != '"'"'yes'"'"') & (country == '"'"'USA'"'"')'

These queries all worked as expected when they passed through to Augur.

I also inspected the filter commands for the Nextstrain GISAID builds which use other subsampling parameters like the min-date, priorities, include, exclude-where etc. These commands all look correctly rendered.

After we resolve conflicts with the current workflow, we could merge this. I'm not excited about the idea of monkey-patching using an undocumented, private Snakemake variable, though. Switching to consistent use of :q seems reasonable.

Many (most?) of the subsampling config a user can specify takes the form
of command-line options to `augur filter`.  Previously the values were
passed through unmodified, leaving shell features like variable
interpolation and command substitution as tripping hazards for the user,
who would have to know to escape those in their YAML build config.  This
change preserves the shell-like semantics of single and double quotes
and backslashes in subsampling values, but renders the tripping hazards
inert by re-quoting each word in the value.

This assumes that we don't need to support those shell features within
subsampling config values (e.g. a user subsampling config doesn't need
to refer to an environment variable), or at least that their cost in
complexity for all users outweighs their benefit to a few users.

There are many other places in the workflow which blindly interpolate,
without quoting, user-provided values into shell commands.  This change
demonstrates a pattern we can use to handle them more robustly if we
choose.

Relatedly, we should consider a) explicitly using Snakemake's built in
{foo:q} interpolation syntax by default for single-word values, or b)
coercing Snakemake to always apply quoting unless asked otherwise with
{foo:u} (currently requires a monkeypatch).
@huddlej
Copy link
Contributor

huddlej commented May 3, 2022

Resolved the simple merge conflict with a rebase and force-push.

@tsibley tsibley merged commit 6ff9410 into master May 3, 2022
@tsibley tsibley deleted the trs/shell-quoting branch May 3, 2022 21:54
@tsibley
Copy link
Member Author

tsibley commented May 3, 2022

Thanks! Merged.

I'm not excited about the idea of monkey-patching using an undocumented, private Snakemake variable, though. Switching to consistent use of :q seems reasonable.

Fair enough. Given the lackluster appetite for my previous attempt (context), I'm not excited by the idea of trying to make using :q consistent given all the contributors, but it's fine to try again. Could write a Snakemake linter… 😶 (one more discerning than snakemake --lint)

@huddlej
Copy link
Contributor

huddlej commented May 3, 2022

Err...I forgot that style guide existed! 🤦🏻 Seems good to merge with the style guide in the wiki (that probably no one else knows about or looks at?)?

@tsibley
Copy link
Member Author

tsibley commented May 3, 2022

Ha, well, it took me a bit to find the right repo name since I was blind typing in URLs based on vague recollections. Amazed the repo is not archived given it went nowhere... but I would be keen to revive it (maybe under new management of the Nextstrain org on GitHub) if we think there's more appetite for such a thing nowadays. I'd want a Nextstrain style guide to be public, so would prefer a repo over our internal wiki. Instead of a separate repo though, could be a section maintained in the docs repo.

@huddlej
Copy link
Contributor

huddlej commented May 3, 2022

Got it. I like the separate section of the docs repo, especially given recent discussion about moving more dev docs there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants