Skip to content

Commit

Permalink
Shell quote user-provided subsampling options
Browse files Browse the repository at this point in the history
Many (most?) of the subsampling config a user can specify takes the form
of command-line options to `augur filter`.  Previously the values were
passed through unmodified, leaving shell features like variable
interpolation and command substitution as tripping hazards for the user,
who would have to know to escape those in their YAML build config.  This
change preserves the shell-like semantics of single and double quotes
and backslashes in subsampling values, but renders the tripping hazards
inert by re-quoting each word in the value.

This assumes that we don't need to support those shell features within
subsampling config values (e.g. a user subsampling config doesn't need
to refer to an environment variable), or at least that their cost in
complexity for all users outweighs their benefit to a few users.

There are many other places in the workflow which blindly interpolate,
without quoting, user-provided values into shell commands.  This change
demonstrates a pattern we can use to handle them more robustly if we
choose.

Relatedly, we should consider a) explicitly using Snakemake's built in
{foo:q} interpolation syntax by default for single-word values, or b)
coercing Snakemake to always apply quoting unless asked otherwise with
{foo:u} (currently requires a monkeypatch).
  • Loading branch information
tsibley committed Mar 11, 2022
1 parent bf3c344 commit eedaf4a
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 9 deletions.
23 changes: 18 additions & 5 deletions workflow/snakemake_rules/common.smk
@@ -1,8 +1,24 @@
"""Small, shared functions used to generate inputs and parameters.
"""
import datetime
from shlex import (
quote as shquote, # shquote() is used in this file and also other workflow files
split as shsplitwords,
)
from urllib.parse import urlsplit

def shquotewords(s: str) -> str:
"""
Split string *s* into (POSIX) shell words, quote each word, and join them
back into a string.
This is suitable for properly quoting multi-word, user-defined values which
should follow shell quoting and escaping semantics (e.g. to allow spaces in
single words) but not allow shell features like variable interpolation,
command substition, redirection, piping, etc.
"""
return " ".join(shquote(word) for word in shsplitwords(s))

def numeric_date(dt=None):
"""
Convert datetime object to the numeric date.
Expand Down Expand Up @@ -73,15 +89,12 @@ def _get_filter_min_length_query(wildcards):
# as recommended by pandas:
#
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
#
# We escape the backticks with backslashes to prevent the bash shell
# from expanding the contents between the backticks as a subprocess.
query_parts.append(f"(\`{input_name}\` == 'yes' & _length >= {min_length})")
query_parts.append(f"(`{input_name}` == 'yes' & _length >= {min_length})")
else:
query_parts.append(f"(_length >= {min_length})")

query = " | ".join(query_parts)
return f"--query \"{query}\""
return f"--query {shquote(query)}"

def _get_filter_value(wildcards, key):
for input_name in config["inputs"].keys():
Expand Down
8 changes: 4 additions & 4 deletions workflow/snakemake_rules/main_workflow.smk
Expand Up @@ -160,9 +160,9 @@ def get_priority_argument(wildcards):
return ""

if subsampling_settings["priorities"]["type"] == "proximity":
return "--priority " + get_priorities(wildcards)
return "--priority " + shquote(get_priorities(wildcards))
elif subsampling_settings["priorities"]["type"] == "file" and "file" in subsampling_settings["priorities"]:
return "--priority " + subsampling_settings["priorities"]["file"]
return "--priority " + shquote(subsampling_settings["priorities"]["file"])
else:
return ""

Expand Down Expand Up @@ -199,15 +199,15 @@ def _get_specific_subsampling_setting(setting, optional=False):
elif setting == 'max_sequences':
value = f"--subsample-max-sequences {value}"

return value
return shquotewords(value)
else:
value = ""

# Check format strings that haven't been resolved.
if re.search(r'\{.+\}', value):
raise Exception(f"The parameters for the subsampling scheme '{wildcards.subsample}' of build '{wildcards.build_name}' reference build attributes that are not defined in the configuration file: '{value}'. Add these build attributes to the appropriate configuration file and try again.")

return value
return shquotewords(value)

return _get_setting

Expand Down

0 comments on commit eedaf4a

Please sign in to comment.