Conversation
jameshadfield
left a comment
There was a problem hiding this comment.
Changes look generally good - left a number of questions and discussion points in threads.
There are no updates to the build-configs/inrb workflow. We should endeavour to keep this up to date. I have some testing files if that would help?
| exclude: "defaults/exclude_accessions.txt" | ||
| clades: "defaults/clades.tsv" | ||
| lat_longs: "defaults/lat_longs.tsv" | ||
| color_ordering: "defaults/color_ordering.tsv" |
There was a problem hiding this comment.
Discussion question prompted by the similarity between these configs.
Did we consider unifying the currently independent workflow invocations (clade-i, hmpv1 etc) into a single workflow parameterised by wildcards? Or has that ship sailed long ago? An alternate direction would be config overlays.
There was a problem hiding this comment.
An alternate direction would be config overlays.
We tried config overlays in the past, but that got reverted in 5019420. So I think that's a no-go.
Did we consider unifying the currently independent workflow invocations (clade-i, hmpv1 etc) into a single workflow parameterised by wildcards? Or has that ship sailed long ago?
I like the simplicity of the single build config, but we are moving towards most pathogens having multi-build wildcard configs. So maybe the ship has not sailed if we think it's worth reworking this config.
With nextstrain run in mind, there would then only be a single phylogenetic workflow and users would have to know the single target for the single clade build OR provide a config overlay that can specify a single clade build.
There was a problem hiding this comment.
Yeah, the nextstrain run interface raises the bar for repos structured like mpox is because each build will need its own snakefile, although the bar's not really that high! (Of course, someone could use one workflow with separate configfiles as I originally did with avian-flu, but I think it's better for our nextstrain repos to all look similar.)
I'm not pushing for this change at the moment, I don't want it to hold up the migration to nextstrain run, but there sure is a lot of duplication across config YAMLs!
Centralizes all config manipulation in rules/config.smk and ensures that the config changes are backwards compatible with older config files. Prints a message to prompt users to update to their own config files. Prompted by various PR comments from @jameshadfield <#321 (comment)> <#321 (comment)> <#321 (comment)>
358d40f to
13f062b
Compare
Centralizes all config manipulation in rules/config.smk and ensures that the config changes are backwards compatible with older config files. Prints a message to prompt users to update to their own config files. Prompted by various PR comments from @jameshadfield <#321 (comment)> <#321 (comment)> <#321 (comment)>
13f062b to
06af354
Compare
| group_by: "--group-by country year" | ||
| sequences_per_group: "--subsample-max-sequences 300" | ||
| other_filters: "--exclude-where outbreak!=hMPXV-1 clade!=IIb" | ||
| non_b1: >- |
There was a problem hiding this comment.
Ah, with adding support for nextstrain run, I'm remembering that this subsample schema does not work well with config overlays because Snakemake merges dictionaries.
Will have to think on how to make this both overridable and extendable...
There was a problem hiding this comment.
One option would be similar to the proposed multi-input config params in nextstrain/zika#80.
The subsample param will be changed to a list and there will be a new additional_subsample param:
subsample:
- name: non_b1
args: >-
--group-by lineage year country
--sequences-per-group 50
...
- name: b1
args: >-
--group-by country year
--subsample-max-sequences 300
...
additional_subsample: []Then config overlays can either completely override the subsample groups by using the subsample config param or define additional subsample groups with additional_subsample.
There was a problem hiding this comment.
This is the exact reason I designed the inputs interface like I did.
For mpox this approach seems nice. My instinct tells me it won't work for a multi-build config (e.g. ncov!). Do you think it's worth solving both now, or running with this approach for the 1-config-1-build style repos and leaving the multi-build-configs until later?
There was a problem hiding this comment.
Another option that Tom proposed is to keep subsample as a dict and clearly document the default subsample groups. Then the user can override the default subsampling groups or completely drop them by nullifying them
subsample:
non_b1: ~
b1: ~
custom_group: >-
--group-by lineage year country
--sequences-per-group 50
...The config parsing in the workflow will just have to know to drop the null subsample groups.
There was a problem hiding this comment.
For mpox this approach seems nice. My instinct tells me it won't work for a multi-build config (e.g. ncov!). Do you think it's worth solving both now, or running with this approach for the 1-config-1-build style repos and leaving the multi-build-configs until later?
Yeah, this could get hairy for multi-build configs...I do think it'd be worth trying to find a solution that works for both. Or at least an approach that doesn't make converting a single build to a multi-build config too difficult.
There was a problem hiding this comment.
The config parsing in the workflow will just have to know to drop the null subsample groups.
A way to nullify config values has been something many of us have been pushing for, so no objections there.
For me the salient question is around the user experience for designing config overlays. If you want to go the (dict + nullification) direction, which is programmatically simpler, then a user has to know the keys in the (default) subsample dict, and we have to have a story for communicating this without saying "read the src". (Aside: this is exactly what I did for the inputs interface - I implemented it as a single dict with ∅ used for nullification - but for the reason I just described I went with the {additional,}inputs design.)
What to do right now for mpox? I'd go for a single dict + nullification. All that's needed to add is the nullification aspect. At the current time you have to read the src to build a config overlay, so that's not a problem at the moment.
There was a problem hiding this comment.
Thanks for the feedback @jameshadfield! I'll go with dict + nullification here.
Surfacing the default subsample dict for the user to nullify could be part of the larger issue for logging config values
There was a problem hiding this comment.
Added support for disabling subsample groups in 18d57d4.
Updated to match latest guidelines from Nextstrain's Snakemake style guide¹: - Always use quoted (:q) interpolations - Use raw, triple-quoted shell blocks - Log standard out and error output to log files and the terminal - Always use the benchmark directive I've left out quoted interpolations that borked the build because the config params need to be converted to lists. I'll clean these up in subsequent commits. ¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>
Allows Snakemake to do the right thing with quoted (:q) interpolations for config params that are passed in as multiple CLI args. Will automatically split string into list for backwards compatibility. Based on similar changes done in zika <nextstrain/zika@11d2644> Since the config function as_list are identical, this is definitely a candidate for consolidating in nextstrain/shared.
Reduce the need for users to supply the CLI flag for a custom script in the workflow. Keeping this change backwards compatible with older configs by automatically stripping the flag from the config param in the workflow.
Follows our guidelines on generic subsampling to use a single config param for the subsample rule.¹ I think this is an improvement over the previous subsample config params: - it is no longer unclear which params require the CLI flag - removes the confusing dependency between `exclude_lineages` and `other_filters` However, this does put the burden on the user to know the augur filter flags and to format them correctly in the configs. I've created examples in the default build configs that uses the YAML folded style² to keep the args readable in the configs. This required an update to the yamlfmt config. ¹ <https://docs.nextstrain.org/en/latest/guides/bioinformatics/filtering-and-subsampling.html#generalizing-subsampling-in-a-workflow> ² <https://yaml.org/spec/1.2.2/#813-folded-style> squash generic subsampling
Most of the default files are already defined in the configs. This commit moves the remaining files to the config. Doing this in preparation for adding support for `nextstrain run`.
Simplify the name of the config files in preparation for support for `nextstrain run`. Also inline with previous discussion on `exclude.txt` in <nextstrain/dengue#26 (comment)>
Updates to match the new config schema
This Snakemake file will only contain helper functions for parsing and validating configs that do not need to be autoformatted with pre-commit/snakefmt.
Centralizes all config manipulation in rules/config.smk and ensures that the config changes are backwards compatible with older config files. Prints a message to prompt users to update to their own config files. Prompted by various PR comments from @jameshadfield <#321 (comment)> <#321 (comment)> <#321 (comment)>
Previously, config overlays were unable to disable default subsampling defined in the `config["subsample"]` dict because Snakemake merges dicts in configs. This change adds support for disabling default subsampling by ignoring subsample groups that have a null value. Users will still have to know the name of the default subsampling groups in order to nullify them in their config overlays. We should be doing a better job of surfacing these subsample names so that users don't have to read the source, but I'm not going to focus on that here as discussed in the PR <#321 (comment)>
06af354 to
18d57d4
Compare
Previously, config overlays were unable to disable default subsampling defined in the `config["subsample"]` dict because Snakemake merges dicts in configs. This change adds support for disabling default subsampling by ignoring subsample groups that have a null value. Users will still have to know the name of the default subsampling groups in order to nullify them in their config overlays. We should be doing a better job of surfacing these subsample names so that users don't have to read the source, but I'm not going to focus on that here as discussed in the PR <#321 (comment)>
18d57d4 to
813a99e
Compare
813a99e to
d59e0d7
Compare
|
Pushed up changes to move filter query into config param in 1021e9c (which was the original purpose of this PR but somehow I forgot about it in all the other changes 🤦♀️ ) If there are no other comments, I'll merge this on Monday. |
Description of proposed changes
Various phylogenetic workflow updates ahead of adding support for
nextstrain run. The main changes are making things more configurable via config files. See commits for details.I think the most controversial change will be f1ed6e9, which switches to the generic subsample config schema. Seeking feedback on whether this is the correct direction.
Related issue(s)
Resolves #273
Checklist