Skip to content

phylogenetic workflow updates#321

Merged
joverlee521 merged 14 commits intomasterfrom
phylo-updates
Jul 7, 2025
Merged

phylogenetic workflow updates#321
joverlee521 merged 14 commits intomasterfrom
phylo-updates

Conversation

@joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jun 25, 2025

Description of proposed changes

Various phylogenetic workflow updates ahead of adding support for nextstrain run. The main changes are making things more configurable via config files. See commits for details.

I think the most controversial change will be f1ed6e9, which switches to the generic subsample config schema. Seeking feedback on whether this is the correct direction.

Related issue(s)

Resolves #273

Checklist

  • Checks pass
  • Update CHANGELOG

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look generally good - left a number of questions and discussion points in threads.

There are no updates to the build-configs/inrb workflow. We should endeavour to keep this up to date. I have some testing files if that would help?

exclude: "defaults/exclude_accessions.txt"
clades: "defaults/clades.tsv"
lat_longs: "defaults/lat_longs.tsv"
color_ordering: "defaults/color_ordering.tsv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion question prompted by the similarity between these configs.

Did we consider unifying the currently independent workflow invocations (clade-i, hmpv1 etc) into a single workflow parameterised by wildcards? Or has that ship sailed long ago? An alternate direction would be config overlays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternate direction would be config overlays.

We tried config overlays in the past, but that got reverted in 5019420. So I think that's a no-go.

Did we consider unifying the currently independent workflow invocations (clade-i, hmpv1 etc) into a single workflow parameterised by wildcards? Or has that ship sailed long ago?

I like the simplicity of the single build config, but we are moving towards most pathogens having multi-build wildcard configs. So maybe the ship has not sailed if we think it's worth reworking this config.

With nextstrain run in mind, there would then only be a single phylogenetic workflow and users would have to know the single target for the single clade build OR provide a config overlay that can specify a single clade build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the nextstrain run interface raises the bar for repos structured like mpox is because each build will need its own snakefile, although the bar's not really that high! (Of course, someone could use one workflow with separate configfiles as I originally did with avian-flu, but I think it's better for our nextstrain repos to all look similar.)

I'm not pushing for this change at the moment, I don't want it to hold up the migration to nextstrain run, but there sure is a lot of duplication across config YAMLs!

Base automatically changed from update-metadata-columns to master June 26, 2025 16:42
joverlee521 added a commit that referenced this pull request Jul 1, 2025
Centralizes all config manipulation in rules/config.smk and ensures that the 
config changes are backwards compatible with older config files. Prints a 
message to prompt users to update to their own config files. 

Prompted by various PR comments from @jameshadfield
<#321 (comment)>
<#321 (comment)>
<#321 (comment)>
@joverlee521 joverlee521 changed the base branch from master to james/node-name-accession July 1, 2025 21:43
Base automatically changed from james/node-name-accession to master July 1, 2025 21:47
joverlee521 added a commit that referenced this pull request Jul 1, 2025
Centralizes all config manipulation in rules/config.smk and ensures that the 
config changes are backwards compatible with older config files. Prints a 
message to prompt users to update to their own config files. 

Prompted by various PR comments from @jameshadfield
<#321 (comment)>
<#321 (comment)>
<#321 (comment)>
@joverlee521 joverlee521 marked this pull request as ready for review July 1, 2025 21:52
group_by: "--group-by country year"
sequences_per_group: "--subsample-max-sequences 300"
other_filters: "--exclude-where outbreak!=hMPXV-1 clade!=IIb"
non_b1: >-
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, with adding support for nextstrain run, I'm remembering that this subsample schema does not work well with config overlays because Snakemake merges dictionaries.

Will have to think on how to make this both overridable and extendable...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option would be similar to the proposed multi-input config params in nextstrain/zika#80.
The subsample param will be changed to a list and there will be a new additional_subsample param:

subsample:
  - name: non_b1
    args: >-
      --group-by lineage year country
      --sequences-per-group 50
      ...
  - name: b1
    args: >-
      --group-by country year
      --subsample-max-sequences 300
      ...

additional_subsample: []

Then config overlays can either completely override the subsample groups by using the subsample config param or define additional subsample groups with additional_subsample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the exact reason I designed the inputs interface like I did.

For mpox this approach seems nice. My instinct tells me it won't work for a multi-build config (e.g. ncov!). Do you think it's worth solving both now, or running with this approach for the 1-config-1-build style repos and leaving the multi-build-configs until later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option that Tom proposed is to keep subsample as a dict and clearly document the default subsample groups. Then the user can override the default subsampling groups or completely drop them by nullifying them

subsample:
  non_b1: ~
  b1: ~
  custom_group: >-
    --group-by lineage year country
    --sequences-per-group 50
    ...

The config parsing in the workflow will just have to know to drop the null subsample groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mpox this approach seems nice. My instinct tells me it won't work for a multi-build config (e.g. ncov!). Do you think it's worth solving both now, or running with this approach for the 1-config-1-build style repos and leaving the multi-build-configs until later?

Yeah, this could get hairy for multi-build configs...I do think it'd be worth trying to find a solution that works for both. Or at least an approach that doesn't make converting a single build to a multi-build config too difficult.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config parsing in the workflow will just have to know to drop the null subsample groups.

A way to nullify config values has been something many of us have been pushing for, so no objections there.

For me the salient question is around the user experience for designing config overlays. If you want to go the (dict + nullification) direction, which is programmatically simpler, then a user has to know the keys in the (default) subsample dict, and we have to have a story for communicating this without saying "read the src". (Aside: this is exactly what I did for the inputs interface - I implemented it as a single dict with used for nullification - but for the reason I just described I went with the {additional,}inputs design.)

What to do right now for mpox? I'd go for a single dict + nullification. All that's needed to add is the nullification aspect. At the current time you have to read the src to build a config overlay, so that's not a problem at the moment.

Copy link
Contributor Author

@joverlee521 joverlee521 Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @jameshadfield! I'll go with dict + nullification here.

Surfacing the default subsample dict for the user to nullify could be part of the larger issue for logging config values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for disabling subsample groups in 18d57d4.

Updated to match latest guidelines from Nextstrain's Snakemake style guide¹:

- Always use quoted (:q) interpolations
- Use raw, triple-quoted shell blocks
- Log standard out and error output to log files and the terminal
- Always use the benchmark directive

I've left out quoted interpolations that borked the build because the config 
params need to be converted to lists. I'll clean these up in subsequent commits.

¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>
Allows Snakemake to do the right thing with quoted (:q) interpolations 
for config params that are passed in as multiple CLI args. Will automatically 
split string into list for backwards compatibility.

Based on similar changes done in zika
<nextstrain/zika@11d2644>

Since the config function as_list are identical, this is definitely a candidate
for consolidating in nextstrain/shared.
Reduce the need for users to supply the CLI flag for a custom script in the 
workflow. Keeping this change backwards compatible with older configs by 
automatically stripping the flag from the config param in the workflow.
Follows our guidelines on generic subsampling to use a single config param 
for the subsample rule.¹ I think this is an improvement over the previous 
subsample config params:
- it is no longer unclear which params require the CLI flag 
- removes the confusing dependency between `exclude_lineages` and `other_filters`

However, this does put the burden on the user to know  the augur filter flags 
and to format them correctly in the configs. I've created examples in the 
default build configs that uses the YAML folded style² to keep the args 
readable in the configs. This required an update to the yamlfmt config.

¹ <https://docs.nextstrain.org/en/latest/guides/bioinformatics/filtering-and-subsampling.html#generalizing-subsampling-in-a-workflow>
² <https://yaml.org/spec/1.2.2/#813-folded-style>

squash generic subsampling
Most of the default files are already defined in the configs. This commit 
moves the remaining files to the config.

Doing this in preparation for adding support for `nextstrain run`.
Simplify the name of the config files in preparation for support for 
`nextstrain run`. 

Also inline with previous discussion on `exclude.txt` in 
<nextstrain/dengue#26 (comment)>
Updates to match the new config schema
This Snakemake file will only contain helper functions for parsing and 
validating configs that do not need to be autoformatted with pre-commit/snakefmt.
Centralizes all config manipulation in rules/config.smk and ensures that the 
config changes are backwards compatible with older config files. Prints a 
message to prompt users to update to their own config files. 

Prompted by various PR comments from @jameshadfield
<#321 (comment)>
<#321 (comment)>
<#321 (comment)>
joverlee521 added a commit that referenced this pull request Jul 2, 2025
Previously, config overlays were unable to disable default subsampling defined 
in the `config["subsample"]` dict because Snakemake merges dicts in configs.
This change adds support for disabling default subsampling by ignoring subsample 
groups that have a null value. 

Users will still have to know the name of the default subsampling groups in 
order to nullify them in their config overlays. We should be doing a better 
job of surfacing these subsample names so that users don't have to read the 
source, but I'm not going to focus on that here as discussed in the PR
<#321 (comment)>
Previously, config overlays were unable to disable default subsampling defined 
in the `config["subsample"]` dict because Snakemake merges dicts in configs.
This change adds support for disabling default subsampling by ignoring subsample 
groups that have a null value. 

Users will still have to know the name of the default subsampling groups in 
order to nullify them in their config overlays. We should be doing a better 
job of surfacing these subsample names so that users don't have to read the 
source, but I'm not going to focus on that here as discussed in the PR
<#321 (comment)>
@joverlee521
Copy link
Contributor Author

Pushed up changes to move filter query into config param in 1021e9c (which was the original purpose of this PR but somehow I forgot about it in all the other changes 🤦‍♀️ )

If there are no other comments, I'll merge this on Monday.

@joverlee521 joverlee521 merged commit 4da6e51 into master Jul 7, 2025
5 checks passed
@joverlee521 joverlee521 deleted the phylo-updates branch July 7, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move hardcode filter query into configuration files

3 participants