phylogenetic workflow updates by joverlee521 · Pull Request #321 · nextstrain/mpox

joverlee521 · 2025-06-25T20:40:11Z

Description of proposed changes

Various phylogenetic workflow updates ahead of adding support for nextstrain run. The main changes are making things more configurable via config files. See commits for details.

I think the most controversial change will be f1ed6e9, which switches to the generic subsample config schema. Seeking feedback on whether this is the correct direction.

Related issue(s)

Resolves #273

Checklist

Checks pass
Update CHANGELOG

jameshadfield

Changes look generally good - left a number of questions and discussion points in threads.

There are no updates to the build-configs/inrb workflow. We should endeavour to keep this up to date. I have some testing files if that would help?

phylogenetic/rules/annotate_phylogeny.smk

phylogenetic/rules/construct_phylogeny.smk

jameshadfield · 2025-06-25T22:14:55Z

phylogenetic/defaults/clade-i/config.yaml

+exclude: "defaults/exclude_accessions.txt"
 clades: "defaults/clades.tsv"
 lat_longs: "defaults/lat_longs.tsv"
+color_ordering: "defaults/color_ordering.tsv"


Discussion question prompted by the similarity between these configs.

Did we consider unifying the currently independent workflow invocations (clade-i, hmpv1 etc) into a single workflow parameterised by wildcards? Or has that ship sailed long ago? An alternate direction would be config overlays.

An alternate direction would be config overlays.

We tried config overlays in the past, but that got reverted in 5019420. So I think that's a no-go.

Did we consider unifying the currently independent workflow invocations (clade-i, hmpv1 etc) into a single workflow parameterised by wildcards? Or has that ship sailed long ago?

I like the simplicity of the single build config, but we are moving towards most pathogens having multi-build wildcard configs. So maybe the ship has not sailed if we think it's worth reworking this config.

With nextstrain run in mind, there would then only be a single phylogenetic workflow and users would have to know the single target for the single clade build OR provide a config overlay that can specify a single clade build.

Yeah, the nextstrain run interface raises the bar for repos structured like mpox is because each build will need its own snakefile, although the bar's not really that high! (Of course, someone could use one workflow with separate configfiles as I originally did with avian-flu, but I think it's better for our nextstrain repos to all look similar.)

I'm not pushing for this change at the moment, I don't want it to hold up the migration to nextstrain run, but there sure is a lot of duplication across config YAMLs!

phylogenetic/defaults/hmpxv1/config.yaml

phylogenetic/defaults/clade-i/config.yaml

@jameshadfield

Centralizes all config manipulation in rules/config.smk and ensures that the config changes are backwards compatible with older config files. Prints a message to prompt users to update to their own config files. Prompted by various PR comments from @jameshadfield <#321 (comment)> <#321 (comment)> <#321 (comment)>

@jameshadfield

Centralizes all config manipulation in rules/config.smk and ensures that the config changes are backwards compatible with older config files. Prints a message to prompt users to update to their own config files. Prompted by various PR comments from @jameshadfield <#321 (comment)> <#321 (comment)> <#321 (comment)>

joverlee521 · 2025-07-01T22:55:43Z

phylogenetic/defaults/hmpxv1/config.yaml

-    group_by: "--group-by country year"
-    sequences_per_group: "--subsample-max-sequences 300"
-    other_filters: "--exclude-where outbreak!=hMPXV-1 clade!=IIb"
+  non_b1: >-


Ah, with adding support for nextstrain run, I'm remembering that this subsample schema does not work well with config overlays because Snakemake merges dictionaries.

Will have to think on how to make this both overridable and extendable...

One option would be similar to the proposed multi-input config params in nextstrain/zika#80.
The subsample param will be changed to a list and there will be a new additional_subsample param:

subsample: - name: non_b1 args: >- --group-by lineage year country --sequences-per-group 50 ... - name: b1 args: >- --group-by country year --subsample-max-sequences 300 ... additional_subsample: []

Then config overlays can either completely override the subsample groups by using the subsample config param or define additional subsample groups with additional_subsample.

This is the exact reason I designed the inputs interface like I did.

For mpox this approach seems nice. My instinct tells me it won't work for a multi-build config (e.g. ncov!). Do you think it's worth solving both now, or running with this approach for the 1-config-1-build style repos and leaving the multi-build-configs until later?

Another option that Tom proposed is to keep subsample as a dict and clearly document the default subsample groups. Then the user can override the default subsampling groups or completely drop them by nullifying them

subsample: non_b1: ~ b1: ~ custom_group: >- --group-by lineage year country --sequences-per-group 50 ...

The config parsing in the workflow will just have to know to drop the null subsample groups.

For mpox this approach seems nice. My instinct tells me it won't work for a multi-build config (e.g. ncov!). Do you think it's worth solving both now, or running with this approach for the 1-config-1-build style repos and leaving the multi-build-configs until later?

Yeah, this could get hairy for multi-build configs...I do think it'd be worth trying to find a solution that works for both. Or at least an approach that doesn't make converting a single build to a multi-build config too difficult.

The config parsing in the workflow will just have to know to drop the null subsample groups.

A way to nullify config values has been something many of us have been pushing for, so no objections there.

For me the salient question is around the user experience for designing config overlays. If you want to go the (dict + nullification) direction, which is programmatically simpler, then a user has to know the keys in the (default) subsample dict, and we have to have a story for communicating this without saying "read the src". (Aside: this is exactly what I did for the inputs interface - I implemented it as a single dict with ∅ used for nullification - but for the reason I just described I went with the {additional,}inputs design.)

What to do right now for mpox? I'd go for a single dict + nullification. All that's needed to add is the nullification aspect. At the current time you have to read the src to build a config overlay, so that's not a problem at the moment.

Thanks for the feedback @jameshadfield! I'll go with dict + nullification here.

Surfacing the default subsample dict for the user to nullify could be part of the larger issue for logging config values

Added support for disabling subsample groups in 18d57d4.

Updated to match latest guidelines from Nextstrain's Snakemake style guide¹: - Always use quoted (:q) interpolations - Use raw, triple-quoted shell blocks - Log standard out and error output to log files and the terminal - Always use the benchmark directive I've left out quoted interpolations that borked the build because the config params need to be converted to lists. I'll clean these up in subsequent commits. ¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>

Allows Snakemake to do the right thing with quoted (:q) interpolations for config params that are passed in as multiple CLI args. Will automatically split string into list for backwards compatibility. Based on similar changes done in zika <nextstrain/zika@11d2644> Since the config function as_list are identical, this is definitely a candidate for consolidating in nextstrain/shared.

Reduce the need for users to supply the CLI flag for a custom script in the workflow. Keeping this change backwards compatible with older configs by automatically stripping the flag from the config param in the workflow.

Follows our guidelines on generic subsampling to use a single config param for the subsample rule.¹ I think this is an improvement over the previous subsample config params: - it is no longer unclear which params require the CLI flag - removes the confusing dependency between `exclude_lineages` and `other_filters` However, this does put the burden on the user to know the augur filter flags and to format them correctly in the configs. I've created examples in the default build configs that uses the YAML folded style² to keep the args readable in the configs. This required an update to the yamlfmt config. ¹ <https://docs.nextstrain.org/en/latest/guides/bioinformatics/filtering-and-subsampling.html#generalizing-subsampling-in-a-workflow> ² <https://yaml.org/spec/1.2.2/#813-folded-style> squash generic subsampling

Most of the default files are already defined in the configs. This commit moves the remaining files to the config. Doing this in preparation for adding support for `nextstrain run`.

Simplify the name of the config files in preparation for support for `nextstrain run`. Also inline with previous discussion on `exclude.txt` in <nextstrain/dengue#26 (comment)>

Updates to match the new config schema

This Snakemake file will only contain helper functions for parsing and validating configs that do not need to be autoformatted with pre-commit/snakefmt.

@jameshadfield

Centralizes all config manipulation in rules/config.smk and ensures that the config changes are backwards compatible with older config files. Prints a message to prompt users to update to their own config files. Prompted by various PR comments from @jameshadfield <#321 (comment)> <#321 (comment)> <#321 (comment)>

Previously, config overlays were unable to disable default subsampling defined in the `config["subsample"]` dict because Snakemake merges dicts in configs. This change adds support for disabling default subsampling by ignoring subsample groups that have a null value. Users will still have to know the name of the default subsampling groups in order to nullify them in their config overlays. We should be doing a better job of surfacing these subsample names so that users don't have to read the source, but I'm not going to focus on that here as discussed in the PR <#321 (comment)>

Resolves #273

joverlee521 · 2025-07-03T20:59:28Z

Pushed up changes to move filter query into config param in 1021e9c (which was the original purpose of this PR but somehow I forgot about it in all the other changes 🤦‍♀️ )

If there are no other comments, I'll merge this on Monday.

.pre-commit-config.yaml

jameshadfield requested changes Jun 25, 2025

View reviewed changes

Base automatically changed from update-metadata-columns to master June 26, 2025 16:42

joverlee521 force-pushed the phylo-updates branch from 358d40f to 13f062b Compare July 1, 2025 21:42

joverlee521 changed the base branch from master to james/node-name-accession July 1, 2025 21:43

Base automatically changed from james/node-name-accession to master July 1, 2025 21:47

joverlee521 force-pushed the phylo-updates branch from 13f062b to 06af354 Compare July 1, 2025 21:49

joverlee521 marked this pull request as ready for review July 1, 2025 21:52

joverlee521 commented Jul 1, 2025

View reviewed changes

joverlee521 added 10 commits July 2, 2025 14:14

phylogenetic: remove --root flag from configs

b6095b8

Reduce the need for users to supply the CLI flag for a custom script in the workflow. Keeping this change backwards compatible with older configs by automatically stripping the flag from the config param in the workflow.

phylogenetic: define default files in configs

3beda2d

Most of the default files are already defined in the configs. This commit moves the remaining files to the config. Doing this in preparation for adding support for `nextstrain run`.

phylo: rename exclude_accessions.txt to exclude.txt

623bdf0

Simplify the name of the config files in preparation for support for `nextstrain run`. Also inline with previous discussion on `exclude.txt` in <nextstrain/dengue#26 (comment)>

phylogenetic: Fix CI config

cee3215

phylo: Update inrb/config.yaml

6f23ff7

Updates to match the new config schema

pre-commit: ignore phylogenetic/rules/config.smk

9fa4eeb

This Snakemake file will only contain helper functions for parsing and validating configs that do not need to be autoformatted with pre-commit/snakefmt.

joverlee521 force-pushed the phylo-updates branch from 06af354 to 18d57d4 Compare July 2, 2025 22:58

joverlee521 force-pushed the phylo-updates branch from 18d57d4 to 813a99e Compare July 2, 2025 23:25

joverlee521 added 3 commits July 3, 2025 12:13

phylogenetic: move config parsing to config.smk

fbb5fe0

phylogenetic: Move filter rule query to config param

1021e9c

Resolves #273

Update changelog

d59e0d7

joverlee521 force-pushed the phylo-updates branch from 813a99e to d59e0d7 Compare July 3, 2025 19:29

genehack approved these changes Jul 3, 2025

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

joverlee521 merged commit 4da6e51 into master Jul 7, 2025
5 checks passed

joverlee521 deleted the phylo-updates branch July 7, 2025 16:46

victorlin mentioned this pull request Aug 28, 2025

Configuration improvements nextstrain/WNV#98

Merged

10 tasks

Conversation

joverlee521 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

jameshadfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joverlee521 Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joverlee521 commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joverlee521 commented Jun 25, 2025 •

edited

Loading

joverlee521 Jul 2, 2025 •

edited

Loading