Split GISAID profile to "six-month" and "all-time" builds #910

trvrb · 2022-04-10T16:44:08Z

Description of proposed changes

This PR splits the existing regional builds global, africa, etc... in the nextstrain-gisaid profile into six-month builds that focus subsampling on the previous six months and all-time builds that subsample evenly across time. This uses the new relative dates functionality in augur filter to make these subsampling strategies easier to implement and more obvious.

Frequencies timespans are set to match subsampling ranges.

The general subsampling logic is cleaned up in a few ways:

North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc...
For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified.
There is a consistent 4:1 emphasis on recent vs early samples for the six-month builds and a consistent 4:1 emphasis on focal vs context for the regional builds.

Trial runs can be seen at:

I would plan to mirror changes to open profile once we're happy with things here.

Blocking issues

There are some blocking issues however before this PR can be merged. These are:

Support relative dates for --min-date and --max-date augur#889
We need augur filter --min-date 6M, augur frequencies --min-date 6M, etc... for this PR to function. This needs to be merged and a new version of Augur released. Currently, the filter work is merged to master and for frequencies, I've just specified

frequencies:
  north-america_six-months:
    min_date: "2021-10-09"

so that this will function. Once augur frequencies has been updated this can get swapped to min_date: "6M". This issue also needs to be fully closed by a new Augur release.

Allow numbers in build names #524
I named builds as global_six-months, africa_six-months, etc... because global_6m doesn't work. If we can generally fix things to allow numbers in build names, then the PR can be updated accordingly.

Todo

Update build_description rule and _get_upload_inputs() to handle the new build_names during templating / in remote file names (per @trvrb's comment below and subsequent discussion).
~~Ping GISAID about updated URLs (per @huddlej's comment).~~ No longer necessary.
Decide how to handle the breaking change of new URLs/filenames under https://data.nextstrain.org/files/ncov/open/. We're preserving existing URLs for now.

Testing

I've tested locally.

Release checklist

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

trvrb · 2022-04-15T18:28:53Z

The above commit swaps from six-months to 6m enabled by @huddlej's recent merge of #524 and directly specifies min_date of 6M for frequencies based on @victorlin's Augur PR nextstrain/augur#890. I think things are looking pretty good.

trvrb · 2022-04-15T18:43:07Z

The above commit replicates the 6m logic into 1m build targets. Given we're already splitting time, I think having build targets that focus on bleeding edge of sequences / variants is really useful. You can see trial runs here:

For example you can easily see the recent prevalence of BA.4 and BA.5 in South Africa and the distribution across the country with BA.4 in Gauteng and BA.5 in KwaZulu-Natal (https://nextstrain.org/staging/multiple-timespans/ncov/gisaid/africa/1m?c=emerging_lineage&f_country=South%20Africa&r=division):

Here, I'd plan to default to 6m, but it should be easy for users to increase resolution to 1m or step back for greater context with all-time.

emmahodcroft · 2022-04-18T18:24:54Z

This is looking great Trevor - I really like this! I actually think the 1-month views on Nextstrain could be pretty interesting. But of course, adding more builds mean more time & more money, so need to be balanced. Still, it might make quick investigation of growing/changing trends a little easier.

trvrb · 2022-04-19T04:52:56Z

Trial run has completed with datasets available as:

Things seem to look generally good. Though a quick note on compute time / resources. Our standard GISAID builds with 6 regional endpoints run on 36 CPUs with 70Gb of memory and the tree rule in particular is specified to use 16 CPUs. This generally takes somewhere around 2hr 45 min, with subsampling taking the first ~2hr.

This trial run used the same resources to run 21 regional endpoints. It took ~7hr. So almost but not quite 3X as expected.

When merging this PR, we should increase resources for these rebuild jobs to use 96 CPUs and specify 8 CPUs for tree rule. Currently, by asking for 16 threads for tree, it's locking at 32 CPUs when building 2 trees. Additionally, speed up for IQTREE is not linear with the number of cores assigned. We can of course test, but it should definitely be faster when running this many builds in parallel to specify 8 threads rather than 16 for tree.

In using 96 CPUs, we'll still have wasted resources as the refine step isn't multithreaded, but essentially tripling resources should bring run time back down. Or we could split the difference and bump to 48 CPUs or 64 CPUs instead.

trvrb · 2022-04-21T21:52:14Z

Okay. After the level of Slack discussion surrounding 1m timepoints, I thought it better to refocus PR on just the uncontroversial split of existing single timepoint to 6m vs all-time. I'd plan an additional PR to expand to either 1m or 2m subsequent to merging this PR.

I've also adjusted the .github/workflows to double compute resources on AWS Batch. This should result in new runs taking approximately the same amount of time as previous runs. The GISAID trial run took 3 hr 10 min instead of the standard ~2hr 45min, so I think we're good there.

trvrb · 2022-04-22T00:35:22Z

The above two commits fix some oversights that I had in copying the GISAID builds.yaml to work as open builds.yaml. I believe everything should be working as it should at this point.

Trial runs are up at:

GISAID

Open

I believe this PR is now ready to be merged.

jameshadfield

Really happy with these changes - it's great to have a builds which are (temporally) uniform sampled as well as ones focused on the recent (6m) situation.

We need to update the description markdown to reflect this however (it currently states "Our primary global analysis subsamples to ~600 genomes per continental region with ~400 from the previous 4 months and ~200 from before this.") We can make 2 markdowns (6m vs all) and set this in per-build via config["builds"][<build>]["description"]).

trvrb · 2022-04-22T22:37:47Z

Thanks for the catch @jameshadfield! I went through and revised both the GISAID and the open description.md in the above commit. I ended up just doing a single description file rather splitting description between 6m and all-time builds. I also cleaned up a few small issues in the description while I was at it.

I list out the available views like so:

Which is rendered as:

It should look better to table them out as:

region	time period	URL
global	past 6 months	/ncov/gisaid/global/6m
Africa	past 6 months	/ncov/gisaid/africa/6m
Asia	past 6 months	/ncov/gisaid/asia/6m
Europe	past 6 months	/ncov/gisaid/europe/6m
...	...	...
global	all time	/ncov/gisaid/global/all-time

However, this currently, renders as:

I'm going to leave this in list form for this PR, but perhaps we can update Auspice's footer.js styles to improve table rendering and them come back and update the Markdown here in a later PR. Edit: I've made a PR to Auspice at nextstrain/auspice#1502.

I've kicked off new trial runs that should fix descriptions. But know believe this is ready to merge.

huddlej · 2022-04-26T18:06:16Z

@trvrb I just realized that when we merge this PR, we'll need to ping GISAID to let them know the new canonical paths for the "nextregions" files. I assume we want to direct folks to "all-time" builds for now, so we don't break workflows for people who have used these data as background for full-pandemic builds. But, it could make sense to switch to a shorter time period for "nextregions" data eventually.

trvrb · 2022-04-26T20:55:26Z

Good catch @huddlej! It's a little unfortunate to have to pick either all-time or 6m for each of the "nextregions" downloads as I can imagine people wanting 6m context for analyses of recent circulation and all-time for more general analyses. There may be enough real estate here:

that GISAID could offer "Global / 6m", "Global / all time", etc... options and not just swap for the 7 existing. I'll propose this, but if it's just the 7 them updating paths to correspond to "all time" outputs seems best.

trvrb · 2022-04-27T17:46:39Z

@tsibley: I'm running into a final small issue with this PR that I don't know immediately how to implement a simple fix for. If you look at https://nextstrain.org/staging/ncov/open/trial/multiple-timespans/africa/6m you'll see:

The first link is then to https://data.nextstrain.org/files/ncov/open/africa_6m/metadata.tsv.xz rather than the desired https://data.nextstrain.org/files/ncov/open/africa/6m/metadata.tsv.xz.

Here, build_name is directly inserted into the description.md with:

env BUILD={wildcards.build_name:q} \
	perl -pe 's/\$\{{BUILD\}}/$ENV{{BUILD}}/g' \
		< {input.description:q} \
		> {output.description:q}

in rule build_description.

I know this should be a simple change to regex _ to / in build_name, but I wasn't sure how to implement this. Would you be able to implement this? 🙏

Edit: I'm realizing we also need to update _get_upload_inputs for uploading intermediates as it currently just uses {build_name} as well.

trvrb · 2022-04-27T17:57:22Z

Though perhaps we also have an issue of what URL to use for these build-specific Auspice files. Previously we were using:

https://data.nextstrain.org/files/ncov/open/africa/africa.json
Following this pattern, I'd think to update to links to something like
https://data.nextstrain.org/files/ncov/open/africa/6m/africa_6m.json
https://data.nextstrain.org/files/ncov/open/africa/6m/africa_6m_tip-frequencies.json
https://data.nextstrain.org/files/ncov/open/africa/6m/africa_6m_root-sequence.json
though I realize this is perhaps a bit wonky.

We should have a pattern here that works across pathogens where we would make Auspice files available through https://data.nextstrain.org/files/. For example, we'd like https://nextstrain.org/flu/seasonal/h3n2/ha/2y to be available as:

https://data.nextstrain.org/files/flu/seasonal/h3n2/ha/2y...

Maybe just?

This would lobby for the above ncov files to be:

Though it's perhaps strange to have:

at different levels of directory hierarchy.

Any thoughts or preferences here? cc @tsibley @jameshadfield @joverlee521

trvrb · 2022-04-28T17:11:55Z

After thinking about the above issue a bit further, my proposal is to just use auspice.json for the file in question. This would work as:

This is directly names what the file is in the same way as metadata and sequences directly name what the files are and it places the file at the correct point in the /files/ hierarchy.

jameshadfield · 2022-04-28T20:13:43Z

The question I’m asking myself reading this is whether we even want the dataset JSON(s) to be in the directory or not? My reading of files is that it’s akin to the ./results directory in (snakemake) workflows, and the (auspice) JSONs are located at (e.g.) https://data.nextstrain.org/flu_seasonal_h3n2_ha_2y.json

tsibley · 2022-04-28T23:41:05Z

For those playing at home, @trvrb, @jameshadfield, and I had a lengthy discussion about URLs/names in a Slack thread.

@trvrb I can update build_description rule and _get_upload_inputs() once we make a decision about the desired URLs/filenames for these. I added a todo checklist item above. Will aim for tomorrow?

One thing I want to call out clearly is that this PR does represent effectively a breaking change for the https://data.nextstrain.org/files/ncov/open/ data, as the old names will no longer be updated. We should figure out how we want to handle that (even if it's to explicitly decide the breakage is ok).

trvrb · 2022-04-29T00:09:17Z

Thanks @tsibley! Given level of uncertainty regarding file names (and how this interacts with planned dataset API for nextstrain.org), I'd think to make the least breaking change with this PR. I'd see this as:

GISAID: Upload ncov_gisaid_global_6m.json as s3://nextstrain-ncov-private/global/global.json, metadata for global 6m as s3://nextstrain-ncov-private/global/metadata.tsv.xz, etc... so that current GISAID fetch won't break. Do the same for all the other 6m regional builds.
Open: Upload ncov_open_global_6m.json as s3://nextstrain-data/files/ncov/open/global/global.json, metadata for global 6m as s3://nextstrain-data/files/ncov/open/global/metadata.tsv.xz, etc... so that users of subsampled open data won't have their processes break. Do the same for all the other 6m regional builds.
Update the open description.md to point to https://data.nextstrain.org/files/ncov/open/global/global.json, etc... essentially stripping out _6m or _all-time from {build-name}.

I'm choosing 6m over all-time here for what files to share as this is effectively the same files as we're currently providing (with subsampling targeting recent viruses).

This will allow us to more seamlessly transition to updated file names when we're ready where we can continue to provide the previous file URLs and then include on top of these additional split 6m and all-time files.

tsibley · 2022-04-29T22:18:07Z

@trvrb Sounds good. Testing changes for that locally and ran into a small complication because of the reference build being an odd one out. I'll push up a commit once I smooth some wrinkles out.

@trvrb

Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)

tsibley · 2022-04-29T23:22:51Z

Pushed.

tsibley · 2022-04-29T23:23:54Z

After deploying these builds for the first time to nextstrain.org, we'll want to setup redirects from the unqualified names to the 6m builds (e.g. ncov/open/global → ncov/open/global/6m).

@trvrb

Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)

@trvrb

Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)

@trvrb

Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)

This commit splits the existing regional builds "global", "africa", etc... in the "nextstrain-gisaid" profile into "six-month" builds that focus subsampling on the previous six months and "all-time" builds that subsample evenly across time. This uses the new relative dates functionality in "augur filter" and "augur frequencies" to make these subsampling strategies easier to implement and more obvious. The general subsampling logic is cleaned up in a few ways: 1. North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc... 2. For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified. 3. There is a consistent 4:1 emphasis on recent vs early samples for the "six-month" builds and a consistent 4:1 emphasis on focal vs context for the regional builds. Frequencies timespans are set to match subsampling ranges. The description.md footer text is updated to describe this split and to provide a table links of region x time period combinations.

Follow the same logic from Nextstrain GISAID and split Nextstrain open to produce "6m" targets that focus subsampling on the previous 6 months as well as "all-time" targets that subsample evenly since pandemic start. Remove subsampling_ranges.smk as it's no longer referenced.

To compensate for doubling build targets from 7 regional builds to 14 regional builds, this commit doubles computational resources from 36 CPUs to 72 CPUs. These specific CPU numbers are keyed to AWS EC2 instance sizes. A c5.9xlarge is 36 CPUs, a c5.12xlarge is 48 CPUs and a c5.18xlarge is 72 CPUs. We should be picking one of these and not a number in between. Finally, this reduces `--set-threads tree` from 16 to 8. There are often close to 7 trees that wanted to simultaneously be run. With 36 CPUs, we'd get situations where 2 trees were taking up 32 CPUs leaving 4 open. With this commit, we'll have 72 CPUs and want to simultaneously run 14 trees. If trees are each 8 CPUs this should better fit into resources.

Tells Snakemake what the `prefix` wildcard's literal value is, preventing Snakemake from interpreting part of the build name as the prefix. When Snakemake misinterprets the build name, this causes key errors downstream that are difficult to debug.

…constraint Avoids accidentally treating the fixed string as a regex which could lead to very weird Snakemake DAG issues when matching the "prefix" wildcard.

@trvrb

Avoids (for now) changes that would break downstream usage like external builds or other analyses based on the Open data and the fetches GISAID makes to re-serve the files themselves. 6m builds are, per @trvrb, "effectively the same files as we're currently providing (with subsampling targeting recent viruses)."¹ In the future, once we work out naming more generally and other APIs, we'll provide files for the all-time builds too. The build description template is updated to handle the new build names. The templating method changed to make it easier to support dynamic template vars. Since the `build_description` rule runs for _all_ builds, not just our own, it's important that we maintain backwards compat. This will mostly maintain it except in an slight edge case where `$BUILD` will now be substituted in addition to `${BUILD}`. The `upload` rule is expected to get less usage outside of our own builds, but I believe it does get some so it will maintain backwards compat behaviour (as long as someone's current build names don't already match our new ones). ¹ #910 (comment)

trvrb · 2022-04-30T01:40:10Z

I've rebased this PR onto master while compressing extraneous commits. I walked through via GitHub's branch comparison interface to make sure everything worked as it should. I believe the PR is fully ready to be merged at this point.

trvrb force-pushed the multiple-timespans branch from 5ddf7cc to 0bf1383 Compare April 15, 2022 02:35

trvrb marked this pull request as ready for review April 21, 2022 21:52

jameshadfield requested changes Apr 22, 2022

View reviewed changes

trvrb requested a review from jameshadfield April 22, 2022 22:38

trvrb mentioned this pull request Apr 22, 2022

Update footer styling to better display tables nextstrain/auspice#1502

Merged

trvrb added 3 commits April 29, 2022 18:31

huddlej and others added 4 commits April 29, 2022 18:32

Escape regexp metachars when using auspice_json_prefix as a wildcard …

bd2d766

…constraint Avoids accidentally treating the fixed string as a regex which could lead to very weird Snakemake DAG issues when matching the "prefix" wildcard.

Update change log

7bbc46e

trvrb force-pushed the multiple-timespans branch from 4723ab1 to 7bbc46e Compare April 30, 2022 01:37

trvrb merged commit 6732f18 into master Apr 30, 2022

trvrb deleted the multiple-timespans branch April 30, 2022 18:33

tsibley mentioned this pull request May 3, 2022

Forecase Automation: Provision counts files nextstrain/forecasts-ncov#1

Closed

trvrb mentioned this pull request May 28, 2022

Include 2m timespan in Nextstrain GISAID and open profiles #957

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split GISAID profile to "six-month" and "all-time" builds #910

Split GISAID profile to "six-month" and "all-time" builds #910

trvrb commented Apr 10, 2022 •

edited by tsibley

Loading

trvrb commented Apr 15, 2022

trvrb commented Apr 15, 2022 •

edited

Loading

emmahodcroft commented Apr 18, 2022

trvrb commented Apr 19, 2022

trvrb commented Apr 21, 2022 •

edited

Loading

trvrb commented Apr 22, 2022 •

edited

Loading

jameshadfield left a comment •

edited

Loading

trvrb commented Apr 22, 2022 •

edited

Loading

huddlej commented Apr 26, 2022

trvrb commented Apr 26, 2022 •

edited

Loading

trvrb commented Apr 27, 2022 •

edited

Loading

trvrb commented Apr 27, 2022 •

edited

Loading

trvrb commented Apr 28, 2022 •

edited

Loading

jameshadfield commented Apr 28, 2022

tsibley commented Apr 28, 2022

trvrb commented Apr 29, 2022 •

edited

Loading

tsibley commented Apr 29, 2022

tsibley commented Apr 29, 2022

tsibley commented Apr 29, 2022

trvrb commented Apr 30, 2022

Split GISAID profile to "six-month" and "all-time" builds #910

Split GISAID profile to "six-month" and "all-time" builds #910

Conversation

trvrb commented Apr 10, 2022 • edited by tsibley Loading

Description of proposed changes

Blocking issues

Todo

Testing

Release checklist

trvrb commented Apr 15, 2022

trvrb commented Apr 15, 2022 • edited Loading

emmahodcroft commented Apr 18, 2022

trvrb commented Apr 19, 2022

trvrb commented Apr 21, 2022 • edited Loading

trvrb commented Apr 22, 2022 • edited Loading

GISAID

Open

jameshadfield left a comment • edited Loading

Choose a reason for hiding this comment

trvrb commented Apr 22, 2022 • edited Loading

huddlej commented Apr 26, 2022

trvrb commented Apr 26, 2022 • edited Loading

trvrb commented Apr 27, 2022 • edited Loading

trvrb commented Apr 27, 2022 • edited Loading

trvrb commented Apr 28, 2022 • edited Loading

jameshadfield commented Apr 28, 2022

tsibley commented Apr 28, 2022

trvrb commented Apr 29, 2022 • edited Loading

tsibley commented Apr 29, 2022

tsibley commented Apr 29, 2022

tsibley commented Apr 29, 2022

trvrb commented Apr 30, 2022

trvrb commented Apr 10, 2022 •

edited by tsibley

Loading

trvrb commented Apr 15, 2022 •

edited

Loading

trvrb commented Apr 21, 2022 •

edited

Loading

trvrb commented Apr 22, 2022 •

edited

Loading

jameshadfield left a comment •

edited

Loading

trvrb commented Apr 22, 2022 •

edited

Loading

trvrb commented Apr 26, 2022 •

edited

Loading

trvrb commented Apr 27, 2022 •

edited

Loading

trvrb commented Apr 27, 2022 •

edited

Loading

trvrb commented Apr 28, 2022 •

edited

Loading

trvrb commented Apr 29, 2022 •

edited

Loading