Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split GISAID profile to "six-month" and "all-time" builds #910

Merged
merged 7 commits into from
Apr 30, 2022

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Apr 10, 2022

Description of proposed changes

This PR splits the existing regional builds global, africa, etc... in the nextstrain-gisaid profile into six-month builds that focus subsampling on the previous six months and all-time builds that subsample evenly across time. This uses the new relative dates functionality in augur filter to make these subsampling strategies easier to implement and more obvious.

Frequencies timespans are set to match subsampling ranges.

The general subsampling logic is cleaned up in a few ways:

  1. North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc...
  2. For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified.
  3. There is a consistent 4:1 emphasis on recent vs early samples for the six-month builds and a consistent 4:1 emphasis on focal vs context for the regional builds.

Trial runs can be seen at:

I would plan to mirror changes to open profile once we're happy with things here.

Blocking issues

There are some blocking issues however before this PR can be merged. These are:

  • Support relative dates for --min-date and --max-date augur#889
    We need augur filter --min-date 6M, augur frequencies --min-date 6M, etc... for this PR to function. This needs to be merged and a new version of Augur released. Currently, the filter work is merged to master and for frequencies, I've just specified
frequencies:
  north-america_six-months:
    min_date: "2021-10-09"

so that this will function. Once augur frequencies has been updated this can get swapped to min_date: "6M". This issue also needs to be fully closed by a new Augur release.

  • Allow numbers in build names #524
    I named builds as global_six-months, africa_six-months, etc... because global_6m doesn't work. If we can generally fix things to allow numbers in build names, then the PR can be updated accordingly.

Todo

  • Update build_description rule and _get_upload_inputs() to handle the new build_names during templating / in remote file names (per @trvrb's comment below and subsequent discussion).

  • Ping GISAID about updated URLs (per @huddlej's comment). No longer necessary.

  • Decide how to handle the breaking change of new URLs/filenames under https://data.nextstrain.org/files/ncov/open/. We're preserving existing URLs for now.

Testing

I've tested locally.

Release checklist

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

@trvrb
Copy link
Member Author

trvrb commented Apr 15, 2022

The above commit swaps from six-months to 6m enabled by @huddlej's recent merge of #524 and directly specifies min_date of 6M for frequencies based on @victorlin's Augur PR nextstrain/augur#890. I think things are looking pretty good.

@trvrb
Copy link
Member Author

trvrb commented Apr 15, 2022

The above commit replicates the 6m logic into 1m build targets. Given we're already splitting time, I think having build targets that focus on bleeding edge of sequences / variants is really useful. You can see trial runs here:

For example you can easily see the recent prevalence of BA.4 and BA.5 in South Africa and the distribution across the country with BA.4 in Gauteng and BA.5 in KwaZulu-Natal (https://nextstrain.org/staging/multiple-timespans/ncov/gisaid/africa/1m?c=emerging_lineage&f_country=South%20Africa&r=division):

ba4-ba5

Here, I'd plan to default to 6m, but it should be easy for users to increase resolution to 1m or step back for greater context with all-time.

@emmahodcroft
Copy link
Member

This is looking great Trevor - I really like this! I actually think the 1-month views on Nextstrain could be pretty interesting. But of course, adding more builds mean more time & more money, so need to be balanced. Still, it might make quick investigation of growing/changing trends a little easier.

@trvrb
Copy link
Member Author

trvrb commented Apr 19, 2022

Trial run has completed with datasets available as:

Things seem to look generally good. Though a quick note on compute time / resources. Our standard GISAID builds with 6 regional endpoints run on 36 CPUs with 70Gb of memory and the tree rule in particular is specified to use 16 CPUs. This generally takes somewhere around 2hr 45 min, with subsampling taking the first ~2hr.

This trial run used the same resources to run 21 regional endpoints. It took ~7hr. So almost but not quite 3X as expected.

When merging this PR, we should increase resources for these rebuild jobs to use 96 CPUs and specify 8 CPUs for tree rule. Currently, by asking for 16 threads for tree, it's locking at 32 CPUs when building 2 trees. Additionally, speed up for IQTREE is not linear with the number of cores assigned. We can of course test, but it should definitely be faster when running this many builds in parallel to specify 8 threads rather than 16 for tree.

In using 96 CPUs, we'll still have wasted resources as the refine step isn't multithreaded, but essentially tripling resources should bring run time back down. Or we could split the difference and bump to 48 CPUs or 64 CPUs instead.

@trvrb
Copy link
Member Author

trvrb commented Apr 21, 2022

Okay. After the level of Slack discussion surrounding 1m timepoints, I thought it better to refocus PR on just the uncontroversial split of existing single timepoint to 6m vs all-time. I'd plan an additional PR to expand to either 1m or 2m subsequent to merging this PR.

I've also adjusted the .github/workflows to double compute resources on AWS Batch. This should result in new runs taking approximately the same amount of time as previous runs. The GISAID trial run took 3 hr 10 min instead of the standard ~2hr 45min, so I think we're good there.

@trvrb trvrb marked this pull request as ready for review April 21, 2022 21:52
@trvrb
Copy link
Member Author

trvrb commented Apr 22, 2022

The above two commits fix some oversights that I had in copying the GISAID builds.yaml to work as open builds.yaml. I believe everything should be working as it should at this point.

Trial runs are up at:

GISAID

Open

I believe this PR is now ready to be merged.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really happy with these changes - it's great to have a builds which are (temporally) uniform sampled as well as ones focused on the recent (6m) situation.

We need to update the description markdown to reflect this however (it currently states "Our primary global analysis subsamples to ~600 genomes per continental region with ~400 from the previous 4 months and ~200 from before this.") We can make 2 markdowns (6m vs all) and set this in per-build via config["builds"][<build>]["description"]).

@trvrb
Copy link
Member Author

trvrb commented Apr 22, 2022

Thanks for the catch @jameshadfield! I went through and revised both the GISAID and the open description.md in the above commit. I ended up just doing a single description file rather splitting description between 6m and all-time builds. I also cleaned up a few small issues in the description while I was at it.

I list out the available views like so:

Which is rendered as:
list

It should look better to table them out as:

region time period URL
global past 6 months /ncov/gisaid/global/6m
Africa past 6 months /ncov/gisaid/africa/6m
Asia past 6 months /ncov/gisaid/asia/6m
Europe past 6 months /ncov/gisaid/europe/6m
... ... ...
global all time /ncov/gisaid/global/all-time

However, this currently, renders as:
table

I'm going to leave this in list form for this PR, but perhaps we can update Auspice's footer.js styles to improve table rendering and them come back and update the Markdown here in a later PR. Edit: I've made a PR to Auspice at nextstrain/auspice#1502.

I've kicked off new trial runs that should fix descriptions. But know believe this is ready to merge.

@huddlej
Copy link
Contributor

huddlej commented Apr 26, 2022

@trvrb I just realized that when we merge this PR, we'll need to ping GISAID to let them know the new canonical paths for the "nextregions" files. I assume we want to direct folks to "all-time" builds for now, so we don't break workflows for people who have used these data as background for full-pandemic builds. But, it could make sense to switch to a shorter time period for "nextregions" data eventually.

@trvrb
Copy link
Member Author

trvrb commented Apr 26, 2022

Good catch @huddlej! It's a little unfortunate to have to pick either all-time or 6m for each of the "nextregions" downloads as I can imagine people wanting 6m context for analyses of recent circulation and all-time for more general analyses. There may be enough real estate here:

gisaid

that GISAID could offer "Global / 6m", "Global / all time", etc... options and not just swap for the 7 existing. I'll propose this, but if it's just the 7 them updating paths to correspond to "all time" outputs seems best.

@trvrb
Copy link
Member Author

trvrb commented Apr 27, 2022

@tsibley: I'm running into a final small issue with this PR that I don't know immediately how to implement a simple fix for. If you look at https://nextstrain.org/staging/ncov/open/trial/multiple-timespans/africa/6m you'll see:

templating

The first link is then to https://data.nextstrain.org/files/ncov/open/africa_6m/metadata.tsv.xz rather than the desired https://data.nextstrain.org/files/ncov/open/africa/6m/metadata.tsv.xz.

Here, build_name is directly inserted into the description.md with:

env BUILD={wildcards.build_name:q} \
	perl -pe 's/\$\{{BUILD\}}/$ENV{{BUILD}}/g' \
		< {input.description:q} \
		> {output.description:q}

in rule build_description.

I know this should be a simple change to regex _ to / in build_name, but I wasn't sure how to implement this. Would you be able to implement this? 🙏

Edit: I'm realizing we also need to update _get_upload_inputs for uploading intermediates as it currently just uses {build_name} as well.

@trvrb
Copy link
Member Author

trvrb commented Apr 27, 2022

Though perhaps we also have an issue of what URL to use for these build-specific Auspice files. Previously we were using:

We should have a pattern here that works across pathogens where we would make Auspice files available through https://data.nextstrain.org/files/. For example, we'd like https://nextstrain.org/flu/seasonal/h3n2/ha/2y to be available as:

Maybe just?

This would lobby for the above ncov files to be:

Though it's perhaps strange to have:

at different levels of directory hierarchy.

Any thoughts or preferences here? cc @tsibley @jameshadfield @joverlee521

@trvrb
Copy link
Member Author

trvrb commented Apr 28, 2022

After thinking about the above issue a bit further, my proposal is to just use auspice.json for the file in question. This would work as:

This is directly names what the file is in the same way as metadata and sequences directly name what the files are and it places the file at the correct point in the /files/ hierarchy.

@jameshadfield
Copy link
Member

The question I’m asking myself reading this is whether we even want the dataset JSON(s) to be in the directory or not? My reading of files is that it’s akin to the ./results directory in (snakemake) workflows, and the (auspice) JSONs are located at (e.g.) https://data.nextstrain.org/flu_seasonal_h3n2_ha_2y.json

@tsibley
Copy link
Member

tsibley commented Apr 28, 2022

For those playing at home, @trvrb, @jameshadfield, and I had a lengthy discussion about URLs/names in a Slack thread.

@trvrb I can update build_description rule and _get_upload_inputs() once we make a decision about the desired URLs/filenames for these. I added a todo checklist item above. Will aim for tomorrow?

One thing I want to call out clearly is that this PR does represent effectively a breaking change for the https://data.nextstrain.org/files/ncov/open/ data, as the old names will no longer be updated. We should figure out how we want to handle that (even if it's to explicitly decide the breakage is ok).

@trvrb
Copy link
Member Author

trvrb commented Apr 29, 2022

Thanks @tsibley! Given level of uncertainty regarding file names (and how this interacts with planned dataset API for nextstrain.org), I'd think to make the least breaking change with this PR. I'd see this as:

  1. GISAID: Upload ncov_gisaid_global_6m.json as s3://nextstrain-ncov-private/global/global.json, metadata for global 6m as s3://nextstrain-ncov-private/global/metadata.tsv.xz, etc... so that current GISAID fetch won't break. Do the same for all the other 6m regional builds.
  2. Open: Upload ncov_open_global_6m.json as s3://nextstrain-data/files/ncov/open/global/global.json, metadata for global 6m as s3://nextstrain-data/files/ncov/open/global/metadata.tsv.xz, etc... so that users of subsampled open data won't have their processes break. Do the same for all the other 6m regional builds.
  3. Update the open description.md to point to https://data.nextstrain.org/files/ncov/open/global/global.json, etc... essentially stripping out _6m or _all-time from {build-name}.

I'm choosing 6m over all-time here for what files to share as this is effectively the same files as we're currently providing (with subsampling targeting recent viruses).

This will allow us to more seamlessly transition to updated file names when we're ready where we can continue to provide the previous file URLs and then include on top of these additional split 6m and all-time files.

@tsibley
Copy link
Member

tsibley commented Apr 29, 2022

@trvrb Sounds good. Testing changes for that locally and ran into a small complication because of the reference build being an odd one out. I'll push up a commit once I smooth some wrinkles out.

tsibley added a commit that referenced this pull request Apr 29, 2022
Avoids (for now) changes that would break downstream usage like external
builds or other analyses based on the Open data and the fetches GISAID
makes to re-serve the files themselves.

6m builds are, per @trvrb, "effectively the same files as we're
currently providing (with subsampling targeting recent viruses)."¹  In
the future, once we work out naming more generally and other APIs, we'll
provide files for the all-time builds too.

The build description template is updated to handle the new build names.
The templating method changed to make it easier to support dynamic
template vars.  Since the `build_description` rule runs for _all_ builds,
not just our own, it's important that we maintain backwards compat.
This will mostly maintain it except in an slight edge case where
`$BUILD` will now be substituted in addition to `${BUILD}`.

The `upload` rule is expected to get less usage outside of our own
builds, but I believe it does get some so it will maintain backwards
compat behaviour (as long as someone's current build names don't already
match our new ones).

¹ #910 (comment)
@tsibley
Copy link
Member

tsibley commented Apr 29, 2022

Pushed.

@tsibley
Copy link
Member

tsibley commented Apr 29, 2022

After deploying these builds for the first time to nextstrain.org, we'll want to setup redirects from the unqualified names to the 6m builds (e.g. ncov/open/global → ncov/open/global/6m).

trvrb pushed a commit that referenced this pull request Apr 30, 2022
Avoids (for now) changes that would break downstream usage like external
builds or other analyses based on the Open data and the fetches GISAID
makes to re-serve the files themselves.

6m builds are, per @trvrb, "effectively the same files as we're
currently providing (with subsampling targeting recent viruses)."¹  In
the future, once we work out naming more generally and other APIs, we'll
provide files for the all-time builds too.

The build description template is updated to handle the new build names.
The templating method changed to make it easier to support dynamic
template vars.  Since the `build_description` rule runs for _all_ builds,
not just our own, it's important that we maintain backwards compat.
This will mostly maintain it except in an slight edge case where
`$BUILD` will now be substituted in addition to `${BUILD}`.

The `upload` rule is expected to get less usage outside of our own
builds, but I believe it does get some so it will maintain backwards
compat behaviour (as long as someone's current build names don't already
match our new ones).

¹ #910 (comment)
trvrb pushed a commit that referenced this pull request Apr 30, 2022
Avoids (for now) changes that would break downstream usage like external
builds or other analyses based on the Open data and the fetches GISAID
makes to re-serve the files themselves.

6m builds are, per @trvrb, "effectively the same files as we're
currently providing (with subsampling targeting recent viruses)."¹  In
the future, once we work out naming more generally and other APIs, we'll
provide files for the all-time builds too.

The build description template is updated to handle the new build names.
The templating method changed to make it easier to support dynamic
template vars.  Since the `build_description` rule runs for _all_ builds,
not just our own, it's important that we maintain backwards compat.
This will mostly maintain it except in an slight edge case where
`$BUILD` will now be substituted in addition to `${BUILD}`.

The `upload` rule is expected to get less usage outside of our own
builds, but I believe it does get some so it will maintain backwards
compat behaviour (as long as someone's current build names don't already
match our new ones).

¹ #910 (comment)
trvrb pushed a commit that referenced this pull request Apr 30, 2022
Avoids (for now) changes that would break downstream usage like external
builds or other analyses based on the Open data and the fetches GISAID
makes to re-serve the files themselves.

6m builds are, per @trvrb, "effectively the same files as we're
currently providing (with subsampling targeting recent viruses)."¹  In
the future, once we work out naming more generally and other APIs, we'll
provide files for the all-time builds too.

The build description template is updated to handle the new build names.
The templating method changed to make it easier to support dynamic
template vars.  Since the `build_description` rule runs for _all_ builds,
not just our own, it's important that we maintain backwards compat.
This will mostly maintain it except in an slight edge case where
`$BUILD` will now be substituted in addition to `${BUILD}`.

The `upload` rule is expected to get less usage outside of our own
builds, but I believe it does get some so it will maintain backwards
compat behaviour (as long as someone's current build names don't already
match our new ones).

¹ #910 (comment)
This commit splits the existing regional builds "global", "africa", etc... in the "nextstrain-gisaid" profile into "six-month" builds that focus subsampling on the previous six months and "all-time" builds that subsample evenly across time. This uses the new relative dates functionality in "augur filter" and "augur frequencies" to make these subsampling strategies easier to implement and more obvious.

The general subsampling logic is cleaned up in a few ways:
1. North America and Oceania are subsampled and traits reconstructed at the "division" level, while Africa, Asia, Europe and South America are subsampled and traits reconstructed at the "country" level. Previously this behavior had been inconsistent between subsampling, traits, etc...
2. For global builds, all regions are now sampled at equal frequency except for Oceania which is 33%. Previous overemphasis on Europe and North America is no longer justified.
3. There is a consistent 4:1 emphasis on recent vs early samples for the "six-month" builds and a consistent 4:1 emphasis on focal vs context for the regional builds.

Frequencies timespans are set to match subsampling ranges.

The description.md footer text is updated to describe this split and to provide a table links of region x time period combinations.
Follow the same logic from Nextstrain GISAID and split Nextstrain open to produce "6m" targets that focus subsampling on the previous 6 months as well as "all-time" targets that subsample evenly since pandemic start.

Remove subsampling_ranges.smk as it's no longer referenced.
To compensate for doubling build targets from 7 regional builds to 14 regional builds, this commit doubles computational resources from 36 CPUs to 72 CPUs.

These specific CPU numbers are keyed to AWS EC2 instance sizes. A c5.9xlarge is 36 CPUs, a c5.12xlarge is 48 CPUs and a c5.18xlarge is 72 CPUs. We should be picking one of these and not a number in between.

Finally, this reduces `--set-threads tree` from 16 to 8. There are often close to 7 trees that wanted to simultaneously be run. With 36 CPUs, we'd get situations where 2 trees were taking up 32 CPUs leaving 4 open.

With this commit, we'll have 72 CPUs and want to simultaneously run 14 trees. If trees are each 8 CPUs this should better fit into resources.
huddlej and others added 4 commits April 29, 2022 18:32
Tells Snakemake what the `prefix` wildcard's literal value is,
preventing Snakemake from interpreting part of the build name as the
prefix. When Snakemake misinterprets the build name, this causes key
errors downstream that are difficult to debug.
…constraint

Avoids accidentally treating the fixed string as a regex which could
lead to very weird Snakemake DAG issues when matching the "prefix"
wildcard.
Avoids (for now) changes that would break downstream usage like external
builds or other analyses based on the Open data and the fetches GISAID
makes to re-serve the files themselves.

6m builds are, per @trvrb, "effectively the same files as we're
currently providing (with subsampling targeting recent viruses)."¹  In
the future, once we work out naming more generally and other APIs, we'll
provide files for the all-time builds too.

The build description template is updated to handle the new build names.
The templating method changed to make it easier to support dynamic
template vars.  Since the `build_description` rule runs for _all_ builds,
not just our own, it's important that we maintain backwards compat.
This will mostly maintain it except in an slight edge case where
`$BUILD` will now be substituted in addition to `${BUILD}`.

The `upload` rule is expected to get less usage outside of our own
builds, but I believe it does get some so it will maintain backwards
compat behaviour (as long as someone's current build names don't already
match our new ones).

¹ #910 (comment)
@trvrb
Copy link
Member Author

trvrb commented Apr 30, 2022

I've rebased this PR onto master while compressing extraneous commits. I walked through via GitHub's branch comparison interface to make sure everything worked as it should. I believe the PR is fully ready to be merged at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants