-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persephone #28
Persephone #28
Conversation
Ideally zika ingest should be added via the pathogen_template_repo. However, to update the recent zika sequences, this temporary branch will copy the ingest directory from the mpox repo: https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05 The `ingest/vendored` subdirectory is not copied over since that folder should be added with `git subtree`. https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05/ingest#ingestvendored Future commits will change this to work with Zika data.
…/vendored subrepo: subdir: "ingest/vendored" merged: "a0faef5" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "a0faef5" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
Removal of Nextclade related rules, pending the compilation of a Nextclade zika dataset and potential v3 changes.
Organize field renaming into two parts. 1. Rename the NCBI output columns to match the NCBI mnemonics¹ (see "ncbi_datasets_fields:" in `config/config.yaml`) 2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names² (see "transform: fieldmap:" in `config/config.yaml`) ¹ https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields ² https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
Rescue some of the original functionality of the zika_upload script from fauna. https://github.com/nextstrain/fauna/blob/master/vdb/zika_upload.py#L14-L30
Move phylogenetic workflow from top-level to folder phylogenetic in order to follow the Pathogen Repo Template: https://github.com/nextstrain/pathogen-repo-template
The original Zika build contained USVI data that had been posted publiclly to GitHub but not yet submitted to NCBI GenBank. This commit adds rules to merge the USVI data with the NCBI GenBank data. Since USVI does not have a genbank_accession column, we create a new accession column for both USVI and NCBI GenBank accessions. This accession column is then used as the strain_id column for the phylogenetic build. Since auspice automagically generates a NCBI GenBank url for "genbank_accession" fields, we use a "url" field instead, allowing for a mix of GenBank and GitHub urls to be used in the strain popup window.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template. We are removing the `source-data` directory and consolidating everything into the config directory since there's really no functional difference between them. nextstrain/mpox@c34f370
Part of work to update this repo to match the pathogen-repo-template. Always provide default config values that can then easily be overridden with --configfiles/--config options. This makes it simpler to change a subset of config values or extend the default configs. Renames the config.yaml to defaults.yaml to reflect this change. nextstrain/mpox@1e01fed
Use the stopgap from mpox (nextstrain/mpox#214) until the pathogen-repo-ci is updated. However, we may be moving to using config/customization/ci in the future, so revisit this commit and edit accordingly. nextstrain/pathogen-repo-guide#24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following the progress with the pathogen-repo-template in the zika repo @j23414!
I left mostly non-blocking comments that can be future improvements. Since this branch's workflows are being used to produce the live builds at nextstrain.org/zika I would prefer to get this merged ASAP and continue improvements in future PRs.
scripts/set_final_strain_name.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Another reminder for us to tackle the issue of accession vs strain name display in Auspice across the board with nextstrain/auspice#1668
scripts/set_final_strain_name.py
Outdated
if saveoldcolumn == "accession": | ||
node["node_attrs"][saveoldcolumn] = node["name"] | ||
node["node_attrs"]["url"] = "https://www.ncbi.nlm.nih.gov/nuccore/" + node["name"] | ||
elif saveoldcolumn == "genbank_accession": | ||
node["node_attrs"][saveoldcolumn] = {} | ||
node["node_attrs"][saveoldcolumn]["value"] = node["name"] | ||
else: | ||
node["node_attrs"][saveoldcolumn] = node["name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this workaround still needed since the augur export v2
no longer loses the attributes after nextstrain/augur#1261?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion with @j23414: this is currently blocked by needing to add metadata as a coloring to keep accession
and genbank_accession
.
Maybe solved in the future by nextstrain/augur#1384
scripts/set_final_strain_name.py
Outdated
|
||
replace_name_recursive(data['tree'], name_lookup) | ||
replace_name_recursive(data['tree'], name_lookup, args.metadata_id_columns[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard-coded index really stood out to me here, but maybe can be removed if the work around is no longer needed.
@@ -1,5 +1,4 @@ | |||
if not config: | |||
configfile: "config/config_zika.yaml" | |||
configfile: "config/config_zika.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
With this change to always use the default config YAML, the config access throughout the workflow can be updated from config.get("strain_id_field", "strain")
to config["strain_id_field"]
without the hard-coded default value. Embedding default values throughout the Snakemake workflow has potential to be confusing in the future.
This commit refactors the generic post-processing script to better align with its specific purpose in Zika ingest. The purpose of this script is to fix zika strain names based on historical modifications from the fauna repo. In summary the following changes: * Rename script to fix-zika-strain-names.py to match the purpose * Add a docstring to the script * Replace the accession argument with a strain field argument, which is the field that needs to be fixed
* Rephrase the CI section to be less likely to get out of sync * Drop the snakefmt section since we are not currently using it. The snakefmt may be added back in the future. Co-authored-by: Jover Lee <joverlee521@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, as noted in our chat, this should be renamed to curate.smk
.
""" | ||
|
||
|
||
rule transform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto rename to curate
.
log: | ||
"logs/transform.txt", | ||
params: | ||
field_map=config["transform"]["field_map"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, just noticed that this is handled a little differently in the pathogen-repo-template to simplify the field_map in the config YAML
Description of proposed changes
The
persephone
branch was initially intended to live long enough to update the zika live build. However, after some internal discussion it was recommended to submit all the changes as a PR.These commits can be broadly outlined as:
Related issue(s)
Checklist