Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persephone #28

Merged
merged 28 commits into from
Jan 19, 2024
Merged

Persephone #28

merged 28 commits into from
Jan 19, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Dec 5, 2023

Description of proposed changes

The persephone branch was initially intended to live long enough to update the zika live build. However, after some internal discussion it was recommended to submit all the changes as a PR.

These commits can be broadly outlined as:

  1. Add ingest pipeline (copied from mpox repo)
    1. Use ncbi-datasets to fetch GenBank records
    2. Add the zika data processing steps from fauna (17e3912)
  2. Move the phylogenetic workflow to its own folder
    1. Add rules for merging USVI data (9334700)
    2. Move rules to sub snakefiles to match the pathogen-repo-template

Related issue(s)

Checklist

  • Checks pass

Ideally zika ingest should be added via the pathogen_template_repo. However,
to update the recent zika sequences, this temporary branch will copy the ingest
directory from the mpox repo:

https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05

The `ingest/vendored` subdirectory is not copied over since that folder should
be added with `git subtree`.

https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05/ingest#ingestvendored

Future commits will change this to work with Zika data.
…/vendored

subrepo:
  subdir:   "ingest/vendored"
  merged:   "a0faef5"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "a0faef5"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Removal of Nextclade related rules, pending the compilation of a Nextclade
zika dataset and potential v3 changes.
Organize field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics¹
   (see "ncbi_datasets_fields:" in `config/config.yaml`)
2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names²
   (see "transform: fieldmap:" in `config/config.yaml`)

¹ https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
² https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
Move phylogenetic workflow from top-level to folder phylogenetic in order
to follow the Pathogen Repo Template:

https://github.com/nextstrain/pathogen-repo-template
The original Zika build contained USVI data that had been posted publiclly to
GitHub but not yet submitted to NCBI GenBank. This commit adds rules to merge
the USVI data with the NCBI GenBank data.

Since USVI does not have a genbank_accession column, we create a new accession column
for both USVI and NCBI GenBank accessions. This accession column is then used
as the strain_id column for the phylogenetic build.

Since auspice automagically generates a NCBI GenBank url for "genbank_accession" fields,
we use a "url" field instead, allowing for a mix of GenBank and GitHub urls
to be used in the strain popup window.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.
Part of work to update this repo to match the pathogen-repo-template.

We are removing the `source-data` directory and consolidating everything
into the config directory since there's really no functional difference
between them.

nextstrain/mpox@c34f370
Part of work to update this repo to match the pathogen-repo-template.

Always provide default config values that can then easily be overridden
with --configfiles/--config options. This makes it simpler to change a
subset of config values or extend the default configs.

Renames the config.yaml to defaults.yaml to reflect this change.

nextstrain/mpox@1e01fed
Use the stopgap from mpox (nextstrain/mpox#214) until the pathogen-repo-ci is updated.
However, we may be moving to using config/customization/ci in the future, so revisit
this commit and edit accordingly.

nextstrain/pathogen-repo-guide#24
@j23414 j23414 marked this pull request as ready for review January 11, 2024 23:54
@j23414 j23414 requested a review from a team January 12, 2024 19:09
@joverlee521 joverlee521 self-requested a review January 13, 2024 01:11
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following the progress with the pathogen-repo-template in the zika repo @j23414!

I left mostly non-blocking comments that can be future improvements. Since this branch's workflows are being used to produce the live builds at nextstrain.org/zika I would prefer to get this merged ASAP and continue improvements in future PRs.

ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/transform.smk Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Another reminder for us to tackle the issue of accession vs strain name display in Auspice across the board with nextstrain/auspice#1668

Comment on lines 7 to 14
if saveoldcolumn == "accession":
node["node_attrs"][saveoldcolumn] = node["name"]
node["node_attrs"]["url"] = "https://www.ncbi.nlm.nih.gov/nuccore/" + node["name"]
elif saveoldcolumn == "genbank_accession":
node["node_attrs"][saveoldcolumn] = {}
node["node_attrs"][saveoldcolumn]["value"] = node["name"]
else:
node["node_attrs"][saveoldcolumn] = node["name"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this workaround still needed since the augur export v2 no longer loses the attributes after nextstrain/augur#1261?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @j23414: this is currently blocked by needing to add metadata as a coloring to keep accession and genbank_accession.
Maybe solved in the future by nextstrain/augur#1384


replace_name_recursive(data['tree'], name_lookup)
replace_name_recursive(data['tree'], name_lookup, args.metadata_id_columns[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded index really stood out to me here, but maybe can be removed if the work around is no longer needed.

@@ -1,5 +1,4 @@
if not config:
configfile: "config/config_zika.yaml"
configfile: "config/config_zika.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

With this change to always use the default config YAML, the config access throughout the workflow can be updated from config.get("strain_id_field", "strain") to config["strain_id_field"] without the hard-coded default value. Embedding default values throughout the Snakemake workflow has potential to be confusing in the future.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
j23414 and others added 5 commits January 18, 2024 08:00
This commit refactors the generic post-processing script to better align
with its specific purpose in Zika ingest. The purpose of this script is
to fix zika strain names based on historical modifications from the fauna
repo. In summary the following changes:

* Rename script to fix-zika-strain-names.py to match the purpose
* Add a docstring to the script
* Replace the accession argument with a strain field argument, which is
  the field that needs to be fixed
* Rephrase the CI section to be less likely to get out of sync
* Drop the snakefmt section since we are not currently using it. The
  snakefmt may be added back in the future.

Co-authored-by: Jover Lee <joverlee521@gmail.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, as noted in our chat, this should be renamed to curate.smk.

"""


rule transform:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto rename to curate.

log:
"logs/transform.txt",
params:
field_map=config["transform"]["field_map"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just noticed that this is handled a little differently in the pathogen-repo-template to simplify the field_map in the config YAML

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants