Persephone #28

j23414 · 2023-12-05T19:35:32Z

Description of proposed changes

The persephone branch was initially intended to live long enough to update the zika live build. However, after some internal discussion it was recommended to submit all the changes as a PR.

These commits can be broadly outlined as:

Add ingest pipeline (copied from mpox repo)
1. Use ncbi-datasets to fetch GenBank records
2. Add the zika data processing steps from fauna (17e3912)
Move the phylogenetic workflow to its own folder
1. Add rules for merging USVI data (9334700)
2. Move rules to sub snakefiles to match the pathogen-repo-template

Related issue(s)

Checklist

Checks pass

Ideally zika ingest should be added via the pathogen_template_repo. However, to update the recent zika sequences, this temporary branch will copy the ingest directory from the mpox repo: https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05 The `ingest/vendored` subdirectory is not copied over since that folder should be added with `git subtree`. https://github.com/nextstrain/mpox/tree/ed4a15c4d8476cc96fb2126cc7a833b36a825e05/ingest#ingestvendored Future commits will change this to work with Zika data.

…/vendored subrepo: subdir: "ingest/vendored" merged: "a0faef5" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "a0faef5" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

Removal of Nextclade related rules, pending the compilation of a Nextclade zika dataset and potential v3 changes.

Organize field renaming into two parts. 1. Rename the NCBI output columns to match the NCBI mnemonics¹ (see "ncbi_datasets_fields:" in `config/config.yaml`) 2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names² (see "transform: fieldmap:" in `config/config.yaml`) ¹ https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields ² https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html

Rescue some of the original functionality of the zika_upload script from fauna. https://github.com/nextstrain/fauna/blob/master/vdb/zika_upload.py#L14-L30

Move phylogenetic workflow from top-level to folder phylogenetic in order to follow the Pathogen Repo Template: https://github.com/nextstrain/pathogen-repo-template

The original Zika build contained USVI data that had been posted publiclly to GitHub but not yet submitted to NCBI GenBank. This commit adds rules to merge the USVI data with the NCBI GenBank data. Since USVI does not have a genbank_accession column, we create a new accession column for both USVI and NCBI GenBank accessions. This accession column is then used as the strain_id column for the phylogenetic build. Since auspice automagically generates a NCBI GenBank url for "genbank_accession" fields, we use a "url" field instead, allowing for a mix of GenBank and GitHub urls to be used in the strain popup window.

Part of work to update this repo to match the pathogen-repo-template.

Part of work to update this repo to match the pathogen-repo-template. We are removing the `source-data` directory and consolidating everything into the config directory since there's really no functional difference between them. nextstrain/mpox@c34f370

Part of work to update this repo to match the pathogen-repo-template. Always provide default config values that can then easily be overridden with --configfiles/--config options. This makes it simpler to change a subset of config values or extend the default configs. Renames the config.yaml to defaults.yaml to reflect this change. nextstrain/mpox@1e01fed

Use the stopgap from mpox (nextstrain/mpox#214) until the pathogen-repo-ci is updated. However, we may be moving to using config/customization/ci in the future, so revisit this commit and edit accordingly. nextstrain/pathogen-repo-guide#24

https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#define-input-paths-with-literal-path-strings

joverlee521

Thanks for following the progress with the pathogen-repo-template in the zika repo @j23414!

I left mostly non-blocking comments that can be future improvements. Since this branch's workflows are being used to produce the live builds at nextstrain.org/zika I would prefer to get this merged ASAP and continue improvements in future PRs.

ingest/bin/post_process_metadata.py

ingest/workflow/snakemake_rules/transform.smk

joverlee521 · 2024-01-16T20:33:37Z

scripts/set_final_strain_name.py

non-blocking

Another reminder for us to tackle the issue of accession vs strain name display in Auspice across the board with nextstrain/auspice#1668

joverlee521 · 2024-01-16T22:50:13Z

scripts/set_final_strain_name.py

+        if saveoldcolumn == "accession":
+            node["node_attrs"][saveoldcolumn] = node["name"]
+            node["node_attrs"]["url"] = "https://www.ncbi.nlm.nih.gov/nuccore/" + node["name"]
+        elif saveoldcolumn == "genbank_accession":
+            node["node_attrs"][saveoldcolumn] = {}
+            node["node_attrs"][saveoldcolumn]["value"] = node["name"]
+        else:
+            node["node_attrs"][saveoldcolumn] = node["name"]


Is this workaround still needed since the augur export v2 no longer loses the attributes after nextstrain/augur#1261?

After discussion with @j23414: this is currently blocked by needing to add metadata as a coloring to keep accession and genbank_accession.
Maybe solved in the future by nextstrain/augur#1384

joverlee521 · 2024-01-16T23:21:58Z

scripts/set_final_strain_name.py


-    replace_name_recursive(data['tree'], name_lookup)
+        replace_name_recursive(data['tree'], name_lookup, args.metadata_id_columns[0])


The hard-coded index really stood out to me here, but maybe can be removed if the work around is no longer needed.

joverlee521 · 2024-01-16T23:33:55Z

phylogenetic/Snakefile

@@ -1,5 +1,4 @@
-if not config:
-    configfile: "config/config_zika.yaml"
+configfile: "config/config_zika.yaml"


non-blocking

With this change to always use the default config YAML, the config access throughout the workflow can be updated from config.get("strain_id_field", "strain") to config["strain_id_field"] without the hard-coded default value. Embedding default values throughout the Snakemake workflow has potential to be confusing in the future.

CONTRIBUTING.md

This commit refactors the generic post-processing script to better align with its specific purpose in Zika ingest. The purpose of this script is to fix zika strain names based on historical modifications from the fauna repo. In summary the following changes: * Rename script to fix-zika-strain-names.py to match the purpose * Add a docstring to the script * Replace the accession argument with a strain field argument, which is the field that needs to be fixed

* Rephrase the CI section to be less likely to get out of sync * Drop the snakefmt section since we are not currently using it. The snakefmt may be added back in the future. Co-authored-by: Jover Lee <joverlee521@gmail.com>

joverlee521 · 2024-01-18T23:41:07Z

ingest/rules/transform.smk

Ah, as noted in our chat, this should be renamed to curate.smk.

joverlee521 · 2024-01-18T23:41:24Z

ingest/rules/transform.smk

+        """
+
+
+rule transform:


Ditto rename to curate.

joverlee521 · 2024-01-18T23:43:41Z

ingest/rules/transform.smk

+    log:
+        "logs/transform.txt",
+    params:
+        field_map=config["transform"]["field_map"],


Ah, just noticed that this is handled a little differently in the pathogen-repo-template to simplify the field_map in the config YAML

j23414 added 18 commits November 13, 2023 13:21

Replace mpox text and taxon id with zika

38007bb

Remove Nextclade related rules

c475361

Removal of Nextclade related rules, pending the compilation of a Nextclade zika dataset and potential v3 changes.

Clear pathogen specific user provided annotations and rules

d154a88

Rescue fauna data processing steps that are specific to Zika

17e3912

Rescue some of the original functionality of the zika_upload script from fauna. https://github.com/nextstrain/fauna/blob/master/vdb/zika_upload.py#L14-L30

Ignore snakemake state dir for current and subfolders

6bdee23

Use genbank_accession column as ID column

41c902f

workaround to get accession links to work

7b37112

Move phylogenetic workflow to a phylogenetic folder

9071f02

Move phylogenetic workflow from top-level to folder phylogenetic in order to follow the Pathogen Repo Template: https://github.com/nextstrain/pathogen-repo-template

Move rules for preparing sequences to its own smk file

11123b9

Part of work to update this repo to match the pathogen-repo-template.

Move rules for constructing phylogeny to its own smk file

59ef926

Part of work to update this repo to match the pathogen-repo-template.

Move rules for annotating phylogeny to its own smk file

4467646

Part of work to update this repo to match the pathogen-repo-template.

Move rules for exporting auspice json to its own smk file

6976946

Part of work to update this repo to match the pathogen-repo-template.

j23414 force-pushed the persephone branch from a37480e to 3124b18 Compare December 18, 2023 23:35

j23414 added 2 commits January 10, 2024 15:20

Use a rules folder for ingest to follow pathogen-repo-template

f388525

Update the CI workflow

efe11e3

Use the stopgap from mpox (nextstrain/mpox#214) until the pathogen-repo-ci is updated. However, we may be moving to using config/customization/ci in the future, so revisit this commit and edit accordingly. nextstrain/pathogen-repo-guide#24

j23414 force-pushed the persephone branch from 6b83abc to efe11e3 Compare January 10, 2024 23:37

Define input paths with literal path strings

6349fd7

https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#define-input-paths-with-literal-path-strings

j23414 marked this pull request as ready for review January 11, 2024 23:54

j23414 added 2 commits January 12, 2024 10:50

Copy contributing docs from mpox

44825a3

Simplify README instructions

aefdec1

j23414 requested a review from a team January 12, 2024 19:09

joverlee521 self-requested a review January 13, 2024 01:11

joverlee521 approved these changes Jan 16, 2024

View reviewed changes

jameshadfield reviewed Jan 17, 2024

View reviewed changes

CONTRIBUTING.md Outdated Show resolved Hide resolved

j23414 and others added 5 commits January 18, 2024 08:00

Abbreviate authors inplace instead of a separate field

a99af29

As discussed, always use the default config

230c6b0

Update Contributing docs

8cf7326

* Rephrase the CI section to be less likely to get out of sync * Drop the snakefmt section since we are not currently using it. The snakefmt may be added back in the future. Co-authored-by: Jover Lee <joverlee521@gmail.com>

fixup: add docstring

0560686

joverlee521 reviewed Jan 18, 2024

View reviewed changes

ingest/rules/transform.smk

Copy link

Contributor

joverlee521 Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, as noted in our chat, this should be renamed to curate.smk.

joverlee521 reviewed Jan 18, 2024

View reviewed changes

ingest/rules/transform.smk

"""

rule transform:

Copy link

Contributor

joverlee521 Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto rename to curate.

joverlee521 reviewed Jan 18, 2024

View reviewed changes

j23414 merged commit 5f11069 into main Jan 19, 2024
8 checks passed

j23414 deleted the persephone branch January 19, 2024 18:40

This was referenced Jan 19, 2024

Harmonize with pathogen-repo-guide #29

Closed

Update zika instructions nextstrain/fauna#152

Merged

Harmonize with pathogen repo guide #31

Merged

joverlee521 mentioned this pull request Jan 24, 2024

Add ingest nextstrain/measles#10

Merged

This was referenced Feb 6, 2024

Drop accession workaround when nextstrain/augur 1384 is solved #32

Open

Update existing pathogen repos to vendor ingest scripts nextstrain/ingest#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persephone #28

Persephone #28

j23414 commented Dec 5, 2023 •

edited

Loading

joverlee521 left a comment

joverlee521 Jan 16, 2024

joverlee521 Jan 16, 2024

joverlee521 Jan 18, 2024

joverlee521 Jan 16, 2024

joverlee521 Jan 16, 2024

joverlee521 Jan 18, 2024

joverlee521 Jan 18, 2024

joverlee521 Jan 18, 2024


		replace_name_recursive(data['tree'], name_lookup)
		replace_name_recursive(data['tree'], name_lookup, args.metadata_id_columns[0])

Persephone #28

Persephone #28

Conversation

j23414 commented Dec 5, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 commented Dec 5, 2023 •

edited

Loading