Pangolin flag #593

sidneymbell · 2021-03-26T15:35:32Z

Draft implementation for fixing #557.

I need to spend time debugging and testing, but would love any high-level feedback on the approach if you have a minute.

cc @ttung

Arguably should be going through "combine_metadata" but I can't quite figure out how to do that properly in this context.

sidneymbell · 2021-04-01T00:34:06Z

Alright, this branch now:
A. Does not run pangolin if the config flag is not present
B. Does run pangolin if the config flag is present
C. Produces the expected output pangolineages.csv with minimal runtime increase
D. Attempts to export this output via node_data.

I have two main questions (probably for @huddlej or @jameshadfield ?) before I land this:

I believe the more appropriate way of exporting the information in pangolineages.csv is by adding it to the combined_metadata.tsv. Presumably this needs to be fed into combine_metadata somehow. However, this is rather confusing to me with the new multiple inputs setup.

Would love suggestions on how to implement this metadata addendum (I can then remove the make_pangolin_node_data.py script.

Sidenote: (D) is currently broken, but given that I don't think this is the path we want to go down, putting those questions aside for now.

On my local instance, I've gone ahead and added pangolin to my nextstrain.yaml via the route suggested here. This works well, at least on my machine. However, I'm not sure this is necessarily a desired addition as it adds more dependencies that some users may not want?

I'm unfamiliar with the conda: argument in snakemake. Is this a way to specify a separate env file to be used for that specific rule? If so, does this seem like a reasonable approach to separating out the dependencies?

rneher · 2021-04-02T14:35:15Z

@sidneymbell

this looks mostly good, but the node data output needs be

{
"produced_by":"pangolin",
"nodes": {
  "tax1": {"pango": "B.34.2323.454"},
  "tax2": {"pango": "X.3.3.4"},
...
}
}

Your implementation is missing the explicit attribute name.

besides, we would need to figure out how this interacts with the Pango lineage assignment we do from GISAID metadata.

huddlej

@sidneymbell I made some specific comments below, but maybe most are not helpful if the goal is to not have a new node data JSON.

Is the most helpful outcome here to add a "Nextstrain-approved" pangolin lineage column to the combined metadata (local + GISAID, for example) prior to subsampling? Or is the goal more specifically for PHLs to reannotate their own metadata file(s) prior to filtering, subsampling, etc.? The specific implementation could be different depending on which of those use cases you need to support.

scripts/make_pangolin_node_data.py

workflow/snakemake_rules/main_workflow.smk

sidneymbell · 2021-04-02T20:22:40Z

Thanks so much, @rneher and @huddlej!

The goal for us is to enable our partner PHLs to see up-to-date pangolin lineage assignments for both the contextual gisaid data and their local data both (1) on the tree and (2, very impt) in the "download metadata" tsv file from the auspice frontend.

Would reimplementing the make_pangolin_node_data.py to instead feed through combine_metadata accomplish both (1) and (2)? If so, I'd love guidance on where/how to insert this in the mainworkflow.smk (or elsewhere...?).

Re: these assignments vs gisaid's, I'd propose having these pangolin assignments (if run) supercede the gisaid assignments for two reasons:

these are more comprehensive as they also include local samples
discrepancies between the gisaid and local pangolineages should be minor; if there are any discrepancies, taking all the lineages from one set at least keeps things internally consistent

huddlej · 2021-04-02T21:03:40Z

The goal for us is to enable our partner PHLs to see up-to-date pangolin lineage assignments for both the contextual gisaid data and their local data both (1) on the tree and (2, very impt) in the "download metadata" tsv file from the auspice frontend

Perfect! It looks like most (maybe all?) node data annotations in the auspice JSON are available through Auspice's metadata download. Since the current workflow mainly flows from raw metadata and sequences toward node data JSONs aggregated into an auspice JSON, the simplest way to achieve the goal is to export the new Pangolin annotations in a node data JSON like you've setup here.

You don't have to choose between existing annotations and the re-annotation. For instance, you can provide an argument like --attribute-name to the script that converts the Pangolin output to a node data JSON and use the value of that argument to set the attribute name of the lineages in the output JSON. This would allow us to annotate lineages as a separate field or in the same field which will override the existing metadata values.

Someone from @nextstrain/core can add the pangolin package to the nextstrain Conda package and Docker image in separate PRs. Before we make those changes, maybe we can get this PR working fully with the --use-conda flag? I can help test on OS X and the Hutch Linux cluster.

sidneymbell · 2021-04-02T23:51:01Z

Thanks for your help, @huddlej !

Recent changes:

Node data export works, confirmed that it shows up in the metadata.tsv file (🎉 ) under the attribute name PANGO lineages - local (suggestions welcome)
Tidied up old comments, etc.
Added pangolin's dependencies to the yaml file, tested on my local machine that the --use-conda flag produces a clean build

Opening full PR for review, looking forward to any additional suggestions you may have.

huddlej

This worked for me as it is, @sidneymbell, but I made a couple of requests for minor Snakemake surgery and a simpler conda environment file. Let me know if you have any questions. We can also chat online (offline?), if it helps to be more synchronous.

Next, we'll have to figure out the best way to get pangolin and pangolearn in the Nextstrain Docker image, but that can be a separate PR.

scripts/make_pangolin_node_data.py

workflow/envs/nextstrain.yaml

workflow/snakemake_rules/main_workflow.smk

sidneymbell · 2021-04-12T20:24:37Z

As always, thanks for such a detailed, thoughtful and constructive review, @huddlej ! I believe I've implemented all of the suggested changes, and this runs well on my box using the --use-conda flag. LMK if you encounter any bugs or have additional suggestions. Thanks!

huddlej

This looks good, @sidneymbell. I had only two small edits:

conda-forge has to appear first in the channel list of the conda environment
the run_pangolin rule needs to write out to the log file with 2>&1 | tee {log}

After those changes, we can merge! 🎉

At the last Nextstrain meeting, we discussed adding the pangolin tools to the base Nextstrain Docker image vs. maintaining a separate image for the ncov workflow. The consensus was to add these tools to the base Docker image, so I'll make a separate issue/PR for that change once I figure out how to install pangolin without Conda. :)

workflow/envs/nextstrain.yaml

sidneymbell added 4 commits March 24, 2021 12:46

Draft pangolin rule

bc8c213

Add script to create node data from CSV

665869b

Arguably should be going through "combine_metadata" but I can't quite figure out how to do that properly in this context.

Comments

0c8e3f2

Bug bashing

ce4505e

rneher mentioned this pull request Apr 2, 2021

Add Pango lineages to the GenBank ingest nextstrain/ncov-ingest#129

Closed

huddlej reviewed Apr 2, 2021

View reviewed changes

sidneymbell added 2 commits April 2, 2021 13:23

Add in pangolin's dependencies

35859eb

Remove unnecessary threads declaration

dcf2470

sidneymbell added 2 commits April 2, 2021 16:43

Add attribute name

ed00198

Fix pandas version issue

5390ffe

sidneymbell marked this pull request as ready for review April 2, 2021 23:51

sidneymbell requested a review from huddlej April 2, 2021 23:51

huddlej requested changes Apr 6, 2021

View reviewed changes

sidneymbell added 5 commits April 12, 2021 12:56

make attr name a param

1f38082

reset changes to dependencies

1b72bf7

tidy up comments

f03ef5b

yay pango conda packages!

1eed5c9

Threads, benchmarking, another couple comments

8eaa3aa

sidneymbell requested a review from huddlej April 12, 2021 20:23

huddlej requested changes Apr 12, 2021

View reviewed changes

workflow/envs/nextstrain.yaml Outdated Show resolved Hide resolved

Log & conda order

9597125

sidneymbell requested a review from huddlej April 12, 2021 21:55

sidneymbell merged commit 2cf2133 into nextstrain:master Apr 12, 2021

sidneymbell deleted the pangolin-flag branch April 12, 2021 22:14

huddlej mentioned this pull request Nov 25, 2021

Pulling pre-computed pangolin is spotty; add flag to enable bespoke pangolin run #557

Closed

tsibley mentioned this pull request Dec 7, 2022

IQtree is locked to v1.6 rather than something modern like v2.2 nextstrain/conda-base#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pangolin flag #593

Pangolin flag #593

sidneymbell commented Mar 26, 2021

sidneymbell commented Apr 1, 2021 •

edited

Loading

rneher commented Apr 2, 2021 •

edited

Loading

huddlej left a comment

sidneymbell commented Apr 2, 2021

huddlej commented Apr 2, 2021

sidneymbell commented Apr 2, 2021

huddlej left a comment

sidneymbell commented Apr 12, 2021

huddlej left a comment

Pangolin flag #593

Pangolin flag #593

Conversation

sidneymbell commented Mar 26, 2021

sidneymbell commented Apr 1, 2021 • edited Loading

I have two main questions (probably for @huddlej or @jameshadfield ?) before I land this:

rneher commented Apr 2, 2021 • edited Loading

huddlej left a comment

Choose a reason for hiding this comment

sidneymbell commented Apr 2, 2021

huddlej commented Apr 2, 2021

sidneymbell commented Apr 2, 2021

huddlej left a comment

Choose a reason for hiding this comment

sidneymbell commented Apr 12, 2021

huddlej left a comment

Choose a reason for hiding this comment

sidneymbell commented Apr 1, 2021 •

edited

Loading

rneher commented Apr 2, 2021 •

edited

Loading