Add workflow for producing the Nextclade dengue dataset #25

j23414 · 2024-02-07T19:12:48Z

Description of proposed changes

Introduce a workflow dedicated to generating the Nextclade dataset for dengue serotypes and ~~subtypes~~ genotypes. This workflow will be housed in a designated nextclade folder, aligning with the pathogen-repo-guide/nextclade. This workflow is for streamlined dataset creation, testing, and debugging.

The changes can be summarized as follows:

Establish a nextclade directory to adhere to the pathogen-repo-guide/nextclade. Start with a copy of the Nextclade README from the pathogen-repo-guide/nextclade repository.
Copy rules from phylogenetic workflow since most of the rules should be the same, for generating the tree.json files.
Modify the rules to deal with reference-root incongruence.
Add files and rules to assemble the nextclade dengue dataset (e.g. pathogen.json). Rules copied from mpox.
Connect rules to test the nextclade dataset (currently failing) based on the mpox nextclade test rule.

Related issue(s)

Checklist

Checks pass
Nextclade test rule passes ...

nextclade/rules/annotate_phylogeny.smk

https://github.com/nextstrain/pathogen-repo-guide/blob/f33c43edd9ebad10aa0e8d2b0791755ddbe2f5c8/nextclade/README.md

https://github.com/nextstrain/dengue/tree/75d9c5fc01e48d1d8385b11fd8cf295ec5b995c2/phylogenetic Subsequent commits will reuse the phylogenetic config and bin directories to avoid duplication.

j23414 · 2024-05-27T17:44:33Z

This PR so-far creates a Nextclade dataset but is stuck on the following errors when testing the assembled dataset:

nextstrain build nextclade test_output/all

~~view error for dengue/all~~ -> fixed by C or M coordinates

[Sun May 26 05:59:38 2024]
rule test_dataset:
    input: datasets/all/tree.json, datasets/all/pathogen.json, resources/all/sequences.fasta, datasets/all/genome_annotation.gff3, datasets/all/README.md, datasets/all/CHANGELOG.md
    output: test_output/all
    jobid: 0
    reason: Missing output files: test_output/all
    wildcards: serotype=all
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/all           --output-all test_output/all           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000001
   2: Encountered a mutation (S2H) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'S', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'V'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:226

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

nextstrain build nextclade test_output/denv1

~~view error for dengue/denv1~~ -> Fixed by Jover 🥳

[Sun May 26 06:02:12 2024]
rule test_dataset:
    input: datasets/denv1/tree.json, datasets/denv1/pathogen.json, resources/all/sequences.fasta, datasets/denv1/genome_annotation.gff3, datasets/denv1/README.md, datasets/denv1/CHANGELOG.md
    output: test_output/denv1
    jobid: 0
    reason: Missing output files: test_output/denv1
    wildcards: serotype=denv1
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv1           --output-all test_output/denv1           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000043
   2: Encountered a mutation (T59A) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'T', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'Q'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:226

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

nextstrain build nextclade test_output/denv2

~~view error for dengue/denv2 - different error~~ -> fixed by C or M coords

[Sun May 26 06:03:17 2024]
rule test_dataset:
    input: datasets/denv2/tree.json, datasets/denv2/pathogen.json, resources/all/sequences.fasta, datasets/denv2/genome_annotation.gff3, datasets/denv2/README.md, datasets/denv2/CHANGELOG.md
    output: test_output/denv2
    jobid: 0
    reason: Missing output files: test_output/denv2
    wildcards: serotype=denv2
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv2           --output-all test_output/denv2           --silent           resources/all/sequences.fasta
        
The application panicked (crashed).
Message:  index out of bounds: the len is 100 but the index is 100
Location: packages/nextclade/src/tree/tree_preprocess.rs:213

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

nextstrain build nextclade test_output/denv3

~~view error for dengue/denv3~~ -> fixed by C or M coords

[Sun May 26 06:06:18 2024]
rule test_dataset:
    input: datasets/denv3/tree.json, datasets/denv3/pathogen.json, resources/all/sequences.fasta, datasets/denv3/genome_annotation.gff3, datasets/denv3/README.md, datasets/denv3/CHANGELOG.md
    output: test_output/denv3
    jobid: 0
    reason: Missing output files: test_output/denv3
    wildcards: serotype=denv3
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv3           --output-all test_output/denv3           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000005
   2: When preprocessing reference tree node NODE_0000005: amino acid mutation C:I108M is outside of the peptide C (length 100). This is likely an inconsistency between reference tree, reference sequence, and genome annotation in the Nextclade dataset

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:203

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

nextstrain build nextclade test_output/denv4

~~view error for dengue/denv4~~ -> fixed by C or M coords

[Sun May 26 06:10:37 2024]
rule test_dataset:
    input: datasets/denv4/tree.json, datasets/denv4/pathogen.json, resources/all/sequences.fasta, datasets/denv4/genome_annotation.gff3, datasets/denv4/README.md, datasets/denv4/CHANGELOG.md
    output: test_output/denv4
    jobid: 0
    reason: Missing output files: test_output/denv4
    wildcards: serotype=denv4
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv4           --output-all test_output/denv4           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000001
   2: Encountered a mutation (S2H) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'S', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'V'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:226

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

Because I fear inadvertently wandering off the path of acceptable solutions, it might be most helpful and efficient for someone(s) with more experience to submit commits to this branch. From the changes, we can have a productive discussion. Please feel free to message me if anyone wants a zipped folder of the results.zip intermediate files.

nextclade/resources/denv1/genome_annotation.gff3

Co-authored-by: Jover Lee <joverlee521@gmail.com>

Since dengue sequences seem to contain many mutations - too many for the browser SVG engine to render efficiently in Nextclade's sequence views - we will set the default CDS to display to the E gene as the "main" gene of interest. Viewing the full genome and other gene/CDS regions can still be displayed by selection from the dropdown menu at the top. Flagged by the following comment: nextstrain/nextclade_data#203 (comment)

Applies fixes to the dataset so far 1. Gff coordinate fixup 2. Adding the example sequences 3. Set defaultCds to the E gene

j23414 · 2024-05-30T22:26:17Z

After some discussion with a few people, I may move the 'fine-tuning' of the "dengue/all" dataset commits to a new draft PR since we are still testing solutions.

This approach allows us to merge a functional workflow for assembling a Nextclade dataset, providing a base from which we can test different solutions. @joverlee521, this scoped PR is ready for review

joverlee521

The added workflow makes sense to me. This looks good to merge and leave fine-tuning the all dataset in another PR.

I do wonder if we can just drop nextclade/datasets/ since the datasets are being officially added in nextstrain/nextclade_data#203? There's no need to maintain the datasets in two places.

j23414 · 2024-05-30T22:41:51Z

if we can just drop nextclade/datasets/

Yes, I wondered that as well. But then decided to keep it as a foundation for a "fine-tuning" PR or for others who might want to create separate branches to explore different solutions from the existing dataset.

My plan is to delete this when nextstrain/nextclade_data#203 is finalized and merged.

j23414 linked an issue Feb 7, 2024 that may be closed by this pull request

Add workflow for producing the Nextclade dengue dataset #21

Open

j23414 commented Feb 8, 2024

View reviewed changes

nextclade/rules/annotate_phylogeny.smk Show resolved Hide resolved

This was referenced Feb 10, 2024

Fix: update dropped strains file to list accession instead of strain names #26

Merged

Split by dengue serotype (denv1-denv4) #19

Closed

joverlee521 mentioned this pull request Feb 14, 2024

Nextclade assignment #16

Merged

2 tasks

j23414 mentioned this pull request Mar 18, 2024

Harmonize ingest with pathogen repo guide #35

Merged

1 task

j23414 force-pushed the nextclade_dataset_rules branch 5 times, most recently from 8c7755d to e8b059b Compare May 22, 2024 19:04

This was referenced May 23, 2024

Add workflow for producing the Nextclade dengue dataset #21

Open

Prepare phylogenetic workflow for Nextclade workflow #57

Merged

j23414 added 2 commits May 24, 2024 18:20

Copy nextclade README from the pathogen-repo-guide

6d00517

https://github.com/nextstrain/pathogen-repo-guide/blob/f33c43edd9ebad10aa0e8d2b0791755ddbe2f5c8/nextclade/README.md

Copy rules from phylogenetic workflow

2f1b103

https://github.com/nextstrain/dengue/tree/75d9c5fc01e48d1d8385b11fd8cf295ec5b995c2/phylogenetic Subsequent commits will reuse the phylogenetic config and bin directories to avoid duplication.

j23414 force-pushed the nextclade_dataset_rules branch from e8b059b to 5717648 Compare May 25, 2024 01:30

j23414 marked this pull request as ready for review May 25, 2024 01:30

j23414 marked this pull request as draft May 25, 2024 01:49

j23414 added 5 commits May 24, 2024 20:39

Assemble Nextclade dataset

8365267

Connect rules to test nextclade dataset

0a19071

Update docs

425b4b5

Reuse phylogenetic config and bin folders

a249bf2

Only assemble genome dataset

d1fef70

j23414 force-pushed the nextclade_dataset_rules branch from 5717648 to d1fef70 Compare May 26, 2024 03:06

wip: datasets

bc07dac

joverlee521 reviewed May 28, 2024

View reviewed changes

nextclade/resources/denv1/genome_annotation.gff3 Outdated Show resolved Hide resolved

j23414 marked this pull request as ready for review May 28, 2024 22:42

j23414 force-pushed the nextclade_dataset_rules branch from 6c778db to 9705755 Compare May 30, 2024 22:08

Fixup: match denv1 C and M gene annotations to genbank file

6a442ed

Co-authored-by: Jover Lee <joverlee521@gmail.com>

j23414 added 3 commits May 30, 2024 15:13

Add example sequences for each DENV serotype

61c6c65

fixup: dataset

baf0263

Applies fixes to the dataset so far 1. Gff coordinate fixup 2. Adding the example sequences 3. Set defaultCds to the E gene

j23414 force-pushed the nextclade_dataset_rules branch from 9705755 to baf0263 Compare May 30, 2024 22:16

j23414 requested a review from joverlee521 May 30, 2024 22:17

joverlee521 approved these changes May 30, 2024

View reviewed changes

j23414 merged commit 9c6827e into main May 30, 2024
32 checks passed

j23414 deleted the nextclade_dataset_rules branch May 30, 2024 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow for producing the Nextclade dengue dataset #25

Add workflow for producing the Nextclade dengue dataset #25

j23414 commented Feb 7, 2024 •

edited

Loading

j23414 commented May 27, 2024 •

edited

Loading

j23414 commented May 30, 2024

joverlee521 left a comment

j23414 commented May 30, 2024

Add workflow for producing the Nextclade dengue dataset #25

Add workflow for producing the Nextclade dengue dataset #25

Conversation

j23414 commented Feb 7, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

j23414 commented May 27, 2024 • edited Loading

j23414 commented May 30, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

j23414 commented May 30, 2024

j23414 commented Feb 7, 2024 •

edited

Loading

j23414 commented May 27, 2024 •

edited

Loading