Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow for producing the Nextclade dengue dataset #25

Merged
merged 12 commits into from
May 30, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Feb 7, 2024

Description of proposed changes

Introduce a workflow dedicated to generating the Nextclade dataset for dengue serotypes and subtypes genotypes. This workflow will be housed in a designated nextclade folder, aligning with the pathogen-repo-guide/nextclade. This workflow is for streamlined dataset creation, testing, and debugging.

The changes can be summarized as follows:

  1. Establish a nextclade directory to adhere to the pathogen-repo-guide/nextclade. Start with a copy of the Nextclade README from the pathogen-repo-guide/nextclade repository.
  2. Copy rules from phylogenetic workflow since most of the rules should be the same, for generating the tree.json files.
  3. Modify the rules to deal with reference-root incongruence.
  4. Add files and rules to assemble the nextclade dengue dataset (e.g. pathogen.json). Rules copied from mpox.
  5. Connect rules to test the nextclade dataset (currently failing) based on the mpox nextclade test rule.

Related issue(s)

Checklist

  • Checks pass
  • Nextclade test rule passes ...

@j23414 j23414 linked an issue Feb 7, 2024 that may be closed by this pull request
@j23414 j23414 marked this pull request as ready for review May 25, 2024 01:30
@j23414 j23414 marked this pull request as draft May 25, 2024 01:49
@j23414
Copy link
Contributor Author

j23414 commented May 27, 2024

This PR so-far creates a Nextclade dataset but is stuck on the following errors when testing the assembled dataset:

nextstrain build nextclade test_output/all
view error for dengue/all -> fixed by C or M coordinates
[Sun May 26 05:59:38 2024]
rule test_dataset:
    input: datasets/all/tree.json, datasets/all/pathogen.json, resources/all/sequences.fasta, datasets/all/genome_annotation.gff3, datasets/all/README.md, datasets/all/CHANGELOG.md
    output: test_output/all
    jobid: 0
    reason: Missing output files: test_output/all
    wildcards: serotype=all
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/all           --output-all test_output/all           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000001
   2: Encountered a mutation (S2H) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'S', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'V'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:226

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
nextstrain build nextclade test_output/denv1
view error for dengue/denv1 -> Fixed by Jover 🥳
[Sun May 26 06:02:12 2024]
rule test_dataset:
    input: datasets/denv1/tree.json, datasets/denv1/pathogen.json, resources/all/sequences.fasta, datasets/denv1/genome_annotation.gff3, datasets/denv1/README.md, datasets/denv1/CHANGELOG.md
    output: test_output/denv1
    jobid: 0
    reason: Missing output files: test_output/denv1
    wildcards: serotype=denv1
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv1           --output-all test_output/denv1           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000043
   2: Encountered a mutation (T59A) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'T', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'Q'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:226

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
nextstrain build nextclade test_output/denv2
view error for dengue/denv2 - different error -> fixed by C or M coords
[Sun May 26 06:03:17 2024]
rule test_dataset:
    input: datasets/denv2/tree.json, datasets/denv2/pathogen.json, resources/all/sequences.fasta, datasets/denv2/genome_annotation.gff3, datasets/denv2/README.md, datasets/denv2/CHANGELOG.md
    output: test_output/denv2
    jobid: 0
    reason: Missing output files: test_output/denv2
    wildcards: serotype=denv2
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv2           --output-all test_output/denv2           --silent           resources/all/sequences.fasta
        
The application panicked (crashed).
Message:  index out of bounds: the len is 100 but the index is 100
Location: packages/nextclade/src/tree/tree_preprocess.rs:213

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
nextstrain build nextclade test_output/denv3
view error for dengue/denv3 -> fixed by C or M coords
[Sun May 26 06:06:18 2024]
rule test_dataset:
    input: datasets/denv3/tree.json, datasets/denv3/pathogen.json, resources/all/sequences.fasta, datasets/denv3/genome_annotation.gff3, datasets/denv3/README.md, datasets/denv3/CHANGELOG.md
    output: test_output/denv3
    jobid: 0
    reason: Missing output files: test_output/denv3
    wildcards: serotype=denv3
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv3           --output-all test_output/denv3           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000005
   2: When preprocessing reference tree node NODE_0000005: amino acid mutation C:I108M is outside of the peptide C (length 100). This is likely an inconsistency between reference tree, reference sequence, and genome annotation in the Nextclade dataset

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:203

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
nextstrain build nextclade test_output/denv4
view error for dengue/denv4 -> fixed by C or M coords
[Sun May 26 06:10:37 2024]
rule test_dataset:
    input: datasets/denv4/tree.json, datasets/denv4/pathogen.json, resources/all/sequences.fasta, datasets/denv4/genome_annotation.gff3, datasets/denv4/README.md, datasets/denv4/CHANGELOG.md
    output: test_output/denv4
    jobid: 0
    reason: Missing output files: test_output/denv4
    wildcards: serotype=denv4
    resources: tmpdir=/var/folders/3_/0vmyf52s7dvdr36h6nlvwt7r0000gp/T


        nextclade run           --input-dataset datasets/denv4           --output-all test_output/denv4           --silent           resources/all/sequences.fasta
        
Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving aa mutations from reference tree node NODE_0000001
   2: Encountered a mutation (S2H) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'S', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'V'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages/nextclade/src/tree/tree_preprocess.rs:226

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

Because I fear inadvertently wandering off the path of acceptable solutions, it might be most helpful and efficient for someone(s) with more experience to submit commits to this branch. From the changes, we can have a productive discussion. Please feel free to message me if anyone wants a zipped folder of the results.zip intermediate files.

@j23414 j23414 marked this pull request as ready for review May 28, 2024 22:42
Co-authored-by: Jover Lee <joverlee521@gmail.com>
Since dengue sequences seem to contain many mutations - too many for the browser
SVG engine to render efficiently in Nextclade's sequence views - we will set the
default CDS to display to the E gene as the "main" gene of interest.

Viewing the full genome and other gene/CDS regions can still be displayed by selection
from the dropdown menu at the top.

Flagged by the following comment:

nextstrain/nextclade_data#203 (comment)
Applies fixes to the dataset so far

1. Gff coordinate fixup
2. Adding the example sequences
3. Set defaultCds to the E gene
@j23414
Copy link
Contributor Author

j23414 commented May 30, 2024

After some discussion with a few people, I may move the 'fine-tuning' of the "dengue/all" dataset commits to a new draft PR since we are still testing solutions.

This approach allows us to merge a functional workflow for assembling a Nextclade dataset, providing a base from which we can test different solutions. @joverlee521, this scoped PR is ready for review

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added workflow makes sense to me. This looks good to merge and leave fine-tuning the all dataset in another PR.

I do wonder if we can just drop nextclade/datasets/ since the datasets are being officially added in nextstrain/nextclade_data#203? There's no need to maintain the datasets in two places.

@j23414
Copy link
Contributor Author

j23414 commented May 30, 2024

if we can just drop nextclade/datasets/

Yes, I wondered that as well. But then decided to keep it as a foundation for a "fine-tuning" PR or for others who might want to create separate branches to explore different solutions from the existing dataset.

My plan is to delete this when nextstrain/nextclade_data#203 is finalized and merged.

@j23414 j23414 merged commit 9c6827e into main May 30, 2024
32 checks passed
@j23414 j23414 deleted the nextclade_dataset_rules branch May 30, 2024 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add workflow for producing the Nextclade dengue dataset
2 participants