Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split by dengue serotype (denv1-denv4) #19

Closed
2 of 3 tasks
j23414 opened this issue Feb 6, 2024 · 1 comment
Closed
2 of 3 tasks

Split by dengue serotype (denv1-denv4) #19

j23414 opened this issue Feb 6, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@j23414
Copy link
Contributor

j23414 commented Feb 6, 2024

Description

Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of sequences_{serotype}.fasta and metadata_{serotype}.tsv files.

Context

Following the merge of #13, all ingested dengue records now exist in a unified pair of sequences.fasta and metadata.tsv files.

For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from sequences_denv1.fasta to sequences_denv4.fasta).

Possible solution(s)

Rely on NCBI taxon id annotations for serotype segregation

Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.

Create a Nextclade dataset for finer subtype classification

Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.

An ensemble method

Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.

Tasks to solve this issue

@j23414 j23414 added the enhancement New feature or request label Feb 6, 2024
@j23414 j23414 self-assigned this Feb 11, 2024
@joverlee521
Copy link
Contributor

Learned in today's Nextstrain meeting that this can probably be done by nextclade sort similar to how RSV separates A/B subtypes

@j23414 j23414 closed this as completed Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants