-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of genome containing plasmid sequence #28
Comments
For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28. Alternatively, we can attempt to drop the plasma from the ends of the sequences if that makes sense.
Similarly, there are genomes with
|
Hello @Rohit-Satyam, thanks for flagging those sequences for removal! We tend to pull the entire NCBI dengue dataset into the sequences FASTA file and then subsequently filter out any problematic ones during the phylogenetic build. I've put the flagged circular strains in the phylogenetic/config/exclude.txt file in this PR: #29 Thanks for the violin plot! The upper and lower bounds should be adding in an augur filter call. Currently we're filtering out strains that are shorter than 5000nt, defined in a config file but your assessment of a We don't seem to have a I'm still looking into |
This is not a query. Just some exploratory analysis and mental notes on why not to rely on the genome completeness filter of the NCBI Virus database. Besides ViruSurf uses outdated genome assemblies so avoid using their resource for time being.
We can see that there are 9745 private to Nextstrain. But now I know that this nuisance arose because the NCBI virus tagged many complete assemblies as incomplete and many (nearly 85) incomplete genomes of merely 1.4K length as complete. I have requested them to correct this. Only MN192436 is missing from Nextstrsin. @j23414 There are duplicates in Nextstrain i.e. out of 14522, 13736 are unique. Deduplication was done using vsearch. Further, these duplicates arise from some old accession numbers associated with new accession numbers. Eg:
|
@j23414 Besides do you guys have any plans for developing a similar resource for FMD virus? |
Hi @Rohit-Satyam, no plans for a FMD resource yet. Might be able to use the pathogen-repo-guide but nothing planned yet |
Thanks for flagging the 13 strains with many Ns! It gave me incentive to look into the augur min-length filter (to check if N's are counted when calculating sequence length). For the test, I pulled the flagged strains and metadata into
|
I'm working through the duplicates. I played around with vsearch although I ended up using smof uniq to aggregate duplicate names into the fasta header: conda install bioconda::smof
smof uniq \
--pack data/sequences_all.fasta \
--pack-sep ';' \
> packed_uniq.fasta
grep ">" packed_uniq.fasta \
| grep ";" \
> duplicates.txt
head duplicates.txt
#> >ON123563;ON123566
#> >ON123564;ON123565;ON123567
#> >ON123568;ON123569;ON123570
#> >LQ868394;OG164735;LQ729835;JC562943;LQ868377;OG164704;AY099336;NC_001475;ON103302;ON103390;ON111445;ON115021;ON115023;ON115025;ON115029;ON115182;ON115811;ON115814;ON115816;ON115818;ON116129;ON127418;ON127533;ON127537;ON127550;ON127552
#> >LY683285;HQ026763; I think a bunch of these will disappear if we filter to only "VRL" genbank files, trying to track down if we can get that field in ncbi datasets, but I'll draft out a manual exclude until then.
Will look into why this is missing |
@j23414 nice work. I appreciate your efforts. I would also suggest you uniformize the subclade for DENV serotype 2 and serotype 4.
If you could change the following:
The roman numerals represents the genotype and I got this information from the following:
|
@j23414 I would like to bring your attention to the following subclade assignments The subclade of the genome
Similarly
To confirm if it is Asian II, I carried out blast of with
Which came out as follows with 99% identity and query cover: |
@j23414 and I think there are more. Below is a list of 16 genomes that needs correction at subclad level based on the table obtained from supplementary material link highlighted above. Please find the attachment below Can you tell me how these subclad assignments were achieved by you? |
For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28. Alternatively, we can attempt to drop the plasma from the ends of the sequences if that makes sense.
Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: #28 (comment). Additional discussion can be found in the thread starting here: #28 (comment). It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity. When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates. In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded. The rationale for each exclusion is documented in the respective comments in the exclude.txt file.
For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28. Alternatively, we can attempt to drop the plasma from the ends of the sequences if that makes sense.
Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: #28 (comment). Additional discussion can be found in the thread starting here: #28 (comment). It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity. When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates. In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded. The rationale for each exclusion is documented in the respective comments in the exclude.txt file.
Hi @Rohit-Satyam , sorry for the late answer! The subclade assignments were outdated (from before we had
You can also update the URL's in the phylogenetic/rules/prepare_sequences.smk file, and it should automatically pull the correct ones. dengue/phylogenetic/rules/prepare_sequences.smk Lines 24 to 25 in 5385608
On our end, we're discussion how to smoothly transition to the the |
Hi, @j23414 thanks for working on this promptly and correcting the subclade assignment of the aforementioned genomes. I was wondering if we can carry out BlastN of sequences with unknown serotypes with the ones that have known serotypes and then use their metadata to assign serotypes/genotype to the unknown ones using some stringent threshold such as 90% query cover and identity or use something like MeshClustV3? Besides when I see your exclude.txt file, I see some sequences being "too divergent" how do you decide those sequences? Did you calculate msa entropy or something? |
Hi @j23414. I downloaded new metadata and tried to annotate some more genomes at least at the serotype level and some at the genotype level. Below attached is the Excel sheet. Some are literature-based (dug some supplementary material files) and another sheet is based on serotype information based on fields like while read p;
do
efetch -db nuccore -id "$p" -format FEATURES > file.temp;
note=$(grep note file.temp | awk -F'=' '{print $2}' | tr '\n' ':' | sed 's/ *\/note="//');
strain=$(grep "strain=" file.temp | tr '\n' ' ' | sed 's/ *\/strain="//');
org=$(grep organism file.temp | sed 's/ *\/organism="//');
isolate=$(grep "isolate=" file.temp | tr '\n' ' ' | sed 's/ *\/isolate="//');
def=$(grep DEFINITION file.temp);
sero=$(grep serotype= file.temp | sed 's/ *\/serotype="//');
echo -e "$p;$strain;$note;$org;$isolate;$def;$sero" >> results.tsv;
done < temp
## where temp is file containing the accessions of genomes that do not have serotype or genotype information |
@j23414 did you check for the genomes above? |
Hi @Rohit-Satyam! Sorry, bogged down with prepping for a presentation next week. I plan to come back to this soon. From a glance through, they seem reasonable to add to "annotations.tsv", basically figuring out how to mark the reasoning for each manual change (e.g. from genbank, from linked publication, from nearest blast hit). |
Add serotype (denv1-4) annotations where the serotype number is mentioned in the GenBank file either as /strain= /genotype= or in the publication title. If the GenBank is PAT instead of VRL, ignore fixing the serotype annotation. If the GenBank is annotated with "UNVERIFIED", ignore fixing the serotype annotation.
Add serotype (denv1-4) annotations where the serotype number is mentioned in the GenBank file either as /strain= /genotype= or in the publication title. If the GenBank is PAT instead of VRL, ignore fixing the serotype annotation. If the GenBank is annotated with "UNVERIFIED", ignore fixing the serotype annotation.
Add serotype (denv1-4) annotations where the serotype number is mentioned in the GenBank file either as /strain= /genotype= or in the publication title. If the GenBank is PAT instead of VRL, ignore fixing the serotype annotation. If the GenBank is annotated with "UNVERIFIED", ignore fixing the serotype annotation.
Closing this since merged
The genotype renaming is being discussed on a separate issue Of course, feel free to submit new issues to flag any problematic records |
Dear @j23414
I see that some of the sequences in all_sequences.fasta file have plasmid DNA as well and therefore are circular DNA and have length greater than 12000 bp. These include
Shouldn't these be either removed or the plasmid sequence be chopped off?
I also see some extremely small genomes of length 2K. Too lar or too small genomes can influence the MSA so shouldn't they be removed from the resource? If so what's thresholds would you recommend for filtering the uninformative genomes. Given the graph below, I was thinking to take Upper boundary (i.e. 12184bp) and lower boundary (i.e. 8670bp)
The text was updated successfully, but these errors were encountered: