Removal of genome containing plasmid sequence #28

Rohit-Satyam · 2024-02-16T18:40:51Z

I see that some of the sequences in all_sequences.fasta file have plasmid DNA as well and therefore are circular DNA and have length greater than 12000 bp. These include

AY243466
AY243467
AY243468
AY243469
AY376438
AY648301
AY656167
AY656168
AY656169
AY656170
AY744148

Shouldn't these be either removed or the plasmid sequence be chopped off?
I also see some extremely small genomes of length 2K. Too lar or too small genomes can influence the MSA so shouldn't they be removed from the resource? If so what's thresholds would you recommend for filtering the uninformative genomes. Given the graph below, I was thinking to take Upper boundary (i.e. 12184bp) and lower boundary (i.e. 8670bp)

The text was updated successfully, but these errors were encountered:

For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28. Alternatively, we can attempt to drop the plasma from the ends of the sequences if that makes sense.

Rohit-Satyam · 2024-02-16T21:39:23Z

Similarly, there are genomes with ~61% of the genome as Ns. GISAID uses at-least 50% genome assembly completeness as mandatory to assign lineage to an assembly. If we use a similar analogy these genomes might also be removed. All these 13 genomes come from the study https://pubmed.ncbi.nlm.nih.gov/19965886/. The submitters somehow managed to meet the length requirement and reported these genomes as a complete assembly. What do you think?

These 13 genomes have <50% of genome assembly covered and come from this 2010 study.
GU318306
GU318307
GU318308
GU318309
GU318310
GU318311
GU318312
GU318313
GU318314
GU318315
GU318316
GU318317
GU318318

j23414 · 2024-02-16T23:57:47Z

Hello @Rohit-Satyam, thanks for flagging those sequences for removal!

We tend to pull the entire NCBI dengue dataset into the sequences FASTA file and then subsequently filter out any problematic ones during the phylogenetic build. I've put the flagged circular strains in the phylogenetic/config/exclude.txt file in this PR: #29

Thanks for the violin plot! The upper and lower bounds should be adding in an augur filter call. Currently we're filtering out strains that are shorter than 5000nt, defined in a config file but your assessment of a lower boundary (i.e. 8670bp) makes sense to me...just need to glance through the tree to see if there are potential issues.

We don't seem to have a --max-length flag, but we should be able to check for sequences that are too long or too distant from the reference sequence, to be added to the "exclude.txt" file.

I'm still looking into genomes with ~61% of the genome as Ns you've flagged. I'll go through the links and background and check if they're making it through our current filters or add them to the exclude list.

Rohit-Satyam · 2024-02-18T12:32:16Z

This is not a query. Just some exploratory analysis and mental notes on why not to rely on the genome completeness filter of the NCBI Virus database. Besides ViruSurf uses outdated genome assemblies so avoid using their resource for time being.

I observed the most exhaustive set of DENV genomes has been provided by Nextstrain. On the contrary, the NCBI Virus database only provides merely 4K complete genomes when using their "complete" genome filter). To investigate reason behind this I tried to intersect the Accession numbers obtained from metadata from NCBi and Nextstrain and the results are shown below.

We can see that there are 9745 private to Nextstrain. But now I know that this nuisance arose because the NCBI virus tagged many complete assemblies as incomplete and many (nearly 85) incomplete genomes of merely 1.4K length as complete. I have requested them to correct this. Only MN192436 is missing from Nextstrsin.

@j23414 There are duplicates in Nextstrain i.e. out of 14522, 13736 are unique. Deduplication was done using vsearch. Further, these duplicates arise from some old accession numbers associated with new accession numbers. Eg:

vsearch --derep_fulllength all.fasta --output drep.fasta

DEN2CGA
DEN2JAMCG

Rohit-Satyam · 2024-02-18T13:16:29Z

@j23414 Besides do you guys have any plans for developing a similar resource for FMD virus?

j23414 · 2024-02-20T23:27:44Z

Hi @Rohit-Satyam, no plans for a FMD resource yet.

Might be able to use the pathogen-repo-guide but nothing planned yet

j23414 · 2024-02-20T23:28:28Z

Thanks for flagging the 13 strains with many Ns! It gave me incentive to look into the augur min-length filter (to check if N's are counted when calculating sequence length).

For the test, I pulled the flagged strains and metadata into NN.fasta and NN.tsv files respectively. Looks like it disregards N's when calculating length, so these will be filtered out by the --min-length 5000 parameter, but thanks for checking!

augur filter \
  --min-length 5000 \
  --sequences NN.fasta \
  --metadata NN.tsv \
  --metadata-id-columns accession \
  --output-strain passed.txt

#> 13 strains were dropped during filtering
#>   13 of these were dropped because they were shorter than minimum length of 5000bp
ERROR: All samples have been dropped! Check filter rules and metadata file format.

j23414 · 2024-02-20T23:37:02Z

I'm working through the duplicates. I played around with vsearch although I ended up using smof uniq to aggregate duplicate names into the fasta header:

conda install bioconda::smof
smof uniq \
  --pack data/sequences_all.fasta \
  --pack-sep ';' \
  > packed_uniq.fasta

grep ">" packed_uniq.fasta \
| grep ";" \
> duplicates.txt

head duplicates.txt
#> >ON123563;ON123566
#> >ON123564;ON123565;ON123567
#> >ON123568;ON123569;ON123570
#> >LQ868394;OG164735;LQ729835;JC562943;LQ868377;OG164704;AY099336;NC_001475;ON103302;ON103390;ON111445;ON115021;ON115023;ON115025;ON115029;ON115182;ON115811;ON115814;ON115816;ON115818;ON116129;ON127418;ON127533;ON127537;ON127550;ON127552
#> >LY683285;HQ026763;

I think a bunch of these will disappear if we filter to only "VRL" genbank files, trying to track down if we can get that field in ncbi datasets, but I'll draft out a manual exclude until then.

For example the duplicates: LP918029 and AF326573

LOCUS       AF326573               10649 bp    RNA     linear   VRL 12-FEB-2004
LOCUS       LP918029               10649 bp    DNA     linear   PAT 29-APR-2018

Only MN192436 is missing from Nextstrain.

Will look into why this is missing

Rohit-Satyam · 2024-02-21T15:25:04Z

@j23414 nice work. I appreciate your efforts. I would also suggest you uniformize the subclade for DENV serotype 2 and serotype 4.
Subclades for serotype 1 and 3 are Roman numerals ("DENV1/V" and "DENV3/III" ) while subclades for serotype 2 and 4 are characters like "S" or "AA" ("DENV4/S" and "DENV2/AA" ).

unique(nextstrain$subclade)
 [1] ""          "DENV1/V"   "DENV3/III" "DENV2/C"   "DENV1/III" "DENV2/AA" 
 [7] "DENV4/II"  "DENV2/AI"  "DENV1/I"   "DENV4/I"   "DENV3/I"   "DENV1/IV" 
[13] "DENV3/IV"  "DENV3/II"  "DENV2/AM"  "DENV2/S"   "DENV1/II"  "DENV4/S"

If you could change the following:

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II
DENV2/S --> DENV2/VI 
DENV4/S --> DENV4/IV

The roman numerals represents the genotype and I got this information from the following:

DENV-2 serotype is classified into genotypes I (American), II (Cosmopolitan), III (Asian/American), IV (Asian II), V (Asian I) and VI (Sylvatic). Source:

Genotype IV is sylvatic, with only three known sequences. Source 1 and we know P73_1120 is sylvatic strain

Rohit-Satyam · 2024-02-21T16:10:07Z

@j23414 I would like to bring your attention to the following subclade assignments

The subclade of the genome GQ398268.1 is designated as DENV2/AI in nextclade annotation which means this is Asian I (i.e genotype V). However, according to a supplementary material here says it's Asian II (i.e genotype IV).

GQ398268 Dengue virus 2 strain DENV-2/ID/1022DN/1975, complete genome Homo sapiens Indonesia 1975 Asian II

Similarly OK573328 is assigned to DENV2/AI while the strain field says it's Type_2/New_Guinea_C_NGC_ and a study here says and I quote

Similarly, the NIH DENV2 component of the tetravalent TV003 vaccine, based on the New Guinea C 1980 strain from the Asian II genotype...

To confirm if it is Asian II, I carried out blast of with OK573328

KM204118 Dengue virus 2 strain New Guinea C, complete genome Homo sapiens Papua New Guinea 1944 Asian II

Which came out as follows with 99% identity and query cover:

Rohit-Satyam · 2024-02-21T16:44:46Z

@j23414 and I think there are more. Below is a list of 16 genomes that needs correction at subclad level based on the table obtained from supplementary material link highlighted above. Please find the attachment below

temp_check.csv

Can you tell me how these subclad assignments were achieved by you?

For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28. Alternatively, we can attempt to drop the plasma from the ends of the sequences if that makes sense.

Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: #28 (comment). Additional discussion can be found in the thread starting here: #28 (comment). It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity. When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates. In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded. The rationale for each exclusion is documented in the respective comments in the exclude.txt file.

For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28. Alternatively, we can attempt to drop the plasma from the ends of the sequences if that makes sense.

Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: #28 (comment). Additional discussion can be found in the thread starting here: #28 (comment). It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity. When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates. In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded. The rationale for each exclusion is documented in the respective comments in the exclude.txt file.

j23414 · 2024-02-28T22:01:44Z

Hi @Rohit-Satyam , sorry for the late answer!

The subclade assignments were outdated (from before we had DENV2/AII (aka DENV2/IV) classification). I've updated the nextclade_subtype (last column) in the following file (notice the extra "workflows/" in the URL).

You can also update the URL's in the phylogenetic/rules/prepare_sequences.smk file, and it should automatically pull the correct ones.

dengue/phylogenetic/rules/prepare_sequences.smk

Lines 24 to 25 in 5385608

    
           sequences_url = "https://data.nextstrain.org/files/dengue/sequences_{serotype}.fasta.zst", 
        
           metadata_url = "https://data.nextstrain.org/files/dengue/metadata_{serotype}.tsv.zst"

On our end, we're discussion how to smoothly transition to the the DENV2/IV style, probably add it as a separate field for a while then deprecate the old field.

Rohit-Satyam · 2024-02-29T11:50:34Z

Hi, @j23414 thanks for working on this promptly and correcting the subclade assignment of the aforementioned genomes. I was wondering if we can carry out BlastN of sequences with unknown serotypes with the ones that have known serotypes and then use their metadata to assign serotypes/genotype to the unknown ones using some stringent threshold such as 90% query cover and identity or use something like MeshClustV3? Besides when I see your exclude.txt file, I see some sequences being "too divergent" how do you decide those sequences? Did you calculate msa entropy or something?

Rohit-Satyam · 2024-03-01T23:45:53Z

Hi @j23414. I downloaded new metadata and tried to annotate some more genomes at least at the serotype level and some at the genotype level. Below attached is the Excel sheet. Some are literature-based (dug some supplementary material files) and another sheet is based on serotype information based on fields like serotype and description present in the GenBank file using code below. Now I know that it's not a perfect code but it does the job and helped me manually annotate several genomes quickly. Please have a look. I've included below the code I used but I think you can optimize it. If you do let me know

while read p; 
do 
efetch -db nuccore -id "$p" -format FEATURES > file.temp; 
note=$(grep note file.temp | awk -F'=' '{print $2}' | tr '\n' ':' | sed 's/ *\/note="//'); 
strain=$(grep "strain=" file.temp | tr '\n' ' ' | sed 's/ *\/strain="//'); 
org=$(grep organism file.temp | sed 's/ *\/organism="//'); 
isolate=$(grep "isolate=" file.temp | tr '\n' ' ' | sed 's/ *\/isolate="//'); 
def=$(grep DEFINITION file.temp); 
sero=$(grep serotype= file.temp | sed 's/ *\/serotype="//'); 
echo -e "$p;$strain;$note;$org;$isolate;$def;$sero" >> results.tsv; 
done < temp

## where temp is file containing the accessions of genomes that do not have serotype or genotype information

serotype.xlsx

Rohit-Satyam · 2024-03-07T10:23:14Z

@j23414 did you check for the genomes above?

j23414 · 2024-03-07T22:39:52Z

Hi @Rohit-Satyam! Sorry, bogged down with prepping for a presentation next week. I plan to come back to this soon.

From a glance through, they seem reasonable to add to "annotations.tsv", basically figuring out how to mark the reasoning for each manual change (e.g. from genbank, from linked publication, from nearest blast hit).

…comment)

Add serotype (denv1-4) annotations where the serotype number is mentioned in the GenBank file either as /strain= /genotype= or in the publication title. If the GenBank is PAT instead of VRL, ignore fixing the serotype annotation. If the GenBank is annotated with "UNVERIFIED", ignore fixing the serotype annotation.

j23414 · 2024-04-12T22:56:38Z

Closing this since merged

The genotype renaming is being discussed on a separate issue

Switch DENV2 genotypes to numeric to be consistent with DENV1, 3, and 4 #32

Of course, feel free to submit new issues to flag any problematic records

Rohit-Satyam added the bug Something isn't working label Feb 16, 2024

j23414 mentioned this issue Feb 16, 2024

Exclude circular synthetic (or chimeric) sequences #29

Merged

1 task

j23414 mentioned this issue Feb 21, 2024

Establish some deduplication guidelines within the phylogenetic workflow #30

Open

j23414 mentioned this issue Feb 22, 2024

Nextclade assignment #16

Merged

2 tasks

This was referenced Feb 22, 2024

[filter]: Add flag for --max-length nextstrain/augur#1421

Closed

filter: Update --min-length help message nextstrain/augur#1422

Merged

This was referenced Mar 14, 2024

Switch DENV2 genotypes to numeric to be consistent with DENV1, 3, and 4 #32

Open

Add manual serotype annotations along with justifications to "annotations.tsv" #33

Closed

j23414 added a commit that referenced this issue Apr 6, 2024

First pass to independently verify missing serotypes flagged by #28 (…

03c3afa

…comment)

j23414 added a commit that referenced this issue Apr 6, 2024

First pass to independently verify missing serotypes flagged by #28 (…

b5fe73a

…comment)

j23414 added a commit that referenced this issue Apr 6, 2024

First pass to independently verify missing serotypes flagged by #28 (…

c056d97

…comment)

j23414 added a commit that referenced this issue Apr 6, 2024

First pass to independently verify missing serotypes flagged by #28 (…

e71da36

…comment)

j23414 added a commit that referenced this issue Apr 6, 2024

First pass to independently verify missing serotypes flagged by #28 (…

ba3af78

…comment)

j23414 closed this as completed Apr 12, 2024

Rohit-Satyam mentioned this issue May 21, 2024

Update the links of FASTA and metadata in the README #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removal of genome containing plasmid sequence #28

Removal of genome containing plasmid sequence #28

Rohit-Satyam commented Feb 16, 2024

Rohit-Satyam commented Feb 16, 2024

j23414 commented Feb 16, 2024

Rohit-Satyam commented Feb 18, 2024 •

edited

Loading

Rohit-Satyam commented Feb 18, 2024

j23414 commented Feb 20, 2024

j23414 commented Feb 20, 2024

j23414 commented Feb 20, 2024 •

edited

Loading

Rohit-Satyam commented Feb 21, 2024

Rohit-Satyam commented Feb 21, 2024 •

edited

Loading

Rohit-Satyam commented Feb 21, 2024 •

edited

Loading

j23414 commented Feb 28, 2024

Rohit-Satyam commented Feb 29, 2024 •

edited

Loading

Rohit-Satyam commented Mar 1, 2024

Rohit-Satyam commented Mar 7, 2024

j23414 commented Mar 7, 2024 •

edited

Loading

j23414 commented Apr 12, 2024

Removal of genome containing plasmid sequence #28

Removal of genome containing plasmid sequence #28

Comments

Rohit-Satyam commented Feb 16, 2024

Rohit-Satyam commented Feb 16, 2024

j23414 commented Feb 16, 2024

Rohit-Satyam commented Feb 18, 2024 • edited Loading

Rohit-Satyam commented Feb 18, 2024

j23414 commented Feb 20, 2024

j23414 commented Feb 20, 2024

j23414 commented Feb 20, 2024 • edited Loading

Rohit-Satyam commented Feb 21, 2024

Rohit-Satyam commented Feb 21, 2024 • edited Loading

Rohit-Satyam commented Feb 21, 2024 • edited Loading

j23414 commented Feb 28, 2024

Rohit-Satyam commented Feb 29, 2024 • edited Loading

Rohit-Satyam commented Mar 1, 2024

Rohit-Satyam commented Mar 7, 2024

j23414 commented Mar 7, 2024 • edited Loading

j23414 commented Apr 12, 2024

Rohit-Satyam commented Feb 18, 2024 •

edited

Loading

j23414 commented Feb 20, 2024 •

edited

Loading

Rohit-Satyam commented Feb 21, 2024 •

edited

Loading

Rohit-Satyam commented Feb 21, 2024 •

edited

Loading

Rohit-Satyam commented Feb 29, 2024 •

edited

Loading

j23414 commented Mar 7, 2024 •

edited

Loading