Skip to content
Mike Tisza edited this page May 9, 2022 · 30 revisions

Welcome to the Cenote-Taker 2 wiki!

With any suggestions to improve Cenote-Taker 2 or if you believe you have verifiable viral contigs that the pipeline is not detecting, please email me at michael.tisza@gmail.com

Installation notes

Hopefully, the instructions provided in the repo readme provide enough guidance on installing Cenote-Taker 2 on your linux HPC or cloud. I always suggest saving a log file of the install to check for errors that you may not notice until trying to run Cenote-Taker 2 later.

Running the provided test datasets

There are two test datasets provided in the repo. One contains contigs from DNA viruses, the other contains contigs from RNA viruses. Info below was generated from Cenote-Taker 2 v2.1.5

DNA virus test contigs: testcontigs_DNA_ct2.fasta

This file contains five contigs. Here's what they are and what to expect from Cenote-Taker 2: testcontig1 = a circular sequence of a novel microvirus-like sequence quite divergent from current GenBank entries. Should find genome map in the main sequin_and_genome_maps/

testcontig2 = a circular sequence of a known circovirus (porcine circovirus 1). Should find genome map in the main sequin_and_genome_maps/

testcontig3 = a linear fragment of a novel phycodnavirus quite divergent from current GenBank entries. Should find genome map in sequin_and_genome_maps/

testcontig4 = a linear fragment of a known crAss-like phage containing several viral hallmark genes. Should find genome map in sequin_and_genome_maps/

test_contig5 = a linear fragment of a known crAss-like phage with only "accessory" or non-hallmark genes. With default settings, this will not be annotated. If you use -am True, a genome map will be found in sequin_and_genome_maps/

What the output genome maps should look like:

Command that I ran:

python run_cenote-taker2.py -c testcontigs_DNA_ct2.fasta -r test_DNA_ct -p True -m 32 -t 32

Note: this command cannot be copy/pasted into your terminal. You must specify the correct path to your Cenote-Taker 2 directory and script.

Output "CONTIG_SUMMARY.tsv" file:

ORIGINAL_NAME	CENOTE_NAME	ORGANISM_NAME	END_FEATURE	LENGTH	ORF_CALLER	NUM_HALLMARKS	HALLMARK_NAMES	BLASTN_INFO
testcontig1	test_DNA_ct1	Petitvirales  sp. ctNWH1	DTR	6097	Prodigal	1	Varsani_microviridae_cluster.14/PFAM:PF02305.16-Major-capsid-protein	BLASTN not conducted
testcontig2	test_DNA_ct2	Circovirus  sp. ctsq52	DTR	1760	Prodigal	1	circoviridae_cluster.1/PFAM:PF02443.14-capsid-protein	BLASTN not conducted
testcontig3	test_DNA_ct3_vs01	Phycodnaviridae  sp. ctSNK3_01	None	9783	Prodigal	1	phycodna_cluster.4/PDB:5TIP_B-Major-capsid-protein	BLASTN not conducted
testcontig4	test_DNA_ct4_vs01	crAss-like phage  sp. ctyu04_01	None	18176	PHANOTATE	3	crass_cluster.2/PFAM:PF16510.4-portal-protein|crass_cluster.3/PDB:3EZK_B-Terminase-large-subunit|crass_cluster.1/PFAM:PF17236.1-Major-Capsid-Protein	BLASTN not conducted

RNA virus test contigs: testcontigs_RNA_ct2.fasta

This file contains two contigs. Here's what they are and what to expect from Cenote-Taker 2:

testcontig6 = a linear sequence from a known human astrovirus. Should find genome map in the main sequin_and_genome_maps/

testcontig7 = a linear sequence from a novel levivirus-like sequence. Should find genome map in the main sequin_and_genome_maps/

What the output genome maps should look like:

Command that I ran:

python run_cenote-taker2.py -c testcontigs_RNA_ct2.fasta -r test_RNA_ct -p True -m 32 -t 32 -db rna_virus

Output "CONTIG_SUMMARY.tsv" file:

ORIGINAL_NAME	CENOTE_NAME	ORGANISM_NAME	END_FEATURE	LENGTH	ORF_CALLER	NUM_HALLMARKS	HALLMARK_NAMES	BLASTN_INFO
testcontig7	test_RNA_ct2_vs01	Norzivirales  sp. ctFyT2_01	None	3422	PHANOTATE	1	levi_cluster.2/PFAM:PF03431.12-RDRP	BLASTN not conducted
testcontig6	test_RNA_ct1_vs01	Mamastrovirus  sp. ctevq1_01	None	6415	Prodigal	2	astro_cluster.1/PDB:5I61_A-RDRP|astro_cluster.2/PFAM:PF03115.13-capsid	BLASTN not conducted

Note: this command cannot be copy/pasted into your terminal. You must specify the correct path to your Cenote-Taker 2 directory and script.

Visualizing genome maps

The .gbf files produced by successful runs can be opened with any genome/plasmid viewer. Free softwares include Ugene and SnapGene Viewer. Paid softwares include Geneious, MacVector, and SnapGene. Others probably exist. Note that changes to .gbf files will not be reflected on the .sqn files that are submittable to GenBank.

Pruning prophages

-p True trims down input contigs that contain phage genomes to remove non-phage bacterial chromosomes

Original contig:

^^^^BACTERIAL CHROM^^^^^\/\/\/\/\/\/PHAGE GENOME\/\/\/\/\/\/\^^^^BACTERIAL CHROM^^^^^

"Pruned" sequence after pruning module:

\/\/\/\/\/\/PHAGE GENOME\/\/\/\/\/\/\

The virus discovery module

How does it work?

For each provided set of contigs, ORFs are translated and compared to a Hidden Markov Model database of viral hallmark genes using hmmscan.

What's a Hidden Markov Model database?

In this case, multiple sequence alignments of related amino acid sequences are prepared. Then a probabilistic model of the position and likelihood of conserved amino acid residues is created using hmmer (i.e. a Hidden Markov Model). Many models go into a database.

What's a viral hallmark gene?

I define this as a gene only found in some viruses but never in non-viral sequences. I also ONLY consider genes with known function. The classic example of this is a virus capsid gene.

Where did I get the hallmark gene sequences?

I considered all available virus proteins from GenBank. I also considered protein sequences from a variety of assembled DNA viromes. Pairwise alignment of sequences was conducted using EFI-EST. Clusters of related sequences were clustered with the MCL add-on in Cytoscape. Clusters were extracted, multiple sequence alignments (MSAs) were prepared with MAFFT. MSAs were used as queries for HHsearch. Best hits with probability > 85% were used to name HMMs, which were then prepared with hmmer. HMMs identified as viral replication, packaging or virion proteins were kept. HMMs were tested for specificity to to viral proteins by querying known NON-VIRAL proteins with a permissive E value. HMMs with hits to non-viral proteins were removed. HMMs with coil-coil and Ig-like domains were also removed as these can generate non-specific hits. Some viral HMMs from CDD and Pfam were also used.

Other discovery notes

Plasmids and conjugative transposons can swap genes with viruses, most notably dsDNA phage. I've taken steps to identify these in the pipeline via gene composition analysis and BLASTX.

Speed considerations

The adding more CPUs should improve the speed of processing datasets. The best way to increase the speed for users mainly interested in virus discovery is to turn off HHsuite via -hh none. I would recommend doing this by default if you want to run many large datasets as quickly as possible. Or, for those just interested in quick discovery without generation of genome maps, run python unlimited_breadsticks.py.

Minimum length considerations

This is really up to the user. Typically, assemblers generate lots of small contigs and few long contigs. Changing minimum consideration from 5000 nt to 1000 nt may increase the total amount of sequence to be analyzed by several fold. Changing '--minimum_length_linear' has the greatest effect as most sequences are not circular. Therefore, the decision is really just one of how many resources do you have to throw at the dataset, how long are you willing to wait, and how important are small contigs to you?

Hard limits are 1000 nt for detection of circularity and 4000nt for detection of ITRs. Smaller contigs will still be considered as 'linear' contigs if you have lower minimum settings for '--minimum_length_linear'.

Sequences from samples enriched for viruses

There are many methods for enriching for viral nucleic acid, most of which involve enriching physical viral particles. Checking your purity with ViromeQC or other bioinformatic tools could inform which discovery thresholds are trustworthy.

Minimum hallmark genes for circular/ITR sequences (--circ_minimum_hallmark_genes):

Circularity itself is a feature common to some viral families (but also to plasmids) that can clue you in to the nature of the element at hand (same goes for inverted terminal repeats, i.e. ITRs). Because some viruses are so divergent or independently derived from known viruses (example), analyzing all circular sequences in a virus enriched set can be fruitful. To do this, set --circ_minimum_hallmark_genes 0. Inevitably, some "circles" without hallmark genes will be false positives, so be vigilant. Otherwise, set this to --circ_minimum_hallmark_genes 1.

Minimum hallmark genes for linear sequences (--lin_minimum_hallmark_genes):

For enriched datasets, I'd set this at --lin_minimum_hallmark_genes 1. Most viral contigs in your dataset will represent incomplete genome fragments. Further, large DNA viruses have long "accessory regions" that don't contain any genes I'd consider to be "hallmarks". These accessory regions will be missed, unfortunately. However, even in samples that appear to be highly enriched for virus particles, there will be non-viral elements. The nature of these elements is yet-to-be-determined, but include PICIs.

Sequences from unenriched (WGS) samples or bacterial genomes

Such datasets have a lot more 'noise' than 'signal', and using a little more caution might be beneficial. First of all, please always use -p True with these samples. My experience suggests that around half of long viral contigs from unenriched WGS metagenome datasets are flanked by bacterial chromosome sequences on at least one side. If your data contain linear large eukaryotic DNA viruses and you've elected to prune prophages, these sequences may have gotten over-chopped or over-pruned. Check the .pdf generated by the pruning script. You can rerun these specific contigs with -p False to recover the entire contig.

I'd suggest using -db virion for unenriched datasets/bacterial genomes, as replication genes can occasionally be exchanged with other mobile genetic elements, and a small false-positive rate might be observed.

Minimum hallmark genes for circular/ITR sequences (--circ_minimum_hallmark_genes):

I'd suggest using --circ_minimum_hallmark_genes 1

Minimum hallmark genes for linear sequences (--lin_minimum_hallmark_genes):

I'd suggest using --lin_minimum_hallmark_genes 2, but 1 could be used.

Without Discovery

Suppose you've used some other method to infer which of your contigs are viral. We can skip the discovery step and just do annotation with -am True:

Note: other arguments (such as -hh are up to the user, but each should be considered.

Note on detecting circularity

Detection of circularity in contigs assembled from Illumina data relies on the fact that most assemblers (e.g. SPAdes, MEGAhit, IDBA_UD) will append direct terminal repeat sequences (equal in length to the largest k-mer used in assembly) to the ends of a contig. Some assemblers will not do this (e.g. CLC Bio), and circularity cannot be determined. Bear in mind that circularity cannot be distinguished from presence of direct repeats on a linear DNA molecule if the repeat sequences is greater in length than the largest k-mer used for assembly.

SPAdes with "Vanilla" settings will occasionally concatenate two ssDNA viral genomes that share a highly conserved hairpin region. Using the '--meta' and '--plasmid' SPAdes options seem to eliminate this problem.

Note on segmented viruses

Cenote-Taker 2 does not try to bin separate segments of segmented viruses. I understand people might look for contigs with identical terminal sequences to group segments. I don't know how to prepare GenBank files for this type of sequence.

Notes on prophage pruning

The basic premise for this module is that it quickly annotates each gene on a contig, scores each position based on the annotation (+10 for common virus genes, +5 for hypothetical genes, -3 for common chromosomal genes, 0 for intergenic), and marks the location of virus hallmark genes. Then, sliding 5kb windows are scored and smoothed. Positive "chunks" of windows with the user-determined minimum number of virus hallmark genes get kept as viral sequences. Everything else is discarded.

I've tested the prophage pruning module on most of the caudoviruses in RefSeq. These should NOT be cut or trimmed by the module, as they contain purely viral sequence. Encouragingly, ~97% of these genomes remain intact, whereas the other ~3% get cut or trimmed in regions where they've picked up genes from bacterial chromosomes that are not commonly found in other caudoviruses.

Error can occur the other way as well, when flanking chromosomal regions don't get fully trimmed. I've seen this occasionally when the virus is integrated adjacent to 'hypothetical' genes or genes that are common in both chromosomal and viral sequences (e.g. some topoisomerases). Ideas on 'rules' I could implement to improve the accuracy of the prophage pruning are most welcome. For example, I'm unaware of how I could accurately identify 'Att' sites. Relatedly, integrase genes are often found near one extreme of an integrated virus's genome. However, there are sometimes genes between the integrase and the integration site, so a hard-and-fast rule seems inappropriate.

Notes on virus taxonomy

Virus taxonomy is extremely challenging, particularly for divergent viruses. Cenote-Taker 2 looks at annotations to find the "best" gene to make a taxonomy call (e.g. RdRp for RNA viruses, terminase or portal for tailed phage). The amino acid sequence of said gene is compared to a database of amino acids from taxonomically defined viruses (mostly RefSeq). This is mostly correct, but finer resolution can be obtained using other, lower throughput methods. VContact2 is typically insightful for bacteriophage (this tool is also available with GUI on CyVerse).

Standard AA sequence approach thresholds:

Hallmark AAI to Reference Taxonomic granularity from CT2
>90% Genus, e.g. "Ilzatvirus"
>40% Family, e.g. "Siphoviridae"
>25% Order, e.g. "Caudovirales"
=<25% Generic name, e.g. "phage"

With BLASTN module, each contig in its entirety will be aligned to the reference database, and if a contig aligns to a reference genome with at least 95% ANI and 85% AF (alignment fraction), the contig will be considered to belong the same species as the reference.

I'm aware that ICTV has updated and changed phage taxonomy significantly and NCBI has not yet followed their lead. Unfortunately, I am unaware of how I could use ICTV data for taxonomy calls.

Change log

See this repo's release notes for more recent updates.

CHANGE LOG - March 19, 2020

Created 'patch1_2.0.1' branch, made changes, then merged back to master Fixed path to BLASTN database. Improved BLASTN results reporting in .gbf as well as table output. Even close matches to entry in BLASTN database will be annotated with same modules as divergent contigs. Fixed a bug where contigs with all ORFs being annotated with custom viral HMMs were misformatted. Fixed some minor things with taxonomy calls. This includes attempts to identify conjugative transposons that have similar replication modules to dsDNA phage. If no virion/packaging genes are present but at least one conjugative transposon-associated gene is present, contig is labeled as conjugative transposon. Added a wiki with some but not all relvant info on pipeline. Reformatted README with new schematic and updated info. Added DNA test contigs and RNA test contigs.

CHANGE LOG - April 21, 2020

Created 'patch4_2.0.1' branch, made changes, then merged back to master. Fixed formatting bug that cut off certain RPSBLAST result descriptions. Modified all RPSBLAST searches to '-seg yes' which filters out low-complexity hits Added ~600 pVOG HMMs from here, mostly to bolster viral hallmarks of bacteriophage. Added Quenyavirus RDRP model to rna_virus HMM database and some manually-trimmed helicase domains from small DNA viruses. Added the ORF calling info (PHANOTATE or Prodigal) to the summary .tsv file.

CHANGE LOG - May 7, 2020

Made minor adjustments to the prophage pruning module to allow pruning annotation of up to 99 prophage from a single parent contig. Added coordinates to pruned prophage .fasta header corresponding with parent contig.

CHANGE LOG - September 15, 2020

Made fairly major edits of the virus hallmark gene HMM database. One change was that models that contained a polyprotein or fusion protein were trimmed down to only contain the hallmark gene domain (e.g. models with major capsid protein fused with capsid maturation protease were trimmed to only have the major capsid protein domain). This effected many RNA virus models and some dsDNA models. Also, genes from GenBank identified by VirFam as phage virion genes but missed by the old Cenote-Taker 2 HMMs were identified (Thanks, Igor Tolstoy), and used to generate new HMMs. A small number of models were removed due to non-specific hits.

Also, RPS-BLAST was parallelized more efficiently in the module for linear virus contig annotation.

CHANGE LOG - October 2, 2020

An option "virion" for the "--virus_domain_db" argument was added. With this option, only hallmark genes encoding virion structural (e.g. capsid) genes, packaging genes, and virion maturation genes will be used in the hallmark count (in regards to "--lin_minimum_hallmark_genes" and "--circ_minimum_hallmark_genes"). A bug where Cenote-Taker wasn't finding read coverage information for linear contigs was fixed. Several changes were made that don't effect the output of the program, but the fixes prevent "ugly" bash errors from being spit out to the terminal. Edits tested on 'Patch7_2.0.1' branch.

CHANGE LOG - March 4, 2021

I've made some very large changes to the code. Mostly, these changes increase paralellization of BLAST, HMMER, HHblits/HHpred, and the pruning module. Further, all genome maps for the entire run can be found in 'sequin_and_genome_maps/'. I've added annotation mode: -am True, discovery only mode: unlimited_breadsticks.py, and improved the output tables. Also, about 100 hallmark gene HMMs were removed for being non specific to viruses, and about 50 new hallmark genes were added from divergent genomes identified with SEEKER. With these changes, I've made Release v2.1.1

Edits tested on 'Patch8_2.0.1' branch.