-
Notifications
You must be signed in to change notification settings - Fork 23
RSV annotation
- Changes made to
rsv.minfo
file - Changes made to covariance models (
rsv.cm
) - Changes made to blastx protein library (
.vadr.protein.fa
files) - Construction of profile hidden Markov models (
rsv.hmm
)
The steps below explain how to use VADR for RSV sequence validation and annotation. Testing of VADR for RSV is still ongoing and these recommendations are subject to change.
GenBank is not currently using VADR to automatically process RSV sequence submissions like it does for SARS-CoV-2. GenBank RSV sequence submissions are still all manually reviewed, but indexers typically use VADR as part of their analysis, using the command-line options listed in step 4 below.
Steps for using VADR for RSV annotation:
-
Download and install the latest version of VADR (v1.5.1 or later), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.5.1 (or later) docker image created by Curtis Kapsak (docker image names:
staphb/vadr:1.5.1
andstaphb/vadr:latest
), available on dockerhub and quay. A brief README for the docker image is here. -
Download the latest RSV VADR models (version 1.5-2, gzipped tarball) from here, unpack them (e.g.
tar xfz <tarball.gz>
). Note the path to the directory name created (<rsv-models-dir-path>
) for step 3. (If you are using the docker image you can skip this step as the RSV VADR models are included.) -
WARNING: the
fasta-trim-terminal-ambigs.pl
script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.To remove terminal ambiguous nucleotides from your sequence file
<input-fasta-file>
and to remove short and long sequences to create a new trimmed file<trimmed-fasta-file>
, execute:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 150 --maxlen 15500 <input-fasta-file> > <trimmed-fasta-file>
-
Run the
v-annotate.pl
program on an input trimmed fasta file with RSV sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).*NOTE: The following command runs multithreaded on up to 4 CPUs, and so is only recommended if you have at least 4 CPUs and 128Gb RAM available. To run on
<n>
CPUs instead, replace--cpu 4
with--cpu <n>
(which will require<n>
32Gb of RAM). To run single threaded on a single CPU remove the--cpu 4
option. The--split
and--cpu
options are incompatible with-p
.
v-annotate.pl --split --cpu 4 -r --mkey rsv --xnocomp --mdir <rsv-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>
This section shows output from an example v-annotate.pl
annotation of three
RSV sequences from GenBank using the above command and options.
The fasta file of those three sequences can be downloaded here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the RSV model directory is in /usr/local/vadr-models-rsv-1.5-2
and the pretrim.rsv.3.fa
sequence file is in the current directory. We will create a new file
rsv.3.fa
with trimmed sequences in the next step.
As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 150nt or longer than 15,500nt with the command:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 150 --maxlen 15500 pretrim.rsv.3.fa > rsv.3.fa
Next, to annotate the trimmed sequences using the above
v-annotate.pl
options for RSV, run the following command (scroll to the
right to see full command), which will create a new directory called
my3
into which VADR output files will be written.
v-annotate.pl --split --cpu 4 -r --mkey rsv --xnocomp --mdir /usr/local/vadr-models-rsv-1.5-2 rsv.3.fa my3
The options used are explained further below.
When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.5.1 (Feb 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Fri Feb 10 12:25:31 2023
# $VADRBIOEASELDIR: /home/nawrocki/vadr-install-dir/Bio-Easel
# $VADRBLASTDIR: /home/nawrocki/vadr-install-dir/ncbi-blast/bin
# $VADREASELDIR: /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR: /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRMODELDIR: /home/nawrocki/vadr-install-dir/vadr-models-calici
# $VADRSCRIPTSDIR: /home/nawrocki/vadr-install-dir/vadr
#
# sequence file: rsv.3.fa
# output directory: my3
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': rsv [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR: /usr/local/vadr-models-rsv-1.5-2 [--mdir]
# turn off composition-based for blastx statistics with -comp_based_stats 0: yes [--xnocomp]
# replace stretches of Ns with expected nts, where possible: yes [-r]
# split input file into chunks, run each chunk separately: yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch): 4 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Next, v-annotate.pl
will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.1 seconds]
# Splitting sequence file into chunks to run independently in parallel on 4 processors ... done. [ 0.4 seconds]
# Executing 3 scripts in parallel on 4 processors to process 3 partition(s) of all 3 sequence(s) ...
# 3 of 3 jobs finished (0.1 minutes spent waiting)
# done. [ 64.9 seconds]
# Merging and finalizing output ... done. [ 0.8 seconds]
With the --split --cpu 4
options, the input fasta script is split up
into chunks and runs v-annotate.pl
separately on those chunks on 4
different CPUs in parallel. When all sequences are finished processing
the main script merges the output together.
The v-annotate.pl
output then includes a summary of the classification of sequences, and the alerts reported:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- -------- ----- -------- ---- ---- ----
1 KY654518 RSV A 2 1 1
2 MZ516105 RSV B 1 1 0
#--- -------- ----- -------- ---- ---- ----
- *all* - - 3 2 1
- *none* - - 0 0 0
#--- -------- ----- -------- ---- ---- ----
#
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- ----------------------------- ------- ----- ---- -----------
1 ambgnt3f no AMBIGUITY_AT_FEATURE_END feature 1 1 final nucleotide of non-CDS feature is an ambiguous nucleotide
2 ambgnt3c no AMBIGUITY_AT_CDS_END feature 1 1 final nucleotide of CDS is an ambiguous nucleotide
#--- -------- ------- ----------------------------- ------- ----- ---- -----------
3 unexleng yes UNEXPECTED_LENGTH feature 1 1 length of complete coding (CDS or mat_peptide) feature is not a multiple of 3
4 cdsstopn yes CDS_HAS_STOP_CODON feature 1 1 in-frame stop codon exists 5' of stop position predicted by homology to reference
5 fsthicft yes POSSIBLE_FRAMESHIFT_HIGH_CONF feature 1 1 high confidence possible frameshift in CDS (frame not restored before end)
6 indf5pst yes INDEFINITE_ANNOTATION_START feature 1 1 protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
#--- -------- ------- ----------------------------- ------- ----- ---- -----------
And finally a list of the output files created:
# Output printed to screen saved in: my3.vadr.log
# List of executed commands saved in: my3.vadr.cmd
# List and description of all output files saved in: my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in: my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in: my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in: my3.vadr.fail.tbl
# list of passing sequences saved in: my3.vadr.pass.list
# list of failing sequences saved in: my3.vadr.fail.list
# list of alerts in the feature tables saved in: my3.vadr.alt.list
# fasta file with passing sequences saved in: my3.vadr.pass.fa
# fasta file with failing sequences saved in: my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in: my3.vadr.sqa
# per-sequence tabular classification summary file saved in: my3.vadr.sqc
# per-feature tabular summary file saved in: my3.vadr.ftr
# per-model-segment tabular summary file saved in: my3.vadr.sgm
# per-alert tabular summary file saved in: my3.vadr.alt
# alert count tabular summary file saved in: my3.vadr.alc
# per-model tabular summary file saved in: my3.vadr.mdl
# alignment doctoring tabular summary file saved in: my3.vadr.dcr
# replaced stretches of Ns summary file (-r) saved in: my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time: 00:01:06.44
# hh:mm:ss
#
[ok]
Note that all the output files will be in the newly created my3
directory.
The Summary of classified sequences
listed that two sequences passed and one failed.
The file my3.vadr.pass.list
, lists the two sequences that passed:
KU839633.1
OM062710.1
and my3.vadr.fail.list
lists the sequence that failed:
MZ516127.1
Also, FASTA-formatted sequence files for each the passing and failing
sequences are my3.vadr.pass.fa
and my3.vadr.fail.fa
.
For the two sequences that passed, the annotation is available in the
output my3.vadr.pass.tbl
file and for the two sequences that failed the
annotation is in the my3.vadr.fail.tbl
file.
Here is the output for the first sequence in the my3.vadr.pass.tbl
file: (with the middle of the table for each sequence removed for
brevity)
>Feature KU839633.1
29 448 gene
gene NS1
29 448 CDS
product nonstructural protein 1
protein_id KU839633.1_1
556 930 gene
gene NS2
556 930 CDS
product nonstructural protein 2
protein_id KU839633.1_2
1069 2244 gene
gene N
1069 2244 CDS
product nucleoprotein
protein_id KU839633.1_3
...snip...
7600 8187 gene
gene M2
7600 8187 CDS
product M2-1 protein
protein_id KU839633.1_9
8153 8425 gene
gene M2
8153 8425 CDS
product M2-2 protein
protein_id KU839633.1_10
8491 14991 gene
gene L
8491 14991 CDS
product RNA-dependent RNA polymerase
protein_id KU839633.1_11
And the sequence in the my3.vadr.fail.tbl
(with the middle removed for brevity):
>Feature MZ516127.1
96 515 gene
gene NS1
96 515 CDS
product nonstructural protein 1
protein_id MZ516127.1_1
625 999 gene
gene NS2
625 999 CDS
product nonstructural protein 2
protein_id MZ516127.1_2
1137 2312 gene
gene N
1137 2312 CDS
product nucleoprotein
protein_id MZ516127.1_3
...snip...
7666 8250 gene
gene M2-1
7666 8250 CDS
product M2-1 protein
protein_id MZ516127.1_9
8225 8491 gene
gene M2-2
8225 8491 CDS
product M2-2 protein
protein_id MZ516127.1_10
8558 >15029 gene
gene L
8558 >15029 misc_feature
note similar to RNA-dependent RNA polymerase
Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:RNA-dependent RNA polymerase) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:4649,M:4649]; seq-coords:10403..10405:+; mdl-coords:10407..10409:+; mdl:KY654518;
ERROR: POSSIBLE_FRAMESHIFT_HIGH_CONF: (CDS:RNA-dependent RNA polymerase) high confidence possible frameshift in CDS (frame not restored before end) [cause:delete,S:10384,M:10388(1); frame:1(3); length:1827:(4670); shifted_avgpp:0.974; exp_avgpp:0.968;]; seq-coords:10385..15054:+; mdl-coords:10388..15058:+; mdl:KY654518;
ERROR: INDEFINITE_ANNOTATION_START: (CDS:RNA-dependent RNA polymerase) protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [1826>5]; seq-coords:8558..10383:+; mdl-coords:8561..8561:+; mdl:KY654518;
ERROR: UNEXPECTED_LENGTH: (CDS:RNA-dependent RNA polymerase) length of complete coding (CDS or mat_peptide) feature is not a multiple of 3 [6497]; seq-coords:8558..15054:+; mdl-coords:8561..15058:+; mdl:KY654518;
Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.
Note that the end of the fail.tbl
file lists ERRORs for
MZ516127.1
. In this case, the RNA-dependent RNA polymerase coding region includes
many ambiguous nucleotides (N
s) and one of those N-regions seems to
be of unexpected length, causing a frameshift. To investigate issues like these
further, it can be helpful to add the --keep
option to v-annotate.pl
which
results in additional output files being created, including alignment files.
ERROR
lines such as this are meant to highlight potential problems
for manual review or regions of interest to the user, but they do not
necessarily mean that there is a problem with the sequence.
The annotation information is also available in other files with
different formats, such as the my3/my3.vadr.ftr
file, which may be
easier to parse for some applications.
For examples of alerts/errors and more information on how to interpret the
VADR output related to those alerts in the output feature tables, .alt.list
file and the GenBank submission portal detailed error report .tsv
files, see this vadr documentation page.
See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.
You can find information on two papers on VADR (below).
The options used in the above command are the recommended set of options for RSV annotation. These options are currently used by GenBank indexers when they run VADR for validating and annotating incoming RSV sequence submissions. (All submissions are currently manually reviewed, no automated processing/approval using VADR is in place for RSV sequences at this time.) The options are each briefly explained in the table below. More information can be found here,
option | explanation |
---|---|
--split |
split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
|
--cpu 4 |
for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 4 CPU workers (4 can be changed to <n1> with --cpu <n1> ), requires --split
|
-r |
turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible |
--mkey rsv |
use the model files with prefix rsv in the directory from --mdir
|
--xnocomp |
turn off composition-based stats for blastx, which seems to increase the length of RSV blastx alignments in some cases |
--mdir /usr/local/vadr-models-rsv-1.5-2 |
specify that the models to use are in directory /usr/local/vadr-models-rsv-1.5-2 |
The -r
option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see
this page
As of January 31, 2023, the VADR model library for RSV annotation
(vadr-models-rsv-1.5-2
model
file)
includes two RSV models. One model is based on
KY654518.1
a human RSV A GenBank sequence of length 15,277 nt. The other model
is based on MZ516105.1
a human RSV B GenBank sequence of length 15,276 nt.
The model files were developed during testing on existing INSDC RSV A and B sequences. The initial model files were created with the following command with VADR 1.5 in December 2022, and then later modified during testing as explained more below.
v-build.pl KY654518 KY654518
v-build.pl MZ516105 MZ516105
The files in the two output directories KY654518
and MZ516105
were
then combined to make a model 'library' for RSV (consisting of two
models) as explained
here
with the exception that rsv
was used as the root for naming the
library files instead of my.vadr
.
After that, the changes discussed below were made to the model library files:
In the rsv.minfo
file the following changes were made manuually:
The two lines that begin with MODEL were changed to:
MODEL KY654518 blastdb:"KY654518.vadr.protein.fa" cmfile:"KY654518.vadr.cm" length:"15277" group:"RSV" subgroup:"A"
MODEL MZ516105 blastdb:"MZ516105.vadr.protein.fa" cmfile:"MZ516105.vadr.cm" length:"15276" group:"RSV" subgroup:"B"
Additionally in rsv.minfo
, the lines that begin with FEATURE KY654518
and include gene:"G"
were removed and
replaced with:
FEATURE KY654518 type:"gene" coords:"4681..5643:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"15"
FEATURE KY654518 type:"gene" coords:"4681..5646:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"16"
FEATURE KY654518 type:"gene" coords:"4681..5649:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"17"
FEATURE KY654518 type:"CDS" coords:"4681..5643:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5493:72;5494:72;5495:72;5496:72;5497:72;5498:72;5499:72;5500:72;5531:72;" xmaxdel_exc:"259:72;260:72;261:72;262:72;263:72;264:72;265:72;266:72;267:72;268:72;269:72;270:72;271:72;272:72;273:72;274:72;275:72;276:72;" alternative_ftr_set:"attachment(cds)"
FEATURE KY654518 type:"CDS" coords:"4681..5646:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5493:72;5494:72;5495:72;5496:72;5497:72;5498:72;5499:72;5500:72;5531:72;" xmaxdel_exc:"259:72;260:72;261:72;262:72;263:72;264:72;265:72;266:72;267:72;268:72;269:72;270:72;271:72;272:72;273:72;274:72;275:72;276:72;" alternative_ftr_set:"attachment(cds)"
FEATURE KY654518 type:"CDS" coords:"4681..5649:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5493:72;5494:72;5495:72;5496:72;5497:72;5498:72;5499:72;5500:72;5531:72;" xmaxdel_exc:"259:72;260:72;261:72;262:72;263:72;264:72;265:72;266:72;267:72;268:72;269:72;270:72;271:72;272:72;273:72;274:72;275:72;276:72;" alternative_ftr_set:"attachment(cds)"
And the lines that begin with FEATURE MZ516105
and include gene:"G"
were removed and
replaced with:
FEATURE MZ516105 type:"gene" coords:"4688..5620:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"16"
FEATURE MZ516105 type:"gene" coords:"4688..5629:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"17"
FEATURE MZ516105 type:"gene" coords:"4688..5635:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"18"
FEATURE MZ516105 type:"gene" coords:"4688..5641:+" parent_idx_str:"GBNULL" gene:"G" alternative_ftr_set:"attachment(gene)" alternative_ftr_set_subn:"19"
FEATURE MZ516105 type:"CDS" coords:"4688..5620:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" xmaxins_exc:"250:60;251:60;252:60;253:60;254:60;255:60;256:60;257:60;258:60;259:60;260:60" alternative_ftr_set:"attachment(cds)"
FEATURE MZ516105 type:"CDS" coords:"4688..5629:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" xmaxins_exc:"250:60;251:60;252:60;253:60;254:60;255:60;256:60;257:60;258:60;259:60;260:60" alternative_ftr_set:"attachment(cds)"
FEATURE MZ516105 type:"CDS" coords:"4688..5635:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" alternative_ftr_set:"attachment(cds)"
FEATURE MZ516105 type:"CDS" coords:"4688..5641:+" parent_idx_str:"GBNULL" gene:"G" product:"attachment glycoprotein" nmaxdel_exc:"5441:60;" xmaxins_exc:"235:60;236:60;237:60;238:60;239:60;240:60;241:60;242:60;243:60;244:60;245:60;246:60;247:60;248:60;249:60;250:60;251:60;252:60;253:60;254:60;255:60;256:60;257:60;258:60;259:60;260:60;" alternative_ftr_set:"attachment(cds)"
These sets of lines allow different RSV sequences with alternative
stop codons in the attachment glycoprotein coding sequence to be
appropriately handled by v-annotate.pl
through the
alternative_ftr_set
and alternative_ftr_set_subn
attributes.
Additionally, the nmaxdel_exc
and xmaxins_exc
attributes enable
exceptions for the maximum allowed deletion and insertions,
respectively, at given reference positions. These allow additional known
diversity in the attachment glycoprotein to be appropriately handled
(and not cause valid RSV sequences to fail due to these indels).
In many RSV A genomes, there is a common deletion of length 72
positions relative to the KY654518.1 sequence starting at KY654518
position 5496, and there is a similar deletion in many RSV B genomes
relative to MZ516105 of length 60 starting at position 5441. To allow
VADR to better handle these deletions, especially at correctly
aligning sequences to the model reference sequences, the CMs in the
RSV 1.5-2 model file rsv.cm
were built from alignments with two
sequences each. The KY654518 CM was built from the KY654518.2.stk
alignment (included in the model library) which includes 2 sequences:
KY654518.1 and a modified version of KY654518.1 with a deletion of length 72 starting at
position 5496, which is a commonly deleted region in RSV A. Similarly,
the MZ516105 CM was built from the MZ516105.2.stk
alignment (included in the model library) which includes 2 sequences:
MZ516105.1 and a modified version of MZ516105.1 with a deletion of length 60 starting at
position 5441.
Those models were built using infernal v1.1.4 with the commands:
cmbuild -F --noss --hand -n KY654518 --eset 1.0 KY654518.2.cm KY654518.2.stk
cmbuild -F --noss --hand -n MZ516105 --eset 1.0 MZ516105.2.cm MZ516105.2.stk
cat MZ516105.2.cm KY654518.2.cm > rsv.cm
The v-annotate.pl
script validates protein coding sequences using
blastx
by default. To accomodate natural variation in protein coding
sequences in RSV, mainly the attachment glycoprotein coding sequence,
additional proteins besides the model proteins from KY654518
and
MZ516105
were added to the blastx
protein libraries. These were
chosen based on empirical performance during testing of the RSV models
with v-annotate.pl
on all existing INSDC RSV sequences, by
identifying sequences that failed for similar reasons, typically
indf5pst
and indf3pst
alerts (with descriptions
INDEFINITE_ANNOTATION_START
and INDEFINITE_ANNOTATION_END
) due to
stop codons that occur upstream of the stop codon position in the
KY654518
or MZ516105
sequences. Specifically, the
.vadr.protein.fa
files contain the following sequences:
(the sequences from KY654518.1
and MZ516105.1
were added initially by v-build.pl
and all other sequences
were manually added based on testing results):
sequence name in .vadr.protein.fa file |
source sequence | source sequence positions | protein length | reference model positions | gene | CDS product
|
---|---|---|---|---|---|---|
KY654518.1/99..518:+ |
KY654518.1 |
99..518:+ |
140 |
99..518:+ |
NS1 |
nonstructural protein 1 |
KY654518.1/628..1002:+ |
KY654518.1 |
628..1002:+ |
125 |
628..1002:+ |
NS2 |
nonstructural protein 2 |
KY654518.1/1140..2315:+ |
KY654518.1 |
1140..2315:+ |
392 |
1140..2315:+ |
N |
nucleoprotein |
KY654518.1/2347..3072:+ |
KY654518.1 |
2347..3072:+ |
242 |
2347..3072:+ |
P |
phosphoprotein |
KY654518.1/3255..4025:+ |
KY654518.1 |
3255..4025:+ |
257 |
3255..4025:+ |
M |
matrix protein |
KY654518.1/4295..4489:+ |
KY654518.1 |
4295..4489:+ |
65 |
4295..4489:+ |
SH |
small hydrophobic protein |
KY654518.1/4681..5646:+ |
KY654518.1 |
4681..5646:+ |
322 |
4681..5646:+ |
G |
attachment glcyoprotein |
KY654518.1/5726..7450:+ |
KY654518.1 |
5726..7450:+ |
575 |
5726..7450:+ |
F |
fusion glcyoprotein |
KY654518.1/7669..8253:+ |
KY654518.1 |
7669..8253:+ |
195 |
7669..8253:+ |
M2-1 |
M2-1 protein |
KY654518.1/8228..8494:+ |
KY654518.1 |
8228..8494:+ |
89 |
8228..8494:+ |
M2-2 |
M2-2 protein |
KY654518.1/8561..15058:+ |
KY654518.1 |
8561..15058:+ |
2166 |
8561..15058:+ |
L |
RNA-dependent RNA polymerase |
proteins below were manually added | ||||||
OM857255.1:4629..5591:+/4681..5643:+ |
OM857255.1 |
4629..5591:+ |
321 |
4681..5643:+ |
G |
attachment glcyoprotein |
AF065254.1:16..909:+/4681..5646:+ |
AF065254.1 |
16..909:+ |
298 |
4681..5646:+ |
G |
attachment glcyoprotein |
OK649616.1:4670..5563:+/4681..5646:+ |
OK649616.1 |
4670..5563:+ |
298 |
4681..5646:+ |
G |
attachment glcyoprotein |
hybrid:KY654518.1:4681..4695:+:AF065410.1:1..879:+/4681..5646:+ |
KY654518.1 |
4681..5646:+ |
322 |
4681..5646:+ |
G |
attachment glcyoprotein |
KF826850.1:4675..5568:+/4681..5646:+ |
KF826850.1 |
4675..5568:+ |
298 |
4681..5646:+ |
G |
attachment glcyoprotein |
KU316092.1:4620..5516:+/4681..5646:+ |
KU316092.1 |
4620..5516:+ |
299 |
4681..5646:+ |
G |
attachment glcyoprotein |
KU316164.1:4611..5504:+/4681..5646:+ |
KU316164.1 |
4611..5504:+ |
298 |
4681..5646:+ |
G |
attachment glcyoprotein |
NC_038235.1:4688..5584:+/4681..5649:+ |
NC_038235.1 |
4688..5584:+ |
299 |
4681..5649:+ |
G |
attachment glcyoprotein |
MZ515659.1:4681..5649:+/4681..5649:+ |
MZ515659.1 |
4681..5649:+ |
323 |
4681..5649:+ |
G |
attachment glcyoprotein |
HQ699266.1:1..897:+/4681..5649:+ |
HQ699266.1 |
1..897:+ |
299 |
4681..5649:+ |
G |
attachment glcyoprotein |
KJ641590.1:4630..5526:+/4681..5649:+ |
KJ641590.1 |
4630..5526:+ |
299 |
4681..5649:+ |
G |
attachment glcyoprotein |
OK649616.1:4670..5566:+/4681..5649:+ |
OK649616.1 |
4670..5566:+ |
299 |
4681..5649:+ |
G |
attachment glcyoprotein |
M17212.1:16..912:+/4681..5649:+ |
M17212.1 |
16..912:+ |
299 |
4681..5649:+ |
G |
attachment glcyoprotein |
OM857351.1:7614..8180:+/7669..8235:+ |
OM857351.1 |
7614..8180:+ |
189 |
7669..8235:+ |
M2-1 |
M2-1 protein |
sequence name in .vadr.protein.fa file |
source sequence | source sequence positions | protein length | reference model positions | gene | CDS product
|
---|---|---|---|---|---|---|
MZ516105.1/99..518:+ |
MZ516105.1 |
99..518:+ |
140 |
99..518:+ |
NS1 |
nonstructural protein 1 |
MZ516105.1/626..1000:+ |
MZ516105.1 |
626..1000:+ |
125 |
626..1000:+ |
NS2 |
nonstructural protein 2 |
MZ516105.1/1139..2314:+ |
MZ516105.1 |
1139..2314:+ |
392 |
1139..2314:+ |
N |
nucleoprotein |
MZ516105.1/2347..3072:+ |
MZ516105.1 |
2347..3072:+ |
242 |
2347..3072:+ |
P |
phosphoprotein |
MZ516105.1/3262..4032:+ |
MZ516105.1 |
3262..4032:+ |
257 |
3262..4032:+ |
M |
matrix protein |
MZ516105.1/4301..4498:+ |
MZ516105.1 |
4301..4498:+ |
66 |
4301..4498:+ |
SH |
small hydrophobic protein |
MZ516105.1/4688..5620:+ |
MZ516105.1 |
4688..5620:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
MZ516105.1/5718..7442:+ |
MZ516105.1 |
5718..7442:+ |
575 |
5718..7442:+ |
F |
fusion glcyoprotein |
MZ516105.1/7669..8256:+ |
MZ516105.1 |
7669..8256:+ |
196 |
7669..8256:+ |
M2-1 |
M2-1 protein |
MZ516105.1/8222..8494:+ |
MZ516105.1 |
8222..8494:+ |
91 |
8222..8494:+ |
M2-2 |
M2-2 protein |
MZ516105.1/8560..15060:+ |
MZ516105.1 |
8560..15060:+ |
2167 |
8560..15060:+ |
L |
RNA-dependent RNA polymerase |
proteins below were manually added | ||||||
MT040088.1:4679..5572:+/4688..5620:+ |
MT040088.1 |
4679..5572:+ |
298 |
4688..5620:+ |
G |
attachment glcyoprotein |
KJ627364.1:4618..5550:+/4688..5620:+ |
KJ627364.1 |
4618..5550:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
MH760718.1:4597..5529:+/4688..5620:+ |
MH760718.1 |
4597..5529:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
MG642047.1:4666..5565:+/4688..5620:+ |
MG642047.1 |
4666..5565:+ |
300 |
4688..5620:+ |
G |
attachment glcyoprotein |
LC474547.1:4663..5595:+/4688..5620:+ |
LC474547.1 |
4663..5595:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
MG431253.1:4674..5567:+/4688..5620:+ |
MG431253.1 |
4674..5567:+ |
298 |
4688..5620:+ |
G |
attachment glcyoprotein |
MZ962122.1:1..933:+/4688..5620:+ |
MZ962122.1 |
1..933:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
KC297442.1:1..933:+/4688..5620:+ |
KC297442.1 |
1..933:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
LC311384.1:1..933:+/4688..5620:+ |
LC311384.1 |
1..933:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
MZ515748.1:4689..5621:+/4688..5620:+ |
MZ515748.1 |
4689..5621:+ |
311 |
4688..5620:+ |
G |
attachment glcyoprotein |
KC297470.1:1..882:+/4688..5629:+ |
KC297470.1 |
1..882:+ |
294 |
4688..5629:+ |
G |
attachment glcyoprotein |
KP856962.1:4618..5505:+/4688..5629:+ |
KP856962.1 |
4618..5505:+ |
296 |
4688..5629:+ |
G |
attachment glcyoprotein |
KU950619.1:4663..5604:+/4688..5629:+ |
KU950619.1 |
4663..5604:+ |
314 |
4688..5629:+ |
G |
attachment glcyoprotein |
KP258745.1:4620..5507:+/4688..5629:+ |
KP258745.1 |
4620..5507:+ |
296 |
4688..5629:+ |
G |
attachment glcyoprotein |
MF185751.1:4640..5527:+/4688..5629:+ |
MF185751.1 |
4640..5527:+ |
296 |
4688..5629:+ |
G |
attachment glcyoprotein |
KU316181.1:4618..5505:+/4688..5629:+ |
KU316181.1 |
4618..5505:+ |
296 |
4688..5629:+ |
G |
attachment glcyoprotein |
KJ627249.1:4618..5565:+/4688..5635:+ |
KJ627249.1 |
4618..5565:+ |
316 |
4688..5635:+ |
G |
attachment glcyoprotein |
NC_001781.1:4690..5589:+/4688..5641:+ |
NC_001781.1 |
4690..5589:+ |
300 |
4688..5641:+ |
G |
attachment glcyoprotein |
KU316144.1:4618..5517:+/4688..5641:+ |
KU316144.1 |
4618..5517:+ |
300 |
4688..5641:+ |
G |
attachment glcyoprotein |
MN365572.1:4676..5629:+/4688..5641:+ |
MN365572.1 |
4676..5629:+ |
318 |
4688..5641:+ |
G |
attachment glcyoprotein |
OK649740.1:4675..5574:+/4688..5641:+ |
OK649740.1 |
4675..5574:+ |
300 |
4688..5641:+ |
G |
attachment glcyoprotein |
LC474543.1:8538..15017:+/8560..15039:+ |
LC474543.1 |
8538..15017:+ |
2160 |
8560..15039:+ |
L |
RNA-dependent RNA polymerase |
The protein validation stage of v-annotate.pl
can use profile HMMs
with HMMER
instead of blastx
by using the --pv_hmmer
option. By default, v-build.pl
will create 'profile' HMMs from the
single sequence proteins encoded by CDS in the model sequence, in this
case, KY654518.1 and MZ516105.1. For the rsv v1.5-2 model library, for
both the RSV A and RSV B models, additional attachment glycoprotein
profile HMMs corresponding to the different possible stop codon
positions were created and the default glycoprotein profile HMM was
replaced with one built from an alignment of multiple sequences.
Specifically, the following additional profile HMMs were created:
reference model | reference model positions | gene | number of sequences in hmmbuild alignment |
alignment file in hmm-alignments subdirectory |
---|---|---|---|---|
KY654518 (RSV A) |
4681..5643 |
G | 1 | KY654518.4681..5643.protein.fa |
KY654518 (RSV A) |
4681..5646 |
G | 7 | KY654518.4681..5646.protein.stk |
KY654518 (RSV A) |
4681..5649 |
G | 6 | KY654518.4681..5649.protein.stk |
MZ516105 (RSV B) |
4688..5620 |
G | 11 | MZ516105.4688..5620.protein.stk |
MZ516105 (RSV B) |
4688..5629 |
G | 6 | MZ516105.4688..5629.protein.stk |
MZ516105 (RSV B) |
4688..5635 |
G | 1 | MZ516105.4688..5629.protein.fa |
MZ516105 (RSV B) |
4688..5641 |
G | 4 | MZ516105.4688..5641.protein.stk |
These profile HMMs were added to the default set created by
v-build.pl
(after removing the original attachment glycoprotein
profile HMMs) to make the RSV profile HMM library file rsv.hmm
. All
files necessary for reproducing the construction of rsv.hmm
are
available in the hmm-alignments
subdirectory in the RSV 1.5-2 model
directory. The following commands will recreate the rsv.hmm
file
within that subdirectory:
sh ./A.build.sh
sh ./B.build.sh
cat KY*hmm MZ*hmm > rsv.hmm
A tutorial based on building an RSV model
library
is included in the VADR documentation. That tutorial includes a long
discussion of how to build a model library and includes a procedure
similar to the one taken to build the RSV vadr-models-rsv-1.5-2
model set.
- VADR README
- VADR installation instructions
-
v-build.pl
example usage and command-line options -
v-annotate.pl
example usage, command-line options and alert information - Advanced tutorial: building an RSV model library
-
Explanations and examples of
v-annotate.pl
detailed alert and error messages- Output fields with detailed alert and error messages
- Explanation of sequence and model coordinate fields in
.alt
files toy50
toy model used in examples of alert messages- Examples of different alert types and corresponding
.alt
output - Posterior probability annotation in VADR output Stockholm alignments
- VADR output file formats
- Available VADR model files (github wiki)
- SARS-CoV-2 annotation (github wiki)
- Rfam-based structural annotation of a viral genome sequence for use with VADR (github wiki)
- Development notes and instructions (github wiki)
-
The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
-
The following article describes changes made to VADR for faster SARS-CoV-2 annotation, including the
-r
option: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genom Bioinform. Vol 5, Issue 1 (2023) https://doi.org/10.1093/nargab/lqad002