Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
executable file 184 lines (105 sloc) 13.9 KB

Kaptive

Welcome to the documentation page for Kaptive Web

Kaptive Web reports information about capsular (K) loci found in genome assemblies.

Given a novel genome and a database of known K-loci, Kaptive will help you to decide whether your sample has a known or a novel K-locus. It carries out the following for each input genome assembly:

  • BLAST for all known K-locus nucleotide sequences (using blastn) to identify the best match ('best' defined as having the highest coverage).
  • Extract the region(s) of the assembly which correspond to the BLAST hits (i.e. the K-locus sequence in the assembly) and save it to a FASTA file.
  • BLAST for all known K-locus genes (using tblastn) to identify which expected genes (genes in the best matching K-locus) are present/missing and whether any unexpected genes (genes from other K-loci) are present.
  • Visualise the results on-screen in the form of images and tables.
  • Summarise the results in downloadable table and json files.

Kaptive will indicate the confidence of the K-locus match.

In cases where your input assembly closely matches a known K-locus, Kaptive will indicate a "Perfect" or "Very High" confidence match.

If Kaptive has lower confidence in the match it may mean that your assembly contains a novel K-locus, a deletion or an insertion sequence variant of a known locus. Alternatively it may mean that your input assembly was not of sufficient quality to make a confident match (e.g. if it is very fragmented).

Kaptive cannot reliably extract or annotate K-locus sequences for totally novel loci – if you think you have a novel K-locus you should investigate this further. If you think you may have a variant of a known locus, and you haven't already done so, you could try rerunning Kaptive with the appropriate variant database.

If you do have a novel K-locus or novel variant and you would like it to be added to the database, please let us know.

If you use Kaptive Web in your research, please cite this paper: Kaptive Web: user-friendly capsule and lipopolysaccharide serotype prediction for Klebsiella genomes. doi: 10.1101/260125

If you use the command-line version of Kaptive (download here), please cite this paper: Identification of Klebsiella capsule synthesis loci from whole genome data. doi: 10.1099/mgen.0.000102

Table of Contents

Input assemblies

Kaptive takes as input one or more pre-assembled bacterial genomes. We use Unicycler to generate high quality short-read or hybrid assemblies, but you can use your favourite assembly program. Assemblies can be uploaded in FASTA or zipped FASTA format. Or you can upload multiple assemblies in a zipped directory (one file per sample).

Results

When your job(s) are completed the results will be shown on-screen and will be available for access for up to 7 days - so make sure to note your token! You can also download a summary results table, a summary json file and/or the individual K-locus FASTA sequences extracted from your input assemblies.

Find more details about these outputs here.

Match confidence

This is a categorical measure of match quality, optimised for use with the primary Klebsiella K-locus database:

  • Perfect = the K-locus was found in a single piece with 100% coverage and 100% identity.
  • Very high = the K-locus was found in a single piece with ≥99% coverage and ≥95% identity, with no missing genes and no extra genes.
  • High = the K-locus was found in a single piece with ≥99% coverage, with ≤ 3 missing genes and no extra genes.
  • Good = the K-locus was found in a single piece or with ≥95% coverage, with ≤ 3 missing genes and ≤ 1 extra genes.
  • Low = the K-locus was found in a single piece or with ≥90% coverage, with ≤ 3 missing genes and ≤ 2 extra genes.
  • None = did not qualify for any of the above.

WARNING: If you use the variant Klebsiella K-locus database please inspect your results carefully and decide for yourself what constitutes a confident match!

Example results and interpretation

Very close match

Example close match

The genome ATCC_BAA1705 is a close match to KL107 with 100% blastn identity at 100% coverage. The K-locus was found in a single assembly piece and was exactly the same length as the reference. All of the expected KL107 genes were found in the K-locus region of the assembly with high tblastx coverage and identity (indicated by dark purple shading). No unexpected genes were found in the K-locus region of the assembly and only a small number were found outside of the K-locus region of the assembly, which is as expected since some K-locus genes share similarity with genes in other regions of the genome.

More distant match

Example distant match

The genome UCICRE7 is a more distant match to KL2. It has 100% blastn coverage but only 98.72% identity. The K-locus region of the assembly is in a single piece but it is 3bp shorter than the reference. Most of the expected KL2 genes were found within the K-locus region of the assembly at high tblastx coverage and identity (dark purple shading) but KL2_13 was missing (grey shading). Together the results suggest there may be a small deletion causing a frame-shift mutation within KL2_13.

Broken assembly

Example broken assembly

The genome MGH51 seems to be a reasonable match to the KL106 reference (99.11% coverage and 99.95% identity by blastn). However, the K-locus region of its assembly is in at least 6 pieces! When an assembly is broken into multiple pieces we should also treat Kaptive's results cautiously because we can't be sure about the true order of the pieces and we may have missed some pieces that contain novel genes (Kaptive can't find these because it only searches for known K-locus genes).

Poor match - possible novel locus

Example novel locus

The genome ERR276923 best matches the KL30 reference (100% blastn coverage and 96.37% identity) but is missing one of the expected KL30 genes (wcuG, indicated by grey shading) and has an unexpected gene within the K-locus region of the assembly (KL104_18). Five expected KL30 genes also have low coverage and/or identity tblastx matches (light purple shading). These genes are all in the capsule-specific region of the locus (the centre) and are adjacent to the missing gene. The combination of these results (clustered low quality gene matches, a missing gene and an unexpected gene) suggest that this genome may have a novel K-locus. However, the K-locus region of the ERR276923 assembly is in multiple pieces so care should be taken when interpreting these results. In such a case we recommend further investigation e.g. exploring the K-locus region of the assembly graph to check for other assembly contigs that may be part of the K-locus - if these contigs contain completely novel genes Kaptive cannot find them!

Poor match - possible novel variant

Example variant

The genome 1753_ST258 is a partial match to KL107 with 99.98% blastn identity but only 85.56% coverage. The K-locus region of the assembly is in one piece from the left-most galF gene to the right-most ugd gene, homologues of both of which are found in almost all K-loci. However, the K-locus region of the assembly is 2977bp shorter than the reference, and four genes are missing from the centre of the locus (wbaP, KL107_08, KL107_09 and KL107_10 shown in grey). In fact, 1753_ST258 is a deletion variant of KL107. Running Kaptive with the Klebsiella K locus variants database shows that it is a very good match to KL107-D1 (shown below).

Example variant database run

Databases available in Kaptive Web

Currently only Klebsiella K-locus and O-locus databases are available in Kaptive Web. You can run the command-line version of Kaptive with any appropriately formatted database of your own.

If you have a locus database that you would like to be added to Kaptive Web for use by yourself and others in the community, please get in touch. Similarly, if you have identified new locus variants not currently in the existing databases, let us know!

Klebsiella K-locus databases

The primary reference database comprises full-length (galF to ugd) annotated sequences for each distinct Klebsiella K-locus, where available:

  • KL1 - KL77 correspond to the loci associated with each of the 77 serologically defined K-type references.
  • KL101 and above are defined from DNA sequence data on the basis of gene content. Note that insertion sequences (IS) are excluded from this database since we assume that the ancestral sequence was likely IS-free and IS transposase genes are not specific to the K-locus. Synthetic IS-free K-locus sequences were generated for K-loci for which no naturally occurring IS-free variants have been identified to date.

The variants database comprises full-length annotated sequences for variants of the distinct loci:

  • IS variants are named as KLN -1, -2 etc e.g. KL15-1 is an IS variant of KL15.
  • Deletion variants are named KLN-D1, -D2 etc e.g. KL15-D1 is a deletion variant of KL15. Note that KL156-D1 is included in the primary reference database since no full-length version of this locus has been identified to date.

We recommend screening your data with the primary reference database first to find the best-matching K-locus type. If you have poor matches or are particularly interested in detecting variant loci you should try the variant database. WARNING: If you use the variants database please inspect your results carefully and decide for yourself what constitutes a confident match! Kaptive is not optimised for accurate variant detection.

Klebsiella O locus database

The O locus database (Klebsiella_o_locus_primary_reference.gbk) contains annotated sequences for 12 distinct Klebsiella O loci.

O locus classification requires some special logic, as the O1 and O2 serotypes contain the same locus genes. It is two additional genes elsewhere in the chromosome (wbbY and wbbZ) which results in the O1 antigen. Kaptive therefore looks for these genes to properly call an assembly as either O1 or O2. When only one of the two additional genes can be found, the result is ambiguous and Kaptive will report a locus type of O1/O2.

Read more about the O locus and its classification here: The diversity of Klebsiella pneumoniae surface polysaccharides.

FAQs

Why are there K-locus genes found outside the K-locus?

A number of the K-locus genes are orthologous to genes outside of the K-locus region of the genome. E.g the Klebsiella K-locus man and rml genes have orthologues in the LPS (lipopolysacharide) locus; so it is not unusual to find a small number of genes "outside" the locus. However, if you have a large number of genes (>5) outside the locus it may mean that there is a problem with the locus match, or that your assembly is very fragmented or contaminated (contains more than one sample).

How can my sample be missing K-locus genes when it has a full-length, high identity K-locus match?

Kaptive uses 'tblastn' to screen for the presence of each K-locus gene with a coverage threshold of 90%. A single non-sense mutation or small indel in the centre of a gene will interrupt the 'tblastn' match and cause it to fall below the 90% threshold. However, such a small change has only a minor effect on the nucleotide 'blast' match across the full locus.

Why does the K-locus region of my sample contain a ugd gene matching another locus?

A small number of the original K-locus references are truncated, containing only a partial ugd sequence. The reference annotations for these loci do not include ugd, so are not identified by the 'tblastn' search. Instead Kaptive reports the closest match to the partial sequence (if it exceeds the 90% coverage threshold).

Installation

If you would like to install and run your own version of Kaptive Web, follow the instructions here.

License

GNU General Public License, version 3

http://dx.doi.org/10.5281/zenodo.55773