The pipeline will verify the organism name provided on input when the pgap.py flag
--taxcheck-only are used.
The taxonomy check assesses whether the organism name provided in the YAML input file matches the input genome sequence. Using average nucleotide identity (ANI), it compares the input genome sequence to the genomes of the type strains in GenBank. In a first step, the set of type assemblies to which the input sequence is most closely related is determined via k-mer analysis. This set assemblies is then aligned to the input sequence with pairwise MegaBLAST. The percent identity of the resulting filtered reciprocal best hits is declared as the overall genome-to-type-assembly ANI.
For most species, we use an ANI threshold of 96% identity and a minimum coverage threshold of 80% of both the query and the type assembly to declare that a query assembly matches a type assembly with high confidence.
More information is available in this publication.
Possible ANI statuses
The status returned by the taxonomy check can be one of the following:
CONFIRMED: The submitted organism name has been confirmed by ANI. A species can be confirmed by the following methods:
- The assembly matches a type and both are of the same species.
- The assembly matches a type and at least one is subspecies of the same species.
- The assembly lacks a submitted full binomial name (i.e., submitted organism is a "sp.", or at genus level), matches a type, and both share the same genus.
- The assembly matches a type of a species that was added to a specialized synonymy list designed to cover difficult-to-handle cases of typing.
MISASSIGNED: The submitted organism name has been found to be misassigned to the query assembly.
- The assembly matches a type for a different species.
- If the submitted organism name is a "sp.", there is a mismatch at the genus level.
INCONCLUSIVE: The organism cannot be identified.
- There is no type assembly available for the submitted organism name.
- The assembly matches a type at the same species, but the ANI is below the species ANI threshold.
- The assembly matches a type at a different species, but the ANI is below the species ANI threshold.
- The assembly and closest type do not share enough sequence to make a determination.
CONTAMINATED: Contamination in genome assemblies will be reported if the following conditions are met:
- We have a reference covering at least 50% of the assembly
- We have a single taxon accounting for at least 10% of the coverage and at least half of the remaining sequence.
Description of the reports
The taxonomy check will produce three reports:
This file provides the results of the taxonomy check in text format. It includes
- Submitted organism name: the organism declared by the submitter, along with NCBI taxonomy identifier, rank (ex: species), and taxonomic lineage.
- Predicted organism name: the organism identity determined by ANI. This may be the same as the submitted organism name.
- Submitted organism has type: possible values are Yes and No. Indicates whether there is a public genome assembly available for the type strain of the declared species.
- Status: possible values are CONFIRMED, MISASSIGNED, INCONCLUSIVE or CONTAMINATED (see above)
- Confidence: possible values are HIGH or LOW. Indicates the confidence level of the stated contamination. Confidence HIGH: the ANI criteria meets the expected cutoff (96% for most prokaryotic taxa). Confidence LOW: the ANI criteria does not meet the expected cutoff, but has provided the best prediction possible based on currently available data.
A table with the following columns
- Percent identity: the percent identity the submitted sequence has to a public type strain sequence of a different species.
- (Query coverage, Subject coverage): The percent coverage of the query (submitted sequence) to the subject (public type strain), and the percent coverage of the subject (public type strain) to the query (submitted sequence) respectively
- GenBank assembly ID: identifier for the GenBank assembly used in comparison.
- Organism name: The organism name of the public type strain used for comparison.
- Assembly accession, assembly name: The assembly accession and assembly name of the public type strain used for comparison.
The same data as ani-tax-report.txt, but in XML format.
List of assemblies selected by kmer analysis for ANI calculation, and their kmer distance to the query assembly, in XML format.
Example of a MISSASSIGNED report:
ANI report for assembly: my_gc_assm_name Submitted organism: Rickettsia hoogstraalii (taxid = 467174, rank = species, lineage = Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia; spotted fever group) Predicted organism: Rickettsia japonica (taxid = 35790, rank = species, lineage = Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia; spotted fever group) Submitted organism has type: Yes Status: MISASSIGNED Confidence: HIGH 99.975 (99.8 99.8) 406738 assembly Rickettsia japonica YH (GCA_000283595.1, ASM28359v1) 99.985 (99.4 99.9) 864348 assembly Rickettsia japonica YH (GCA_000302635.2, ASM30263v2) 97.722 (96.5 97.1) 320558 assembly Rickettsia slovaca 13-B (GCA_000237845.1, ASM23784v1) 98.893 (95.3 84.4) 6004488 assembly Rickettsia fournieri (GCA_900243065.1, PRJEB23962) 97.100 (96.3 91.7) 834068 assembly Rickettsia gravesii BWI-1 (GCA_000485845.1, RicGra1.0) 97.246 (95.9 83.0) 1655938 assembly Rickettsia raoultii (GCA_000940955.1, ASM94095v1) 97.114 (95.8 96.9) 3973378 assembly Rickettsia rickettsii (GCA_001951015.1, ASM195101v1) 97.115 (95.8 96.9) 3973358 assembly Rickettsia rickettsii (GCA_001950995.1, ASM195099v1) 97.115 (95.8 96.9) 1526588 assembly Rickettsia rickettsii str. Iowa (GCA_000017445.3, ASM1744v3) 94.100 (84.7 74.8) 1199088 assembly Rickettsia tamurae (GCA_000751075.1, Rickettsia tamurae AT-1) 94.312 (79.2 75.1) 1720158 assembly Rickettsia monacensis (GCA_000499665.2, RMONA_1) 94.484 (76.4 59.1) 1086398 assembly Rickettsia buchneri (GCA_000696365.1, REISMNv1) 99.121 (99.0 99.4) 296048 assembly Rickettsia heilongjiangensis 054 (GCA_000221205.1, ASM22120v1) 97.446 (96.3 97.4) 380228 assembly Rickettsia honei RB (GCA_000263055.1, Rho1.0) [...] 96.865 (94.5 92.9) 407678 assembly Rickettsia rhipicephali str. 3-7-female6-CWPP (GCA_000284075.1, ASM28407v1) 93.031 (83.9 72.5) 1485538 assembly Rickettsia hoogstraalii (GCA_000825685.1, Rickettsia hoogstraalii Croatica) [...]
In the above example, the organism was declared by the submitter to be Rickettsia hoogstraalii. The predicted organism is found to be Rickettsia japonica with high confidence, based on an ANI of 99.975% over 99.8% of the input sequence.
Example of a CONTAMINATED report:
ANI report for assembly: my_gc_assm_name Submitted organism: Staphylococcus aureus (taxid = 1280, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus) Predicted organism: Staphylococcus aureus (taxid = 1280, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus) Submitted organism has type: Yes Status: CONTAMINATED Confidence: HIGH 99.045 (54.5 80.3) 4972758 assembly Ochrobactrum quorumnocens (GCA_002278035.1, ASM227803v1) 99.450 (31.5 94.1) 11348628 assembly Staphylococcus aureus (GCA_006364675.1, ASM636467v1) 99.450 (31.5 94.3) 10960368 assembly Staphylococcus aureus subsp. aureus (GCA_006094915.1, ASM609491v1) 99.450 (31.5 94.3) 1806888 assembly Staphylococcus aureus subsp. aureus DSM 20231 (GCA_001027105.1, ASM102710v1) 99.441 (31.5 94.2) 8986608 assembly Staphylococcus aureus (GCA_900706775.1, 27323_B01) 99.450 (31.5 95.1) 2490008 assembly Staphylococcus aureus subsp. aureus DSM 20231 (GCA_000330825.2, SASA1.0) 99.465 (31.4 96.1) 2855328 assembly Staphylococcus aureus subsp. aureus NBRC 100910 (GCA_001544175.1, ASM154417v1) 97.857 (28.8 93.1) 5947508 assembly Staphylococcus aureus subsp. anaerobius (GCA_002902425.1, ASM290242v1) 88.596 (38.7 58.4) 6727398 assembly Ochrobactrum pituitosum (GCA_003049685.2, ASM304968v2) [...]
In the above example, the organism was declared by the submitter to be Staphylococcus aureus. The predicted organism was in agreement, but there was contamination from Ochrobactrum quorumnocens, which has a 99.045 identity over 54.5% of the sequence, representing 80.3% of the contaminating organism's genome.