Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Searching for missing genes and identifying causes for absence with tblastn

mattb112885 edited this page Jun 4, 2014 · 8 revisions

tBLASTn primer

tBLASTn is a powerful tool that searches for proteins in a nucleotide sequence (such as the sequence of a whole contig) by translating it in all six possible frames (3 per strand) and then running a BLASTP against all of those translated regions. It has been used in tools such as GenePRIMP to identify and try to fix missing genes. However, it is useful to be able to identify such problems on-the-fly, and the tBLASTn program does not natively attempt to identify whether the hit regions correspond to any called proteins or (if so) how much overlap they have.

ITEP wrapper for tBLASTn

We have included a program called db_TBlastN_wrapper.py that solves these problems by generating tBLASTn databases for contigs on the fly (from a specified set of organisms to run against), running tBLASTn against a set of input proteins and looking for overlaps with existing called genes.

As an example, we searched for ribosomal protein L20 in Acetobacterium woodii (which does not have an annotated copy in the RefSeq annotation of Acetobacterium woodii), organism ID 931626.1, using the C. beijerinckii gene (Cbei_1571) as a seed.

$ db_getGenesWithAnnotation.py "L20" | grep "Cbei" | db_TBlastN_wrapper.py -o 931626.1 > tblastn_results
fig|290402.1.peg.1555   359     931626.1.NC_016894.1    Acetobacterium woodii DSM 1030  2362582 2362229 353     98.33   1e-36    132    -3      NOGENE                                  TBLASTN_CONTIG_931626.1.NC_016894.1_START_2362582_STOP_2362229

This indicates that there is a strong hit with length 353 (to a protein with length 359) in a region with no called gene.

On the other hand, running it against C. novyi (organism ID 386415.1) finds a strong hit with 99.16% overlap with a called gene (fig|386415.1.peg.862) with a predicted function as an L20 ribosomal protein:

$ db_getGenesWithAnnotation.py "L20" | grep "Cbei" | db_TBlastN_wrapper.py -o 386415.1
fig|290402.1.peg.1555	359	386415.1.NC_008593.1	Clostridium novyi NT	965021	965377	356	99.16	2e-54	 183	2	SAMESTRAND	fig|386415.1.peg.862	50S ribosomal protein L20_YP_877836.1_NT01CX_1763_rplT	359	99.16	TBLASTN_CONTIG_386415.1.NC_008593.1_START_965021_STOP_965377

It is important to check four values here to ensure robustness of the interaction between the called gene and the tBLASTn hit: the overlap percentage between the hit and the called gene (99.16%), the length of the called gene (359), the length of the tBLASTn hit (359) and the length of the query gene (359). If the latter three are all close in magnitude and the overlap is high, it is almost certainly a correct call (you can also look at the E-value, 2E-54, and the bit score, 183, to help determine whether you believe it or not). However, often the overlap percentage will be low due to mistaken start site calls in the annotation or other errors, in which case the gene association cannot be trusted.

If more than one overlapping gene is found for a single tBLASTn hit, one row is created for each overlapping gene. Only overlaps > 1% of the called gene are printed by default. This cutoff can be modified using the -a flag. If the overlap between a particular hit and a called gene is less than the cutoff, INSUFFICIENT_OVERLAP is printed in place of the overlapping gene information.

Getting the amino acid sequence for a tBLASTn hit

We can get the amino acid sequence for the uncalled region using the file we created for tBLASTn results against Acetobacterium woodii above:

$ cat tblastn_results | db_getSequencesFromBlastResults.py -c -i 3 -s 5 -e 6 -t

We will describe the db_getSequencesFromBlastResults.py function more in the tutorial "Extracting DNA and amino acid sequences from a region of a genome, gene or protein"

This command appends the following sequence to the tBLASTn results table:

   IMRIKKGVNAKKKHKKVLKLAKGFYGAKSKLYRSANEAVMRAQRSSYVGRKEKKRNFRRLWITRINAGARMYDLSYSKFMFGLKQAGVEIDRKILADLAMNDINAFKDLVEVSKKNLN

The sequence can easily be added to an alignment of L20 proteins to confirm that it is very similar along the entire length.

Identifying frameshifts, insertions, inversions and nonsense mutations with tBLASTn

NOTE - This function is still somewhat of a work in progress, it gives way more false positives than we would expect due to some design issues that I need to address.

ITEP script db_findBadMutationsFromTblastn.py include crude ways to identify putative frameshift mutations, insertions and inversions with tBLASTn by identifying multiple tBLASTn hits with the same query gene but hits to different locations on the genome that are close to each other. Note that this definition can also lead to false positives if for example a gene is duplicated and the second copy is very close to the first, or multiple distantly related proteins (but still closely enough to show up in a tBLASTn search) are found in close proximity. The definitions are defined in the help text and repeated here (the user can specify the required distances)

If two hits are found within INTERVAL base pairs of each other (where INTERVAL is set by the user)

  • Frameshift: The two hits are on the same strand on different frames
  • Insertion: The two hits are on the same strand on the same frame
  • Inversion: The two hits are on different strands

The script also identifies putative nonsense mutations if a single hit has a stop codon within a user-defined percentage of the total length of the hit.

As an example, when we ran tBLASTn with the 30S ribosomal protein S2 from Acetobacterium woodii as the query, we found that the homologous protein in C. novii is slightly shorter and that the hit goes about 10 amino acids past the stop codon, indicating a possible (small) gene truncation event or a sequencing error at that position.

$ echo "fig|931626.1.peg.1904" | db_TBlastN_wrapper.py -f ../organisms > ribosomal_tblastn
fig|931626.1.peg.1904   779     931626.1.NC_016894.1    Acetobacterium woodii DSM 1030  2300191 2300967 776     99.61   1e-169   523    1       SAMESTRAND      fig|931626.1.peg.1904   30S ribosomal protein S2_YP_005269607.1_Awo_c19390_rpsB 779     99.61   TBLASTN_CONTIG_931626.1.NC_016894.1_START_2300191_STOP_2300967
fig|931626.1.peg.1904   779     386415.1.NC_008593.1    Clostridium novyi NT    1356578 1355847 731     93.84   2e-109   350    -2      SAMESTRAND      fig|386415.1.peg.1248   30S ribosomal protein S2_YP_878222.1_NT01CX_2149_rpsB   701     100.00  TBLASTN_CONTIG_386415.1.NC_008593.1_START_1356578_STOP_1355847
fig|931626.1.peg.1904   779     290402.1.NC_009617.1    Clostridium beijerinckii NCIMB 8052     1409826 1410524 698     89.60   3e-105   338    3       SAMESTRAND      fig|290402.1.peg.1180   30S ribosomal protein S2_YP_001308326.1_Cbei_1188_rpsB  701     99.57   TBLASTN_CONTIG_290402.1.NC_009617.1_START_1409826_STOP_141052
queryid, querylen, subcontig, organism, tblaststart, tblastend, tblastlen, queryoverlappct, evalue, bitscore, hitframe, strandedString, targetgeneid, targetannotation, targetgenelen, targetoverlappct, TBLASTN_hitID

Note that the C. novyi gene is only 701 base pairs but the tBLASTn hit was 731 bases. Searching for issues with the db_findBadMutationsFromTblastn.py reveals that there is a potential nonsense mutation here:

$ cat ribosomal_tblastn | db_findBadMutationsFromTblastn.py
386415.1.NC_008593.1    fig|931626.1.peg.1904   1356578 1355847 1356578 1355847 NONSENSE

We can get the sequence from this (since it has a contig ID, start and stop location) using db_getSequencesFromBlastResults. We use -t to turn the DNA sequence into an amino acid sequence.

$ cat ribosomal_tblastn | db_findBadMutationsFromTblastn.py | db_getSequencesFromBlastResults.py -c -i 1 -s 3 -e 4 -t
386415.1.NC_008593.1    fig|931626.1.peg.1904   1356578 1355847 1356578 1355847 NONSENSE        MSIISMKQLLEAGVHFGHQTRRWNPKMAPYIFTERNGIYIIDLQKTVKKVEEAYEFVKSVVADGKEVLFVGTKKQAQEAIEEESLRSGMHFVNNRWLGGMLTNFKTIKTRINKLEQLEKMEEDGTFEVLPKKEVIKLRNEKEKLEKNLGGIKNLDASNLGAIFIVDPRKEKNAIDEAKNLGIPVVAIVDTNCDPDEIDYVIPGNDDAIRAVRLITSKIADAIIEGNQGEQLAE*FKINLLGVCD

Note the "*" 11 amino acids prior to the end of the tBLASTn hit indicating a stop codon, which is less than 10% of the protein. We can tell db_findBadMutationsFromTblastn.py to ignore stop codons that are too close to the end of the protein using the -n flag. E.g. if we don't care about the stop codon if it is less than 5% of way into the protein we use:

$ cat ribosomal_tblastn | db_findBadMutationsFromTblastn.py -n 5

This returns nothing because the stop codon is less than 5% of the way into the protein.

Clone this wiki locally