Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Extracting dna and amino acid sequences from a region of a genome, gene or protein

mattb112885 edited this page Jun 13, 2013 · 8 revisions

DNA and amino acid sequences can be extracted from specific genes (or proteins) or contigs using the db_getSequencesFromBlastResults.py command. The name of the function comes from the fact that most of the time, the function will be used from BLAST results, but it can be used with any table as input that contains, in three separate columns:

  1. The name of a gene or a contig from which to get the sequence
  2. The start location of the gene or contig
  3. The stop location of the gene or contig.

Obviously the way that you call the function should be different depending on whether you want amino acid or nucleotide sequences, and whether you are starting with a gene (or protein) ID or a contig ID. Here's some common use cases:

Getting sequences of homologous regions from BLASTP results

The BLASTP results table contains locations of the HSP within the AMINO ACID sequence for each gene. Therefore you must call the db_getSequencesFromBlastResults.py function with the -p flag.

The column number options (-i for ID, -s for start location column and -e for end location column) are set up by default to get you the sequences of the target in a BLASTP results table.

As an example, going back to the example we discussed in an earlier tutorial, a search for genes homologous to "Cbei_1843" (a phosphofructokinase) gave many results; here is one of them.

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 > Cbei_1843_blastp_res
fig|290402.1.peg.3996    fig|290402.1.peg.1824    25.66    304    194    13    11    297    10    298    1e-10    58.9    600.0    600.0

Note that the db_getBlastResultsContainingGenes.py function returns BLASTP results by default.

Now we get the sequence of the target gene (which in this case is fig|290402.1.peg.1824) using

$ cat Cbei_1843_blastp_res | db_getSequencesFromBlastResults.py -p

The result is the addition of an amino acid sequence to each line in the BLASTP results. The sequence appended to the above BLAST hit is

SLDYIVKVDSFKVD ....

This sequence is amino acids number 10 to 298 (columns 9 and 10 above) in the protein sequence for fig|290402.1.peg.1824.

To get the sequence of the HSP for the query we need to specify the columns: column 1 for the query gene ID, column 7 for the query gene start location and column 8 for the query gene stop location.

$ cat Cbei_1843_blastp_res | db_getSequencesFromBlastResults.py -p -i 1 -s 7 -e 8

This would append the amino acid sequence for amino acid positions 11 to 297 within the protein fig|290402.1.peg.3996.

Getting sequences of homologous regions from BLASTN results

BLASTN takes a gene sequence (nucleotide) as both query and target and returns as part of its output the nucleotide positions within the query and target that were found to be homologous.

First lets get some BLASTN results by using the -n flag on db_getBlastResultsContainingGenes.py

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -n > Cbei_1843_blastn_res

To get the correct sequences from BLASTN results you must use the -n flag which specifies that the positions are nucleotide positions within a gene:

$ cat Cbei_1843_blastn_res | db_getSequencesFromBlastResults.py -n

Just as in BLASTP, this result by default gives you the nucleotide sequences within the target gene. To get the nucleotide sequences within the query gene use this command:

$ cat Cbei_1843_blastn_res | db_getSequencesFromBlastResults.py -n -i 1 -s 7 -e 8

We suggest NOT trying to use the translate (-t) tag to get amino acid sequences from BLASTN hits because the homologous regions are likely to begin and\or end in the wrong frame.

Getting sequences of homologous regions from tBLASTn results

Unlike BLASTP and BLASTN, the target sequence for tBLASTn is a contig, not a gene. Therefore, when trying to get a sequence corresponding to a tBLASTn hit you should use the -c flag to indicate that the ID for the target is a contig ID and not a gene ID. Also the meaning of the columns is somewhat different since the tBLASTn wrapper script prints out different information than what is provided by BLASTP and BLASTN so you need to use the -i, -s, and -e flags to get the right column numbers.

The tBLASTn table includes locations on the contig (i.e. they are nucleotide positions). To obtain a nucleotide sequence from the target contig use:

$ cat [tblastn_results] | db_getSequencesFromBlastResults.py -c -i 3 -s 5 -e 6

Unlike BLASTN, the tBLASTn hits will generally be in the correct frame (i.e. a frame that produces an amino acid that is actually homologous to the query) and more importantly, using the default settings, will always both start and end in the same frame. Therefore, you can get the translated amino acid sequence for the homologous region by adding the -t flag:

$ cat [tblastn_results] | db_getSequencesFromBlastResults.py -c -i 3 -s 5 -e 6 -t

Note that tBLASTn can and will find hits that go past stop codons. Stop codons will appear in the translation as "*".

Clone this wiki locally