Identifying the upstream regions of homologous proteins

Why get upstream regions?

Comparing the upstream regions of related genes can be useful for identifying regulatory motifs (see e.g. http://nar.oxfordjournals.org/content/28/22/4523.abstract ). The upstream regions also contain information such as ribosomal binding sites and the TATA box.

Caveats when getting upstream regions

The ITEP tool for getting upstream regions identifies and warns about several possible issues with the returned sequences:

Presence of a called gene in the requested upstream region. Genes are more conserved than inter-genic regions due to need to conserve amino acid sequence for functionally important pars. Therefore, ITEP by default will only report intergenic regions unless you tell it to ignore genes less than a certain length.
Presence of sequencing gaps in the requested upstream region If one is using scaffolds rather than contigs as the DNA unit, it is quite possible that the upstream region of the gene of interest consists mainly of "n". ITEP by default will only print the upstream region until it encounters a single "n" and then stops (you can control how many "n" are allowed)
Your gene is at the end of a contig - Genes on the ends of a contig will only have returned the upstream region up to the end of that contig.

Obtaining upstream regions for a list of genes

We can get an upstream region for a list of genes using the db_getUpstreamRegion.py function. As an example, we pull out the list of genes in the 6-phosphofructokinase cluster that we have analyzed many times before, and then get the upstream regions of each (the default is to extract the start codon of the query gene plus 100 bases upstream of it, but these can both can be changed):

$ makeTabDelimitedRow.py "all_I_2.0_c_0.4_m_maxbit" "647" | db_getGenesInClusters.py | db_getUpstreamRegion.py -g 3
fig|290402.1.peg.4768           TTTATGTACAATAATTCTTGATTAATAAATTATGGTTATATTATGTACACAAGGGTCATTTAAGGTCCCTTTGCGGTATTTTAAGGAGGAAATATATATTATG
fig|386415.1.peg.406            TGGTGGTATACTATTCATAGCTAATAAAAAACTTAAAATTAACAATTTACATGCTTTATGGGATAAAAAATGTCCCAGTGTTTTTGGGAGGTAATAAGTGATG
fig|931626.1.peg.1249   OTHERGENE,      ATATG

The analysis indicates no problems getting the upstream regions for two of the genes but the third one (fig|931626.1.peg.1249) had another called gene 3 base pairs upstream of it, so we could only reliably obtain 2 base pairs of data. To see what that gene was, use db_getGeneNeighborhoods.py (only the two relevant lines of output are shown):

$ echo "fig|931626.1.peg.1249" | db_getGeneNeighborhoods.py
fig|931626.1.peg.1249   fig|931626.1.peg.1248   -1      931626.1.NC_016894.1    1486386 **1489946** +       DNA polymerase III subunit alpha DnaE_YP_005268951.1_Awo_c12780_dnaE
fig|931626.1.peg.1249   fig|931626.1.peg.1249   0       931626.1.NC_016894.1    **1489949** 1490905 +       6-phosphofructokinase_YP_005268952.1_Awo_c12790_pfkA

The gene that was very close to this one was a DNA polymerase III subunit. The conclusion here is that we cannot use this upstream region for analysis, because any conservation is likely due to presence of the protein rather than due to regulatory conservation (these two genes might even be in an operon!).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying the upstream regions of homologous proteins

Why get upstream regions?

Caveats when getting upstream regions

Obtaining upstream regions for a list of genes

Clone this wiki locally