Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Searching for genes by homology with other genes

mattb112885 edited this page Jan 21, 2014 · 8 revisions

Finding genes homologous to a specific gene

Recall that you can get the ITEP gene ID for a gene of interest with specific function or locus tag by using the db_getGenesWithAnnotation.py function:

$ db_getGenesWithAnnotation.py "Cbei_1843"
fig|290402.1.peg.1824 1-phosphofructokinase_YP_001308970.1_Cbei_1843

Many ITEP scripts are designed to take output such as this (which is by default printed to stdout) and use them as inputs to another script using pipes (|). One such script is the script db_getBlastResultsContainingGenes.py function, which identifies all genes homologous to one or more input genes. We connect the results of the previous script with this new one by using the following syntax (the -g flag tells ITEP that the gene IDs are found in the first column of the input):

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 

Which produces an output like this (only one line shown here)

fig|290402.1.peg.1824   fig|290402.1.peg.1824   100.0   300     0       0       1       300     1       300     5e-173  600.0   600.0   600.0

This is the standard (-m9) tab-delimited output from BLAST with some additional information added to the end. The table consists of in order: query gene, target gene, percent ID, length of alignment, number of mismatches, number of gap openings, query start, query end, target start, target end, E-value, bitscore, query self-bitscore and target self-bitscore.

The script by default gives you BLASTP results - you can also get BLASTN results with identical format by using the -n flag:

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -n
fig|290402.1.peg.1824   fig|290402.1.peg.1824   100.0   903     0       0       1       903     1       903     0.0     1629.0  1629.0  1629.0

You can specify an E-value cutoff for results to display with the -c flag.

Finally, if the input file has the gene IDs you want to query against in a different column, change the value of -g to reflect that. The -g argument is optional if the gene ID is in the first column as described in the help text (Use the -h flag with any python script to get help text).

Finding homology between a set of genes

Suppose you have a set of genes and want to identify whether or not they are strongly homologous. For example, Clostridium beijerinckii has three genes annotated as 6-phosphofructokinase, as we can see by searching for this annotation:

$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | grep "Cbei"
fig|290402.1.peg.581    6-phosphofructokinase_YP_001307727.1_Cbei_0584
fig|290402.1.peg.992    6-phosphofructokinase_YP_001308138.1_Cbei_0998
fig|290402.1.peg.4768   6-phosphofructokinase_YP_001311914.1_Cbei_4852

You can tell whether these genes are actually homologous (and how strongly so they are) using the db_getBlastResultsBetweenSpecificGenes.py function (note the difference between this and the db_getBlastResultsContainingGenes.py function). Use pipes to link the output from the above function with this one:

$ db_getGenesWithAnnotation.py "6-phosphofructokinase" | grep "Cbei" | db_getBlastResultsBetweenSpecificGenes.py

Two lines in the output are

fig|290402.1.peg.581    fig|290402.1.peg.4768   27.55   294     167     12      1       279     1       263     1e-13   68.6    833.0   637.0
fig|290402.1.peg.992    fig|290402.1.peg.4768   37.97   345     178     5       5       349     1       309     5e-71   225.0   723.0   637.0

Suggesting that the two copies fig|290402.1.peg.581 and fig|290402.1.peg.4768 are not very strongly homologous, while fig|290402.1.peg.992 and fig|290402.1.peg.4768 are more closely related.

Building a homologous set from a seed protein

We can combine the two analyses above to first obtain a list of homologous proteins to a given protein, and then find all of the homologies between them to make a cohesive set. Visualizing this data (see "Visualizing homology patterns") with specific homology scoring metrics is a useful way to see how the proteins are likely to cluster without actually performing the clustering.

To make a cohesive set of BLASTP results between all proteins that are similar to Cbei_1843 within an E-value of 1E-20, the following commands are sufficient:

$ db_getGenesWithAnnotation.py "Cbei_1843" | db_getBlastResultsContainingGenes.py -g 1 -c 1E-20 | db_getBlastResultsBetweenSpecificGenes.py -g 1
Clone this wiki locally