Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

Searching for functions using conserved domains

Matthew Benedict edited this page Nov 8, 2013 · 10 revisions

WARNING: The 32-bit VM distribution is unable to run RPSBLAST against the full CDD due to memory limitations of a 32-bit machine. If you wish to analyze conserved domains with the VM, you should use the 64 bit VM.

We ran the RPSBLAST on another machine with sufficient memory and copied the results over to the 32 bit VM so you can explore the functions.

Finding conserved domains in the CDD by keyword search

If you are interested in conserved domains that match a particular description, you can search through the descriptions by using the db_getExternalClustersByDescription.py function. This script takes any number of possible descriptions to match in a case-insensitive manner and returns any of the CDD domains that match that description. For example, if you are interested in biotin synthase you can search for domains related to it using the following (some descriptions have been truncated for readability):

$ db_getExternalClustersByDescription.py "biotin synthase"
30848   COG0502 BioB    Biotin synthase and related enzymes [Coenzyme metabolism]       335
32586   COG2516 COG2516 Biotin synthase-related enzyme [General function prediction only]       339
178013  PLN02389        PLN02389        biotin synthase 379
180492  PRK06256        PRK06256        biotin synthase; Validated      336
180835  PRK07094        PRK07094        biotin synthase; Provisional    323
181453  PRK08508        PRK08508        biotin synthase; Provisional    279
185063  PRK15108        PRK15108        biotin synthase; Provisional    345
129447  TIGR00347       bioD    dethiobiotin synthase. [description truncated]       166
200012  TIGR00433       bioB    biotin synthase. [description truncated]     296
100105  cd01335 Radical_SAM     Radical SAM superfamily. ... Examples are biotin synthase (BioB),...  204
198863  cl06149 BATS    Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), ...    0
148534  pfam06968       BATS    Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), EC:2.8.1.6, c...   93
205678  pfam13500       AAA_26  AAA domain. ... found in a number of proteins involved in cofactor biosynthesis such as dethiobiotin synthase ...       197
197846  smart00729      Elp3    Elongator protein 3, MiaB family, Radical SAM. This superfamily contains ... biotin synthase ...    216
197944  smart00876      BATS    Biotin and Thiamin Synthesis associated domain...    94

You can also specify that you only want results from a specific database, such as PFAM here:

$ db_getExternalClustersByDescription.py "biotin synthesase" -d pfam
148534  pfam06968       BATS    Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), EC:2.8.1.6, c...   93
205678  pfam13500       AAA_26  AAA domain. ... found in a number of proteins involved in cofactor biosynthesis such as dethiobiotin synthase ...       197

Searching for conserved domains associated with a protein

You can search for the conserved domains associated with a protein with the db_getExternalClusterGroups.py function, which takes a list of genes from standard in and returns to you a list of RPSBLAST hits to the CDD. Doing this for our favorite 6-phosphofructokinase gene gives us the following set of conserved domains:

$ echo "fig|290402.1.peg.4768" | db_getExternalClusterGroups.py
fig|290402.1.peg.4768   235111  63.64   319     115     1       1       318     1       319     2e-157  550.0   PRK03202
fig|290402.1.peg.4768   238388  58.68   317     131     0       2       318     1       317     2e-123  437.0   cd00763
fig|290402.1.peg.4768   213713  63.12   301     108     2       3       300     1       301     1e-119  425.0   TIGR02482
fig|290402.1.peg.4768   223283  51.47   340     143     6       1       318     2       341     1e-111  398.0   COG0205
fig|290402.1.peg.4768   109425  58.99   278     111     2       2       276     1       278     2e-110  394.0   pfam00365

The gene's name appears first followed by the CDD ID for the external cluster, percent identity, other metrics, E-value (e.g. 2E-157), bitscore, and the cluster's ID in the source database. We see here that the strongest hit is to PRK03202.

The function gives you the option to append the cluster's name (e.g. BATS) or description to the results table, to cut off results at an E-value lower than the default value of 1E-5, or to limit the printed results to those in a given conserved database (e.g. COG). See the function's help text for details.

You can also get information about a particular cluster after the fact using db_getExternalClustersById.py:

$ echo "PRK03202" | db_getExternalClustersById.py
235111	PRK03202	PRK03202	6-phosphofructokinase; Provisional	320

NOTE: If you get the following error, it indicates that you have not run setup_step4.sh (or that it failed):

error:
Traceback (most recent call last):
File "[directory]/src/db_getExternalClusterGroups.py", line 47, in <module>
   cur.execute(cmd, (geneid, ) )
sqlite3.OperationalError: no such table: rpsblast_results

Searching for proteins associated with a conserved domain

You can perform the reverse search (looking for proteins matching a domain, such as pfam00001) using the db_getHitsToExternalClusters.py function. It takes a list of external cluster IDs as input and returns a list of RPSBLAST hits to those external clusters (including names and descriptions).

$ echo "PRK03202" | db_getHitsToExternalClusters.py
fig|290402.1.peg.581	235111	28.12	352	165	20	35	368	33	314	6e-28	121.0	235111	PRK03202	PRK03202	6-phosphofructokinase; Provisional	320
fig|290402.1.peg.992	235111	43.42	357	165	5	5	361	1	320	4e-123	437.0	235111	PRK03202	PRK03202	6-phosphofructokinase; Provisional	320
fig|290402.1.peg.4768	235111	63.64	319	115	1	1	318	1	319	2e-157	550.0	235111	PRK03202	PRK03202	6-phosphofructokinase; Provisional	320
fig|386415.1.peg.406	235111	62.07	319	120	1	1	318	1	319	3e-153	537.0	235111	PRK03202	PRK03202	6-phosphofructokinase; Provisional	320
fig|931626.1.peg.1249	235111	52.47	324	143	4	1	318	2	320	4e-126	447.0	235111	PRK03202	PRK03202	6-phosphofructokinase; Provisional	320

Visualizing conserved domains associated with a protein

You can visualize the locations and strengths (E-values) of the hits from a given protein to conserved domain databases using the db_displayExternalClusterHits.py function, which takes a list of gene IDs as input and produces a PNG file displaying the position and name of each sufficiently-strong hit to external domains in relation to the gene (strongest hits are on the bottom).

Clone this wiki locally