The purpose of this notebook is to build a query that pulls all promiscuous enzyme data from KEGG

Resource:

https://biopython.readthedocs.io/en/latest/Tutorial/chapter_kegg.html

http://biopython.org/DIST/docs/api/Bio.KEGG.REST-module.html

In [16]:
# imports

from Bio.KEGG import REST
from Bio.KEGG import Enzyme

import pandas as pd

In [12]:
# pull info on EC: 5.4.2.2

request = REST.kegg_get("ec:5.4.2.2")

In [13]:
request.read()

'ENTRY       EC 5.4.2.2                  Enzyme\nNAME        phosphoglucomutase (alpha-D-glucose-1,6-bisphosphate-dependent);\n            glucose phosphomutase;\n            phosphoglucose mutase\nCLASS       Isomerases;\n            Intramolecular transferases;\n            Phosphotransferases (phosphomutases)\nSYSNAME     alpha-D-glucose 1,6-phosphomutase\nREACTION    alpha-D-glucose 1-phosphate = D-glucose 6-phosphate [RN:R08639]\nALL_REAC    R08639 > R00959;\n            (other) R01057 R03319\nSUBSTRATE   alpha-D-glucose 1-phosphate [CPD:C00103]\nPRODUCT     D-glucose 6-phosphate [CPD:C00092]\nCOMMENT     Maximum activity is only obtained in the presence of alpha-D-glucose 1,6-bisphosphate. This bisphosphate is an intermediate in the reaction, being formed by transfer of a phosphate residue from the enzyme to the substrate, but the dissociation of bisphosphate from the enzyme complex is much slower than the overall isomerization. The enzyme also catalyses (more slowly) the interco

>>> from Bio.KEGG import REST
>>> from Bio.KEGG import Enzyme
>>> request = REST.kegg_get("ec:5.4.2.2")
>>> open("ec_5.4.2.2.txt", 'w').write(request.read())
>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))
>>> record = list(records)[0]
>>> record.classname
['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
>>> record.entry
'5.4.2.2'


In [4]:
open("ec_5.4.2.2.txt", 'w').write(request.read())

157253

In [5]:
records = Enzyme.parse(open("ec_5.4.2.2.txt"))

In [7]:
record = list(records)[0]

In [8]:
record

<Bio.KEGG.Enzyme.Record at 0x10e8127f0>

In [9]:
record.classname

['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']

In [10]:
record.entry

'5.4.2.2'

from Bio.KEGG import REST

human_pathways = REST.kegg_list("pathway", "hsa").read()

# Filter all human pathways for repair pathways
repair_pathways = []
for line in human_pathways.rstrip().split("\n"):
    entry, description = line.split("\t")
    if "repair" in description:
        repair_pathways.append(entry)

# Get the genes for pathways and add them to a list
repair_genes = []
for pathway in repair_pathways:
    pathway_file = REST.kegg_get(pathway).read()  # query and read each pathway

    # iterate through each KEGG pathway file, keeping track of which section
    # of the file we're in, only read the gene in each pathway
    current_section = None
    for line in pathway_file.rstrip().split("\n"):
        section = line[:12].strip()  # section names are within 12 columns
        if not section == "":
            current_section = section

        if current_section == "GENE":
            gene_identifiers, gene_description = line[12:].split("; ")
            gene_id, gene_symbol = gene_identifiers.split()

            if not gene_symbol in repair_genes:
                repair_genes.append(gene_symbol)

print("There are %d repair pathways and %d repair genes. The genes are:" % \
      (len(repair_pathways), len(repair_genes)))
print(", ".join(repair_genes))

In [52]:
# tutorial

human_pathways = REST.kegg_list("pathway", "hsa").read()

# Filter all human pathways for repair pathways
repair_pathways = []
for line in human_pathways.rstrip().split("\n"):
    entry, description = line.split("\t")
    if "repair" in description:
        repair_pathways.append(entry)

# Get the genes for pathways and add them to a list
repair_genes = []
for pathway in repair_pathways:
    pathway_file = REST.kegg_get(pathway).read()  # query and read each pathway

    # iterate through each KEGG pathway file, keeping track of which section
    # of the file we're in, only read the gene in each pathway
    current_section = None
    for line in pathway_file.rstrip().split("\n"):
        section = line[:12].strip()  # section names are within 12 columns
        if not section == "":
            current_section = section

        if current_section == "GENE":
            gene_identifiers, gene_description = line[12:].split("; ")
            gene_id, gene_symbol = gene_identifiers.split()

            if not gene_symbol in repair_genes:
                repair_genes.append(gene_symbol)

print("There are %d repair pathways and %d repair genes. The genes are:" % \
      (len(repair_pathways), len(repair_genes)))
print(", ".join(repair_genes))

There are 3 repair pathways and 78 repair genes. The genes are:
OGG1, NTHL1, NEIL1, NEIL2, NEIL3, UNG, SMUG1, MUTYH, MPG, MBD4, TDG, APEX1, APEX2, POLB, POLL, HMGB1, XRCC1, PCNA, POLD1, POLD2, POLD3, POLD4, POLE, POLE2, POLE3, POLE4, LIG1, LIG3, PARP2, PARP1, PARP3, PARP4, FEN1, RBX1, CUL4B, CUL4A, DDB1, DDB2, XPC, RAD23B, RAD23A, CETN2, ERCC8, ERCC6, CDK7, MNAT1, CCNH, ERCC3, ERCC2, GTF2H5, GTF2H1, GTF2H2, GTF2H2C_2, GTF2H2C, GTF2H3, GTF2H4, ERCC5, BIVM-ERCC5, XPA, RPA1, RPA2, RPA3, RPA4, ERCC4, ERCC1, RFC1, RFC4, RFC2, RFC5, RFC3, SSBP1, PMS2, MLH1, MSH6, MSH2, MSH3, MLH3, EXO1


In [37]:
human_pathways = REST.kegg_list("pathway", "hsa")

In [38]:
df = pd.read_csv(human_pathways, sep='\t', header=None)

In [41]:
df.head()

Unnamed: 0,0,1
0,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...
1,path:hsa00020,Citrate cycle (TCA cycle) - Homo sapiens (human)
2,path:hsa00030,Pentose phosphate pathway - Homo sapiens (human)
3,path:hsa00040,Pentose and glucuronate interconversions - Hom...
4,path:hsa00051,Fructose and mannose metabolism - Homo sapiens...


## experimentation

#### current # of enzymes in KEGG:

| database      | entry type          | number of entries |
| :------------ | :------------------ | :---------------- |
| KEGG Enzyme	| Enzyme nomenclature |	7,524             |
| KEGG Compound	| Metabolic compounds |	18,505            |


In [28]:
REST.kegg_info('enzyme').read()

'enzyme           KEGG Enzyme Database\nec               Release 89.0+/02-22, Feb 19\n                 Kanehisa Laboratories\n                 7,524 entries\n\nlinked db        pathway\n                 module\n                 ko\n                 <org>\n                 vg\n                 ag\n                 compound\n                 glycan\n                 reaction\n                 rclass\n'

In [46]:
enzyme_list = pd.read_csv(REST.kegg_list('enzyme'), sep='\t', header=None, names=['EC number', 'description'])

enzyme_list.head()

Unnamed: 0,EC number,description
0,ec:1.1.1.1,alcohol dehydrogenase; aldehyde reductase; ADH...
1,ec:1.1.1.2,alcohol dehydrogenase (NADP+); aldehyde reduct...
2,ec:1.1.1.3,homoserine dehydrogenase; HSDH; HSD
3,ec:1.1.1.4,"(R,R)-butanediol dehydrogenase; butyleneglycol..."
4,ec:1.1.1.5,Transferred to 1.1.1.303 and 1.1.1.304


In [50]:
enzyme_list.shape

(7524, 2)

In [55]:
# nope - try using the parser

first_enzyme = pd.read_csv(REST.kegg_get('ec:1.1.1.1'), sep='\t')

first_enzyme

In [56]:
first_enzyme

Unnamed: 0,ENTRY EC 1.1.1.1 Enzyme
0,NAME alcohol dehydrogenase;
1,aldehyde reductase;
2,ADH;
3,alcohol dehydrogenase (NAD);
4,aliphatic alcohol dehydrogenase;
5,ethanol dehydrogenase;
6,NAD-dependent alcohol dehydrogenase;
7,NAD-specific aromatic alcohol dehy...
8,NADH-alcohol dehydrogenase;
9,NADH-aldehyde dehydrogenase;


In [60]:
reactions_df = pd.read_csv(REST.kegg_list('reaction'), sep='\t', header=None, names=['reaction ID', 'description'])

reactions_df.head()

Unnamed: 0,reaction ID,description
0,rn:R00001,polyphosphate polyphosphohydrolase; Polyphosph...
1,rn:R00002,Reduced ferredoxin:dinitrogen oxidoreductase (...
2,rn:R00004,diphosphate phosphohydrolase; pyrophosphate ph...
3,rn:R00005,urea-1-carboxylate amidohydrolase; Urea-1-carb...
4,rn:R00006,pyruvate:pyruvate acetaldehydetransferase (dec...


In [61]:
reactions_df.description[0]

'polyphosphate polyphosphohydrolase; Polyphosphate + n H2O <=> (n+1) Oligophosphate'

In [62]:
reactions_df.shape

(11136, 2)