# Chapter ‍18 KEGG

KEGG (http://www.kegg.jp/) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

Please note that the KEGG parser implementation in Biopython is incomplete. While the KEGG website indicates many flat file formats, only parsers and writers for compound, enzyme, and map are currently implemented. However, a generic parser is implemented to handle the other formats.

## 18.1 Parsing KEGG records
Parsing a KEGG record is as simple as using any other file format parser in Biopython. (Before running the
following codes, please open http://rest.kegg.jp/get/ec:5.4.2.2 with your web browser and save it as
`ec_5.4.2.2.txt`.)

In [1]:
from Bio.KEGG import Enzyme
records = Enzyme.parse(open("ec_5.4.2.2.txt"))
record = list(records)[0]
record.classname

['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']

In [2]:
record.entry

'5.4.2.2'

## 18.2 Querying the KEGG API

Biopython has full support for the querying of the KEGG api. Querying all KEGG endpoints are supported; all methods documented by KEGG (http://www.kegg.jp/kegg/rest/keggapi.html) are supported. The interface has some validation of queries which follow rules defined on the KEGG site. However, invalid queries which return a 400 or 404 must be handled by the user.

In [3]:
from Bio.KEGG import REST
from Bio.KEGG import Enzyme
request = REST.kegg_get("ec:5.4.2.2")
open("ec_5.4.2.2.txt", "w").write(request.read())
records = Enzyme.parse(open("ec_5.4.2.2.txt"))
record = list(records)[0]
record.classname

['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']

In [4]:
 record.entry

'5.4.2.2'

In [5]:
from Bio.KEGG import REST
human_pathways = REST.kegg_list("pathway", "hsa").read()

In [6]:
# Filter all human pathways for repair pathways
repair_pathways = []
for line in human_pathways.rstrip().split("\n"):
    entry, description = line.split("\t")
    if "repair" in description:
        repair_pathways.append(entry)


In [7]:
# Get the genes for pathways and add them to a list
repair_genes = [] 
for pathway in repair_pathways:
    pathway_file = REST.kegg_get(pathway).read()  # query and read each pathway

    # iterate through each KEGG pathway file, keeping track of which section
    # of the file we're in, only read the gene in each pathway
    current_section = None
    for line in pathway_file.rstrip().split("\n"):
        section = line[:12].strip()  # section names are within 12 columns
        if not section == "":
            current_section = section
        
        if current_section == "GENE":
            gene_identifiers, gene_description = line[12:].split("; ")
            gene_id, gene_symbol = gene_identifiers.split()

            if not gene_symbol in repair_genes:
                repair_genes.append(gene_symbol)

print("There are %d repair pathways and %d repair genes. The genes are:" % \
        (len(repair_pathways), len(repair_genes)))
print(", ".join(repair_genes))

There are 3 repair pathways and 100 repair genes. The genes are:
OGG1, NTHL1, NEIL1, NEIL2, NEIL3, UNG, SMUG1, MUTYH, MPG, MBD4, TDG, APEX1, PNKP, TDP1, POLB, POLL, HMGB1, PARP1, PARP2, PARP3, PARP4, PARG, ADPRS, APTX, XRCC1, POLG, POLG2, LIG3, POLD1, POLD2, POLD3, POLD4, POLE, POLE2, POLE3, POLE4, PCNA, RFC1, RFC4, RFC2, RFC5, RFC3, FEN1, LIG1, RBX1, CUL4B, CUL4A, DDB1, DDB2, XPC, RAD23B, RAD23A, CETN2, ERCC8, ERCC6, UVSSA, POLR2A, POLR2B, POLR2C, POLR2D, POLR2E, POLR2F, POLR2G, POLR2H, POLR2I, POLR2L, POLR2K, POLR2J, POLR2J3, POLR2J2, POLR2M, CDK7, MNAT1, CCNH, ERCC3, ERCC2, GTF2H5, GTF2H1, GTF2H2, GTF2H2C_2, GTF2H2C, GTF2H3, GTF2H4, ERCC5, BIVM-ERCC5, XPA, RPA1, RPA2, RPA3, RPA4, ERCC4, ERCC1, SSBP1, PMS2, MLH1, MSH6, MSH2, MSH3, MLH3, EXO1
