# Map from chemical names to KEGG IDs

This notebook provides an example how to map from chemical names, e.g. '16-Hydroxypalmitate' to KEGG ID, e.g. C18218. The resulting dataframe can be used as input to WebOmics analysis.

In [18]:
import os
import pandas as pd
from bioservices import *

### Read CSV data

In [19]:
base_dir = 'C:\\Users\\joewa\\Dropbox\\Analysis\\omics_integration\\covid19_data'

In [21]:
df = pd.read_csv(os.path.join(base_dir, 'compound_names.tsv'), sep='\t', index_col=0)

In [22]:
df.head()

Unnamed: 0_level_0,h_jkdz1,h_jkdz2,h_jkdz3,h_jkdz4,h_jkdz5,h_jkdz6,h_jkdz7,h_jkdz8,h_jkdz9,h_jkdz10,...,s_ZX12,s_ZX13,s_ZX14,s_ZX15,s_ZX16,s_ZX17,s_ZX18,s_ZX19,s_ZX20,s_ZX21
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(14 or 15)-methylpalmitate (a17:0 or i17:0),7439425.0,8212134.0,6727393.0,9237738.0,16576250.0,14113510.0,13881904.0,14007920.0,22017360.0,16636080.0,...,7678126.0,5950649.5,9728899.0,14192680.0,8378746.0,9313578.0,6368525.0,9560249.0,8087552.0,5124412.0
(16 or 17)-methylstearate (a19:0 or i19:0),1007010.0,931975.2,597569.6,1043849.0,1898809.0,3054365.0,1874052.0,1515312.0,3905149.0,1793032.0,...,466395.0,740318.0,983813.4,1937461.0,779367.7,1031906.813,644210.0,894437.2,888394.5,585608.9
(2 or 3)-decenoate (10:1n7 or n8),832046.2,854998.8,162133.4,803271.8,1682666.0,2289676.0,596591.25,644776.7,1276429.0,447215.6,...,136620.1,175718.875,667318.8,,1386143.0,2392266.25,1268946.0,996628.6,795150.1,104943.2
"(2,4 or 2,5)-dimethylphenol sulfate",79229.42,146257.3,109284.8,44204.26,781608.9,80404.38,901579.0,399109.2,55137.11,326591.1,...,,,,30958.77,,,,76090.78,,
(R)-3-hydroxybutyrylcarnitine,,,,,,50869.08,,,112154.7,,...,,,,,535975.2,,337522.8,208344.9,90297.62,


The names of the chemical compounds are stored in the index of the dataframe. For example:

In [43]:
df.index.values[0:5]

array(['(14 or 15)-methylpalmitate (a17:0 or i17:0)',
       '(16 or 17)-methylstearate (a19:0 or i19:0)',
       '(2 or 3)-decenoate (10:1n7 or n8)',
       '(2,4 or 2,5)-dimethylphenol sulfate',
       '(R)-3-hydroxybutyrylcarnitine'], dtype=object)

Print the total number of names

In [42]:
len(df.index.values)

905

### Map Chemical Names to KEGG ID

Iterate through names and query kegg to get the ID, if possible

In [None]:
mapping = {}
for name in df.index.values:
    print('Processing', name)

    try:
        res = k.find('compound', name)    
        res = res.strip()
    except AttributeError:
        continue
        
    if len(res) == 0:
        continue
    
    tokens = res.split('\t')
    front = tokens[0].split(':')
    kegg_id = front[1]
    kegg_names = tokens[1]
    print(kegg_id, ':', kegg_names)    
    
    mapping[name] = kegg_id

Print the total number of mapping we get

In [48]:
print('Found %d/%d KEGG IDs' % (len(mapping), len(df.index.values)))

Found 220/905 KEGG IDs


Parsed dataframe: remove those with no mappings

In [11]:
keep = set(mapping.keys())
drop = set([name for name in df.index.values if name not in keep])
len(drop), len(keep)

(685, 220)

In [17]:
filtered_df = df.drop(drop)
filtered_df = filtered_df.rename(index=mapping)
filtered_df.head()

Unnamed: 0_level_0,h_jkdz1,h_jkdz2,h_jkdz3,h_jkdz4,h_jkdz5,h_jkdz6,h_jkdz7,h_jkdz8,h_jkdz9,h_jkdz10,...,s_ZX12,s_ZX13,s_ZX14,s_ZX15,s_ZX16,s_ZX17,s_ZX18,s_ZX19,s_ZX20,s_ZX21
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C21482,19413052.0,6381812.0,9748316.0,5326872.0,19980720.0,3580375.0,8256121.0,8079382.0,15596590.0,15203630.0,...,1904349.0,3226016.0,737814.7,2817698.0,3329101.0,3206752.75,1466174.0,2779301.0,2117668.0,2184310.0
C18218,2711915.25,2056393.0,1445594.0,2038765.0,2536996.0,2638198.0,2285757.0,1973140.0,2015425.0,2290842.0,...,1409720.0,1413307.0,3218834.0,1602131.0,1317878.0,2930312.75,1168094.0,2946776.0,1417311.0,1474166.0
C05127,87727.25,,92387.06,,159787.9,,,,90551.3,121411.4,...,,,,,,,,138278.8,,
C01152,58832828.0,58439340.0,55521330.0,45162140.0,54789520.0,39412590.0,29878760.0,67517260.0,46660310.0,91185240.0,...,28813140.0,31643580.0,25387670.0,33076040.0,39156980.0,24400592.0,25933750.0,64138680.0,40205880.0,49044880.0
C02918,,181554.9,224039.2,160939.7,320619.4,717655.7,326818.2,513581.0,273458.2,,...,333724.5,,434715.2,35321.18,,655827.25,835970.6,4034381.0,283935.8,80621.6


Save output CSV

In [23]:
filtered_df.to_csv(os.path.join(base_dir, 'compound_data.csv'))