*Nutrition & Biomedicine Internship*

# Example: Using ElementTree to access information within metabolites XML file from HMDB
This code was written to extract specific identifiers and synonym from HMDB and therefore does not extract all metabolite information comprehensively

In [1]:
# import important packages
import xml.etree.ElementTree as ET
import pandas as pd

In [2]:
# read in the xml file and get root (might take some time as the file is very big)
tree = ET.parse('hmdb_metabolites.xml')
root = tree.getroot()

In [3]:
root.tag

'{http://www.hmdb.ca}hmdb'

In [4]:
# Every child should correspond to a metabolite
child = root[0]
child.tag

'{http://www.hmdb.ca}metabolite'

In [5]:
# Have a look at what information is contained in each child
for element in child:
    print(element.tag)
    print(element.text)

{http://www.hmdb.ca}version
4.0
{http://www.hmdb.ca}creation_date
2005-11-16 15:48:42 UTC
{http://www.hmdb.ca}update_date
2020-04-23 20:53:36 UTC
{http://www.hmdb.ca}accession
HMDB0000001
{http://www.hmdb.ca}status
quantified
{http://www.hmdb.ca}secondary_accessions

    
{http://www.hmdb.ca}name
1-Methylhistidine
{http://www.hmdb.ca}description
1-Methylhistidine, also known as 1-MHis, belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from a reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom. 1-Methylhistidine is derived mainly from the anserine of dietary flesh sources, especially poultry. The enzyme, carnosinase, splits anserine into beta-alanine and 1-MHis. High levels of 1-MHis tend to inhibit the enzyme carnosinase and increase anserine levels. Conversely, genetic variants with defici

In [6]:
# prepare the dataframe that we want to populate with our data
hmdb_synonyms = {}
hmdb_mets = [['name', 'hmdb_id', 'iupac', 'kegg_id', 'foodb_id', 'chemspider_id', 'drugbank_id', 'pdb_id',
             'chebi_id', 'pubchem_compound_id', 'wikipedia_id', 'bigg_id', 'vmh_id']]

In [7]:
# iterate through every child (metabolite) contained in the root and extract information on selected identifiers
for child in root:
    synonym_set = set()

    for elem in child:
        if elem.tag == r'{http://www.hmdb.ca}name':
            name = elem.text
            synonym_set.add(name)

        if elem.tag == r'{http://www.hmdb.ca}synonyms':
            for synonym in elem:
                if synonym.text != name:
                    synonym_set.add(synonym.text)
                    synonym_set.add(synonym.text.lower())

        if elem.tag == r'{http://www.hmdb.ca}accession':
            hmdb_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}iupac_name':
            iupac_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}kegg_id':
            kegg_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}foodb_id':
            foodb_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}chemspider_id':
            chemspider_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}drugbank_id':
            drugbank_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}pdb_id':
            pdb_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}chebi_id':
            chebi_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}pubchem_compound_id':
            pubchem_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}wikipedia_id':
            wikipedia_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}bigg_id':
            bigg_id = elem.text

        if elem.tag == r'{http://www.hmdb.ca}vmh_id':
            vmh_id = elem.text

    hmdb_synonyms.update({hmdb_id: synonym_set})

    metabolite = [name, hmdb_id, iupac_id, kegg_id, foodb_id, chemspider_id, drugbank_id, pdb_id, chebi_id, pubchem_id, wikipedia_id, bigg_id, vmh_id]
    hmdb_mets.append(metabolite)

In [8]:
# Turn information saved in the list into a pandas dataframe to have a better view of the data
df_hmbd = pd.DataFrame(hmdb_mets[1:], columns=hmdb_mets[0])

In [9]:
df_hmbd

Unnamed: 0,name,hmdb_id,iupac,kegg_id,foodb_id,chemspider_id,drugbank_id,pdb_id,chebi_id,pubchem_compound_id,wikipedia_id,bigg_id,vmh_id
0,1-Methylhistidine,HMDB0000001,(2S)-2-amino-3-(1-methyl-1H-imidazol-4-yl)prop...,C01152,FDB093588,83153,DB04151,,50599,92105,Methylhistidine,,
1,"1,3-Diaminopropane",HMDB0000002,"propane-1,3-diamine",C00986,FDB031131,415,,,15725,428,"1,3-Diaminopropane",,
2,2-Ketobutyric acid,HMDB0000005,2-oxobutanoic acid,C00109,FDB030356,57,DB04553,,30831,58,Alpha-Ketobutyric_acid,33889,2OBUT
3,2-Hydroxybutyric acid,HMDB0000008,(2S)-2-hydroxybutanoic acid,C05984,FDB021867,389701,,,50613,440864,2-Hydroxybutyric_acid,,
4,2-Methoxyestrone,HMDB0000010,"(1S,10R,11S,15S)-5-hydroxy-4-methoxy-15-methyl...",C05299,FDB021868,389515,,,1189,440624,2-Methoxyestrone,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
114217,Cer(d17:1/18:0),HMDB0240683,"N-[(2S,3R,4E)-1,3-dihydroxyheptadec-4-en-2-yl]...",,,8160656,,,,9985066,,,
114218,Cer(d20:1/18:0),HMDB0240684,"N-[(2S,3R,4E)-1,3-dihydroxyicos-4-en-2-yl]octa...",,,,,,,101853667,,,
114219,Cer(d17:1/16:0),HMDB0240685,"N-[(2S,3R,4E)-1,3-dihydroxyheptadec-4-en-2-yl]...",,,8431739,,,,10256256,,,
114220,"Cer(d18:2(4E,14Z)/16:0)",HMDB0240686,"N-[(2S,3R,4E,14Z)-1,3-dihydroxyoctadeca-4,14-d...",,,,,,,52931118,,,


In [10]:
# Save data as csv
df_hmbd.to_csv('hmdb_metabolites_specific_identifiers.csv', index=False)