# <b>AOP-Wiki XML conversion to RDF</b>
Author: Marvin Martens

The [AOP-Wiki](https://aopwiki.org/) is the central repository for qualitative descriptions of AOPs, and releases its database every three months in XML format. This Jupyter notebook makes the conversion of the AOP-Wiki XML into RDF with Turtle (ttl) syntax. 

It downloads and parses the AOP-Wiki XML file with the ElementTree XML API Python library, and stores all its components in nested dictionaries for the all subjects which form the basis of the existing AOP-Wiki, being the AOPs, KEs,  KERs,  stressors,  chemicals,  taxonomy,  cell-terms,  organ-terms,  and  the  KE  components, which comprise of Biological Processes (BPs),  Biological Objects (BOs) and Biological Actions (BAs).  During the filling of those dictionaries, semantic annotations are being added for  the  subjects,  the  relationship  (predicate)  to  their  property  (object),  and  for  the  properties themselves when meant to represent an identifier or ontology term.

<img src="Overview AOP-Wiki RDF.svg" style="width: 650px;">

## <b>Step #1: imports</b>
First, all required Python libraries are imported. It will `pip install` libraries if the imports are not found on the system.

In [61]:
import sys

!{sys.executable} -m pip install --upgrade pip 
from xml.etree.ElementTree import parse
import re
TAG_RE = re.compile(r'<[^>]+>')
import requests
import datetime
import urllib
import gzip
import shutil
import os
import stat
import time
import pandas as pd




In [62]:
!pipreqsnb . 

/bin/bash: line 1: pipreqsnb: command not found


This notebook includes the mapping of identifiers for chemicals and genes. To make this possible, the URL to the BridgeDb service should be defined in the `bridgedb` variable, and include the `/Human/`. The quickest way to execute the code is by using a local BridgeDb service launched with the BridgeDb Docker image using the [instructions](https://github.com/bridgedb/docker). Alternatively, the live web version can be used by defining the `bridgedb` variable as 'https://webservice.bridgedb.org/Human/'.

In [63]:
bridgedb = 'https://webservice.bridgedb.org/Human/'#'http://localhost:8180/Human/'

## <b>Step #2: Getting the AOP-Wiki XML</b>
Next, the last version of the AOP-Wiki XML is defined in the `aopwikixmlfilename` variable, which can be found in the [download page of the AOP-Wiki](https://aopwiki.org/downloads/). This file is downloaded, unzipped, and opened, after which the ElementTree XML API parses it, making it ready for extracting its contents from the `root`.

In [64]:
from datetime import date

today = date.today()
print("Today's date:", today)

Today's date: 2025-02-11


In [65]:
aopwikixmlfilename = 'aop-wiki-xml-'+str(today)
response = requests.get('https://aopwiki.org/downloads/aop-wiki-xml.gz', verify=False)
with open(aopwikixmlfilename, 'wb') as f:
    f.write(response.content)



The XML will be extracted to the folder defined within the variable `filepath` in the next block of code, which is by defailt `/data` relative to the location of the Jupyter notebook. All datafiles used and produced with this notebook will be placed there.

In [66]:
filepath = 'data/'

In [67]:
try:
    with gzip.open(aopwikixmlfilename, 'rb') as f_in:
        with open(filepath+aopwikixmlfilename, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
            print('File ' + filepath+aopwikixmlfilename + ' opened')
except:
    print('Check if the filepath is correct:\n' + filepath+aopwikixmlfilename)

File data/aop-wiki-xml-2025-02-11 opened


In [68]:
tree = parse(filepath + aopwikixmlfilename)
root = tree.getroot()
print('The AOP-Wiki XML is parsed correctly, and contains ' + str(len(root)) + ' entities')

aopxml = '{http://www.aopkb.org/aop-xml}'

The AOP-Wiki XML is parsed correctly, and contains 6559 entities


## <b>Step #3: extracting information from the XML</b>
The next section extracts all information from the main 11 AOP-Wiki entities shown in Figure 1. These are stored in nested dictionaries, while using ontological annotations as keys for semantic mapping of the information. Note that the cell-terms and organ-terms are included in the KE block of code.

First, all reference identifiers for AOPs, KEs, KERs and stressors need to be extracted.

In [69]:
refs = {'AOP': {}, 'KE': {}, 'KER': {}, 'Stressor': {}}
for ref in root.find(aopxml + 'vendor-specific').findall(aopxml + 'aop-reference'):
    refs['AOP'][ref.get('id')] = ref.get('aop-wiki-id')
for ref in root.find(aopxml + 'vendor-specific').findall(aopxml + 'key-event-reference'):
    refs['KE'][ref.get('id')] = ref.get('aop-wiki-id')
for ref in root.find(aopxml + 'vendor-specific').findall(aopxml + 'key-event-relationship-reference'):
    refs['KER'][ref.get('id')] = ref.get('aop-wiki-id')
for ref in root.find(aopxml + 'vendor-specific').findall(aopxml + 'stressor-reference'):
    refs['Stressor'][ref.get('id')] = ref.get('aop-wiki-id')
for item in refs:
    print('\nThe AOP-Wiki XML contains ' + str(len(refs[item])) + ' identifiers for the entity ' + item)


The AOP-Wiki XML contains 514 identifiers for the entity AOP

The AOP-Wiki XML contains 1497 identifiers for the entity KE

The AOP-Wiki XML contains 2123 identifiers for the entity KER

The AOP-Wiki XML contains 719 identifiers for the entity Stressor


### Adverse Outcome Pathways

In [70]:
aopdict = {}
kedict = {}
for AOP in root.findall(aopxml + 'aop'):
    aopdict[AOP.get('id')] = {}
    aopdict[AOP.get('id')]['dc:identifier'] = 'aop:' + refs['AOP'][AOP.get('id')]
    aopdict[AOP.get('id')]['rdfs:label'] = '"AOP ' + refs['AOP'][AOP.get('id')] + '"'
    aopdict[AOP.get('id')]['foaf:page'] = '<https://identifiers.org/aop/' + refs['AOP'][AOP.get('id')] + '>'
    aopdict[AOP.get('id')]['dc:title'] = '"' + AOP.find(aopxml + 'title').text + '"'
    aopdict[AOP.get('id')]['dcterms:alternative'] = AOP.find(aopxml + 'short-name').text
    aopdict[AOP.get('id')]['dc:description'] = []
    if AOP.find(aopxml + 'background') is not None:
        aopdict[AOP.get('id')]['dc:description'].append('"""' + TAG_RE.sub('', AOP.find(aopxml + 'background').text) + '"""')
    if AOP.find(aopxml + 'authors').text is not None:
        aopdict[AOP.get('id')]['dc:creator'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'authors').text) + '"""'
    if AOP.find(aopxml + 'abstract').text is not None:
        aopdict[AOP.get('id')]['dcterms:abstract'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'abstract').text) + '"""'
    if AOP.find(aopxml + 'status').find(aopxml + 'wiki-status') is not None:
        aopdict[AOP.get('id')]['dcterms:accessRights'] = '"' + AOP.find(aopxml + 'status').find(aopxml + 'wiki-status').text + '"' 
    if AOP.find(aopxml + 'status').find(aopxml + 'oecd-status') is not None:
        aopdict[AOP.get('id')]['oecd-status'] =  '"' + AOP.find(aopxml + 'status').find(aopxml + 'oecd-status').text + '"' 
    if AOP.find(aopxml + 'status').find(aopxml + 'saaop-status') is not None:
        aopdict[AOP.get('id')]['saaop-status'] =  '"' + AOP.find(aopxml + 'status').find(aopxml + 'saaop-status').text + '"' 
    aopdict[AOP.get('id')]['oecd-project'] = AOP.find(aopxml + 'oecd-project').text
    aopdict[AOP.get('id')]['dc:source'] = AOP.find(aopxml + 'source').text
    aopdict[AOP.get('id')]['dcterms:created'] = AOP.find(aopxml + 'creation-timestamp').text
    aopdict[AOP.get('id')]['dcterms:modified'] = AOP.find(aopxml + 'last-modification-timestamp').text
    for appl in AOP.findall(aopxml + 'applicability'):
        for sex in appl.findall(aopxml + 'sex'):
            if 'pato:0000047' not in aopdict[AOP.get('id')]:
                aopdict[AOP.get('id')]['pato:0000047'] = [[sex.find(aopxml + 'evidence').text, sex.find(aopxml + 'sex').text]]
            else:
                aopdict[AOP.get('id')]['pato:0000047'].append([sex.find(aopxml + 'evidence').text, sex.find(aopxml + 'sex').text])
        for life in appl.findall(aopxml + 'life-stage'):
            if 'aopo:LifeStageContext' not in aopdict[AOP.get('id')]:
                aopdict[AOP.get('id')]['aopo:LifeStageContext'] = [[life.find(aopxml + 'evidence').text, life.find(aopxml + 'life-stage').text]]
            else:
                aopdict[AOP.get('id')]['aopo:LifeStageContext'].append([life.find(aopxml + 'evidence').text, life.find(aopxml + 'life-stage').text])
    aopdict[AOP.get('id')]['aopo:has_key_event'] = {}
    if AOP.find(aopxml + 'key-events') is not None:
        for KE in AOP.find(aopxml + 'key-events').findall(aopxml + 'key-event'):
            aopdict[AOP.get('id')]['aopo:has_key_event'][KE.get('key-event-id')] = {}
            aopdict[AOP.get('id')]['aopo:has_key_event'][KE.get('key-event-id')]['dc:identifier'] = 'aop.events:' + refs['KE'][KE.get('key-event-id')]
    aopdict[AOP.get('id')]['aopo:has_key_event_relationship'] = {}
    if AOP.find(aopxml + 'key-event-relationships') is not None:
        for KER in AOP.find(aopxml + 'key-event-relationships').findall(aopxml + 'relationship'):
            aopdict[AOP.get('id')]['aopo:has_key_event_relationship'][KER.get('id')] = {}
            aopdict[AOP.get('id')]['aopo:has_key_event_relationship'][KER.get('id')]['dc:identifier'] = 'aop.relationships:' + refs['KER'][KER.get('id')]
            aopdict[AOP.get('id')]['aopo:has_key_event_relationship'][KER.get('id')]['adjacency'] = KER.find(aopxml + 'adjacency').text
            aopdict[AOP.get('id')]['aopo:has_key_event_relationship'][KER.get('id')]['quantitative-understanding-value'] = KER.find(aopxml + 'quantitative-understanding-value').text
            aopdict[AOP.get('id')]['aopo:has_key_event_relationship'][KER.get('id')]['aopo:has_evidence'] = KER.find(aopxml + 'evidence').text
    aopdict[AOP.get('id')]['aopo:has_molecular_initiating_event'] = {}
    for MIE in AOP.findall(aopxml + 'molecular-initiating-event'):
        aopdict[AOP.get('id')]['aopo:has_molecular_initiating_event'][MIE.get('key-event-id')] = {}
        aopdict[AOP.get('id')]['aopo:has_molecular_initiating_event'][MIE.get('key-event-id')]['dc:identifier'] = 'aop.events:' + refs['KE'][MIE.get('key-event-id')]
        aopdict[AOP.get('id')]['aopo:has_key_event'][MIE.get('key-event-id')] = {}
        aopdict[AOP.get('id')]['aopo:has_key_event'][MIE.get('key-event-id')]['dc:identifier'] = 'aop.events:' + refs['KE'][MIE.get('key-event-id')]
        if MIE.find(aopxml + 'evidence-supporting-chemical-initiation').text is not None:
            kedict[MIE.get('key-event-id')] = {}
            aopdict[AOP.get('id')]['dc:description'].append('"""' + TAG_RE.sub('', MIE.find(aopxml + 'evidence-supporting-chemical-initiation').text) + '"""')
    aopdict[AOP.get('id')]['aopo:has_adverse_outcome'] = {}
    for AO in AOP.findall(aopxml + 'adverse-outcome'):
        aopdict[AOP.get('id')]['aopo:has_adverse_outcome'][AO.get('key-event-id')] = {}
        aopdict[AOP.get('id')]['aopo:has_adverse_outcome'][AO.get('key-event-id')]['dc:identifier'] = 'aop.events:' + refs['KE'][AO.get('key-event-id')]
        aopdict[AOP.get('id')]['aopo:has_key_event'][AO.get('key-event-id')] = {}
        aopdict[AOP.get('id')]['aopo:has_key_event'][AO.get('key-event-id')]['dc:identifier'] = 'aop.events:' + refs['KE'][AO.get('key-event-id')]
        if AO.find(aopxml + 'examples').text is not None:
            kedict[AO.get('key-event-id')] = {}
            aopdict[AOP.get('id')]['dc:description'].append('"""' + TAG_RE.sub('', AO.find(aopxml + 'examples').text) + '"""')
    aopdict[AOP.get('id')]['nci:C54571'] = {}
    if AOP.find(aopxml + 'aop-stressors') is not None:
        for stressor in AOP.find(aopxml + 'aop-stressors').findall(aopxml + 'aop-stressor'):
            aopdict[AOP.get('id')]['nci:C54571'][stressor.get('stressor-id')] = {}
            aopdict[AOP.get('id')]['nci:C54571'][stressor.get('stressor-id')]['dc:identifier'] = 'aop.stressor:' + refs['Stressor'][stressor.get('stressor-id')]
            aopdict[AOP.get('id')]['nci:C54571'][stressor.get('stressor-id')]['aopo:has_evidence'] = stressor.find(aopxml + 'evidence').text
    if AOP.find(aopxml + 'overall-assessment').find(aopxml + 'description').text is not None:
        aopdict[AOP.get('id')]['nci:C25217'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'overall-assessment').find(aopxml + 'description').text) + '"""'
    if AOP.find(aopxml + 'overall-assessment').find(aopxml + 'key-event-essentiality-summary').text is not None:
        aopdict[AOP.get('id')]['nci:C48192'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'overall-assessment').find(aopxml + 'key-event-essentiality-summary').text) + '"""'
    if AOP.find(aopxml + 'overall-assessment').find(aopxml + 'applicability').text is not None:
        aopdict[AOP.get('id')]['aopo:AopContext'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'overall-assessment').find(aopxml + 'applicability').text) + '"""'
    if AOP.find(aopxml + 'overall-assessment').find(aopxml + 'weight-of-evidence-summary').text is not None:
        aopdict[AOP.get('id')]['aopo:has_evidence'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'overall-assessment').find(aopxml + 'weight-of-evidence-summary').text) + '"""'
    if AOP.find(aopxml + 'overall-assessment').find(aopxml + 'quantitative-considerations').text is not None:
        aopdict[AOP.get('id')]['edam:operation_3799'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'overall-assessment').find(aopxml + 'quantitative-considerations').text) + '"""'
    if AOP.find(aopxml + 'potential-applications').text is not None:
        aopdict[AOP.get('id')]['nci:C25725'] = '"""' + TAG_RE.sub('', AOP.find(aopxml + 'potential-applications').text) + '"""'
print('Done!\n\nA total of ' + str(len(aopdict)) + ' Adverse Outcome Pathways have been parsed.')

Done!

A total of 514 Adverse Outcome Pathways have been parsed.


### Chemicals
For the chemicals in the AOP-Wiki, we added BridgeDb mappings for increased coverage of chemical databases for which we used the already present CAS identifers. 

In [71]:
chedict = {}
listofchebi = []
listofchemspider = []
listofwikidata = []
listofchembl = []
listofdrugbank = []
listofpubchem = []
listoflipidmaps = []
listofhmdb = []
listofkegg = []
listofcas = []
listofinchikey = []
listofcomptox = []

for che in root.findall(aopxml + 'chemical'):
    chedict[che.get('id')] = {}
    if che.find(aopxml + 'casrn') is not None:
        if 'NOCAS' not in che.find(aopxml + 'casrn').text:  # all NOCAS ids are taken out, so no issues as subjects
            chedict[che.get('id')]['dc:identifier'] = 'cas:' + che.find(aopxml + 'casrn').text
            listofcas.append('cas:' + che.find(aopxml + 'casrn').text)
            chedict[che.get('id')]['cheminf:000446'] = '"' + che.find(aopxml + 'casrn').text + '"'
            a = requests.get(bridgedb+'xrefs/Ca/'+che.find(aopxml + 'casrn').text).text.split('\n')
            dictionaryforchemical = {}
            if 'html' not in a:
                for item in a:
                    b = item.split('\t')
                    if len(b) == 2:
                        if b[1] not in dictionaryforchemical:
                            dictionaryforchemical[b[1]] = []
                            dictionaryforchemical[b[1]].append(b[0])
                        else:
                            dictionaryforchemical[b[1]].append(b[0])
            if 'ChEBI' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000407'] = []
                for chebi in dictionaryforchemical['ChEBI']:
                    # Remove "CHEBI:" prefix if it exists
                    formatted_chebi = "chebi:" + chebi.split("CHEBI:")[-1]
                    if formatted_chebi not in listofchebi:
                        listofchebi.append(formatted_chebi)
                    chedict[che.get('id')]['cheminf:000407'].append(formatted_chebi)
            if 'Chemspider' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000405'] = []
                for chemspider in dictionaryforchemical['Chemspider']:
                    if "chemspider:"+chemspider not in listofchemspider:
                        listofchemspider.append("chemspider:"+chemspider)
                    chedict[che.get('id')]['cheminf:000405'].append("chemspider:"+chemspider)
            if 'Wikidata' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000567'] = []
                for wd in dictionaryforchemical['Wikidata']:
                    if "wikidata:"+wd not in listofwikidata:
                        listofwikidata.append("wikidata:"+wd)
                    chedict[che.get('id')]['cheminf:000567'].append("wikidata:"+wd)
            if 'ChEMBL compound' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000412'] = []
                for chembl in dictionaryforchemical['ChEMBL compound']:
                    if "chembl.compound:"+chembl not in listofchembl:
                        listofchembl.append("chembl.compound:"+chembl)
                    chedict[che.get('id')]['cheminf:000412'].append("chembl.compound:"+chembl)
            if 'PubChem-compound' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000140'] = []
                for pub in dictionaryforchemical['PubChem-compound']:
                    if "pubchem.compound:"+pub not in listofpubchem:
                        listofpubchem.append("pubchem.compound:"+pub)
                    chedict[che.get('id')]['cheminf:000140'].append("pubchem.compound:"+pub)
            if 'DrugBank' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000406'] = []
                for drugbank in dictionaryforchemical['DrugBank']:
                    if "drugbank:"+drugbank not in listofdrugbank:
                        listofdrugbank.append("drugbank:"+drugbank)
                    chedict[che.get('id')]['cheminf:000406'].append("drugbank:"+drugbank)
            if 'KEGG Compound' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000409'] = []
                for kegg in dictionaryforchemical['KEGG Compound']:
                    if "kegg.compound:"+kegg not in listofkegg:
                        listofkegg.append("kegg.compound:"+kegg)
                    chedict[che.get('id')]['cheminf:000409'].append("kegg.compound:"+kegg)
            if 'LIPID MAPS' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000564'] = []
                for lipidmaps in dictionaryforchemical['LIPID MAPS']:
                    if "lipidmaps:"+lipidmaps not in listoflipidmaps:
                        listoflipidmaps.append("lipidmaps:"+lipidmaps)
                    chedict[che.get('id')]['cheminf:000564'].append("lipidmaps:"+lipidmaps)
            if 'HMDB' in dictionaryforchemical:
                chedict[che.get('id')]['cheminf:000408'] = []
                for hmdb in dictionaryforchemical['HMDB']:
                    if "hmdb:"+hmdb not in listofhmdb:
                        listofhmdb.append("hmdb:"+hmdb)
                    chedict[che.get('id')]['cheminf:000408'].append("hmdb:"+hmdb)
        else:
            chedict[che.get('id')]['dc:identifier'] = '"' + che.find(aopxml + 'casrn').text + '"'
    if che.find(aopxml + 'jchem-inchi-key') is not None:
        chedict[che.get('id')]['cheminf:000059'] = 'inchikey:' + str(che.find(aopxml + 'jchem-inchi-key').text)
        listofinchikey.append('inchikey:' + str(che.find(aopxml + 'jchem-inchi-key').text))
    if che.find(aopxml + 'preferred-name') is not None:
        chedict[che.get('id')]['dc:title'] = '"' + che.find(aopxml + 'preferred-name').text + '"'
    if che.find(aopxml + 'dsstox-id') is not None:
        chedict[che.get('id')]['cheminf:000568'] = 'comptox:' + che.find(aopxml + 'dsstox-id').text
        listofcomptox.append('comptox:' + che.find(aopxml + 'dsstox-id').text)
    if che.find(aopxml + 'synonyms') is not None:
        chedict[che.get('id')]['dcterms:alternative'] = []
        for synonym in che.find(aopxml + 'synonyms').findall(aopxml + 'synonym'):
            chedict[che.get('id')]['dcterms:alternative'].append(synonym.text[:-1])
print('Done!\n\nA total of ' + str(len(chedict)) + ' chemicals have been parsed.')

Done!

A total of 412 chemicals have been parsed.


### Stressors

In [72]:
strdict = {}
for stressor in root.findall(aopxml + 'stressor'):
    strdict[stressor.get('id')] = {}
    strdict[stressor.get('id')]['dc:identifier'] = 'aop.stressor:' + refs['Stressor'][stressor.get('id')]
    strdict[stressor.get('id')]['rdfs:label'] = '"Stressor ' + refs['Stressor'][stressor.get('id')] + '"'
    strdict[stressor.get('id')]['foaf:page'] = '<https://identifiers.org/aop.stressor/' + refs['Stressor'][stressor.get('id')] + '>'
    strdict[stressor.get('id')]['dc:title'] = '"' + stressor.find(aopxml + 'name').text + '"'
    if stressor.find(aopxml + 'description').text is not None:
        strdict[stressor.get('id')]['dc:description'] = '"""' + TAG_RE.sub('', stressor.find(aopxml + 'description').text) + '"""'
    strdict[stressor.get('id')]['dcterms:created'] = stressor.find(aopxml + 'creation-timestamp').text
    strdict[stressor.get('id')]['dcterms:modified'] = stressor.find(aopxml + 'last-modification-timestamp').text
    strdict[stressor.get('id')]['aopo:has_chemical_entity'] = []
    strdict[stressor.get('id')]['linktochemical'] = []
    if stressor.find(aopxml + 'chemicals') is not None:
        for chemical in stressor.find(aopxml + 'chemicals').findall(aopxml + 'chemical-initiator'):
            strdict[stressor.get('id')]['aopo:has_chemical_entity'].append('"' + chemical.get('user-term') + '"')
            strdict[stressor.get('id')]['linktochemical'].append(chemical.get('chemical-id'))
print('Done!\n\nA total of ' + str(len(strdict)) + ' Stressors have been parsed.')

Done!

A total of 719 Stressors have been parsed.


### Taxonomy

In [73]:
taxdict = {}
for tax in root.findall(aopxml + 'taxonomy'):
    taxdict[tax.get('id')] = {}
    taxdict[tax.get('id')]['dc:source'] = tax.find(aopxml + 'source').text
    taxdict[tax.get('id')]['dc:title'] = tax.find(aopxml + 'name').text
    if taxdict[tax.get('id')]['dc:source'] == 'NCBI':
        taxdict[tax.get('id')]['dc:identifier'] = 'ncbitaxon:' + tax.find(aopxml + 'source-id').text
    elif taxdict[tax.get('id')]['dc:source'] is not None:
        taxdict[tax.get('id')]['dc:identifier'] = '"' + tax.find(aopxml + 'source-id').text + '"'
    else:
        taxdict[tax.get('id')]['dc:identifier'] = '"' + tax.find(aopxml + 'source-id').text + '"'
print('Done!\n\nA total of ' + str(len(taxdict)) + ' taxonomies have been parsed.')

Done!

A total of 282 taxonomies have been parsed.


### Key Event Components
Which comprise of the Biological Actions, Biological Processes, Biological Objects.

In [74]:
bioactdict = {None: {}}
bioactdict[None]['dc:identifier'] = None
bioactdict[None]['dc:source'] = None
bioactdict[None]['dc:title'] = None
for bioact in root.findall(aopxml + 'biological-action'):
    bioactdict[bioact.get('id')] = {}
    bioactdict[bioact.get('id')]['dc:source'] = '"' + bioact.find(aopxml + 'source').text + '"'
    bioactdict[bioact.get('id')]['dc:title'] = '"' + bioact.find(aopxml + 'name').text + '"'
    bioactdict[bioact.get('id')]['dc:identifier'] = '"WIKI:' + bioact.find(aopxml + 'source-id').text + '"'
print('Done!\nA total of ' + str(len(bioactdict)) + ' Biological Activity annotations have been parsed.')

Done!
A total of 12 Biological Activity annotations have been parsed.


In [75]:
# Initialize bioprodict with default values
bioprodict = {
    None: {
        'dc:identifier': None,
        'dc:source': None,
        'dc:title': None
    }
}

# Mapping of source prefixes to their respective formats
source_prefix_map = {
    '"GO"': ('go:', 3),
    '"MI"': ('mi:', 0),
    '"MP"': ('mp:', 3),
    '"MESH"': ('mesh:', 0),
    '"HP"': ('hp:', 3),
    '"PCO"': ('pco:', 4),
    '"NBO"': ('nbo:', 4),
    '"VT"': ('vt:', 3),
    '"RBO"': ('rbo:', 4),
    '"NCI"': ('nci:', 4),
    '"IDO"': ('ido:', 4),
}

# Loop through biological processes and populate bioprodict
for biopro in root.findall(aopxml + 'biological-process'):
    biopro_id = biopro.get('id')
    bioprodict[biopro_id] = {}

    # Extract values
    source = f'"{biopro.find(aopxml + "source").text}"'
    name = f'"{biopro.find(aopxml + "name").text}"'
    source_id = biopro.find(aopxml + 'source-id').text

    # Populate source and title
    bioprodict[biopro_id]['dc:source'] = source
    bioprodict[biopro_id]['dc:title'] = name

    # Handle identifier based on source prefix
    if source in source_prefix_map:
        prefix, offset = source_prefix_map[source]
        identifier = prefix + source_id[offset:]
        bioprodict[biopro_id]['dc:identifier'] = identifier
    else:
        # Default case for unhandled sources
        bioprodict[biopro_id]['dc:identifier'] = source_id

print(f"Done!\n\nA total of {len(bioprodict)} Biological Process annotations have been parsed.")


Done!

A total of 526 Biological Process annotations have been parsed.


In [76]:
# Initialize bioobjdict with default values
bioobjdict = {
    None: {
        'dc:identifier': None,
        'dc:source': None,
        'dc:title': None
    }
}
objectstoskip = []
prolist = []

# Mapping of source prefixes to their respective formats
source_prefix_map = {
    '"PR"': ('pr:', 3),
    '"CL"': ('cl:', 3),
    '"MESH"': ('mesh:', 0),
    '"GO"': ('go:', 3),
    '"UBERON"': ('uberon:', 7),
    '"CHEBI"': ('chebio:', 6),
    '"MP"': ('mp:', 3),
    '"FMA"': ('fma:', 4),
    '"PCO"': ('pco:', 4),
}

# Loop through biological objects and populate bioobjdict
for bioobj in root.findall(aopxml + 'biological-object'):
    bioobj_id = bioobj.get('id')
    bioobjdict[bioobj_id] = {}

    # Extract values
    source = f'"{bioobj.find(aopxml + "source").text}"'
    name = f'"{bioobj.find(aopxml + "name").text}"'
    source_id = bioobj.find(aopxml + 'source-id').text

    # Populate source and title
    bioobjdict[bioobj_id]['dc:source'] = source
    bioobjdict[bioobj_id]['dc:title'] = name

    # Handle identifier based on source prefix
    if source in source_prefix_map:
        prefix, offset = source_prefix_map[source]
        identifier = prefix + source_id[offset:]
        bioobjdict[bioobj_id]['dc:identifier'] = identifier

        # Add to prolist if PR
        if source == '"PR"':
            prolist.append(identifier)
    else:
        # Default case for unhandled sources
        bioobjdict[bioobj_id]['dc:identifier'] = f'"{source_id}"'

print(f"Done!\n\nA total of {len(bioobjdict)} Biological Object annotations have been parsed.")


Done!

A total of 476 Biological Object annotations have been parsed.


The Biological Objects containing terms from the Protein Ontology are mapped to protein identifiers with the PR mapping file `promapping.txt`, which was downloaded from the [Protein Consortium website](https://proconsortium.org/download/current/), which provides matching identifiers from Entrez Gene, HGNC and UniProt. The file location should be the `filepath` variable defined in Step #2.

In [77]:
pro = "promapping.txt"
urllib.request.urlretrieve('https://proconsortium.org/download/current/promapping.txt', 'data/promapping.txt')

('data/promapping.txt', <http.client.HTTPMessage at 0x797af0b7df10>)

In [78]:
fileStatsObj = os.stat (filepath + pro)
PromodificationTime = time.ctime ( fileStatsObj [ stat.ST_MTIME ] )
print("Last Modified Time : ", PromodificationTime )

Last Modified Time :  Tue Feb 11 14:34:16 2025


In [79]:
f = open(filepath+pro, "r")
prodict = {}
hgnclist = []
uniprotlist = []
ncbigenelist = []
for line in f:
    a = line.split('\t')
    key = 'pr:'+a[0][3:]
    if key in prolist:
        if not key in prodict:
            prodict[key] = []
        if 'HGNC:' in a[1]:
            prodict[key].append('hgnc:'+a[1][5:])
            hgnclist.append('hgnc:'+a[1][5:])
        if 'NCBIGene:' in a[1]:
            prodict[key].append('ncbigene:'+a[1][9:])
            ncbigenelist.append('ncbigene:'+a[1][9:])
        if 'UniProtKB:' in a[1]:
            prodict[key].append('uniprot:'+a[1].split(',')[0][10:])
            uniprotlist.append('uniprot:'+a[1].split(',')[0][10:])
        if prodict[key]==[]:
            del prodict[key]
f.close()
print('This step added ' + str(len(hgnclist)+len(ncbigenelist)+len(uniprotlist)) + ' identifiers for ' + str(len(prodict)) + ' Protein Ontology terms')

This step added 709 identifiers for 150 Protein Ontology terms


### Key Events
The KEs also include the entities for cell-terms and organ-terms.

In [80]:
listofkedescriptions = []
for ke in root.findall(aopxml + 'key-event'):
    if not ke.get('id') in kedict:
        kedict[ke.get('id')] = {}
    kedict[ke.get('id')]['dc:identifier'] = 'aop.events:' + refs['KE'][ke.get('id')]
    kedict[ke.get('id')]['rdfs:label'] = '"KE ' + refs['KE'][ke.get('id')] + '"'
    kedict[ke.get('id')]['foaf:page'] = '<https://identifiers.org/aop.events/' + refs['KE'][ke.get('id')] + '>'
    kedict[ke.get('id')]['dc:title'] = '"' + ke.find(aopxml + 'title').text + '"'
    kedict[ke.get('id')]['dcterms:alternative'] = ke.find(aopxml + 'short-name').text
    kedict[ke.get('id')]['nci:C25664'] = '"""' + ke.find(aopxml + 'biological-organization-level').text + '"""'
    if ke.find(aopxml + 'description').text is not None:
        kedict[ke.get('id')]['dc:description'] = '"""' + TAG_RE.sub('', ke.find(aopxml + 'description').text) + '"""'
#    if ke.find(aopxml + 'evidence-supporting-taxonomic-applicability').text is not None:
#        kedict[ke.get('id')]['dc:description'] = '"""' + TAG_RE.sub('', ke.find(aopxml + 'evidence-supporting-taxonomic-applicability').text) + '"""'
    if ke.find(aopxml + 'measurement-methodology').text is not None:
        kedict[ke.get('id')]['mmo:0000000'] = '"""' + TAG_RE.sub('', ke.find(aopxml + 'measurement-methodology').text) + '"""'
    kedict[ke.get('id')]['biological-organization-level'] = ke.find(aopxml + 'biological-organization-level').text
    kedict[ke.get('id')]['dc:source'] = ke.find(aopxml + 'source').text
    for appl in ke.findall(aopxml + 'applicability'):
        for sex in appl.findall(aopxml + 'sex'):
            if 'pato:0000047' not in kedict[ke.get('id')]:
                kedict[ke.get('id')]['pato:0000047'] = [[sex.find(aopxml + 'evidence').text, sex.find(aopxml + 'sex').text]]
            else:
                kedict[ke.get('id')]['pato:0000047'].append([sex.find(aopxml + 'evidence').text, sex.find(aopxml + 'sex').text])
        for life in appl.findall(aopxml + 'life-stage'):
            if 'aopo:LifeStageContext' not in kedict[ke.get('id')]:
                kedict[ke.get('id')]['aopo:LifeStageContext'] = [[life.find(aopxml + 'evidence').text, life.find(aopxml + 'life-stage').text]]
            else:
                kedict[ke.get('id')]['aopo:LifeStageContext'].append([life.find(aopxml + 'evidence').text, life.find(aopxml + 'life-stage').text])
        for tax in appl.findall(aopxml + 'taxonomy'):
            if 'ncbitaxon:131567' not in kedict[ke.get('id')]:
                if 'dc:identifier' in taxdict[tax.get('taxonomy-id')]:
                    kedict[ke.get('id')]['ncbitaxon:131567'] = [[tax.get('taxonomy-id'), tax.find(aopxml + 'evidence').text, taxdict[tax.get('taxonomy-id')]['dc:identifier'], taxdict[tax.get('taxonomy-id')]['dc:source'], taxdict[tax.get('taxonomy-id')]['dc:title']]]
            else:
                if 'dc:identifier' in taxdict[tax.get('taxonomy-id')]:
                    kedict[ke.get('id')]['ncbitaxon:131567'].append([tax.get('taxonomy-id'), tax.find(aopxml + 'evidence').text, taxdict[tax.get('taxonomy-id')]['dc:identifier'], taxdict[tax.get('taxonomy-id')]['dc:source'], taxdict[tax.get('taxonomy-id')]['dc:title']])
    if ke.find(aopxml + 'biological-events') is not None:
        kedict[ke.get('id')]['biological-event'] = {}
        kedict[ke.get('id')]['biological-event']['go:0008150'] = []
        kedict[ke.get('id')]['biological-event']['pato:0001241'] = []
        kedict[ke.get('id')]['biological-event']['pato:0000001'] = []
        for event in ke.find(aopxml + 'biological-events').findall(aopxml + 'biological-event'):
            if event.get('process-id') is not None:
                kedict[ke.get('id')]['biological-event']['go:0008150'].append(bioprodict[event.get('process-id')]['dc:identifier'])
            if event.get('object-id') is not None:
                kedict[ke.get('id')]['biological-event']['pato:0001241'].append(bioobjdict[event.get('object-id')]['dc:identifier'])
            if event.get('action-id') is not None:
                kedict[ke.get('id')]['biological-event']['pato:0000001'].append(bioactdict[event.get('action-id')]['dc:identifier'])
    if ke.find(aopxml + 'cell-term') is not None:
        kedict[ke.get('id')]['aopo:CellTypeContext'] = {}
        kedict[ke.get('id')]['aopo:CellTypeContext']['dc:source'] = '"' + ke.find(aopxml + 'cell-term').find(aopxml + 'source').text + '"'
        kedict[ke.get('id')]['aopo:CellTypeContext']['dc:title'] = '"' + ke.find(aopxml + 'cell-term').find(aopxml + 'name').text + '"'
        if kedict[ke.get('id')]['aopo:CellTypeContext']['dc:source'] == '"CL"':
            kedict[ke.get('id')]['aopo:CellTypeContext']['dc:identifier'] = ['cl:' + ke.find(aopxml + 'cell-term').find(aopxml + 'source-id').text[3:], ke.find(aopxml + 'cell-term').find(aopxml + 'source-id').text]
        elif kedict[ke.get('id')]['aopo:CellTypeContext']['dc:source'] == '"UBERON"':
            kedict[ke.get('id')]['aopo:CellTypeContext']['dc:identifier'] = ['uberon:' + ke.find(aopxml + 'cell-term').find(aopxml + 'source-id').text[7:], ke.find(aopxml + 'cell-term').find(aopxml + 'source-id').text]
        else:
            kedict[ke.get('id')]['aopo:CellTypeContext']['dc:identifier'] = ['"' + ke.find(aopxml + 'cell-term').find(aopxml + 'source-id').text + '"', 'placeholder']
    if ke.find(aopxml + 'organ-term') is not None:
        kedict[ke.get('id')]['aopo:OrganContext'] = {}
        kedict[ke.get('id')]['aopo:OrganContext']['dc:source'] = '"' + ke.find(aopxml + 'organ-term').find(aopxml + 'source').text + '"'
        kedict[ke.get('id')]['aopo:OrganContext']['dc:title'] = '"' + ke.find(aopxml + 'organ-term').find(aopxml + 'name').text + '"'
        if kedict[ke.get('id')]['aopo:OrganContext']['dc:source'] == '"UBERON"':
            kedict[ke.get('id')]['aopo:OrganContext']['dc:identifier'] = ['uberon:' + ke.find(aopxml + 'organ-term').find(aopxml + 'source-id').text[7:], ke.find(aopxml + 'organ-term').find(aopxml + 'source-id').text]
        else:
            kedict[ke.get('id')]['aopo:OrganContext']['dc:identifier'] = [
                '"' + ke.find(aopxml + 'organ-term').find(aopxml + 'source-id').text + '"', 'placeholder']
    if ke.find(aopxml + 'key-event-stressors') is not None:
        kedict[ke.get('id')]['nci:C54571'] = {}
        for stressor in ke.find(aopxml + 'key-event-stressors').findall(aopxml + 'key-event-stressor'):
            kedict[ke.get('id')]['nci:C54571'][stressor.get('stressor-id')] = {}
            kedict[ke.get('id')]['nci:C54571'][stressor.get('stressor-id')]['dc:identifier'] = strdict[stressor.get('stressor-id')]['dc:identifier']
            kedict[ke.get('id')]['nci:C54571'][stressor.get('stressor-id')]['aopo:has_evidence'] = stressor.find(aopxml + 'evidence').text
print('Done!\n\nA total of ' + str(len(kedict)) + ' Key Events have been parsed.')

Done!

A total of 1497 Key Events have been parsed.


### Key Event Relationships

In [81]:
kerdict = {}
for ker in root.findall(aopxml + 'key-event-relationship'):
    kerdict[ker.get('id')] = {}
    kerdict[ker.get('id')]['dc:identifier'] = 'aop.relationships:' + refs['KER'][ker.get('id')]
    kerdict[ker.get('id')]['rdfs:label'] = '"KER ' + refs['KER'][ker.get('id')] + '"'
    kerdict[ker.get('id')]['foaf:page'] = '<https://identifiers.org/aop.relationships/' + refs['KER'][ker.get('id')] + '>'
    kerdict[ker.get('id')]['dc:source'] = ker.find(aopxml + 'source').text
    kerdict[ker.get('id')]['dcterms:created'] = ker.find(aopxml + 'creation-timestamp').text
    kerdict[ker.get('id')]['dcterms:modified'] = ker.find(aopxml + 'last-modification-timestamp').text
    if ker.find(aopxml + 'description').text is not None:
        kerdict[ker.get('id')]['dc:description'] = '"""' + TAG_RE.sub('', ker.find(aopxml + 'description').text) + '"""'
    for weight in ker.findall(aopxml + 'weight-of-evidence'):
        if weight.find(aopxml + 'biological-plausibility').text is not None:
            kerdict[ker.get('id')]['nci:C80263'] = '"""' + TAG_RE.sub('', weight.find(aopxml + 'biological-plausibility').text) + '"""'
        if weight.find(aopxml + 'emperical-support-linkage').text is not None:
            kerdict[ker.get('id')]['edam:data_2042'] = '"""' + TAG_RE.sub('', weight.find(aopxml + 'emperical-support-linkage').text) + '"""'
        if weight.find(aopxml + 'uncertainties-or-inconsistencies').text is not None:
            kerdict[ker.get('id')]['nci:C71478'] = '"""' + TAG_RE.sub('', weight.find(aopxml + 'uncertainties-or-inconsistencies').text) + '"""'
    kerdict[ker.get('id')]['aopo:has_upstream_key_event'] = {}
    kerdict[ker.get('id')]['aopo:has_upstream_key_event']['id'] = ker.find(aopxml + 'title').find(aopxml + 'upstream-id').text
    kerdict[ker.get('id')]['aopo:has_upstream_key_event']['dc:identifier'] = 'aop.events:' + refs['KE'][ker.find(aopxml + 'title').find(aopxml + 'upstream-id').text]
    kerdict[ker.get('id')]['aopo:has_downstream_key_event'] = {}
    kerdict[ker.get('id')]['aopo:has_downstream_key_event']['id'] = ker.find(aopxml + 'title').find(aopxml + 'downstream-id').text
    kerdict[ker.get('id')]['aopo:has_downstream_key_event']['dc:identifier'] = 'aop.events:' + refs['KE'][ker.find(aopxml + 'title').find(aopxml + 'downstream-id').text]
    for appl in ker.findall(aopxml + 'taxonomic-applicability'):
        for sex in appl.findall(aopxml + 'sex'):
            if 'pato:0000047' not in kerdict[ker.get('id')]:
                kerdict[ker.get('id')]['pato:0000047'] = [[sex.find(aopxml + 'evidence').text, sex.find(aopxml + 'sex').text]]
            else:
                kerdict[ker.get('id')]['pato:0000047'].append([sex.find(aopxml + 'evidence').text, sex.find(aopxml + 'sex').text])
        for life in appl.findall(aopxml + 'life-stage'):
            if 'aopo:LifeStageContext' not in kerdict[ker.get('id')]:
                kerdict[ker.get('id')]['aopo:LifeStageContext'] = [[life.find(aopxml + 'evidence').text, life.find(aopxml + 'life-stage').text]]
            else:
                kerdict[ker.get('id')]['aopo:LifeStageContext'].append([life.find(aopxml + 'evidence').text, life.find(aopxml + 'life-stage').text])
        for tax in appl.findall(aopxml + 'taxonomy'):
            if 'ncbitaxon:131567' not in kerdict[ker.get('id')]:
                if 'dc:identifier' in taxdict[tax.get('taxonomy-id')]:
                    kerdict[ker.get('id')]['ncbitaxon:131567'] = [[tax.get('taxonomy-id'), tax.find(aopxml + 'evidence').text, taxdict[tax.get('taxonomy-id')]['dc:identifier'], taxdict[tax.get('taxonomy-id')]['dc:source'], taxdict[tax.get('taxonomy-id')]['dc:title']]]
            else:
                if 'dc:identifier' in taxdict[tax.get('taxonomy-id')]:
                    kerdict[ker.get('id')]['ncbitaxon:131567'].append([tax.get('taxonomy-id'), tax.find(aopxml + 'evidence').text, taxdict[tax.get('taxonomy-id')]['dc:identifier'], taxdict[tax.get('taxonomy-id')]['dc:source'], taxdict[tax.get('taxonomy-id')]['dc:title']])
print('Done!\n\nA total of ' + str(len(kerdict)) + ' Key Event Relationships have been parsed.')

Done!

A total of 2123 Key Event Relationships have been parsed.


## <b>Step #4: Writing the AOP-Wiki RDF</b>
This step involves the writing of the central RDF file, containing all information from the AOP-Wiki XML, written in Turtle (ttl) syntax.

In [82]:
g = open(filepath + 'AOPWikiRDF.ttl', 'w', encoding='utf-8')

### Writing prefixes
The first thing is writing the Prefixes of all ontologies and database identifiers, which go in the top of the document. That is followed by the writing of all entities of the AOP-Wiki described in Figure 1.

In [83]:
# Load the prefixes from a CSV file
prefixes = pd.read_csv("prefixes.csv")

# Format the prefixes as RDF-compatible strings
prefix_strings = prefixes.apply(lambda row: f"@prefix {row['prefix']}: <{row['uri']}> .", axis=1)

# Join the strings with newlines
rdf_prefixes = "\n".join(prefix_strings)
print(rdf_prefixes)

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix aop: <https://identifiers.org/aop/> .
@prefix aop.events: <https://identifiers.org/aop.events/> .
@prefix aop.relationships: <https://identifiers.org/aop.relationships/> .
@prefix aop.stressor: <https://identifiers.org/aop.stressor/> .
@prefix aopo: <http://aopkb.org/aop_ontology#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix cas: <https://identifiers.org/cas/> .
@prefix inchikey: <https://identifiers.org/inchikey/> .
@prefix pato: <http://purl.obolibrary.org/obo/PATO_> .
@prefix ncbitaxon: <http://purl.bioontology.org/ontology/NCBITAXON/> .
@prefix cl: <http://purl.obolibrary.org/obo/CL_> .
@prefix uberon: <http://purl.obolibrary.org/obo/UBERON_> .
@prefix go: <http://purl.obolibrary.org/obo/GO_> .
@prefix mi: <http:

In [84]:
g.write(rdf_prefixes + "\n")
#g.write('@prefix dc: <http://purl.org/dc/elements/1.1/> .\n@prefix dcterms: <http://purl.org/dc/terms/> .\n@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n@prefix owl: <http://www.w3.org/2002/07/owl#> .\n@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n@prefix aop: <https://identifiers.org/aop/> .\n@prefix aop.events: <https://identifiers.org/aop.events/> .\n@prefix aop.relationships: <https://identifiers.org/aop.relationships/> .\n@prefix aop.stressor: <https://identifiers.org/aop.stressor/> .\n@prefix aopo: <http://aopkb.org/aop_ontology#> .\n@prefix skos: <http://www.w3.org/2004/02/skos/core#> . \n@prefix cas: <https://identifiers.org/cas/> .\n@prefix inchikey: <https://identifiers.org/inchikey/> .\n@prefix pato: <http://purl.obolibrary.org/obo/PATO_> .\n@prefix ncbitaxon: <http://purl.bioontology.org/ontology/NCBITAXON/> .\n@prefix cl: <http://purl.obolibrary.org/obo/CL_> .\n@prefix uberon: <http://purl.obolibrary.org/obo/UBERON_> .\n@prefix go: <http://purl.obolibrary.org/obo/GO_> .\n@prefix mi: <http://purl.obolibrary.org/obo/MI_> .\n@prefix mp: <http://purl.obolibrary.org/obo/MP_> .\n@prefix mesh: <http://purl.org/commons/record/mesh/> .\n@prefix hp: <http://purl.obolibrary.org/obo/HP_> .\n@prefix pco: <http://purl.obolibrary.org/obo/PCO_> .\n@prefix nbo: <http://purl.obolibrary.org/obo/NBO_> .\n@prefix vt: <http://purl.obolibrary.org/obo/VT_> .\n@prefix pr: <http://purl.obolibrary.org/obo/PR_> .\n@prefix chebio: <http://purl.obolibrary.org/obo/CHEBI_> .\n@prefix fma: <http://purl.org/sig/ont/fma/fma> .\n@prefix cheminf: <http://semanticscience.org/resource/CHEMINF_> .\n@prefix nci: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#> .\n@prefix comptox: <https://identifiers.org/comptox/> .\n@prefix mmo: <http://purl.obolibrary.org/obo/MMO_> .\n@prefix chebi: <https://identifiers.org/chebi/> .\n@prefix chemspider: <https://identifiers.org/chemspider/> .\n@prefix wikidata: <https://identifiers.org/wikidata/> .\n@prefix chembl.compound: <https://identifiers.org/chembl.compound/> .\n@prefix pubchem.compound: <https://identifiers.org/pubchem.compound/> .\n@prefix drugbank: <https://identifiers.org/drugbank/> .\n@prefix kegg.compound: <https://identifiers.org/kegg.compound/> .\n@prefix lipidmaps: <https://identifiers.org/lipidmaps/> .\n@prefix hmdb: <https://identifiers.org/hmdb/> .\n@prefix ensembl: <https://identifiers.org/ensembl/> .\n@prefix edam: <http://edamontology.org/> .\n@prefix hgnc: <https://identifiers.org/hgnc/>.\n@prefix ncbigene: <https://identifiers.org/ncbigene/>.\n@prefix uniprot: <https://identifiers.org/uniprot/>.\n@prefix rbo: <http://purl.obolibrary.org/obo/RBO_>.\n@prefix ido: <http://purl.obolibrary.org/obo/IDO_>.\n\n')

2644

### Writing Adverse Outcome Pathway triples

In [85]:
for aop in aopdict:
    g.write(aopdict[aop]['dc:identifier'] + '\n\ta\taopo:AdverseOutcomePathway ;\n\tdc:identifier\t' + aopdict[aop]['dc:identifier'] + ' ;\n\trdfs:label\t' + aopdict[aop]['rdfs:label'] + ' ;\n\trdfs:seeAlso\t' + aopdict[aop]['foaf:page'] + ' ;\n\tfoaf:page\t' + aopdict[aop]['foaf:page'] + ' ;\n\tdc:title\t' + aopdict[aop]['dc:title'] + ' ;\n\tdcterms:alternative\t"' + aopdict[aop]['dcterms:alternative'] + '" ;\n\tdc:source\t"' + aopdict[aop]['dc:source'] + '" ;\n\tdcterms:created\t"' + aopdict[aop]['dcterms:created'] + '" ;\n\tdcterms:modified\t"' + aopdict[aop]['dcterms:modified'] + '"')
    if 'dc:description' in aopdict[aop]:
        if not aopdict[aop]['dc:description'] == []:
            g.write(' ;\n\tdc:description\t' + ','.join(aopdict[aop]['dc:description']))
    if 'nci:C25217' in aopdict[aop]:
        g.write(' ;\n\tnci:C25217\t' + aopdict[aop]['nci:C25217'])
    if 'nci:C48192' in aopdict[aop]:
        g.write(' ;\n\tnci:C48192\t' + aopdict[aop]['nci:C48192'])
    if 'aopo:AopContext' in aopdict[aop]:
        g.write(' ;\n\taopo:AopContext\t' + aopdict[aop]['aopo:AopContext'])
    if 'aopo:has_evidence' in aopdict[aop]:
        g.write(' ;\n\taopo:has_evidence\t' + aopdict[aop]['aopo:has_evidence'])
    if 'edam:operation_3799' in aopdict[aop]:
        g.write(' ;\n\tedam:operation_3799\t' + aopdict[aop]['edam:operation_3799'])
    if 'nci:C25725' in aopdict[aop]:
        g.write(' ;\n\tnci:C25725\t' + aopdict[aop]['nci:C25725'])
    if 'dc:creator' in aopdict[aop]:
        g.write(' ;\n\tdc:creator\t' + aopdict[aop]['dc:creator'])
    if 'dcterms:accessRights' in aopdict[aop]:
         g.write(' ;\n\tdcterms:accessRights\t' + aopdict[aop]['dcterms:accessRights'])
    if 'oecd-status' in aopdict[aop]:
         g.write(' ;\n\tnci:C25688\t' + aopdict[aop]['oecd-status'])
    if 'saaop-status' in aopdict[aop]:
         g.write(' ;\n\tnci:C25688\t' + aopdict[aop]['saaop-status'])
    if 'dcterms:abstract' in aopdict[aop]:
        g.write(' ;\n\tdcterms:abstract\t' + aopdict[aop]['dcterms:abstract'])
    listofthings = []
    for KE in aopdict[aop]['aopo:has_key_event']:
        listofthings.append(aopdict[aop]['aopo:has_key_event'][KE]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\taopo:has_key_event\t' + (','.join(listofthings)))
    listofthings = []
    for KER in aopdict[aop]['aopo:has_key_event_relationship']:
        listofthings.append(aopdict[aop]['aopo:has_key_event_relationship'][KER]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\taopo:has_key_event_relationship\t' + (','.join(listofthings)))
    listofthings = []
    for mie in aopdict[aop]['aopo:has_molecular_initiating_event']:
        listofthings.append(aopdict[aop]['aopo:has_molecular_initiating_event'][mie]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\taopo:has_molecular_initiating_event\t' + (','.join(listofthings)))
    listofthings = []
    for ao in aopdict[aop]['aopo:has_adverse_outcome']:
        listofthings.append(aopdict[aop]['aopo:has_adverse_outcome'][ao]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\taopo:has_adverse_outcome\t' + (','.join(listofthings)))
    listofthings = []
    for stressor in aopdict[aop]['nci:C54571']:
        listofthings.append(aopdict[aop]['nci:C54571'][stressor]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\tnci:C54571\t' + (','.join(listofthings)))
    listofthings = []
    if 'pato:0000047' in aopdict[aop]:
        for sex in aopdict[aop]['pato:0000047']:
            listofthings.append('"' + sex[1] + '"')
        if not listofthings == []:
            g.write(' ;\n\tpato:0000047\t' + (','.join(listofthings)))
    listofthings = []
    if 'aopo:LifeStageContext' in aopdict[aop]:
        for lifestage in aopdict[aop]['aopo:LifeStageContext']:
            listofthings.append('"' + lifestage[1] + '"')
        if not listofthings == []:
            g.write(' ;\n\taopo:LifeStageContext\t' + (','.join(listofthings)))
    g.write(' .\n\n')
print("Done!")

Done!


### Writing Key Event triples
This step also includes the extraction of the cell-terms and organ-terms, which are written to the file later.

In [86]:
cterm = {}
oterm = {}
for ke in kedict:
    g.write(kedict[ke]['dc:identifier'] + '\n\ta\taopo:KeyEvent ;\n\tdc:identifier\t' + kedict[ke]['dc:identifier'] + ' ;\n\trdfs:label\t' + kedict[ke]['rdfs:label'] + ' ;\n\tfoaf:page\t' + kedict[ke]['foaf:page'] + ' ;\n\trdfs:seeAlso\t' + kedict[ke]['foaf:page'] + ' ;\n\tdc:title\t' + kedict[ke]['dc:title'] + ' ;\n\tdcterms:alternative\t"' + kedict[ke]['dcterms:alternative'] + '" ;\n\tdc:source\t"' + kedict[ke]['dc:source'] + '"')
    if 'dc:description' in kedict[ke]:
        g.write(' ;\n\tdc:description\t' + kedict[ke]['dc:description'])
    if 'mmo:0000000' in kedict[ke]:
        g.write(' ;\n\tmmo:0000000\t' + kedict[ke]['mmo:0000000'])
    if 'nci:C25664' in kedict[ke]:
        g.write(' ;\n\tnci:C25664\t' + kedict[ke]['nci:C25664'])
    listofthings = []
    if 'pato:0000047' in kedict[ke]:
        for sex in kedict[ke]['pato:0000047']:
            listofthings.append('"' + sex[1] + '"')
        if not listofthings == []:
            g.write(' ;\n\tpato:0000047\t' + (','.join(listofthings)))
    listofthings = []
    if 'aopo:LifeStageContext' in kedict[ke]:
        for lifestage in kedict[ke]['aopo:LifeStageContext']:
            listofthings.append('"' + lifestage[1] + '"')
        if not listofthings == []:
            g.write(' ;\n\taopo:LifeStageContext\t' + (','.join(listofthings)))
    listofthings = []
    if 'ncbitaxon:131567' in kedict[ke]:
        for taxonomy in kedict[ke]['ncbitaxon:131567']:
            listofthings.append(taxonomy[2])
        if not listofthings == []:
            g.write(' ;\n\tncbitaxon:131567\t' + (','.join(listofthings)))
    listofthings = []
    if 'nci:C54571' in kedict[ke]:
        for stressor in kedict[ke]['nci:C54571']:
            listofthings.append(kedict[ke]['nci:C54571'][stressor]['dc:identifier'])
        if not listofthings == []:
            g.write(' ;\n\tnci:C54571\t' + (','.join(listofthings)))
    if 'aopo:CellTypeContext' in kedict[ke]:
        g.write(' ;\n\taopo:CellTypeContext\t' + kedict[ke]['aopo:CellTypeContext']['dc:identifier'][0])
        if not kedict[ke]['aopo:CellTypeContext']['dc:identifier'][0] in cterm:
            cterm[kedict[ke]['aopo:CellTypeContext']['dc:identifier'][0]] = {}
            cterm[kedict[ke]['aopo:CellTypeContext']['dc:identifier'][0]]['dc:source'] = kedict[ke]['aopo:CellTypeContext']['dc:source']
            cterm[kedict[ke]['aopo:CellTypeContext']['dc:identifier'][0]]['dc:title'] = kedict[ke]['aopo:CellTypeContext']['dc:title']
    if 'aopo:OrganContext' in kedict[ke]:
        g.write(' ;\n\taopo:OrganContext\t' + kedict[ke]['aopo:OrganContext']['dc:identifier'][0])
        if not kedict[ke]['aopo:OrganContext']['dc:identifier'][0] in oterm:
            oterm[kedict[ke]['aopo:OrganContext']['dc:identifier'][0]] = {}
            oterm[kedict[ke]['aopo:OrganContext']['dc:identifier'][0]]['dc:source'] = kedict[ke]['aopo:OrganContext']['dc:source']
            oterm[kedict[ke]['aopo:OrganContext']['dc:identifier'][0]]['dc:title'] = kedict[ke]['aopo:OrganContext']['dc:title']
    if 'biological-event' in kedict[ke]:
        if len(kedict[ke]['biological-event']['go:0008150']) > 0:
            g.write(' ;\n\tgo:0008150\t' + (','.join(kedict[ke]['biological-event']['go:0008150'])))
        if len(kedict[ke]['biological-event']['pato:0000001']) > 0:
            g.write(' ;\n\tpato:0000001\t' + (','.join(kedict[ke]['biological-event']['pato:0000001'])))
        if len(kedict[ke]['biological-event']['pato:0001241']) > 0:
            g.write(' ;\n\tpato:0001241\t' + (','.join(kedict[ke]['biological-event']['pato:0001241'])))
    listofthings = []
    for aop in aopdict:
        if ke in aopdict[aop]['aopo:has_key_event']:
            listofthings.append(aopdict[aop]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\tdcterms:isPartOf\t' + (','.join(listofthings)))
    g.write(' .\n\n')
print("Done!")

Done!


### Writing Key Event Relationship triples

In [87]:
for ker in kerdict:
    g.write(kerdict[ker]['dc:identifier'] + '\n\ta\taopo:KeyEventRelationship ;\n\tdc:identifier\t' + kerdict[ker]['dc:identifier'] + ' ;\n\trdfs:label\t' + kerdict[ker]['rdfs:label'] + ' ;\n\tfoaf:page\t' + kerdict[ker]['foaf:page'] + ' ;\n\trdfs:seeAlso\t' + kerdict[ker]['foaf:page'] + ' ;\n\tdcterms:created\t"' + kerdict[ker]['dcterms:created'] + '" ;\n\tdcterms:modified\t"' + kerdict[ker]['dcterms:modified'] + '" ;\n\taopo:has_upstream_key_event\t' + kerdict[ker]['aopo:has_upstream_key_event']['dc:identifier'] + ' ;\n\taopo:has_downstream_key_event\t' + kerdict[ker]['aopo:has_downstream_key_event']['dc:identifier'])
    if 'dc:description' in kerdict[ker]:
        g.write(' ;\n\tdc:description\t' + kerdict[ker]['dc:description'])
    if 'nci:C80263' in kerdict[ker]:
        g.write(' ;\n\tnci:C80263\t' + kerdict[ker]['nci:C80263'])
    if 'edam:data_2042' in kerdict[ker]:
        g.write(' ;\n\tedam:data_2042\t' + kerdict[ker]['edam:data_2042'].replace("\\", ""))
    if 'nci:C71478' in kerdict[ker]:
        g.write(' ;\n\tnci:C71478\t' + kerdict[ker]['nci:C71478'].replace("\\", ""))
    listofthings = []
    if 'pato:0000047' in kerdict[ker]:
        for sex in kerdict[ker]['pato:0000047']:
            listofthings.append('"' + sex[1] + '"')
        if not listofthings == []:
            g.write(' ;\n\tpato:0000047\t' + (','.join(listofthings)))
    listofthings = []
    if 'aopo:LifeStageContext' in kerdict[ker]:
        for lifestage in kerdict[ker]['aopo:LifeStageContext']:
            listofthings.append('"' + lifestage[1] + '"')
        if not listofthings == []:
            g.write(' ;\n\taopo:LifeStageContext\t' + (','.join(listofthings)))
    listofthings = []
    if 'ncbitaxon:131567' in kerdict[ker]:
        for taxonomy in kerdict[ker]['ncbitaxon:131567']:
            listofthings.append(taxonomy[2])
        if not listofthings == []:
            g.write(' ;\n\tncbitaxon:131567\t' + (','.join(listofthings)))
    listofthings = []
    for aop in aopdict:
        if ker in aopdict[aop]['aopo:has_key_event_relationship']:
            listofthings.append(aopdict[aop]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\tdcterms:isPartOf\t' + (','.join(listofthings)))
    g.write(' .\n\n')
print("Done!")

Done!


### Writing Taxonomy triples

In [88]:
for tax in taxdict:
    if 'dc:identifier' in taxdict[tax]:
        if '"' not in taxdict[tax]['dc:identifier']:
            g.write(taxdict[tax]['dc:identifier'] + '\n\ta\tncbitaxon:131567 ;\n\tdc:identifier\t' + taxdict[tax]['dc:identifier'] + ' ;\n\tdc:title\t"' + taxdict[tax]['dc:title'])
            if taxdict[tax]['dc:source'] is not None:
                g.write('" ;\n\tdc:source\t"' + taxdict[tax]['dc:source'])
            g.write('" .\n\n')
print("Done!")

Done!


### Writing Stressor triples

In [89]:
for stressor in strdict:
    g.write(strdict[stressor]['dc:identifier'] + '\n\ta\tnci:C54571 ;\n\tdc:identifier\t' + strdict[stressor]['dc:identifier'] + ' ;\n\trdfs:label\t' + strdict[stressor]['rdfs:label'] + ' ;\n\tfoaf:page\t' + strdict[stressor]['foaf:page'] + ' ;\n\tdc:title\t' + strdict[stressor]['dc:title'] + ' ;\n\tdcterms:created\t"' + strdict[stressor]['dcterms:created'] + '" ;\n\tdcterms:modified\t"' + strdict[stressor]['dcterms:modified'] + '"')
    if 'dc:description' in strdict[stressor]:
        g.write(' ;\n\tdc:description\t' + strdict[stressor]['dc:description'])
    listofthings = []
    for chem in strdict[stressor]['linktochemical']:
        listofthings.append(chedict[chem]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\taopo:has_chemical_entity\t' + ','.join(listofthings))
    listofthings = []
    for ke in kedict:
        if 'nci:C54571' in kedict[ke]:
            if stressor in kedict[ke]['nci:C54571']:
                listofthings.append(kedict[ke]['dc:identifier'])
    for item in listofthings:
        for ke in kedict:
            if kedict[ke]['dc:identifier'] == item:
                for aop in aopdict:
                    if ke in aopdict[aop]['aopo:has_key_event'] and aopdict[aop]['dc:identifier'] not in listofthings:
                        listofthings.append(aopdict[aop]['dc:identifier'])
    for aop in aopdict:
        if stressor in aopdict[aop]['nci:C54571']:
                if not aopdict[aop]['dc:identifier'] in listofthings:
                    listofthings.append(aopdict[aop]['dc:identifier'])
    if not listofthings == []:
        g.write(' ;\n\tdcterms:isPartOf\t' + (','.join(listofthings)))
    g.write(' .\n\n')
print("Done!")

Done!


### Writing Biological Process triples

In [90]:
for pro in bioprodict:
    print(pro)
    if pro is not None:
        g.write(bioprodict[pro]['dc:identifier'] + '\ta\tgo:0008150 ;\n\tdc:identifier\t' + bioprodict[pro]['dc:identifier'] + ' ;\n\tdc:title\t' + bioprodict[pro]['dc:title'] + ' ;\n\tdc:source\t' + bioprodict[pro]['dc:source'] + ' . \n\n')
print("Done!")

None
48d9321c-67a1-4402-ac29-1b45c31ac5c0
2864765b-3256-4caf-819f-565965abd874
00bb623d-8daa-406b-b1f8-e3e2926b20b5
551183d4-0798-442a-bc5c-5d4f8edb6c84
633c07c9-ebe0-4cfe-978c-97f82b5f7302
c8e51268-4725-4c61-be93-6116eed5471b
adfa3d9b-3fc5-4de9-b1c4-fe2b949ea277
e71f779c-8feb-4807-894e-84057fff6233
5f0ab70a-3f65-476a-9907-971eab056779
ed9040e3-fa19-4515-9dfa-47229de4537e
bedd5e40-cd20-4876-bc64-ab061bd3df89
1eeaf1e0-e9db-457c-95e6-81c78de4fbab
e8bbd9c4-a7e8-468b-9a1a-0bba991c0f5c
3c7253b4-6b54-432f-9773-660f58e5a253
cbe1b89d-3670-40de-b9e7-898c62b2177c
b228629a-0ec2-4d2f-bd2b-fd294acc0fdc
e6bd719e-ef68-4a76-a82e-50a6f5b50588
d8d5ed27-4181-4927-b519-78cf1bba00cd
46742984-4572-47cb-a7dd-8d8ecef94e1c
407d09c6-630c-4474-8b6a-50d8a45a47d7
9873512b-8a4c-4531-ab18-bd4889628638
e5424cb9-073d-4c4f-bd74-74fc4ac4af4e
273e6ff7-dc1b-4441-94d4-ac55f04e765f
b80cdaaf-c22a-49e0-bdc8-abaa3d096449
c2dc4951-1911-4d2e-8911-4ccb60f8419c
b4c43c11-2aa8-4b23-8b4a-7260be3d21df
c1a958f8-0095-4d66-a7b3-fedae7883

### Writing Biological Object triples

In [91]:
for obj in bioobjdict:
    if obj is not None and obj != "N/A" and 'TAIR' not in bioobjdict[obj]['dc:identifier']:
        g.write(bioobjdict[obj]['dc:identifier'] + '\ta\tpato:0001241 ;\n\tdc:identifier\t' + bioobjdict[obj]['dc:identifier'] + ' ;\n\tdc:title\t' + bioobjdict[obj]['dc:title'] + ' ;\n\tdc:source\t' + bioobjdict[obj]['dc:source'])
        if bioobjdict[obj]['dc:identifier'] in prodict:
            g.write(' ;\n\tskos:exactMatch\t'+','.join(prodict[bioobjdict[obj]['dc:identifier']]))
        g.write('. \n\n')
print("Done!")

Done!


### Writing Biological Action triples


In [92]:
for act in bioactdict:
    if act is not None:
        if '"' not in bioactdict[act]['dc:identifier']:
            g.write(bioactdict[act]['dc:identifier'] + '\ta\tpato:0000001 ;\n\tdc:identifier\t' + bioactdict[act]['dc:identifier'] + ' ;\n\tdc:title\t' + bioactdict[act]['dc:title'] + ' ;\n\tdc:source\t' + bioactdict[act]['dc:source'] + ' . \n\n')
print("Done!")

Done!


### Writing Cell term triples

In [93]:
for item in cterm:
    if '"' not in item:
        g.write(item + '\ta\taopo:CellTypeContext ;\n\tdc:identifier\t' + item + ' ;\n\tdc:title\t' + cterm[item]['dc:title'] + ' ;\n\tdc:source\t' + cterm[item]['dc:source'] + ' .\n\n')
print("Done!")

Done!


### Writing Organ term triples

In [94]:
for item in oterm:
    if '"' not in item:
        g.write(item + '\ta\taopo:OrganContext ;\n\tdc:identifier\t' + item + ' ;\n\tdc:title\t' + oterm[item]['dc:title'] + ' ;\n\tdc:source\t' + oterm[item]['dc:source'] + ' .\n\n')
print("Done!")

Done!


### Writing Chemical triples

In [95]:
for che in chedict:
    if 'dc:identifier' in chedict[che] and '"' not in chedict[che]['dc:identifier']:
        g.write(chedict[che]['dc:identifier'] + '\n\tdc:identifier\t' + chedict[che]['dc:identifier'])
        if 'cheminf:000446' in chedict[che]:
            g.write(' ;\n\ta\tcheminf:000000, cheminf:000446 ;\n\tcheminf:000446\t' + chedict[che]['cheminf:000446'])
        if not chedict[che]['cheminf:000059'] == 'inchikey:None':
            g.write(' ;\n\tcheminf:000059\t' + chedict[che]['cheminf:000059'])
        if 'dc:title' in chedict[che]:
            g.write(' ;\n\tdc:title\t' + chedict[che]['dc:title'])
        if 'cheminf:000568' in chedict[che]:
            g.write(' ;\n\tcheminf:000568\t' + str(chedict[che]['cheminf:000568']))
        listofexactmatches = []
        if 'cheminf:000407' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000407']))
        if 'cheminf:000405' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000405']))
        if 'cheminf:000567' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000567']))
        if 'cheminf:000412' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000412']))
        if 'cheminf:000140' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000140']))
        if 'cheminf:000406' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000406']))
        if 'cheminf:000408' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000408']))
        if 'cheminf:000409' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000409']))
        if 'cheminf:000564' in chedict[che]:
            listofexactmatches.append(','.join(chedict[che]['cheminf:000564']))
        if 'cheminf:000407' in chedict[che] or 'cheminf:000405' in chedict[che] or 'cheminf:000567' in chedict[che] or 'cheminf:000412' in chedict[che] or 'cheminf:000140' in chedict[che] or 'cheminf:000406' in chedict[che] or 'cheminf:000408' in chedict[che] or 'cheminf:000409' in chedict[che] or 'cheminf:000564' in chedict[che]:
            g.write(' ;\n\tskos:exactMatch\t'+','.join(listofexactmatches))
        listofthings = []
        if 'dcterms:alternative' in chedict[che]:
            for alt in chedict[che]['dcterms:alternative']:
                listofthings.append('"' + alt + '"')
            g.write(' ;\n\tdcterms:alternative\t' + ','.join(listofthings))
        listofthings = []
        for stressor in strdict:
            if 'aopo:has_chemical_entity' in strdict[stressor]:
                if che in strdict[stressor]['linktochemical']:
                    listofthings.append(strdict[stressor]['dc:identifier'])
        if not listofthings == []:
            g.write(' ;\n\tdcterms:isPartOf\t' + (','.join(listofthings)))
        g.write(' .\n\n')
print("Done!")

Done!


In [96]:
n = 0

### Writing mapped Chemical identifiers

In [97]:
for cas in listofcas:
    g.write(cas + '\tdc:source\t"CAS".\n\n')
    n += 1
print(n)
for inchikey in listofinchikey:
    g.write(inchikey + '\tdc:source\t"InChIKey".\n\n')
    n += 1
print(n)
    
for comptox in listofcomptox:
    g.write(comptox + '\tdc:source\t"CompTox".\n\n')
    n += 1
print(n)

for chebi in listofchebi:
    g.write(chebi + '\ta\tcheminf:000407 ;\n\tcheminf:000407\t"'+chebi[6:]+'";\n\tdc:identifier\t"'+chebi+'";\n\tdc:source\t"ChEBI".\n\n')
    n += 1
print(n)
for chemspider in listofchemspider:
    g.write(chemspider + '\ta\tcheminf:000405 ;\n\tcheminf:000405\t"'+chemspider[11:]+'";\n\tdc:identifier\t"'+chemspider+'";\n\tdc:source\t"ChemSpider".\n\n')
    n += 1
print(n)
for wd in listofwikidata:
    g.write(wd + '\ta\tcheminf:000567 ;\n\tcheminf:000567\t"'+wd[9:]+'";\n\tdc:identifier\t"'+wd+'";\n\tdc:source\t"Wikidata".\n\n')
    n += 1
print(n)
for chembl in listofchembl:
    g.write(chembl + '\ta\tcheminf:000412 ;\n\tcheminf:000412\t"'+chembl[16:]+'";\n\tdc:identifier\t"'+chembl+'";\n\tdc:source\t"ChEMBL".\n\n')
    n += 1
print(n)
for pubchem in listofpubchem:
    g.write(pubchem + '\ta\tcheminf:000140 ;\n\tcheminf:000140\t"'+pubchem[17:]+'";\n\tdc:identifier\t"'+pubchem+'";\n\tdc:source\t"PubChem".\n\n')
    n += 1
print(n)
for drugbank in listofdrugbank:
    g.write(drugbank + '\ta\tcheminf:000406 ;\n\tcheminf:000406\t"'+drugbank[9:]+'";\n\tdc:identifier\t"'+drugbank+'";\n\tdc:source\t"DrugBank".\n\n')
    n += 1
print(n)
for kegg in listofkegg:
    g.write(kegg + '\ta\tcheminf:000409 ;\n\tcheminf:000409\t"'+kegg[14:]+'";\n\tdc:identifier\t"'+kegg+'";\n\tdc:source\t"KEGG".\n\n')
    n += 1
print(n)
for lipidmaps in listoflipidmaps:
    g.write(lipidmaps + '\ta\tcheminf:000564 ;\n\tcheminf:000564\t"'+lipidmaps[10:]+'";\n\tdc:identifier\t"'+lipidmaps+'";\n\tdc:source\t"LIPID MAPS".\n\n')
    n += 1
print(n)
for hmdb in listofhmdb:
    g.write(hmdb + '\ta\tcheminf:000408 ;\n\tcheminf:000408\t"'+hmdb[5:]+'";\n\tdc:identifier\t"'+hmdb+'";\n\tdc:source\t"HMDB".\n\n')
    n += 1
print(n)
print("Done!")

405
817
1229
1660
2079
2495
2840
3256
3472
3786
3820
4242
Done!


### Writing mapped Gene identifiers

In [98]:
for hgnc in hgnclist:
    g.write(hgnc + '\ta\tedam:data_2298, edam:data_1025 ;\n\tedam:data_2298\t"'+hgnc[5:]+'";\n\tdc:identifier\t"'+hgnc+'";\n\tdc:source\t"HGNC".\n\n')

for entrez in ncbigenelist:
    g.write(entrez + '\ta\tedam:data_1027, edam:data_1025 ;\n\tedam:data_1027\t"'+entrez[9:]+'";\n\tdc:identifier\t"'+entrez+'";\n\tdc:source\t"Entrez Gene".\n\n')

for uniprot in uniprotlist:
    g.write(uniprot + '\ta\tedam:data_2291, edam:data_1025 ;\n\trdfs:seeAlso <http://purl.uniprot.org/uniprot/' + uniprot[8:] + '>;\n\towl:sameAs <http://purl.uniprot.org/uniprot/' + uniprot[8:] + '>;\n\tedam:data_2291\t"'+uniprot[8:]+'";\n\tdc:identifier\t"'+uniprot+'";\n\tdc:source\t"UniProt".\n\n')
    
print("Done!")

Done!


### Writing class labels

In [99]:

df = pd.read_csv(filepath + 'typelabels.txt')
df

Unnamed: 0,URI,label,description
0,<http://rdfs.org/ns/void#Dataset>,Dataset,-
1,<http://edamontology.org/data_1027>,Gene ID (NCBI),An NCBI unique identifier of a gene.
2,<http://edamontology.org/data_1025>,Gene identifier,"An identifier of a gene, such as a name/symbol..."
3,<http://edamontology.org/data_2298>,Gene ID (HGNC),Identifier for a gene approved by the HUGO Gen...
4,<http://purl.org/obo/owl/GO#0008150>,biological_process,A biological process represents a specific obj...
5,<http://aopkb.org/aop_ontology#CellTypeContext>,Cell-term,-
6,<http://aopkb.org/aop_ontology#OrganContext>,Organ-term,-
7,<http://semanticscience.org/resource/CHEMINF_0...,ChEBI identifier,Database identifier used by ChEBI.
8,<http://semanticscience.org/resource/CHEMINF_0...,ChEMBL identifier,Identifier used by the ChEMBL database for com...
9,<http://semanticscience.org/resource/CHEMINF_0...,ChemSpider identifier,Database identifier used by ChemSpider.


In [100]:
for row,index in df.iterrows():
    g.write('\n\n'+index['URI']+'\trdfs:label\t"'+index['label'])
    if index['description'] != '-':
        g.write('";\n\tdc:description\t"""'+index['description']+'""".')
    else:
        g.write('".')


Close the file.

In [101]:
g.close()
print("The AOP-Wiki RDF file is created!")

The AOP-Wiki RDF file is created!


## <b>Step #5: Gene ID text-mapping (HGNC)</b>
In order to identify genes present in the textual descriptions of Key Events (KEs) and Key Event Relationships (KERs), HGNC identifier mapping was performed. [Genenames.org](https://www.genenames.org/) is the curated online repository for HGNC nomenclature, and it allows custom downloads for all HGNC entries, including approved symbols and names, previous symbols and synonyms. 

## Step #5A - Parsing the custom HGNC file
This starts with loading the custom download file, which was named `HGNCgenes.txt` and stored in the path defined in Step #2. Next, its contents are extracted and stored in a dictionary called `genedict1`, while variants are created for every gene name and gene symbol for more effective mapping of genes. These variants are stored in `genedict2`, which is used for more effective mapping of genes in Step #5B. 

In [102]:
HGNCfilename = 'HGNCgenes.txt'

In [103]:
fileStatsObj = os.stat (filepath + HGNCfilename)
HGNCmodificationTime = time.ctime ( fileStatsObj [ stat.ST_MTIME ] )
print("Last Modified Time : ", HGNCmodificationTime )

Last Modified Time :  Tue Aug 20 08:51:09 2024


In [104]:
HGNCgenes = open(filepath + HGNCfilename, 'r')
symbols = [' ','(',')','[',']',',','.']
genedict1 = {}
genedict2 = {}
b = 0
for line in HGNCgenes:
    if not 'HGNC ID	Approved symbol	Approved name	Previous symbols	Synonyms	Accession numbers	Ensembl ID(supplied by Ensembl)'in line:
        a = line[:-1].split('\t')
        if not '@' in a[1]: #gene clusters contain a '@' in their symbol, which are filtered out
            genedict1[a[1]] = []
            genedict2[a[1]] = []
            genedict1[a[1]].append(a[1])
            if not a[2] == '':
                genedict1[a[1]].append(a[2])
            for item in a[3:]:
                if not item == '':
                    for name in item.split(', '):
                        genedict1[a[1]].append(name)
            for item in genedict1[a[1]]:
                for s1 in symbols:
                    for s2 in symbols:
                        genedict2[a[1]].append((s1+item+s2))
HGNCgenes.close()
print("A total of " + str(len(genedict2)) + " genes are included for mappings")

A total of 41893 genes are included for mappings


## Step #5B - HGNC identifier mapping
Genes are mapped for descriptions of KEs and KERs, and for the biological plausibility and emperical support sections of KERs. First, these are screened for any overlap with all possible gene symbols and names captured in genedict1. Then, all positive matches are checked by mapping with all variants of those genes, ensuring the correct mapping. All matches are stored in the kedict and kerdict dictionaries. Also, all mapped genes are stored in a list called hgnclist.

### Key Events

In [105]:
hgnclist = []
keyhitcount = {}
print("Gene mapping on Key Events is can take a minute...")
for ke in root.findall(aopxml + 'key-event'):
    geneoverlapdict = {}
    if ke.find(aopxml + 'description').text is not None:
        geneoverlapdict[ke.get('id')] = []
        for key in genedict2:
            a = 0
            for item in genedict1[key]:
                if item in kedict[ke.get('id')]['dc:description']:
                    a = 1
            if a == 1:
                for item in genedict2[key]:
                    if item in kedict[ke.get('id')]['dc:description'] and not 'hgnc:' + genedict2[key][1][1:-1] in geneoverlapdict[ke.get('id')]:
                        geneoverlapdict[ke.get('id')].append('hgnc:' + genedict2[key][1][1:-1])
                        if 'hgnc:' + genedict2[key][1][1:-1] not in hgnclist:
                            hgnclist.append('hgnc:' + genedict2[key][1][1:-1])
                        if item in keyhitcount:
                            keyhitcount[item] += 1
                        else:
                            keyhitcount[item] = 1
                            
        if not geneoverlapdict[ke.get('id')]:
            del geneoverlapdict[ke.get('id')]
    if ke.get('id') in geneoverlapdict:
        kedict[ke.get('id')]['edam:data_1025'] = geneoverlapdict[ke.get('id')]
print("Done!\nIn total, " + str(len(hgnclist))+ " genes were mapped to descriptions of Key Events")

Gene mapping on Key Events is can take a minute...
Done!
In total, 943 genes were mapped to descriptions of Key Events


In [106]:
for gene, count in keyhitcount.items():
    if count > 10:
        print(f"{gene}: {count} hits")

 ROS : 12 hits
 II : 41 hits
 B : 37 hits
(ALP): 24 hits
(ROS): 14 hits
 T3 : 11 hits
 TH : 18 hits
 G2 : 15 hits
 T : 35 hits
 AR : 12 hits
 E2 : 30 hits


### Key Event Relationships

In [107]:
print("Gene mapping on Key Events is can take a couple of minutes...")
for ker in root.findall(aopxml + 'key-event-relationship'):
    geneoverlapdict = {}
    geneoverlapdict[ker.get('id')] = []
    if ker.find(aopxml + 'description').text is not None:
        for key in genedict2:
            a = 0
            for item in genedict1[key]:
                if item in kerdict[ker.get('id')]['dc:description']:
                    a = 1
            if a == 1:
                for item in genedict2[key]:
                    if item in kerdict[ker.get('id')]['dc:description'] and not 'hgnc:' + genedict2[key][1][1:-1] in geneoverlapdict[ker.get('id')]:
                        geneoverlapdict[ker.get('id')].append('hgnc:' + genedict2[key][1][1:-1])
                        if 'hgnc:' + genedict2[key][1][1:-1] not in hgnclist:
                            hgnclist.append('hgnc:' + genedict2[key][1][1:-1])
    for weight in ker.findall(aopxml + 'weight-of-evidence'):
        if weight.find(aopxml + 'biological-plausibility').text is not None:
            for key in genedict2:
                a = 0
                for item in genedict1[key]:
                    if item in kerdict[ker.get('id')]['nci:C80263']:
                        a = 1
                if a== 1:
                    for item in genedict2[key]:
                        if item in kerdict[ker.get('id')]['nci:C80263'] and not 'hgnc:' + genedict2[key][1][1:-1] in geneoverlapdict[ker.get('id')]:
                            geneoverlapdict[ker.get('id')].append('hgnc:' + genedict2[key][1][1:-1])
                            if 'hgnc:' + genedict2[key][1][1:-1] not in hgnclist:
                                hgnclist.append('hgnc:' + genedict2[key][1][1:-1])
        if weight.find(aopxml + 'emperical-support-linkage').text is not None:
            for key in genedict2:
                a = 0
                for item in genedict1[key]:
                    if item in kerdict[ker.get('id')]['edam:data_2042']:
                        a = 1
                if a== 1:
                    for item in genedict2[key]:
                        if item in kerdict[ker.get('id')]['edam:data_2042'] and not 'hgnc:' + genedict2[key][1][1:-1] in geneoverlapdict[ker.get('id')]:
                            geneoverlapdict[ker.get('id')].append('hgnc:' + genedict2[key][1][1:-1])
                            if 'hgnc:' + genedict2[key][1][1:-1] not in hgnclist:
                                hgnclist.append('hgnc:' + genedict2[key][1][1:-1])
    if not geneoverlapdict[ker.get('id')]:
        del geneoverlapdict[ker.get('id')]
    if ker.get('id') in geneoverlapdict:
        kerdict[ker.get('id')]['edam:data_1025'] = geneoverlapdict[ker.get('id')]
print("Done!\nIn total, " + str(len(hgnclist))+ " genes were mapped to descriptions of Key Events and Key Event Relationships")

Gene mapping on Key Events is can take a couple of minutes...
Done!
In total, 1549 genes were mapped to descriptions of Key Events and Key Event Relationships


## Step #5C - Identifier mapping for other databases
BridgeDb was used to additional identifiers from other databases, including Entrez gene, Ensembl, and UniProt IDs. By a request call, identifiers are returned, which are stored in the dictionary called `geneiddict`. The BridgeDb service URL has already been defined in Step #3.

In [108]:
geneiddict = {}
listofentrez = []
listofensembl = []
listofuniprot = []

for gene in hgnclist:
    a = requests.get(bridgedb + 'xrefs/H/' + gene[5:]).text.split('\n')
    dictionaryforgene = {}
    if 'html' not in a:
        for item in a:
            b = item.split('\t')
            if len(b) == 2:
                if b[1] not in dictionaryforgene:
                    dictionaryforgene[b[1]] = []
                    dictionaryforgene[b[1]].append(b[0])
                else:
                    dictionaryforgene[b[1]].append(b[0])
    geneiddict[gene] = []
    if 'Entrez Gene' in dictionaryforgene:
        for entrez in dictionaryforgene['Entrez Gene']:
            if 'ncbigene:'+entrez not in listofentrez:
                listofentrez.append("ncbigene:"+entrez)
            geneiddict[gene].append("ncbigene:"+entrez)
    if 'Ensembl' in dictionaryforgene:
        for ensembl in dictionaryforgene['Ensembl']:
            if 'ensembl:' + ensembl not in listofensembl:
                listofensembl.append("ensembl:"+ensembl)
            geneiddict[gene].append("ensembl:"+ensembl)
    if 'Uniprot-TrEMBL' in dictionaryforgene:
        for uniprot in dictionaryforgene['Uniprot-TrEMBL']:
            if 'uniprot:'+uniprot not in listofuniprot:
                listofuniprot.append("uniprot:"+uniprot)
            geneiddict[gene].append("uniprot:"+uniprot)
print("Gene identifiers mapped:\n" + str(len(listofentrez)) + " Entrez gene IDs\n" + str(len(listofuniprot)) + " Uniprot IDs\n" + str(len(listofensembl)) + " Ensembl IDs")

Gene identifiers mapped:
1523 Entrez gene IDs
7599 Uniprot IDs
1521 Ensembl IDs


## Step #5D - Writing output file
The final step involves the writing of the RDF file in Turtle syntax. After writing the prefixes used for predicates and identifier types, all gene mapping links stored in the kedict and kerdict are written, followed by the HGNC IDs and matched IDs for other databases.

In [109]:
g = open(filepath + 'AOPWikiRDF-Genes.ttl', 'w', encoding='utf-8')

### Writing prefixes

In [110]:
g.write('@prefix dc: <http://purl.org/dc/elements/1.1/> .\n@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n@prefix aop.events: <https://identifiers.org/aop.events/> .\n@prefix aop.relationships: <https://identifiers.org/aop.relationships/> .\n@prefix skos: <http://www.w3.org/2004/02/skos/core#> . \n@prefix ensembl: <https://identifiers.org/ensembl/> .\n@prefix edam: <http://edamontology.org/> .\n@prefix hgnc: <https://identifiers.org/hgnc/>.\n@prefix ncbigene: <https://identifiers.org/ncbigene/>.\n@prefix uniprot: <https://identifiers.org/uniprot/>.\n@prefix owl: <http://www.w3.org/2002/07/owl#>.\n\n')

595

### Writing Key Event triples
These triples only contain the mappings with genes.

In [111]:
n = 0
for ke in kedict:
    if 'edam:data_1025' in kedict[ke]:
        n += 1
        g.write(kedict[ke]['dc:identifier']+'\tedam:data_1025\t' + ','.join(kedict[ke]['edam:data_1025'])+' .\n\n')
print("Done!\n\nNumber of Key Events with genes mapped to their descriptions: " + str(n))

Done!

Number of Key Events with genes mapped to their descriptions: 381


### Writing Key Event Relationship triples
These triples only contain the mappings with genes.

In [112]:
n = 0
for ker in kerdict:
    if 'edam:data_1025' in kerdict[ker]:
        n += 1
        g.write(kerdict[ker]['dc:identifier']+'\tedam:data_1025\t' + ','.join(kerdict[ker]['edam:data_1025'])+' .\n\n')
print("Done!\n\nNumber of Key Event Relationships with genes mapped to their descriptions: " + str(n))

Done!

Number of Key Event Relationships with genes mapped to their descriptions: 591


### Writing Gene identifier triples

In [113]:
for hgnc in hgnclist:
    g.write(hgnc + '\ta\tedam:data_2298, edam:data_1025 ;\n\tedam:data_2298\t"'+hgnc[5:]+'";\n\tdc:identifier\t"'+hgnc+'";\n\tdc:source\t"HGNC"')
    if not geneiddict[hgnc] == []:
        g.write(' ;\n\tskos:exactMatch\t'+','.join(geneiddict[hgnc]))
    g.write('.\n\n')
print(str(len(hgnclist))+" HGNC triples written.")
for entrez in listofentrez:
    g.write(entrez + '\ta\tedam:data_1027, edam:data_1025 ;\n\tedam:data_1027\t"'+entrez[9:]+'";\n\tdc:identifier\t"'+entrez+'";\n\tdc:source\t"Entrez Gene".\n\n')
print(str(len(listofentrez))+" Entrez gene triples written.")
for ensembl in listofensembl:
    g.write(ensembl + '\ta\tedam:data_1033, edam:data_1025 ;\n\tedam:data_1033\t"'+ensembl[8:]+'";\n\tdc:identifier\t"'+ensembl+'";\n\tdc:source\t"Ensembl".\n\n')
print(str(len(listofensembl))+ " Ensembl triples written.")
for uniprot in listofuniprot:
    g.write(uniprot + '\ta\tedam:data_2291, edam:data_1025 ;\n\tedam:data_2291\t"'+uniprot[8:]+'";\n\tdc:identifier\t"'+uniprot+'";\n\tdc:source\t"UniProt".\n\n')
print(str(len(listofuniprot))+ " UniProt triples written.\nDone!")

1549 HGNC triples written.
1523 Entrez gene triples written.
1521 Ensembl triples written.
7599 UniProt triples written.
Done!


Close the file.

In [114]:
g.close()
print("The AOP-Wiki RDF Genes file is created!")

The AOP-Wiki RDF Genes file is created!


## <b>Step #6: Creating the VoID file</b>
The last file contains the metadata of the original data, script, and tools used for the creation of the AOP-Wiki RDF files. 

In [115]:
a = requests.get(bridgedb + 'properties').text.split('\n')
info = {}
for item in a:
    if not item.split('\t')[0] in info:
        info[item.split('\t')[0]] = []
    if len(item.split('\t')) == 2:
        info[item.split('\t')[0]].append(item.split('\t')[1])
print('The version of the BridgeDb mapping files: \n Gene/Proteins: '
      + str(info['DATASOURCENAME'][0]) + ':' + str(info['DATASOURCEVERSION'][0]) + '\n Chemicals: '
      + str(info['DATASOURCENAME'][5]) + ':' + str(info['DATASOURCEVERSION'][5]))

The version of the BridgeDb mapping files: 
 Gene/Proteins: Ensembl:108
 Chemicals: HMDB-CHEBI-WIKIDATA:HMDB5.0.20211102-CHEBI211-WIKIDATA20220707


In [116]:
x = datetime.datetime.now()
print('The date: ' + str(x))
y = str(x)
y = y[:10]

The date: 2025-02-11 14:46:20.225426


In [117]:
g = open(filepath + 'AOPWikiRDF-Void.ttl', 'w', encoding='utf-8')
g.write('@prefix : <https://aopwiki.rdf.bigcat-bioinformatics.org/> .\n@prefix dc: <http://purl.org/dc/elements/1.1/> .\n@prefix dcterms: <http://purl.org/dc/terms/> .\n@prefix void:  <http://rdfs.org/ns/void#> .\n@prefix pav:   <http://purl.org/pav/> .\n@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .\n@prefix dcat:  <http://www.w3.org/ns/dcat#> .\n@prefix foaf:  <http://xmlns.com/foaf/0.1/> .\n@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n@prefix freq:  <http://purl.org/cld/freq/> .')
g.write('\n:AOPWikiRDF.ttl\ta\tvoid:Dataset ;\n\tdc:description\t"AOP-Wiki RDF data from the AOP-Wiki database" ;\n\tpav:createdOn\t"' + y + '"^^xsd:date;\n\tdcterms:modified\t"' + y +'"^^xsd:date ;\n\tpav:createdWith\t"' + str(aopwikixmlfilename) + '", :Promapping ;\n\tpav:createdBy\t<https://zenodo.org/badge/latestdoi/146466058> ;\n\tfoaf:homepage\t<https://aopwiki.org> ;\n\tdcterms:accuralPeriodicity  freq:quarterly ;\n\tdcat:downloadURL\t<https://aopwiki.org/downloads/' + str(aopwikixmlfilename) + '> .\n\n:AOPWikiRDF-Genes.ttl\ta\tvoid:Dataset ;\n\tdc:description\t"AOP-Wiki RDF extension with gene mappings based on approved names and symbols" ;\n\tpav:createdOn\t"' + str(x) + '" ;\n\tpav:createdWith\t"' + str(aopwikixmlfilename) + '", :HGNCgenes ;\n\tpav:createdBy\t<https://zenodo.org/badge/latestdoi/146466058> ;\n\tdcterms:accuralPeriodicity  freq:quarterly ;\n\tfoaf:homepage\t<https://aopwiki.org> ;\n\tdcat:downloadURL\t<https://aopwiki.org/downloads/' + str(aopwikixmlfilename) + '>, <https://www.genenames.org/download/custom/> . \n\n:HGNCgenes.txt\ta\tvoid:Dataset, void:Linkset ;\n\tdc:description\t"HGNC approved symbols and names for genes" ;\n\tdcat:downloadURL\t<https://www.genenames.org/download/custom/> ;\n\tpav:importedOn\t"'+HGNCmodificationTime+'" .\n\n<https://proconsortium.org/download/current/promapping.txt>\ta\tvoid:Dataset, void:Linkset;\n\tdc:description\t"PRotein ontology mappings to protein database identifiers";\n\tdcat:downloadURL\t<https://proconsortium.org/download/current/promapping.txt>;\n\tpav:importedOn\t"'+PromodificationTime+'".')
g.close()
print("The VoID file is created!")

The VoID file is created!
