### Generation of predictions using the Toolbox
- Created by: Louis Groff
- PIs: Imran Shah and Grace Patlewicz
- Last modified: 4 March 2024
- Changes made: Additional note on implementation added from the SI

Most of the effort required to run the Toolbox API takes place with the initial setup of the local CLI instance of the Toolbox server external to the Python environment. With that, the main inputs consist of the user-specified port number (port_number) that the Toolbox is communicating over, a QSAR-Ready SMILES for the parent chemical obtained as described previously, and a numerical value corresponding to the index of the “GUID” identifier hash (model_GUID) for the desired metabolic simulator returned from the list of available simulators within the Toolbox. “GUIDs” are viewable within the Swagger UI for the Web API tool via:
http://localhost:<port_number>/api/v6/metabolism/
where the indexes in the returned list of simulators corresponding to the In Vivo Rat Simulator and In Vitro Rat Liver S9 models are index 8 and 15, respectively. The in.hcd_smiles is URL-encoded (smiles_url) via the function within the “urllib” package in Python. An example metabolism API call structure is:
http://localhost:<port_number>/api/v6/metabolism/<model_GUID>?smiles=<smiles_url>
No further parameter tuning is available, and simulators run on TIMES defaults, except that the types of transformations are limited to phase I metabolism. The functions developed within this study to run the Toolbox take these inputs and perform API calls as necessary to query the Toolbox for metabolites using the given simulator number, and the given SMILES, which returns the list of metabolite SMILES for that chemical if metabolites exist. The Toolbox does not provide avenues to determine generational tracking of its output phase I metabolites. If a given metabolism query to the Toolbox API does not yield metabolites, it is futher queried via its “search” functions using the parent chemical in:casrn to query all available chemical entries for that in:casrn. The input casrn is stripped of hyphens in the API URL call (cas_nohyphen). The search parameters are set such that they ignore stereochemical information as well (True/False parameter at the end of the URL call). An example of the URL structure to perform a search on a casrn is given below:
http://localhost:<port_number>/api/v6/search/<cas_nohyphen>/true
These entries were filtered to remove mixtures, which discards any of the results where the “SubstanceType” parameter in the output of the API call does not equal “MonoConstituent”. Metabolism queries are sequentially performed on each of the Chemical Identifier hash strings (ChemID) associated with mono-constituent chemical entries until the SMILES associated with one of the entries returns metabolite SMILES. In this case, the URL structure changes minorly from the above API call when SMILES is given to instead use the ChemID returned from the CASRN search:
http://localhost:<port_number>/api/v6/metabolism/<model_GUID>/<ChemID>
If none of the available mono-constituent chemical entries in the Toolbox API database yield metabolites, an empty output schema is stored accordingly to reflect this result.

# 1. metsim_hcd_out():
     Wrapped within the Toolbox WebAPI or BioTransformer calling functions. (Does not require VPN)
     input: SMILES string (required), optional are the DTXSID (required for CCD searching), CASRN, and Chemical Names.
     action: queries the EPA Hazard Comparison Dashboard (HCD) for JSON output data for a given chemical. If output data exists, relevant chemical/structural identifiers are supplemented to existing metsim outputs.
     output: If they exist - DTXSID, Canonical SMILES (hcd_smiles key), CASRN, and InChIKey (via HCD, or RDKit if not in HCD).
# 2. metsim_metadata_full():
    used to save time between running BioTransformer and gathering metabolite metadata. Since BioTransformer suffers more from the "combinatorial explosion" issues of having so many metabolites it produces, which is computationally taxing, we run BioTransformer and save the SMILES and generational tracking from it in our dictionary format as a list of dictionaries, and then later use metsim_metadata_full to fill in the metadata for this list of dictionaries.
# 3. toolbox_metsim_api():
    Runs a metabolism simulation based on available data in the TB database, using qsar-ready SMILES as input for metabolism simulation, if metabolism simulation fails, we note which molecules fail and pass to toolbox_metsim_api_search to find altername chemical IDs (ChemId in Toolbox Outputs) to reattempt metabolism simulation with alternate qsar-ready smiles from toolbox. On occasion, the default record that comes up from a SMILES metabolism simulation doesn't have metabolites associated, but other records for the same chemical will.
    input: Toolbox API port number running on your local machine within command prompt (required), simulator number (0-17, metsim_url_base subfunction will give list of GUIDs corresponding to which metabolism simulator is desired. 15 = In Vitro Rat Liver S9 Phase I model from TIMES, required), QSAR-Ready SMILES string (required), CASRN (optional), DTXSID (optional), chemical name (optional), index (for multiprocessing to keep track of sequential order of input data while parallel processing).
    action: Simulates rat liver metabolism (Assumed default from TIMES as 3 cycles of Phase I, thresholded at 5 metabolites/cycle or 0.1 Transformation probability, uncertain if true)
    output: tuple of index parameter and standardized dictionary of precursor and successors for each chemical (not generationally tracked).
# 4. toolbox_metsim_api_search:
    #inputs: same as toolbox_metsim_api()
    action: Searches QSAR Toolbox Database for alternate records for the same chemical, and attempts metabolism simulation for all qsar-ready smiles records for a given chemical either until one succeeds, or all fail, and the standardized output dictionary is updated accordingly.
    output: same as toolbox_metsim_api()

In [1]:
import os
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib.request, urllib.parse, json
import time
from rdkit import Chem
import datetime
import multiprocess as mp

ModuleNotFoundError: No module named 'multiprocess'

In [2]:
#Function 1: HCD Queries
def metsim_hcd_out(smiles = None, 
                   casrn = None,
                   dtxsid = None,
                   chem_name = None,
                   likely = None):
    """
    Query function for the Cheminformatics Modules Standardizer API, formerly wrapped within the Hazard Comparison Dashboard (HCD) API. 
    Used to convert an input SMILES string into QSAR-Ready SMILES. Returns InChIKey structural identifier as well,
    along with any other chemical identifer metadata if available, and not already given as inputs (e.g., CASRN, DTXSID, Chemical Name).
    
    If SMILES is not known, but DTXSID is known, can instead query on DTXSID to obtain Daylight SMILES from the Comptox Chemicals Dashboard API (CCD API),
    and subsequently query the Standardizer API using the SMILES obtained from the CCD API.
    
    Required Inputs:
    smiles: Daylight SMILES string
    or
    dtxsid: DSSTox Substance Identifier
    
    Optional Inputs:
    chem_name: Chemical name, whether trade name or IUPAC
    casrn: Chemical Abstracts Services Registry Number
    inchikey: International Chemical Identifier Key (InChIKey)
    likely: If MetSim predictions are obtained from the Chemical Transformation Simulator, can optionally keep the transformation "likelihood" parameter
    
    Returns:
    out_dict: Output dictionary containing all available output data for the given chemical, using the input parameter names as dictionary keys.
    Includes "hcd_smiles" as output dictionary key containing QSAR-Ready version of the input SMILES.
    
    Examples:
    
    SMILES given as sole input:
    input:    
    test_dict = metsim_hcd_out(smiles = "OCCOCCO")
    
    output:
    Attempting query of Cheminformatics Modules Standardizer with SMILES: OCCOCCO...
    Query succeeded.
    test_dict
    {'smiles': 'OCCOCCO',
     'casrn': '111-46-6',
     'hcd_smiles': 'OCCOCCO',
     'inchikey': 'MTHSVFCYNBDYFN-UHFFFAOYNA-N',
     'dtxsid': 'DTXSID8020462',
     'chem_name': 'Diethylene glycol',
     'likelihood': None}
    
    DTXSID given as sole input:
    
    input:    
    test_dict = metsim_hcd_out(dtxsid = "DTXSID4020402")
    
    output:
    Attempting query of Comptox Chemicals Dashboard with DTXSID: DTXSID4020402...
    Query succeeded.
    No SMILES given. Using CCD output SMILES.
    Attempting query of Cheminformatics Modules Standardizer with SMILES: CC1=C(N)C=C(N)C=C1...
    Query succeeded.
    test_dict
    {'smiles': 'CC1=C(N)C=C(N)C=C1',
     'casrn': '95-80-7',
     'hcd_smiles': 'CC1C=CC(N)=CC=1N',
     'inchikey': 'VOZKAJLKRJDJLL-UHFFFAOYNA-N',
     'dtxsid': 'DTXSID4020402',
     'chem_name': '2,4-Diaminotoluene',
     'likelihood': None}
     
     Empty inputs:
     input:
     test_dict = metsim_hcd_out(smiles = None, dtxsid = None)
     
     output:
     test_dict
     {'smiles': None,
      'casrn': None,
      'hcd_smiles': None,
      'inchikey': None,
      'dtxsid': None,
      'chem_name': None,
      'likelihood': None}   
    """
    
    ccd_out = []
    if dtxsid != None and smiles == None:
        #get metadata from Comptox Chemicals Dashboard for a given DTXSID (No structure searching atm).
        ccd_url = 'https://comptox.epa.gov/dashboard-api/ccdapp2/chemical-detail/search/by-dsstoxsid?id='+dtxsid
        ccd_success = 0
        try_count = 0
        while ccd_success == 0 and try_count < 3:
            try:
                print('Attempting query of Comptox Chemicals Dashboard with DTXSID: '+dtxsid+'...')
                ccd_out = json.loads(urllib.request.urlopen(ccd_url).read().decode())
                
                ccd_success = 1
                try_count+=1
            except:
                #Given that this occasionally fails randomly due to timeout errors, 
                #but then works again later, try again after a 1 second pause.
                #Should work on second attempt.
                print('URL Error Occurred, reattempting CCD query in 0.5 seconds.')
                time.sleep(0.5)
                try_count+=1
            print('Query succeeded.')
    if smiles != None or len(ccd_out) > 0:
        if smiles != None:
            smiles_url = urllib.parse.quote_plus(smiles) #URL encode SMILES string.
        elif len(ccd_out) > 0 and ccd_out['smiles'] != None:
            print('No SMILES given. Using CCD output SMILES.')
            smiles = ccd_out['smiles']
            smiles_url = urllib.parse.quote_plus(smiles) #URL enconde CCD smiles string.
        else:
            print('No SMILES given, and no SMILES available from CCD output.')
            smiles_url = None
        base_url = "https://hcd.rtpnc.epa.gov/api/stdizer?workflow=qsar-ready&smiles=" #Production environment (current, no VPN needed)
        # base_url = "https://hazard-dev.sciencedataexperts.com/api/stdizer?workflow=qsar-ready&smiles=" #Dev environment (no VPN needed)
        # base_url = "https://hazard.sciencedataexperts.com/api/stdizer?workflow=qsar-ready&smiles=" #Production environment (VPN needed)
        if smiles_url != None:
            hcd_url = base_url+smiles_url
            hcd_success = 0
            try_count = 0
            hcd_out = []
            while hcd_success == 0 and try_count < 3:
                try:
                    print('Attempting query of Cheminformatics Modules Standardizer with SMILES: '+smiles+'...')
                    time.sleep(0.5)
                    hcd_out = json.loads(urllib.request.urlopen(hcd_url).read().decode())
                    print('Query succeeded.')
                    hcd_success = 1
                    try_count+=1
                except:
                    #Given that this occasionally fails randomly due to timeout errors, 
                    #but then works again later, try again after a 1 second pause.
                    #Should work on second attempt.
                    print('URL Error Occurred, reattempting Cheminformatics Modules query in 0.5 seconds.')
                    time.sleep(0.5)
                    try_count+=1
            if len(hcd_out) > 0:
                out_dict = {'smiles': smiles, 
                            'casrn': casrn,
                            'hcd_smiles': hcd_out[0]['smiles'],
                            'inchikey': hcd_out[0]['inchiKey'],
                            'dtxsid': dtxsid,
                            'chem_name': chem_name,
                            'likelihood': likely}
                if out_dict['dtxsid'] == None:
                    if 'DTXSID' in hcd_out[0]['id']:
                        out_dict['dtxsid'] = hcd_out[0]['id']
                    elif len(ccd_out) > 0 and ccd_out['dsstoxSubstanceId'] != None:
                        out_dict['dtxsid'] = ccd_out['dsstoxSubstanceId']
                if out_dict['casrn'] == None:
                    if 'casrn' in hcd_out[0].keys():
                        out_dict['casrn'] = hcd_out[0]['casrn']
                    elif len(ccd_out) > 0 and ccd_out['casrn'] != None:
                        out_dict['casrn'] = ccd_out['casrn']
                if out_dict['chem_name'] == None:
                    if len(ccd_out) > 0 and ccd_out['preferredName'] != None:
                        out_dict['chem_name'] = ccd_out['preferredName']
                    elif 'name' in hcd_out[0].keys():
                        out_dict['chem_name'] = hcd_out[0]['name']
                if out_dict['inchikey'] == None and len(ccd_out) > 0:
                    if ccd_out['inchiKey'] != None:
                        out_dict['inchikey'] = ccd_out['inchiKey'] 
            else:
                out_dict = {'smiles': smiles,
                            'casrn': casrn,
                            'hcd_smiles': None,
                            'inchikey': None,
                            'dtxsid': dtxsid,
                            'chem_name': chem_name,
                            'likelihood': likely
                           }
                #HCD Returns empty list. Try to supplement with metadata from RDKit.
                try:
                    smiles_mol = Chem.MolFromSmiles(smiles)
                    out_dict['inchikey'] = Chem.inchi.MolToInchiKey(smiles_mol)
                except:
                    #Rarely, BioTransformer makes a bad SMILES string for a metabolite, and RDKit can't convert it to an InChIKey. Store None
                    print('RDKit failed to generate an inchikey for SMILES: '+smiles)
                    out_dict['inchikey'] = None
        else:
            out_dict = {'smiles': smiles,
                        'casrn': casrn,
                        'hcd_smiles': None,
                        'inchikey': None,
                        'dtxsid': dtxsid,
                        'chem_name': chem_name,
                        'likelihood': likely
                       }
    else:
        out_dict = {'smiles': smiles,
                    'casrn': casrn,
                    'hcd_smiles': None,
                    'inchikey': None,
                    'dtxsid': dtxsid,
                    'chem_name': chem_name,
                    'likelihood': likely
                   }
    return out_dict

In [3]:
def metsim_metadata_full(metsim_out = [], fnam = None, metsim_cache = None):
    
    if len(metsim_out) > 0:
        if metsim_cache != None:
            #Supplement metadata via serial HCD query through individual input chemicals, precursors, successors/metabolites for a full metsim dataset
            for i in range(len(metsim_out)): # i = number of input chemicals
                if metsim_out[i]['input']['inchikey'] != None:
                        continue
                if metsim_out[i]['input']['smiles'] not in [cache_item['smiles'] for cache_item in metsim_cache]:
                    metsim_out[i]['input'] = metsim_hcd_out(smiles = metsim_out[i]['input']['smiles'],
                                                            casrn = metsim_out[i]['input']['casrn'],
                                                            dtxsid = metsim_out[i]['input']['dtxsid'],
                                                            chem_name = metsim_out[i]['input']['chem_name'])
                    metsim_cache.append(metsim_out[i]['input'])
                    print('Input query added to metadata cache...')
                else:
                    print('Input SMILES found in cached results. Inserting into dictionary...')
                    metsim_out[i]['input'] = metsim_cache[[idx for idx in range(len(metsim_cache)) if metsim_cache[idx]['smiles'] == metsim_out[i]['input']['smiles']][0]]
                for j in range(len(metsim_out[i]['output'])): # j = number of unique precursors
                    if 'likelihood' in list(metsim_out[i]['output'][j]['precursor'].keys()):
                        if metsim_out[i]['output'][j]['precursor']['smiles'] not in [cache_item['smiles'] for cache_item in metsim_cache]:
                            metsim_out[i]['output'][j]['precursor'] = metsim_hcd_out(smiles = metsim_out[i]['output'][j]['precursor']['smiles'],
                                                                                     casrn = metsim_out[i]['output'][j]['precursor']['casrn'],
                                                                                     dtxsid = metsim_out[i]['output'][j]['precursor']['dtxsid'],
                                                                                     chem_name = metsim_out[i]['output'][j]['precursor']['chem_name'],
                                                                                     likely = metsim_out[i]['output'][j]['precursor']['likelihood'])
                            metsim_cache.append(metsim_out[i]['output'][j]['precursor'])
                            print('Precursor query added to metadata cache...')
                        else:
                            print('Precursor SMILES found in cached results. Inserting into dictionary...')
                            metsim_out[i]['output'][j]['precursor'] = metsim_cache[[idx for idx in range(len(metsim_cache)) if metsim_cache[idx]['smiles'] == metsim_out[i]['output'][j]['precursor']['smiles']][0]]
                    else:
                        if metsim_out[i]['output'][j]['precursor']['smiles'] not in [cache_item['smiles'] for cache_item in metsim_cache]:
                            metsim_out[i]['output'][j]['precursor'] = metsim_hcd_out(smiles = metsim_out[i]['output'][j]['precursor']['smiles'],
                                                                                     casrn = metsim_out[i]['output'][j]['precursor']['casrn'],
                                                                                     dtxsid = metsim_out[i]['output'][j]['precursor']['dtxsid'],
                                                                                     chem_name = metsim_out[i]['output'][j]['precursor']['chem_name'])
                            metsim_cache.append(metsim_out[i]['output'][j]['precursor'])
                            print('Precursor query added to metadata cache...')
                        else:
                            print('Precursor SMILES found in cached results. Inserting into dictionary...')
                            metsim_out[i]['output'][j]['precursor'] = metsim_cache[[idx for idx in range(len(metsim_cache)) if metsim_cache[idx]['smiles'] == metsim_out[i]['output'][j]['precursor']['smiles']][0]]
                    for k in range(len(metsim_out[i]['output'][j]['successors'])): # k = number of metabolites per precursor
                        if 'likelihood' in list(metsim_out[i]['output'][j]['successors'][k]['metabolite'].keys()):
                            if metsim_out[i]['output'][j]['successors'][k]['metabolite']['smiles'] not in [cache_item['smiles'] for cache_item in metsim_cache]:
                                metsim_out[i]['output'][j]['successors'][k]['metabolite'] = metsim_hcd_out(smiles = metsim_out[i]['output'][j]['successors'][k]['metabolite']['smiles'],
                                                                                                           casrn = metsim_out[i]['output'][j]['successors'][k]['metabolite']['casrn'],
                                                                                                           dtxsid = metsim_out[i]['output'][j]['successors'][k]['metabolite']['dtxsid'],
                                                                                                           chem_name = metsim_out[i]['output'][j]['successors'][k]['metabolite']['chem_name'],
                                                                                                           likely = metsim_out[i]['output'][j]['successors'][k]['metabolite']['likelihood'])
                                metsim_cache.append(metsim_out[i]['output'][j]['successors'][k]['metabolite'])
                                print('Successor metabolite query added to metadata cache...')
                            else:
                                print('Successor metabolite SMILES found in cached results. Inserting into dictionary...')
                                metsim_out[i]['output'][j]['successors'][k]['metabolite'] = metsim_cache[[idx for idx in range(len(metsim_cache)) if metsim_cache[idx]['smiles'] == metsim_out[i]['output'][j]['successors'][k]['metabolite']['smiles']][0]] 
                        else:
                            if metsim_out[i]['output'][j]['successors'][k]['metabolite']['smiles'] not in [cache_item['smiles'] for cache_item in metsim_cache]:
                                metsim_out[i]['output'][j]['successors'][k]['metabolite'] = metsim_hcd_out(smiles = metsim_out[i]['output'][j]['successors'][k]['metabolite']['smiles'],
                                                                                                           casrn = metsim_out[i]['output'][j]['successors'][k]['metabolite']['casrn'],
                                                                                                           dtxsid = metsim_out[i]['output'][j]['successors'][k]['metabolite']['dtxsid'],
                                                                                                           chem_name = metsim_out[i]['output'][j]['successors'][k]['metabolite']['chem_name'])
                                metsim_cache.append(metsim_out[i]['output'][j]['successors'][k]['metabolite'])
                                print('Successor metabolite query added to metadata cache...')
                            else:
                                print('Successor metabolite SMILES found in cached results. Inserting into dictionary...')
                                metsim_out[i]['output'][j]['successors'][k]['metabolite'] = metsim_cache[[idx for idx in range(len(metsim_cache)) if metsim_cache[idx]['smiles'] == metsim_out[i]['output'][j]['successors'][k]['metabolite']['smiles']][0]] 
                        print('input: '+str(i+1)+'/'+str(len(metsim_out))+' precursor: '+str(j+1)+'/'+str(len(metsim_out[i]['output']))+' metabolite: '+str(k+1)+'/'+str(len(metsim_out[i]['output'][j]['successors'])))
                if fnam != None:
                    json.dump(metsim_out, open(fnam,'w'))
        else:
            return metsim_metadata_full(metsim_out = metsim_out, fnam = fnam, metsim_cache = [])
    else:
        raise('Please supply a metsim dataset (list of dictionaries)')
    # print(metsim_out)
    return metsim_out

In [4]:
#Example:
test_dict = metsim_hcd_out(smiles = 'OCCOCCO')
test_dict

Attempting query of Cheminformatics Modules Standardizer with SMILES: OCCOCCO...
Query succeeded.


{'smiles': 'OCCOCCO',
 'casrn': '111-46-6',
 'hcd_smiles': 'OCCOCCO',
 'inchikey': 'MTHSVFCYNBDYFN-UHFFFAOYNA-N',
 'dtxsid': 'DTXSID8020462',
 'chem_name': 'Diethylene glycol',
 'likelihood': None}

In [5]:
#Function 2: OECD Toolbox Rat Liver S9 WebAPI Metsim.
def toolbox_metsim_api(tb_port = 16384,
                       simulator_num = 15,
                       smiles = None,
                       casrn = None,
                       dtxsid = None,
                       chem_name = None,
                       idx = None):
    import datetime
    import time
    import pandas as pd
    import urllib.request, urllib.parse, json
    
    metsim_url_base = 'http://localhost:'+str(tb_port)+'/API/v6/Metabolism/'
    with urllib.request.urlopen(metsim_url_base) as url:
        oecd_metsim_guids = json.loads(url.read().decode())
    #store GUID of metabolism simulator
    guid = oecd_metsim_guids[simulator_num]['Guid']
    #Store base metsim info in output dictionary:
    oecd_dict = {'datetime': str(datetime.datetime.now().strftime('%Y-%m-%d_%Hh%Mm%Ss')),
                 'software': 'OECD QSAR Toolbox WebAPI',
                 'version': 6,
                 'params':{'depth': 3,
                           'organism': 'Rat',
                           'site_of_metabolism': False,
                           'model': [oecd_metsim_guids[simulator_num]['Caption']]
                          }
                }

    #make lambda function to perform metsim in API:
    metsim_exec = lambda url: json.loads(urllib.request.urlopen(url).read().decode())
    #Lambda function to search with url-encoded SMILES:
    search_base = 'http://localhost:'+str(tb_port)+'/api/v6/Search/'
    search_smiles = lambda smiles: json.loads(urllib.request.urlopen(search_base+'smiles/false/true?smiles='+smiles).read().decode())
    #Lambda function to search on casrn if qsar_ready_smiles = nan:
    search_cas = lambda casrn: json.loads(urllib.request.urlopen(search_base+'cas/'+casrn+'/true').read().decode())
    stereo_filter = ['@','/','\\','.']
    oecd_dict['input'] = {'smiles': smiles,
                          'inchikey': None,
                          'casrn': casrn,
                          'hcd_smiles': None,
                          'dtxsid': dtxsid,
                          'chem_name': chem_name
                         }
    oecd_metab_list = []
    if pd.notna(smiles):
        #url encode SMILES
        smiles_encoded = urllib.parse.quote_plus(smiles)
        #Try metsim with base SMILES call first before doing anything more complicated than that:
        if pd.notna(smiles_encoded):
            print('Attempting metsim from SMILES input for index #'+str(idx)+'...')
            metsim_url = metsim_url_base+guid+'?smiles='+smiles_encoded  
            oecd_metab_list = metsim_exec(metsim_url) #store metabolite list for the input precursor
        if len(oecd_metab_list) > 0:
            oecd_dict['output'] = []
            oecd_dict['output'].append({'precursor': oecd_dict['input'],
                                        'successors': [{'enzyme': None,
                                                        'mechanism': None,
                                                        'metabolite': {'smiles': oecd_metab_list[j],
                                                                       'inchikey': None,
                                                                       'casrn': None,
                                                                       'hcd_smiles': None,
                                                                       'dtxsid': None,
                                                                       'chem_name': None
                                                                      },
                                                       } for j in range(len(oecd_metab_list))]
                                      })
            print('metsim succeeded for index #'+str(idx))
        else:
            oecd_dict = idx
    elif pd.notna(casrn) & ('NOCAS' not in casrn):
        oecd_dict = idx
    else:
        #This dictionary returns if no SMILES or CASRN are given:
        oecd_dict['input'] = {'smiles': None,
                              'inchikey': None,
                              'casrn': None,
                              'hcd_smiles': None,
                              'dtxsid': None,
                              'chem_name': None
                             }
        oecd_dict['output'] = [{'precursor': {'smiles': None,
                                              'inchikey': None,
                                              'casrn': None,
                                              'hcd_smiles': None,
                                              'dtxsid': None,
                                              'chem_name': None
                                             },
                                'successors': [{'enzyme': [],
                                                'mechanism': None,
                                                'metabolite': {'smiles': None,
                                                               'inchikey': None,
                                                               'casrn': None,
                                                               'hcd_smiles': None,
                                                               'dtxsid': None,
                                                               'chem_name': None
                                                               }
                                              }]
                              }]
        print('metsim failed for index #'+str(idx)+'. Neither SMILES nor CASRN were provided.')
    return (idx, oecd_dict)

In [6]:
def toolbox_metsim_api_search(tb_port = 16384,
                              simulator_num = 15,
                              smiles = None,
                              casrn = None,
                              dtxsid = None,
                              chem_name = None,
                              idx = None):
    #Version of toolbox_metsim_api that does not do the SMILES metsim query, but searches for ChemIds on SMILES and CASRN to do metsim.
    #Run serially to circumvent TB search server crash issues until rectified by LMC updates.
    import datetime
    import time
    import pandas as pd
    import urllib.request, urllib.parse, json
    
    metsim_url_base = 'http://localhost:'+str(tb_port)+'/API/v6/Metabolism/'
    with urllib.request.urlopen(metsim_url_base) as url:
        oecd_metsim_guids = json.loads(url.read().decode())
    #store GUID of metabolism simulator
    guid = oecd_metsim_guids[simulator_num]['Guid']
    #Store base metsim info in output dictionary:
    oecd_dict = {'datetime': str(datetime.datetime.now().strftime('%Y-%m-%d_%Hh%Mm%Ss')),
                 'software': 'OECD QSAR Toolbox WebAPI',
                 'version': 6,
                 'params':{'depth': 3,
                           'organism': 'Rat',
                           'site_of_metabolism': False,
                           'model': [oecd_metsim_guids[simulator_num]['Caption']]
                          }
                }

    #make lambda function to perform metsim in API:
    metsim_exec = lambda url: json.loads(urllib.request.urlopen(url).read().decode())
    #Lambda function to search with url-encoded SMILES:
    search_base = 'http://localhost:'+str(tb_port)+'/api/v6/search/'
    search_smiles = lambda smiles: json.loads(urllib.request.urlopen(search_base+'smiles/false/true?smiles='+smiles).read().decode())
    #Lambda function to search on casrn if qsar_ready_smiles = nan:
    search_cas = lambda casrn: json.loads(urllib.request.urlopen(search_base+'cas/'+casrn+'/true').read().decode())
    stereo_filter = ['@','/','\\','.'] #comment if using daylight smiles with stereochemistry, uncomment for qsar-ready smiles
    oecd_dict['input'] = {'smiles': smiles,
                          'inchikey': None,
                          'casrn': casrn,
                          'hcd_smiles': None,
                          'dtxsid': dtxsid,
                          'chem_name': chem_name
                         }
    oecd_metab_list = []
    smiles_encoded = None
    if pd.notna(smiles):
        #url encode SMILES
        smiles_encoded = urllib.parse.quote_plus(smiles)
    if pd.notna(casrn):
        print('Base SMILES query for index #'+str(idx)+' yields no metabolites. Searching for alternate ChemIds...')
        #casrn without hyphens:
        if 'NOCAS' not in casrn:
            cas_nohyphen = ''.join(casrn.split('-'))
        chem_entries = []
        #Query database on both smiles and casrn within try statements.
        #Necessary becasue of HTTP Errors for some chemicals:
        if pd.notna(smiles_encoded):
            try: 
                chem_entries = search_smiles(smiles_encoded)
            except:
                print('Bad http request for index #'+str(idx)+' SMILES')
                chem_entries = []
        if len(chem_entries) > 0:
            try:
                if 'NOCAS' not in casrn:
                    chem_cas = search_cas(cas_nohyphen)
                    chem_entries = chem_entries+chem_cas
            except:
                print('Bad http request for index #'+str(idx)+' CASRN')
        else:
            try:
                if 'NOCAS' not in casrn:
                    chem_entries = search_cas(cas_nohyphen)
            except:
                print('Bad http request for index #'+str(idx)+' CASRN')
                chem_entries = []
        if len(chem_entries) > 0:
            #remove chem name lists from dicts so that duplicate dicts can be removed via set comprehension
            rem_names = [chem_entries[i].pop('Names',None) for i in range(len(chem_entries))]
            chem_entries = [dict(t) for t in {tuple(chem_entries[i].items()) for i in range(len(chem_entries))}] #remove duplicate entries
            #Filter out mixtures, and substances with no casrn:
            chem_mono = [chem_entries[j] for j in range(len(chem_entries)) if (chem_entries[j]['SubstanceType'] == 'MonoConstituent' and chem_entries[j]['Cas'] != 0)]
            #Store ChemID so long as smiles has no stereo information:
            chem_Id = [chem_mono[j]['ChemId'] for j in range(len(chem_mono)) if sum([stereo_filter[k] in chem_mono[j]['Smiles'] for k in range(len(stereo_filter))]) == 0] #uncomment if using qsar-ready smiles
            # chem_Id = [chem_mono[j]['ChemId'] for j in range(len(chem_mono))] #uncomment if using daylight smiles with stereochemistry
            if len(chem_Id) > 0:
                print('Alternate ChemId corresponding to a QSAR-Ready SMILES found for index #'+str(idx)+'. Attempting metsim...')
                #Cosntruct URL from base call for metsims + GUID of the simulator + ChemID:
                #In most cases, anything filtered down this far likely has the same SMILES 
                #even if ChemID is different, will yield same metabolite list.
                #So just take the first one.
                metsim_url = metsim_url_base+guid+'/'+chem_Id[0]
                oecd_metab_list = metsim_exec(metsim_url)
            elif len(chem_mono) > 0:
                print('Metsim failed. Alternate monoconstituent ChemIds found. Attempting metsim for alternate ChemId 1/'+str(len(chem_mono))+' for index #'+str(idx)+'...')
                #If the only monoconstituent db entries have a stereo SMILES associated with our search, 
                #just run it using that ChemId, will likely yield metabolites over an empty list.
                chem_Id = chem_mono[0]['ChemId']
                metsim_url = metsim_url_base+guid+'/'+chem_Id
                oecd_metab_list = metsim_exec(metsim_url)
            if len(oecd_metab_list) > 0:
                oecd_dict['output'] = []
                oecd_dict['output'].append({'precursor': oecd_dict['input'],
                                            'successors': [{'enzyme': None,
                                                            'mechanism': None,
                                                            'metabolite': {'smiles': oecd_metab_list[j],
                                                                           'inchikey': None,
                                                                           'casrn': None,
                                                                           'hcd_smiles': None,
                                                                           'dtxsid': None,
                                                                           'chem_name': None
                                                                          },
                                                           } for j in range(len(oecd_metab_list))]
                                          })
                print('Metsim succeeded for index #'+str(idx))
            elif len(chem_mono) > 1:
                print('Metsim failed. Attempting metsim from alternate monoconstituent ChemIds for index #'+str(idx)+'...')
                #If first chemID yields no metabolites, and there are other monoconstituent 
                #Chem_Ids to try run them until a non-empty metabolite list is yielded
                for l in range(1,len(chem_mono)):
                    oecd_metab_list = metsim_exec(metsim_url_base+guid+'/'+chem_mono[l]['ChemId'])
                    if len(oecd_metab_list) > 0:
                        oecd_dict['output'] = []
                        oecd_dict['output'].append({'precursor': oecd_dict['input'],
                                                    'successors': [{'enzyme': None,
                                                                    'mechanism': None,
                                                                    'metabolite': {'smiles': oecd_metab_list[j],
                                                                                   'inchikey': None,
                                                                                   'casrn': None,
                                                                                   'hcd_smiles': None,
                                                                                   'dtxsid': None,
                                                                                   'chem_name': None
                                                                                  },
                                                                   } for j in range(len(oecd_metab_list))]
                                                  })
                        print('Metsim succeeded for index #'+str(idx))
                        break
                    else:
                        #No available ChemIDs yield a metabolite list, inspect list manually later.
                        oecd_dict['output'] = [{'precursor': oecd_dict['input'],
                                                'successors': [{'enzyme': [],
                                                                'mechanism': None,
                                                                'metabolite': {'smiles': None,
                                                                               'inchikey': None,
                                                                               'casrn': None,
                                                                               'hcd_smiles': None,
                                                                               'dtxsid': None,
                                                                               'chem_name': None
                                                                              }
                                                              }]
                                              }]
                        print('Metsim failed for index #'+str(idx)+'. Alternate ChemId '+str(l+1)+'/'+str(len(chem_mono))+' yielded no metabolites.')
            else:
                oecd_dict['output'] = [{'precursor': oecd_dict['input'],
                                        'successors': [{'enzyme': [],
                                                        'mechanism': None,
                                                        'metabolite': {'smiles': None,
                                                                       'inchikey': None,
                                                                       'casrn': None,
                                                                       'hcd_smiles': None,
                                                                       'dtxsid': None,
                                                                       'chem_name': None
                                                                      }
                                                      }]
                                      }]
                print('Metsim failed for index #'+str(idx)+'. No alternate ChemIds found.')            
        else:
            oecd_dict['output'] = [{'precursor': oecd_dict['input'],
                                    'successors': [{'enzyme': [],
                                                    'mechanism': None,
                                                    'metabolite': {'smiles': None,
                                                                   'inchikey': None,
                                                                   'casrn': None,
                                                                   'hcd_smiles': None,
                                                                   'dtxsid': None,
                                                                   'chem_name': None
                                                                  }
                                                  }]
                                  }]
            print('metsim failed for index #'+str(idx)+'. Neither SMILES nor CASRN yield valid ChemIds.')
    return (idx, oecd_dict)

In [14]:
def tb_metsim_api_search_logkow(casrn = None, tb_port = None, idx = None):
    """ 
    Search the OECD Toolbox database via its WebAPI for a chemical ID for an input chemical, and then 
    return the octanol-water partition coefficient to have a measure of its hydrophobicity.
    
    Inputs: 
    casrn: CAS Registry Number
    tb_port: Port number selected for locally running instance of the Toolbox Server
    
    Outputs:
    log_kow: Log10 scaled octanol-water partition coefficient, if available.
    """
    import pandas as pd
    import urllib.request, urllib.parse, json
    
    search_base = 'http://localhost:'+str(tb_port)+'/api/v6/search/'
    kow_url = 'http://localhost:'+str(tb_port)+'/api/v6/calculation/41552380-4d5d-4eab-bee0-03774c0eabb6/'
    search_cas = lambda casrn: json.loads(urllib.request.urlopen(search_base+'cas/'+casrn+'/true').read().decode())
    kow_exec = lambda chem_Id: json.loads(urllib.request.urlopen(kow_url+chem_Id).read().decode())
    stereo_filter = ['@','/','\\','.']
    chem_entries = []
    if casrn == None:
        print('No CASRN provided.')
        return None
    try:
        if 'NOCAS' not in casrn:
            print('Searching Toolbox database for ChemIds using CASRN: '+casrn+'.')
            cas_nohyphen = ''.join(casrn.split('-'))
            chem_entries = search_cas(cas_nohyphen)
        else:
            print('Invalid CASRN (contains "NOCAS"). Cannot search for ChemIds using this identifier.')
            return None
    except:
        print('Bad http request for index #'+str(idx)+' CASRN')
        return None
    if len(chem_entries) > 0:
        print('Toolbox database ChemIds found for CASRN: '+casrn+' via Toolbox API search.')
        #remove chem name lists from dicts so that duplicate dicts can be removed via set comprehension
        rem_names = [chem_entries[i].pop('Names',None) for i in range(len(chem_entries))]
        chem_entries = [dict(t) for t in {tuple(chem_entries[i].items()) for i in range(len(chem_entries))}] #remove duplicate entries
        #Filter out mixtures, and substances with no casrn:
        chem_mono = [chem_entries[j] for j in range(len(chem_entries)) if (chem_entries[j]['SubstanceType'] == 'MonoConstituent' and chem_entries[j]['Cas'] != 0)]
        #Store ChemID so long as smiles has no stereo information:
        chem_Id = [chem_mono[j]['ChemId'] for j in range(len(chem_mono)) if sum([stereo_filter[k] in chem_mono[j]['Smiles'] for k in range(len(stereo_filter))]) == 0]
        if len(chem_Id) > 0:
            print('Monoconstituent ChemIds found within search results...')
            for i in range(len(chem_Id)):
                print('Calculating Log Kow value for ChemId '+str(i+1)+'/'+str(len(chem_Id))+'...')
                try:
                    log_kow = kow_exec(chem_Id[i])
                    if log_kow != None:
                        if log_kow['Value'] != None:
                            print('Log Kow value successfully determined for CASRN: '+casrn+'.')
                            return float(log_kow['Value'])
                        else:
                            print('Log Kow value not found for current ChemId.')
                            continue
                except:
                    print('Bad http request for index #'+str(idx)+' Log Kow.')
                    return None
            if log_kow != None:
                if log_kow['Value'] == None:
                    print('ChemId(s) found, but no Log Kow available for CASRN: '+casrn+'.')
                    return None
            else:
                print('ChemId(s) found, but no Log Kow available for CASRN: '+casrn+'.')
                return None
        else:
            print('No ChemId(s) found for CASRN: '+casrn+'.')
            return None
#example
tb_metsim_api_search_logkow(casrn = '15687-27-1', tb_port = 16384)

Searching Toolbox database for ChemIds using CASRN: 15687-27-1.
Toolbox database ChemIds found for CASRN: 15687-27-1 via Toolbox API search.
Monoconstituent ChemIds found within search results...
Calculating Log Kow value for ChemId 1/2...
Log Kow value successfully determined for CASRN: 15687-27-1.


3.7931

In [9]:
drug_dataset = json.load(open('smpdb_jcim_valid_aggregate_112parents.json','r'))
pool = mp.Pool(mp.cpu_count()) #define number of available processors.
tb_metsim_vitro = pool.starmap_async(toolbox_metsim_api,
                                      #arguments (must be listed in the same order as given in the function definition):
                                      [(16384, #tb_port
                                        15, #simulator_num
                                        drug_dataset[idx]['input']['hcd_smiles'], #smiles
                                        drug_dataset[idx]['input']['casrn'], #casrn
                                        drug_dataset[idx]['input']['dtxsid'], #dtxsid
                                        drug_dataset[idx]['input']['chem_name'], #Chemical Name
                                        idx #index
                                       )
                                       for idx in range(len(drug_dataset[0:5]))]).get()

metsim_rerun_vitro = [i for i in range(len(tb_metsim_vitro)) if type(tb_metsim_vitro[i][1]) == int]
for idx in metsim_rerun_vitro:
    tb_metsim_vitro[idx] = pool.apply(toolbox_metsim_api_search,
                                       #arguments (must be listed in the same order as given in the function definition):
                                       args = (16384, #tb_port
                                                15, #simulator_num
                                                drug_dataset[idx]['input']['hcd_smiles'], #smiles
                                                drug_dataset[idx]['input']['casrn'], #casrn
                                                drug_dataset[idx]['input']['dtxsid'], #dtxsid
                                                drug_dataset[idx]['input']['chem_name'], #Chemical Name
                                                tb_metsim_vitro[idx][0] #index
                                              )
                                      )
#keep output dictionaries in list, remove tuple index:
tb_metsim_vitro = [tb_metsim_vitro[i][1] for i in range(len(tb_metsim_vitro))]
pool.close()

In [None]:
metsim_metadata_full(tb_metsim_vitro, fnam = 'tb_metsim_invivoratsimulator_112parents.json')