# Generation of predictions using BioTransformer
- Created by: Louis Groff
- PIs: Imran Shah and Grace Patlewicz (GP)
- Last modified by GP: 5 April 2024
- Changes made: Additional notes on implementation added from the SI and clean up of the notebook to remove function into the metsim package. Example added. 

## Information re installation and operation of BioTransformer 

A working copy of BioTransformer is needed. Assuming the input directory contains the BioTransformer Executable Java Archive (JAR) the full Command Line Interface (CLI) input command would be: 
- java -jar BioTransformer3.0_20220615.jar -k pred -q "ecbased:1;cyp450:2;phaseII:1" -cm 1 -ismi "<in.hcd_smiles>" -ocsv  "\tmpfiles\btrans_out_<in.dtxsid>_randomfilename string>.csv" 
<br>The randomly suffixed output files generated via the “tempfile” package in Python can either be kept or discarded after data processing is completed. 

For the validation datasets, the models used were either “cyp450” for single-step phase I metabolism (Table S2 in the supplemental information), or any mixture of “cyp450”, “ecbased” and “phaseII” that terminated in phase II, whether individual runs or sequential. The CLI command for the highest performing single-step human model (no gut) was:
- java -jar BioTransformer3.0_20220615.jar -k pred -q "ecbased:1;cyp450:1;phaseII:1" -cm 1 -ismi "<in:smiles>" -ocsv "\tmpfiles\btrans_out_<in.dtxsid>_randomfile name string>.csv"

### Import relevant libraries including the metsim functions

In [1]:
import os, sys
import pandas as pd

In [2]:
TOP = os.getcwd().replace("notebooks", "")
raw_dir = TOP + 'data/raw/'
processed_dir = TOP + 'data/processed/'
figures_dir = TOP + 'reports/figures/'
jcim = TOP + 'data/raw/JCIM_PhaseI/'
smpdb = TOP + 'data/raw/smpdb_drugs/'
missing = TOP + 'data/raw/extra_metsim/'

In [3]:
LIB = os.getcwd().replace("notebooks", "")

In [4]:
if not LIB in sys.path: 
    sys.path.insert(0,LIB)

In [5]:
from metsim.sim.metsim_bt import *

### btrans_metsim_subprocess():

- input: SMILES, List of Models, List of Cycles, Delete Tempfile (True/False)
- action: Simulates human metabolism (Default: 2 cycles Phase I/"cyp450", 1 cycle Phase II/"phaseII") using BioTransformer 3.0 through Java in the command prompt, producing output data in temporary files that can be kept or deleted. Recursively searches through output CSV to process data into a standardized dictionary output.
- output: Tuple with dummy index (for parallel processing), Dictionary of precursor and successor SMILES, CASRN, DTXSID, InChIKey as supplemented by HCD (or RDKit for InChIKey), and filename (if del_tmp = False).

### Example for a single chemical, Ibuprofen

In [1]:
#test_btrans = btrans_metsim_subprocess(smiles = 'CC(C)CC1=CC=C(C=C1)C(C)C(O)=O',dtxsid = 'DTXSID3047138')

### MetSim output from BioTransformer for Ibuprofen before Cheminformatics Modules queries:

In [7]:
#test_btrans

### Example of Parallel processing the first ten chemicals of the full drug dataset of 112 parents using the "multiprocess" (mp) package to wrap the MetSim functions.
    wrap btrans_metsim_subprocess in a multiprocess call
    
### general format of parallel processed MetSim:
    1. import multiprocess as mp
    2. define pool of available CPUs for parallel processing by counting the available cores with pool = mp.Pool(mp.cpu_count()) to allocate all cores, reduce as needed if multitasking on a PC.
    3. for asynchronous parallel processable functions:
        a. first argument is the function to be called (must have all necessary imports in function definitions!)
        b. second argument is a list comprehension For Loop of [(tuple of parameters from function in order they are defined) for idx in range(len(dataset))]
        c. append .get() to starmap_async() function: e.g., pool.starmap_async(<function>, [(ordered argument tuple)]).get()
        d. completed list of output_dict is given as list of (idx, output_dict, csv_filename) tuples
    7. Iterate through tuple index 2 of (idx, output_dict, csv_filename) with recursive_gen_list() function and store to a list of generationally tracked output_dict for first 10 parent chemicals

In [None]:
import multiprocess as mp
smpdb_obs_pathways = json.load(open('..\\smpdb_jcim_valid_aggregate_112parents.json','r'))
smpdb_obs_pathways = smpdb_obs_pathways[:10]
out_list = []
pool = mp.Pool(mp.cpu_count()) #define number of available CPU cores for multiprocessing.

out_list = pool.starmap_async(btrans_metsim_subprocess,
                              #arguments (must be listed in the same order as given in the function definition):
                              [('C:\\Users\\LGROFF\\OneDrive - Environmental Protection Agency (EPA)\\Profile\\Desktop\\biotransformer3.0jar',
                                  ['ecbased','cyp450','phaseII'], #models
                                 1, #cyp_mode
                                 [1,2,1], #cycles
                                 smpdb_obs_pathways[idx]['input']['qsar_smiles'], #smiles
                                 smpdb_obs_pathways[idx]['input']['casrn'], #casrn
                                 False, #del_tmp
                                 smpdb_obs_pathways[idx]['input']['dtxsid'], #dtxsid
                                 smpdb_obs_pathways[idx]['input']['chem_name'], #chem_name
                                 idx, #index
                                 True #multi_proc
                               )
                               for idx in range(len(smpdb_obs_pathways))]).get() #show for first four chemicals in SMPDB dataset as a test

pool.close() #close the processing pool to release resources.

# Post-processing of multiprocessed BioTransformer Data
for i in range(len(out_list)):
    if os.path.getsize(out_list[i][2]) > 0:
        input_df = pd.read_csv(out_list[i][2])
        parent_list = input_df['Precursor ID']
        parent_list = parent_list.drop_duplicates()
        out_list[i][1]['output'] = recursive_gen_list(input_df = input_df,
                                                      parent_list = parent_list,
                                                      successor_list = [],
                                                      out_list = [],
                                                      gen = 1)
    else: #Valid SMILES given, no metabolites produced:
            print('No metabolites produced for index #'+str(i))
            out_list[i][1]['output'] = [{'precursor': out_list[i][1]['input'],
                                         'successors': [{'enzyme': [],
                                                         'mechanism': None,
                                                         'generation': None,
                                                         'metabolite': {'smiles': None,
                                                                        'inchikey': None,
                                                                        'casrn': None,
                                                                        'hcd_smiles': None,
                                                                        'dtxsid': None,
                                                                        'chem_name': None
                                                                       }
                                                       }]
                                       }]
preds_complete = [out_list[i][1] for i in range(len(out_list))]
json.dump(preds_complete,open('metsim_biotransformer_1xecbased_2xcyp450_1xphase2_smpdb_test.json','w'))