# Generation of predictions using BioTransformer
- Created by: Louis Groff
- PIs: Imran Shah and Grace Patlewicz (GP)
- Last modified by GP: 28 March 2024
- Changes made: Additional notes on implementation added from the SI and clean up of the notebook to remove function into the metsim package.

## Information re installation and operation of BioTransformer 

A working copy of BioTransformer is needed. Assuming the input directory contains the BioTransformer Executable Java Archive (JAR) the full Command Line Interface (CLI) input command would be: java -jar BioTransformer3.0_20220615.jar -k pred -q "ecbased:1;cyp450:2;phaseII:1" -cm 1 -ismi "<in.hcd_smiles>" -ocsv ".\tmpfiles\btrans_out_<in.dtxsid>_randomfilename string>.csv" The randomly suffixed output files generated via the “tempfile” package in Python can either be kept or discarded after data processing is completed. 

For the validation datasets, the models used were either “cyp450” for single-step phase I metabolism (Table S2 in the supplemental information), or any mixture of “cyp450”, “ecbased” and “phaseII” that terminated in phase II, whether individual runs or sequential. The CLI command for the highest performing single-step human model (no gut) was:
java -jar BioTransformer3.0_20220615.jar -k pred -q "ecbased:1;cyp450:1;phaseII:1" -cm 1 -ismi "<in:smiles>" -ocsv ".\tmpfiles\btrans_out_<in.dtxsid>_randomfile name string>.csv"

### Import relevant libraries including the metsim functions

In [2]:
import os, sys
import pandas as pd

In [3]:
TOP = os.getcwd().replace("notebooks", "")
raw_dir = TOP + 'data/raw/'
processed_dir = TOP + 'data/processed/'
figures_dir = TOP + 'reports/figures/'
jcim = TOP + 'data/raw/JCIM_PhaseI/'
smpdb = TOP + 'data/raw/smpdb_drugs/'
missing = TOP + 'data/raw/extra_metsim/'

In [4]:
LIB = os.getcwd().replace("notebooks", "")

In [5]:
if not LIB in sys.path: 
    sys.path.insert(0,LIB)

In [8]:
from metsim.sim.metsim_bt import *

1. metsim_metadata_full():

used to save time between running BioTransformer and gathering metabolite metadata. Since BioTransformer suffers more from the "combinatorial explosion" issues of having so many metabolites it produces, which is computationally taxing, we run BioTransformer and save the SMILES and generational tracking from it in our dictionary format as a list of dictionaries, and then later use metsim_metadata_full to fill in the metadata for this list of dictionaries.

2. btrans_metsim_subprocess():

input: SMILES, List of Models, List of Cycles, Delete Tempfile (True/False)
action: Simulates human metabolism (Default: 2 cycles Phase I/"cyp450", 1 cycle Phase II/"phaseII") using BioTransformer 3.0 through Java in the command prompt, producing output data in temporary files that can be kept or deleted. Recursively searches through output CSV to process data into a standardized dictionary output.
output: Tuple with dummy index (for parallel processing), Dictionary of precursor and successor SMILES, CASRN, DTXSID, InChIKey as supplemented by HCD (or RDKit for InChIKey), and filename (if del_tmp = False).

In [None]:
# # Load in existing dictionary instead of rerunning.
#github URL for RAW readout of SMPDB_59Parents.json (update token in URL as necessary)
smpdb_obs_pathways = json.loads(open('smpdb_jcim_valid_aggregate_112parents.json','r'))

out_list = []
pool = mp.Pool(mp.cpu_count()) #define number of available CPU cores for multiprocessing.

out_list = pool.starmap_async(btrans_metsim_subprocess,
                              #arguments (must be listed in the same order as given in the function definition):
                              [(['ecbased','cyp450','phaseII'], #models
                                 1, #cyp_mode
                                 [1,2,1], #cycles
                                 smpdb_obs_pathways[idx]['input']['hcd_smiles'], #smiles
                                 smpdb_obs_pathways[idx]['input']['casrn'], #casrn
                                 False, #del_tmp
                                 smpdb_obs_pathways[idx]['input']['dtxsid'], #dtxsid
                                 smpdb_obs_pathways[idx]['input']['chem_name'], #chem_name
                                 idx, #index
                                 True #multi_proc
                               )
                               for idx in range(len(smpdb_obs_pathways[0:4]))]).get() #show for first four chemicals in SMPDB dataset as a test

pool.close() #close the processing pool to release resources.

# Post-processing of multiprocessed BioTransformer Data
for i in range(len(out_list)):
    if os.path.getsize(out_list[i][2]) > 0:
        input_df = pd.read_csv(out_list[i][2])
        parent_list = input_df['Precursor ID']
        parent_list = parent_list.drop_duplicates()
        out_list[i][1]['output'] = recursive_gen_list(input_df = input_df,
                                                      parent_list = parent_list,
                                                      successor_list = [],
                                                      out_list = [],
                                                      gen = 1)
    else: #Valid SMILES given, no metabolites produced:
            print('No metabolites produced for index #'+str(i))
            out_list[i][1]['output'] = [{'precursor': out_list[i][1]['input'],
                                         'successors': [{'enzyme': [],
                                                         'mechanism': None,
                                                         'generation': None,
                                                         'metabolite': {'smiles': None,
                                                                        'inchikey': None,
                                                                        'casrn': None,
                                                                        'hcd_smiles': None,
                                                                        'dtxsid': None,
                                                                        'chem_name': None
                                                                       }
                                                       }]
                                       }]
preds_complete = [out_list[i][1] for i in range(len(out_list))]
json.dump(preds_complete,open('metsim_biotransformer_1xecbased_2xcyp450_1xphase2_smpdb_test.json','w'))

In [34]:
preds_complete_all_metadata = metsim_metadata_full(preds_complete_all_metadata,fnam = 'metsim_biotransformer_1xecbased_2xcyp450_1xphase2_smpdb_test.json')