#### Generate predictions for a new list of chemicals for Instrinic Clearance¶

- Step 1: Identify substances of interest and their SMILES codes - Use KNIME to convert SMILES into a V2000 sdf file
- See KNIME workflow presented in models directory (httk/models) for example knwf file generated in KNIME 3.7.2
- Step 2: Use sdf file to generate Pubchem and ToxPrint Fingerprints using KNIME and the Chemotyper
- Step 3: Use sdf file to generate OPERA descriptors (v2.6)



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
import glob

In [2]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import r2_score
import pickle

In [3]:
def normalizeDescriptors(X):
    scaler = preprocessing.StandardScaler().fit(X)
    transformed = scaler.transform(X)
    x_norm = pd.DataFrame(transformed, index = X.index) 
    x_norm.columns = X.columns
    return(x_norm)

In [4]:
raw_dir = '/home/grace/Documents/python/httk/data/raw/'
processed_dir = '/home/grace/Documents/python/httk/data/processed/'
interim_dir = '/home/grace/Documents/python/httk/data/interim/'
figures_dir = '/home/grace/Documents/python/httk/reports/figures/'
external_dir = '/home/grace/Documents/python/httk/data/external/'
models_dir = '/home/grace/Documents/python/httk/models/'

Load descriptors needed for intrinsic clearance (regression model)

Looks like per Table S6 this model only needs Pubchem and ToxPrint fingerprints. 

In [19]:
pubchem = pd.read_csv(processed_dir+'Fub_Pubchem.csv')

In [20]:
pubchem.head()

Unnamed: 0,CASRN,bitvector0,bitvector1,bitvector2,bitvector3,bitvector4,bitvector5,bitvector6,bitvector7,bitvector8,...,bitvector871,bitvector872,bitvector873,bitvector874,bitvector875,bitvector876,bitvector877,bitvector878,bitvector879,bitvector880
0,94-74-6,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,148477-71-8,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,56-29-1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,153233-91-1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,96182-53-5,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
txps = pd.read_excel(processed_dir+'ToxPrints.xlsx')


None of the supplementary files appear to correspond to the Clint final features as described in mmc1 Table S6. Copied the features from Table S6. V frustrating!

In [8]:
pc = ['bitvector2',
 'bitvector12',
 'bitvector14',
 'bitvector15',
 'bitvector19',
 'bitvector20',
 'bitvector33',
 'bitvector37',
 'bitvector143',
 'bitvector179',
 'bitvector180',
 'bitvector185',
 'bitvector186',
 'bitvector256',
 'bitvector257',
 'bitvector286',
 'bitvector299',
 'bitvector308',
 'bitvector333',
 'bitvector340',
 'bitvector341',
 'bitvector345',
 'bitvector346',
 'bitvector356',
 'bitvector370',
 'bitvector374',
 'bitvector375',
 'bitvector376',
 'bitvector377',
 'bitvector380',
 'bitvector381',
 'bitvector390',
 'bitvector391',
 'bitvector392',
 'bitvector405',
 'bitvector420',
 'bitvector439',
 'bitvector451',
 'bitvector476',
 'bitvector516',
 'bitvector553',
 'bitvector592',
 'bitvector597',
 'bitvector599',
 'bitvector613',
 'bitvector614',
 'bitvector643',
 'bitvector645',
 'bitvector656',
 'bitvector696',
 'bitvector697',
 'bitvector698',
 'bitvector712']

In [9]:
tp = ['bond:CN_amine_aliphatic_generic',
 'bond:CX_halide_aromatic-X_generic',
 'chain:alkaneLinear_ethyl_C2(H_gt_1)',
 'chain:alkaneLinear_ethyl_C2_(connect_noZ_CN=4)',
 'chain:aromaticAlkane_Ph-C1_acyclic_connect_noDblBd']

In [21]:
pubchem.set_index('CASRN', inplace = True)

In [24]:
pubchem_ = pubchem[pc]

In [26]:
txps.drop(['DTXSID', 	'PREFERRED_NAME'], axis =1, inplace = True)
txps.set_index('INPUT', inplace = True)

In [29]:
txps_ = txps[tp]

In [31]:
ids = list(set(txps_.index & pubchem_.index))

In [32]:
txps_ = txps_.loc[ids]
pubchem_ = pubchem_.loc[ids]

In [35]:
descriptors = pd.concat([pubchem_, txps_], axis = 1)

In [39]:
descriptors.shape

(1118, 58)

Load saved model

In [36]:
clint_rf = pickle.load(open(models_dir+'clintReg_rf.sav', 'rb'))



In [38]:
len(clint_rf.feature_importances_)

58

Looks like there is a match in terms of number of descriptors expected...

In [40]:
predicted_clint_rf = pd.DataFrame(clint_rf.predict(descriptors), descriptors.index )

In [41]:
predicted_clint_rf.columns = ['pred_clint_rf']

In [42]:
predicted_clint_rf

Unnamed: 0,pred_clint_rf
5152-30-7,1.035721
120-71-8,1.217503
77-28-1,0.536407
79538-32-2,1.267599
57-94-3,0.943510
...,...
123441-03-2,0.897008
42399-41-7,0.720585
95153-31-4,0.697933
484-17-3,0.878016


Setting up descriptor sets for the other Clint model which is a SVC

In [43]:
pc2 = ['bitvector2', 'bitvector12', 'bitvector14', 'bitvector15', 'bitvector19', 'bitvector20', 'bitvector33', 'bitvector37', 'bitvector143', 'bitvector179', 'bitvector180', 'bitvector185', 'bitvector186', 'bitvector256', 'bitvector257', 'bitvector286', 'bitvector299', 'bitvector308', 'bitvector333', 'bitvector340', 'bitvector341', 'bitvector345', 'bitvector346', 'bitvector355', 'bitvector356', 'bitvector366', 'bitvector374', 'bitvector375', 'bitvector376', 'bitvector377', 'bitvector380', 'bitvector381', 'bitvector390', 'bitvector391', 'bitvector392', 'bitvector405', 'bitvector420', 'bitvector439', 'bitvector451', 'bitvector476', 'bitvector493', 'bitvector516', 'bitvector539', 'bitvector592', 'bitvector614', 'bitvector637', 'bitvector643', 'bitvector645', 'bitvector656', 'bitvector688', 'bitvector696', 'bitvector697', 'bitvector698', 'bitvector712']

In [44]:
tp2 = ['chain:alkaneLinear_ethyl_C2(H_gt_1)', 'chain:alkaneLinear_ethyl_C2_(connect_noZ_CN=4)', 'chain:aromaticAlkane_Ph-C1_acyclic_connect_noDblBd']

In [45]:
txps_2 = txps[tp2]

In [46]:
pubchem_2 = pubchem[pc2]

In [47]:
pubchem_2 = pubchem_2.loc[ids]
txps_2 = txps_2.loc[ids]

In [51]:
df_opera = pd.read_csv(interim_dir+'normalised_opera.csv')

In [53]:
df_opera.set_index('CASRN', inplace = True)

In [54]:
opera_ = df_opera.loc[ids]

In [55]:
descriptors_2 = pd.concat([pubchem_2, txps_2, opera_], axis = 1).dropna(axis=0, how='any')

In [59]:
descriptors_2.to_csv(interim_dir+'descriptors_2.csv')

In [56]:
clint_svc = pickle.load(open(models_dir+'clintClas_svc.sav', 'rb'))



In [57]:
predicted_clint_svc = pd.DataFrame(clint_svc.predict(descriptors_2), descriptors_2.index )

AttributeError: 'SVC' object has no attribute 'break_ties'

Error with running SVC models due to version control of Sklearn

Note - have created a conda environment - if version scikit-learn=0.20.1 is part of the environment - than these models load and run correctly!