#### Generate predictions for a new list of chemicals for Fraction Unbound


- Step 1: Identify substances of interest and their SMILES codes - Use KNIME to convert SMILES into a V2000 sdf file
- See KNIME workflow presented in models directory (httk/models) for example knwf file generated in KNIME 3.7.2
- Step 2: Use sdf file to generate Pubchem and ToxPrint Fingerprints using KNIME and the Chemotyper
- Step 3: Use sdf file to generate OPERA descriptors (v2.6)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
import glob

In [31]:
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import r2_score
import pickle

In [32]:
def normalizeDescriptors(X):
    scaler = preprocessing.StandardScaler().fit(X)
    transformed = scaler.transform(X)
    x_norm = pd.DataFrame(transformed, index = X.index) 
    x_norm.columns = X.columns
    return(x_norm)

In [118]:
raw_dir = '/home/grace/Documents/python/httk/data/raw/'
processed_dir = '/home/grace/Documents/python/httk/data/processed/'
interim_dir = '/home/grace/Documents/python/httk/data/interim/'
figures_dir = '/home/grace/Documents/python/httk/reports/figures/'
external_dir = '/home/grace/Documents/python/httk/data/external/'
models_dir = '/home/grace/Documents/python/httk/models/'

Importing descriptor files

In [5]:
pubchem = pd.read_csv(processed_dir+'Fub_Pubchem.csv')

In [9]:
pubchem.head()

Unnamed: 0,CASRN,bitvector0,bitvector1,bitvector2,bitvector3,bitvector4,bitvector5,bitvector6,bitvector7,bitvector8,...,bitvector871,bitvector872,bitvector873,bitvector874,bitvector875,bitvector876,bitvector877,bitvector878,bitvector879,bitvector880
0,94-74-6,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,148477-71-8,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,56-29-1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,153233-91-1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,96182-53-5,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
cdk = pd.read_csv(processed_dir+'Fub_CDK.csv')

In [10]:
cdk.head()

Unnamed: 0,Molecule,INPUT,FOUND_BY,DTXSID,PREFERRED_NAME,CASRN,Molecular Weight,SMILES,QSAR_READY_SMILES,Mannhold LogP,...,XLogP,Zagreb Index,Molecular Formula,Formal Charge,Formal Charge (pos),Formal Charge (neg),Heavy Atoms Count,Molar Mass,SP3 Character,Rotatable Bonds Count (non terminal)
0,Cc1c(OCC(O)=O)ccc(Cl)c1,94-74-6,CAS-RN,DTXSID4024195,MCPA,94-74-6,200.024022,CC1=C(OCC(O)=O)C=CC(Cl)=C1,CC1=C(OCC(O)=O)C=CC(Cl)=C1,2.01,...,2.167,60,C9H9ClO3,0,0,0,13,200.619242,0.090909,3
1,CCC(C)(C)C(=O)OC1=C(C(=O)OC21CCCCC2)c3ccc(Cl)c...,148477-71-8,CAS-RN,DTXSID6034928,Spirodiclofen,148477-71-8,410.105165,CCC(C)(C)C(=O)OC1=C(C(=O)OC11CCCCC1)C1=CC=C(Cl...,CCC(C)(C)C(=O)OC1=C(C(=O)OC11CCCCC1)C1=CC=C(Cl...,3.11,...,6.084,146,C21H24Cl2O4,0,0,0,27,411.319527,0.215686,5
2,CN1C(O)=NC(=O)C(C)(C2=CCCCC2)C1=O,56-29-1,CAS-RN,DTXSID9023122,Hexobarbital,56-29-1,236.116092,CN1C(O)=NC(=O)C(C)(C2=CCCCC2)C1=O,CN1C(O)=NC(=O)C(C)(C2=CCCCC2)C1=O,2.23,...,1.838,90,C12H16N2O3,0,0,0,17,236.267504,0.212121,1
3,CCOc1c(ccc(c1)C(C)(C)C)C2COC(=N2)c3c(F)cccc3F,153233-91-1,CAS-RN,DTXSID8034586,Etoxazole,153233-91-1,359.169685,CCOC1=C(C=CC(=C1)C(C)(C)C)C1COC(=N1)C1=C(F)C=C...,CCOC1=C(C=CC(=C1)C(C)(C)C)C1COC(=N1)C1=C(F)C=C...,3.22,...,6.008,138,C21H23F2NO2,0,0,0,26,359.410411,0.163265,5
4,CCOP(=S)(OC(C)C)Oc1cnc(nc1)C(C)(C)C,96182-53-5,CAS-RN,DTXSID1032482,Tebupirimfos,96182-53-5,318.1167,CCOP(=S)(OC(C)C)OC1=CN=C(N=C1)C(C)(C)C,CCOP(=S)(OC(C)C)OC1=CN=C(N=C1)C(C)(C)C,2.12,...,3.253,98,C13H23N2O3PS,0,0,0,20,318.373672,0.209302,7


Does not look like CDK descriptors are included in the Fub model

In [8]:
txps = pd.read_excel(processed_dir+'ToxPrints.xlsx')

In [11]:
txps.head()

Unnamed: 0,INPUT,DTXSID,PREFERRED_NAME,atom:element_main_group,atom:element_metal_group_I_II,atom:element_metal_group_III,atom:element_metal_metalloid,atom:element_metal_poor_metal,atom:element_metal_transistion_metal,atom:element_noble_gas,...,ring:polycycle_bicyclo_propene,ring:polycycle_spiro_[2.2]pentane,ring:polycycle_spiro_[2.5]octane,ring:polycycle_spiro_[4.5]decane,ring:polycycle_spiro_1_4-dioxaspiro[4.5]decane,ring:polycycle_tricyclo_[3.5.5]_cyclopropa[cd]pentalene,ring:polycycle_tricyclo_[3.7.7]bullvalene,ring:polycycle_tricyclo_[3.7.7]semibullvalene,ring:polycycle_tricyclo_adamantane,ring:polycycle_tricyclo_benzvalene
0,94-74-6,DTXSID4024195,MCPA,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,148477-71-8,DTXSID6034928,Spirodiclofen,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,56-29-1,DTXSID9023122,Hexobarbital,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,153233-91-1,DTXSID8034586,Etoxazole,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,96182-53-5,DTXSID1032482,Tebupirimfos,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
opera = pd.read_csv(processed_dir+'Fub-sdf_OPERA2.6Pred.csv')

In [57]:
opera.head()

Unnamed: 0,CASRN,MolWeight,nbAtoms,nbHeavyAtoms,nbC,nbO,nbN,nbAromAtom,nbRing,nbHeteroRing,...,SpMax4_Bhm,nHCsats,LipinskiFailures,ATSC6e,minsOH,WTPT_4,SaaN,minHBint2,maxHBa,ETA_dBeta
0,94-74-6,200.024022,22,13,9,3,0,6,1,0,...,2.823948,0,0,0.095497,8.403354,7.771488,0.0,8.004069,10.233938,0.75
1,148477-71-8,410.105165,51,27,21,4,0,6,3,1,...,3.580445,9,1,-1.903217,0.0,11.293233,0.0,0.0,12.953436,-5.0
2,56-29-1,236.116092,33,17,12,3,2,0,2,1,...,3.05689,5,0,-2.178422,9.363909,7.544963,0.0,0.649081,12.190578,-5.5
3,153233-91-1,359.169685,49,26,21,2,1,12,3,1,...,3.325359,7,1,-2.737049,0.0,6.073529,0.0,0.0,13.99048,-1.0
4,96182-53-5,318.1167,43,20,13,3,2,6,1,1,...,3.273965,8,0,0.482987,0.0,8.841324,8.76651,0.0,6.064177,-4.75


Supplementary file mmc24 corresponds to the Fub final features as described in mmc1 Table S6

In [26]:
desc = pd.read_csv(interim_dir+'Fub_final_features.csv') 

In [71]:
desc.Fingerprints.values

array(["['bitvector2', 'bitvector12', 'bitvector15', 'bitvector16', 'bitvector19', 'bitvector20', 'bitvector33', 'bitvector37', 'bitvector143', 'bitvector145', 'bitvector179', 'bitvector180', 'bitvector185', 'bitvector186', 'bitvector192', 'bitvector256', 'bitvector257', 'bitvector299', 'bitvector308', 'bitvector333', 'bitvector335', 'bitvector338', 'bitvector339', 'bitvector340', 'bitvector341', 'bitvector345', 'bitvector346', 'bitvector352', 'bitvector356', 'bitvector357', 'bitvector370', 'bitvector374', 'bitvector375', 'bitvector376', 'bitvector377', 'bitvector379', 'bitvector380', 'bitvector381', 'bitvector390', 'bitvector391', 'bitvector392', 'bitvector405', 'bitvector420', 'bitvector439', 'bitvector464', 'bitvector476', 'bitvector493', 'bitvector502', 'bitvector516', 'bitvector521', 'bitvector528', 'bitvector539', 'bitvector566', 'bitvector569', 'bitvector592', 'bitvector593', 'bitvector597', 'bitvector607', 'bitvector614', 'bitvector637', 'bitvector638', 'bitvector643', 'bitvect

In [75]:
pc = ['bitvector2', 'bitvector12', 'bitvector15', 'bitvector16', 'bitvector19', 'bitvector20', 'bitvector33', 'bitvector37', 'bitvector143', 'bitvector145', 'bitvector179', 'bitvector180', 'bitvector185', 'bitvector186', 'bitvector192', 'bitvector256', 'bitvector257', 'bitvector299', 'bitvector308', 'bitvector333', 'bitvector335', 'bitvector338', 'bitvector339', 'bitvector340', 'bitvector341', 'bitvector345', 'bitvector346', 'bitvector352', 'bitvector356', 'bitvector357', 'bitvector370', 'bitvector374', 'bitvector375', 'bitvector376', 'bitvector377', 'bitvector379', 'bitvector380', 'bitvector381', 'bitvector390', 'bitvector391', 'bitvector392', 'bitvector405', 'bitvector420', 'bitvector439', 'bitvector464', 'bitvector476', 'bitvector493', 'bitvector502', 'bitvector516', 'bitvector521', 'bitvector528', 'bitvector539', 'bitvector566', 'bitvector569', 'bitvector592', 'bitvector593', 'bitvector597', 'bitvector607', 'bitvector614', 'bitvector637', 'bitvector638', 'bitvector643', 'bitvector646', 'bitvector656', 'bitvector667', 'bitvector688', 'bitvector696', 'bitvector697', 'bitvector698', 'bitvector699', 'bitvector712']

In [76]:
tp = ['bond:CN_amine_aliphatic_generic', 'bond:CN_amine_ter-N_aliphatic', 'bond:COH_alcohol_generic', 'bond:CX_halide_aromatic-X_generic', 'chain:alkaneCyclic_ethyl_C2_(connect_noZ)', 'chain:alkaneLinear_ethyl_C2(H_gt_1)', 'chain:alkaneLinear_ethyl_C2_(connect_noZ_CN=4)', 'chain:aromaticAlkane_Ph-C1_acyclic_connect_noDblBd', 'ring:hetero_[6]_N_pyridine_generic']

In [53]:
print(desc['Padel+CDK'].values)

["['nN', 'nO', 'nS', 'nP', 'nF', 'nCl', 'nBr', 'nI', 'SM1_DzZ', 'SM1_Dzv']"]


In [None]:
['nN', 'nO', 'nS', 'nP', 'nF', 'nCl', 'nBr', 'nI', 'SM1_DzZ', 'SM1_Dzv']

#### Filter OPERA descriptors for the 2 descriptors needed for the model and normalise them using the 'normalizeDescripors' function

In [58]:
df_opera = opera[['CASRN','LogP_pred','pKa_a_pred', 'pKa_b_pred']]
df_opera['pKa_pred']=df_opera[['pKa_a_pred','pKa_b_pred']].min(axis=1)
df_opera.set_index('CASRN', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [59]:
#df_opera

In [60]:
df_opera = normalizeDescriptors(df_opera)#[['pKa_pred','LogP_pred']]
df_opera = df_opera[['pKa_pred','LogP_pred']]

In [166]:
df_opera.to_csv(interim_dir+'normalised_opera.csv')

#### Filter ToxPrints descriptor file for relevant ToxPrints needed for the model

In [77]:
txps.set_index('INPUT', inplace = True)
txps.head()

Unnamed: 0_level_0,DTXSID,PREFERRED_NAME,atom:element_main_group,atom:element_metal_group_I_II,atom:element_metal_group_III,atom:element_metal_metalloid,atom:element_metal_poor_metal,atom:element_metal_transistion_metal,atom:element_noble_gas,bond:C#N_cyano_acylcyanide,...,ring:polycycle_bicyclo_propene,ring:polycycle_spiro_[2.2]pentane,ring:polycycle_spiro_[2.5]octane,ring:polycycle_spiro_[4.5]decane,ring:polycycle_spiro_1_4-dioxaspiro[4.5]decane,ring:polycycle_tricyclo_[3.5.5]_cyclopropa[cd]pentalene,ring:polycycle_tricyclo_[3.7.7]bullvalene,ring:polycycle_tricyclo_[3.7.7]semibullvalene,ring:polycycle_tricyclo_adamantane,ring:polycycle_tricyclo_benzvalene
INPUT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
94-74-6,DTXSID4024195,MCPA,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
148477-71-8,DTXSID6034928,Spirodiclofen,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56-29-1,DTXSID9023122,Hexobarbital,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
153233-91-1,DTXSID8034586,Etoxazole,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96182-53-5,DTXSID1032482,Tebupirimfos,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
txps.drop(['DTXSID', 	'PREFERRED_NAME'],axis = 1, inplace = True)

In [83]:
txps_ = txps[tp]

#### Filter Pubchem file for relevant Pubchem features needed for the model

In [85]:
pubchem.set_index('CASRN', inplace = True)

In [87]:
pubchem_ = pubchem[pc]

#### Note txps_ and pubchem_ descriptors set have different dimensions in terms of what could be calculated. Need either to merge the sets by an inner join or take the set of common ids and concatenate the dfs together. Here we take the common CASRN ids and concat the 2 df by column using axis = 1

In [91]:
ids = list(set(pubchem_.index & txps_.index))

In [94]:
txps_ = txps_.loc[ids]

In [95]:
pubchem_ = pubchem_.loc[ids]

In [96]:
fingerprints = pd.concat([pubchem_,txps_ ], axis =1)

In [97]:
fingerprints.head()

Unnamed: 0,bitvector2,bitvector12,bitvector15,bitvector16,bitvector19,bitvector20,bitvector33,bitvector37,bitvector143,bitvector145,...,bitvector712,bond:CN_amine_aliphatic_generic,bond:CN_amine_ter-N_aliphatic,bond:COH_alcohol_generic,bond:CX_halide_aromatic-X_generic,chain:alkaneCyclic_ethyl_C2_(connect_noZ),chain:alkaneLinear_ethyl_C2(H_gt_1),chain:alkaneLinear_ethyl_C2_(connect_noZ_CN=4),chain:aromaticAlkane_Ph-C1_acyclic_connect_noDblBd,ring:hetero_[6]_N_pyridine_generic
106-44-5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
129453-61-8,1,1,0,0,1,0,1,0,1,0,...,1,0,0,1,0,1,1,1,0,0
91-66-7,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,1,0,0,0
162808-62-0,1,1,1,1,1,1,0,0,1,1,...,0,1,0,1,0,0,1,1,1,0
106791-40-6,1,1,1,0,1,1,0,0,0,0,...,1,0,0,0,0,0,1,1,1,1


In [102]:
padel = pd.read_csv(processed_dir+'padel.csv', index_col = 'Name')

In [103]:
padel.head()

Unnamed: 0_level_0,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,nAtom,nHeavyAtom,nH,...,AMW,WTPT-1,WTPT-2,WTPT-3,WTPT-4,WTPT-5,WPATH,WPOL,XLogP,Zagreb
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
94-74-6,1,0.7906,0.625048,24.5181,26.427137,6,6,22,13,9,...,9.092001,25.243848,1.941834,10.285051,7.771488,0.0,266.0,15.0,1.942,60.0
148477-71-8,0,1.7152,2.941911,76.1574,60.531032,6,6,51,27,24,...,8.041278,54.16076,2.005954,16.368275,11.293233,0.0,1661.0,48.0,5.468,146.0
56-29-1,0,-0.579,0.335241,58.7187,36.394688,0,0,33,17,16,...,7.155033,33.720604,1.983565,13.671018,7.544963,6.126055,458.0,32.0,1.838,90.0
153233-91-1,0,2.6304,6.919004,48.0469,56.114239,12,12,49,26,23,...,7.329994,52.376492,2.01448,14.332016,6.073529,3.156631,1688.0,42.0,6.844,138.0
96182-53-5,0,2.7615,7.625882,63.3122,49.352239,6,6,43,20,23,...,7.398063,38.619132,1.930957,21.024075,8.841324,6.106695,871.0,27.0,4.154,98.0


In [105]:
padel_ = padel[['nN', 'nO', 'nS', 'nP', 'nF', 'nCl', 'nBr', 'nI', 'SM1_DzZ', 'SM1_Dzv']]

In [107]:
padel_ = normalizeDescriptors(padel_)

In [110]:
padel_ = padel_.loc[ids]

Turns out no Padel descriptors are needed despite what is written in Table S6 and captured in MMC24 since the Fub model only needs 82 descriptors and addition of Padel descriptors results in their being 92!

In [112]:
opera_ = df_opera.loc[ids]

In [127]:
descriptors = pd.concat([fingerprints, opera_], axis=1).dropna(axis=0, how='any')

In [128]:
descriptors

Unnamed: 0_level_0,bitvector2,bitvector12,bitvector15,bitvector16,bitvector19,bitvector20,bitvector33,bitvector37,bitvector143,bitvector145,...,bond:CN_amine_ter-N_aliphatic,bond:COH_alcohol_generic,bond:CX_halide_aromatic-X_generic,chain:alkaneCyclic_ethyl_C2_(connect_noZ),chain:alkaneLinear_ethyl_C2(H_gt_1),chain:alkaneLinear_ethyl_C2_(connect_noZ_CN=4),chain:aromaticAlkane_Ph-C1_acyclic_connect_noDblBd,ring:hetero_[6]_N_pyridine_generic,pKa_pred,LogP_pred
CASRN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
106-44-5,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1.246075,-0.342767
129453-61-8,1,1,0,0,1,0,1,0,1,0,...,0,1,0,1,1,1,0,0,-1.082904,3.830820
91-66-7,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0.164490,0.366283
162808-62-0,1,1,1,1,1,1,0,0,1,1,...,0,1,0,0,1,1,1,0,0.107434,-0.826579
106791-40-6,1,1,1,0,1,1,0,0,0,0,...,0,0,0,0,1,1,1,1,-0.021484,1.028392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74051-80-2,1,1,0,0,1,0,1,0,0,0,...,0,1,0,1,1,1,0,0,-2.528775,0.925867
56-54-2,1,1,1,0,1,0,0,0,0,0,...,1,1,0,1,0,0,0,1,0.753297,0.436290
924-16-3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,-0.099127,0.014637
161326-34-7,1,1,1,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,-1.784756,0.076640


#### Load sklearn pickle files

In [120]:
fub_rf = pickle.load(open(models_dir+'fub_rf.sav', 'rb'))
fub_svr = pickle.load(open(models_dir+'fub_svr.sav', 'rb'))



Number of features in the saved model for random forest

In [138]:
len(fub_rf.feature_importances_)


82

In [137]:
fub_svr.predict(descriptors)

AttributeError: 'SVR' object has no attribute '_n_support'

In [155]:
descriptors.shape

(992, 82)

Warning flags that a different & older version of Sklearn was used than what is in my conda environment - oh brother! I can't run a consensus model prediction because it is not possible to run the SVR models!!!

Make predictions of the substances using the RF model

In [150]:
predicted_Fub = pd.DataFrame(fub_rf.predict(descriptors), descriptors.index )

In [153]:
predicted_Fub.columns = ['pred_Fub_rf']

In [154]:
predicted_Fub

Unnamed: 0_level_0,pred_Fub_rf
CASRN,Unnamed: 1_level_1
106-44-5,0.288792
129453-61-8,2.613303
91-66-7,1.151550
162808-62-0,-0.192807
106791-40-6,0.310187
...,...
74051-80-2,1.366361
56-54-2,0.551601
924-16-3,0.706235
161326-34-7,1.448633


If SVM model was available then using the df above, the consensus predictions would have been computed by taking the mean of the predictions from each model as shown below 

In [None]:
#predicted_Fub['Consensus (SVM,RF)'] = predicted_Fub[['SVR', 'pred_Fub_rf']].mean(axis = 1)

#### Comparing against the training set compounds - MMC2 in Supplementary corresponds to Fub_1139.csv here

Note that can't check whether the predictions are exact matches given we only have one of the predictions and no predicted file to check

In [156]:
fub = pd.read_csv(raw_dir+'Fub_1139.csv')

In [160]:
fub_expt = fub[fub['CASRN'].isin(predicted_Fub.index)]

In [164]:
fub_expt.set_index('CASRN', inplace = True)

In [165]:
fub_expt.loc[predicted_Fub.index]

Unnamed: 0_level_0,Name,Human.Funbound.plasma
CASRN,Unnamed: 1_level_1,Unnamed: 2_level_1
106-44-5,4-methylphenol|P-cresol,0.325890
129453-61-8,Fulvestrant,0.001375
91-66-7,"N,n-diethyl aniline|N,n-diethylaniline",0.042869
162808-62-0,Caspofungin,0.035000
106791-40-6,Mivacurium (cis/cis)|Mivacurium mixture of iso...,0.700000
...,...,...
74051-80-2,Sethoxydim,0.055000
56-54-2,Quinidine,0.130000
924-16-3,N-nitrosodi-n-butylamine|N-nitrosodibutylamine,0.260237
161326-34-7,Fenamidone,0.024000
