This notebook demonstrates how to predict IC50 from BindingDB data using the digital-cell model, as well as giving suggestions for how to enhance the performance of the model beyond that shown here.

In [1]:
import os
os.chdir('./DeepPurpose/')

from DeepPurpose import utils, dataset, CompoundPred
from DeepPurpose import DTI as models
import digitalcell
import warnings
warnings.filterwarnings("ignore")

# First pass: DeepPurpose model

Only run this first cell if you haven't downloaded BindingDB yet! It's several gigabytes.

In [4]:
data_path = dataset.download_BindingDB('../data/')

Beginning to download dataset...
100% [......................................................................] 327218168 / 327218168Beginning to extract zip file...
Done!



## Data Importing

The entirety of BindingDB is saved as a .tsv file. The function `digital-cell.process_BindingDB_omic` loads it into a Pandas dataframe.

In [4]:
pwd

'C:\\Users\\Julia\\Dropbox\\Work\\insight\\omic\\digicell\\DeepPurpose'

In [5]:
data_path = '../data//BindingDB_All.tsv'
df, X_drugs, X_targets, y = digitalcell.process_BindingDB_omic(path = data_path, temp_ph = True)

Loading Dataset from path...


b'Skipping line 772572: expected 193 fields, saw 205\nSkipping line 772598: expected 193 fields, saw 205\n'
b'Skipping line 805291: expected 193 fields, saw 205\n'
b'Skipping line 827961: expected 193 fields, saw 265\n'
b'Skipping line 1231688: expected 193 fields, saw 241\n'
b'Skipping line 1345591: expected 193 fields, saw 241\nSkipping line 1345592: expected 193 fields, saw 241\nSkipping line 1345593: expected 193 fields, saw 241\nSkipping line 1345594: expected 193 fields, saw 241\nSkipping line 1345595: expected 193 fields, saw 241\nSkipping line 1345596: expected 193 fields, saw 241\nSkipping line 1345597: expected 193 fields, saw 241\nSkipping line 1345598: expected 193 fields, saw 241\nSkipping line 1345599: expected 193 fields, saw 241\n'
b'Skipping line 1358864: expected 193 fields, saw 205\n'
b'Skipping line 1378087: expected 193 fields, saw 241\nSkipping line 1378088: expected 193 fields, saw 241\nSkipping line 1378089: expected 193 fields, saw 241\nSkipping line 1378090: e

Beginning Processing...
There are 93336 drug target pairs.


The cell below just shows what the data looks like.

In [3]:
df.head()

Unnamed: 0,ID,InChI,SMILES,PubChem_ID,UniProt_ID,Organism,Target Sequence,Kd,IC50,Ki,EC50,kon,koff,pH,Temp,pIC50
180,181,InChI=1/C31H51N5O5/c1-20(2)28(32-22(5)37)30(40...,CC(C)[C@H](NC(C)=O)C(=O)N[C@@H](Cc1ccccc1)[C@@...,65023.0,,Human immunodeficiency virus 1,PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKM...,,8.5,,,,,6.0,37.0,8.065502
181,182,InChI=1/C33H55N5O7/c1-7-44-32(42)35-28(22(3)4)...,CCOC(=O)N[C@@H](C(C)C)C(=O)N[C@@H](Cc1ccccc1)[...,461984.0,,Human immunodeficiency virus 1,PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKM...,,177.0,,,,,6.0,37.0,6.751781
183,184,InChI=1/C35H59N5O9/c1-24(2)30(37-34(44)48-19-1...,COCCOC(=O)N[C@@H](C(C)C)C(=O)N[C@@H](Cc1ccccc1...,461988.0,,Human immunodeficiency virus 1,PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKM...,,164.0,,,,,6.0,37.0,6.784891
184,185,InChI=1/C39H67N5O11/c1-28(2)34(41-38(48)54-23-...,COCCOCCOC(=O)N[C@@H](C(C)C)C(=O)N[C@@H](Cc1ccc...,461990.0,,Human immunodeficiency virus 1,PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKM...,,67.0,,,,,6.0,37.0,7.173277
185,186,InChI=1/C38H51N7O7/c1-24(2)34(43-38(51)52-3)37...,COC(=O)N[C@@H](C(C)C)C(=O)NN(C[C@H](O)[C@H](Cc...,461985.0,,Human immunodeficiency virus 1,PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKM...,,27.0,,,,,6.0,37.0,7.567031


## Drug and target encoding

DeepPurpose supports several possible encodings for drugs and targets. Right now I'm using the Morgan Extended-Connectivity Fingerprints encoding for drugs and the conjoint triad features encoding for targets because they're not computationally intensive.

In [4]:
drug_encoding, target_encoding = 'Morgan', 'Conjoint_triad'

The following cell splits the data into training and testing sets. I think these numbers are probably fine, but maybe there's a way to improve on them. One thing to note is that for some reason RDKit (the biochemistry library) has trouble translating some of the SMILES data (~500 drugs) into Morgan format. That's only a small fraction of the drugs in the training data, though.

In [5]:
train, val, test = utils.data_process(X_drugs, X_targets, y, 
                                drug_encoding, target_encoding, 
                                split_method='random',frac=[0.7,0.1,0.2])

Drug Target Interaction Prediction Mode...
in total: 93336 drug-target pairs
encoding drug...
unique drugs: 63535
rdkit not found this smiles for morgan: CSc1ccc(cc1)C1=C(C=C[N]([O-])=C1)[C@@H]1CCC(F)(F)C[C@H]1C(=O)NCC#N convert to all 0 features
rdkit not found this smiles for morgan: O=C1NC(=O)c2c1c1c3ccccc3n3[Ru](C#[O])[n+]4cccc2c4c13 convert to all 0 features
rdkit not found this smiles for morgan: CN1C(=O)c2c(C1=O)c1cc(F)c[n+]3[Ru](C#[O])n4c5ccc(O)cc5c2c4c13 convert to all 0 features
rdkit not found this smiles for morgan: NOOSc1ccc(CC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles for morgan: NOOSc1ccc(CC[N@@]23CC(=O)O[Re]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles for morgan: CN1C=C[N]2=C1C[N]1(CCc3ccc(SOON)cc3)CC3=[N](C=CN3C)[Re+]21 convert to all 0 features
rdkit not found this smiles for morgan: NOOSc1ccc(NC(=S)NCCCCCCCC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features

rdkit not found this smiles for morgan: Clc1cccc(Cl)c1NC(=O)N1CCN(CC1)c1ccc(cc1)-c1[nH]cnn1CCc1ccccc1 convert to all 0 features
rdkit not found this smiles for morgan: Clc1cccc(Cl)c1NC(=O)N1CCN(CC1)c1ccc(cc1)-c1ncnn1 convert to all 0 features
rdkit not found this smiles for morgan: Cc1ccc(cc1)-c1nc([nH]o1)-c1ccc(cc1)N1CCN(CC1)C(=O)Nc1ccccc1Cl convert to all 0 features
rdkit not found this smiles for morgan: Cn1cc(ccc1C(N)=O)S(=O)(=O)c1ccc(CN\C(Nc2ccncc2)=N/C#N)cc1 convert to all 0 features
rdkit not found this smiles for morgan: Cc1cc(nn1)-c1cc(C(=O)Nc2ccc(F)cn2)c2ncnn2c1 convert to all 0 features
rdkit not found this smiles for morgan: C[N]1=C(OC(=C1)c1cccc(Nc2nccc(n2)-c2cccnc2)c1)N1CCNC(=O)C11CC1 convert to all 0 features
rdkit not found this smiles for morgan: Cc1ccnc(Nc2cccc(C3=C[N](C)=C(O3)N3CCNC(=O)C33CC3)c2C)n1 convert to all 0 features
rdkit not found this smiles for morgan: CN1C=[N](C)C=C1c1ccc(C[C@H](NC(=O)[C@H]2N[C@@H]3CC[C@H]2C3)C#N)c(F)c1 convert to all 0 features
rdkit no

## Model configuration

DeepPurpose's model configuration utility is a wrapper for generating neural networks using PyTorch. The list of options for hyperparameters is [here](https://github.com/kexinhuang12345/DeepPurpose/blob/e169e2f550694145077bb2af95a4031abe400a77/DeepPurpose/utils.py#L486). Several types of model architecture are supported including CNNs, RNNs, MPNNs, MLPs, and transformers. I think there is a lot of potential work to be done on hyperparameter optimization here. The hyperparameters used below are suggested defaults that aren't too computationally intensive; they produce a 3 layer MPNN. 

In [6]:
config = utils.generate_config(drug_encoding = drug_encoding, 
                               target_encoding = target_encoding,
                         cls_hidden_dims = [1024,1024,512], 
                         train_epoch = 3, 
                         LR = 0.001, 
                         batch_size = 128,
                         hidden_dim_drug = 128,
                         mpnn_hidden_size = 128,
                         mpnn_depth = 3
                        )

In [7]:
model = models.model_initialize(**config)

## Model training and loading

Using the hyperparameters above, the model takes about 20 minutes on a GPU (1.5 hours on a laptop CPU) to train on BindingDB. For demo purposes I'm going to just load a model I trained earlier today, but I used the exact same code as above.

In [None]:
#run this if you want to train a new one
model.train(train, val, test, verbose = True)
model.save_model('../model-10-11')

In [8]:
pwd

'C:\\Users\\Julia\\Dropbox\\Work\\insight\\omic\\digicell\\DeepPurpose\\DeepPurpose'

In [6]:
#run this if you want to use my trained one
model = models.model_pretrained(path_dir = '../model-9-24')

## Model validation

The following code is built in to the DeepPurpose `train` method, I've just pulled it out so I can grab the dataset that was set aside for validation during the data processing step. 

In [13]:
import torch
from torch.utils import data

params = {'batch_size': config['batch_size'],
    'shuffle': True,
    'num_workers': config['num_workers'],
    'drop_last': False}

validation_generator = data.DataLoader(utils.data_process_loader(val.index.values, val.Label.values, val, **config), **params)

The available performance metrics are mean squared error, Pearson's R, p-value of Pearson's R, and concordance index.

In [14]:
model.binary = False
mse, pearsonr, pvalue, concordance_index = models.DBTA.test_(model, validation_generator, model.model, test=True)
print(auc)

0.8197822689413422


## Model usage

To predict IC50, run data_process to create a dataset consisting of a single drug-target pair, and then run `predict`, which is just a wrapper on `test_`. The output is in pIC50. 

In [28]:
X_drug = ['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N']
X_target = ['MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL']
X_pred = utils.data_process(X_drug, X_target, y, 
                                drug_encoding, target_encoding, 
                                split_method='no_split')
y_pred = model.predict(X_pred)
print('The predicted score is ' + str(y_pred))

Drug Target Interaction Prediction Mode...
in total: 1 drug-target pairs
encoding drug...
unique drugs: 1
encoding protein...
unique target sequence: 1
splitting dataset...
do not do train/test split on the data for already splitted data
predicting...
The predicted score is [7.395412921905518]


# Second pass: Gradient Boosting with additional information

My main innovation here has been to improve upon DeepPurpose by incorporating additional information such as temperature, pH, and model organism in a gradient boosting model that adjusts the first model's predictions.


## Data importing

There are a number of supported encodings; calling `digitalcell.feature_select` without specifying runs ones that have low overhead and perform well in the classifier.

TODO: import support for other datasets (incl conversion to SMILES), list of supported encodings

In [7]:
df_data = digitalcell.feature_select(df)

Encoding:  drug2emb_encoder
Elapsed time:  30.04101538658142
Encoding:  smiles2daylight
rdkit not found this smiles: CSc1ccc(cc1)C1=C(C=C[N]([O-])=C1)[C@@H]1CCC(F)(F)C[C@H]1C(=O)NCC#N convert to all 0 features
rdkit not found this smiles: O=C1NC(=O)c2c1c1c3ccccc3n3[Ru](C#[O])[n+]4cccc2c4c13 convert to all 0 features
rdkit not found this smiles: CN1C(=O)c2c(C1=O)c1cc(F)c[n+]3[Ru](C#[O])n4c5ccc(O)cc5c2c4c13 convert to all 0 features
rdkit not found this smiles: NOOSc1ccc(CC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles: NOOSc1ccc(CC[N@@]23CC(=O)O[Re]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles: CN1C=C[N]2=C1C[N]1(CCc3ccc(SOON)cc3)CC3=[N](C=CN3C)[Re+]21 convert to all 0 features
rdkit not found this smiles: NOOSc1ccc(NC(=S)NCCCCCCCC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles: NOOSc1ccc(NC(=S)NCCOCCOCC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 c

rdkit not found this smiles: Cc1cc(nn1)-c1cc(C(=O)Nc2ccc(F)cn2)c2ncnn2c1 convert to all 0 features
rdkit not found this smiles: C[N]1=C(OC(=C1)c1cccc(Nc2nccc(n2)-c2cccnc2)c1)N1CCNC(=O)C11CC1 convert to all 0 features
rdkit not found this smiles: Cc1ccnc(Nc2cccc(C3=C[N](C)=C(O3)N3CCNC(=O)C33CC3)c2C)n1 convert to all 0 features
rdkit not found this smiles: CN1C=[N](C)C=C1c1ccc(C[C@H](NC(=O)[C@H]2N[C@@H]3CC[C@H]2C3)C#N)c(F)c1 convert to all 0 features
rdkit not found this smiles: Oc1cnc(Cn2cnc(c(Oc3cc(Cl)cc(c3)C#N)c2=O)C(F)(F)F)c[nH]1 convert to all 0 features
Elapsed time:  194.3962700366974
Encoding:  CalculateConjointTriad
Elapsed time:  195.450261592865
Encoding:  protein2emb_encoder
Elapsed time:  228.64833068847656


TODO: The option to include entries that don't have pH and temperature and impute those values should be straightforward to implement

## Getting first pass predictions from Model 1

This cell loads the initial DeepPurpose model and converts the current training data into the format used by said model. Calling `utils.data_process` may not be necessary if you're computing both models at the same time, but if the encodings used in Model 1 differ from those used in Model 2, trying to run the model without doing this might break it.

In [8]:
model = models.model_pretrained(path_dir = '../model-9-24')
X_pred = utils.data_process(X_drugs, X_targets, y, 
                                model.drug_encoding, model.target_encoding, 
                                split_method='no_split')

Drug Target Interaction Prediction Mode...
in total: 93336 drug-target pairs
encoding drug...
unique drugs: 63535
rdkit not found this smiles for morgan: CSc1ccc(cc1)C1=C(C=C[N]([O-])=C1)[C@@H]1CCC(F)(F)C[C@H]1C(=O)NCC#N convert to all 0 features
rdkit not found this smiles for morgan: O=C1NC(=O)c2c1c1c3ccccc3n3[Ru](C#[O])[n+]4cccc2c4c13 convert to all 0 features
rdkit not found this smiles for morgan: CN1C(=O)c2c(C1=O)c1cc(F)c[n+]3[Ru](C#[O])n4c5ccc(O)cc5c2c4c13 convert to all 0 features
rdkit not found this smiles for morgan: NOOSc1ccc(CC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles for morgan: NOOSc1ccc(CC[N@@]23CC(=O)O[Re]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features
rdkit not found this smiles for morgan: CN1C=C[N]2=C1C[N]1(CCc3ccc(SOON)cc3)CC3=[N](C=CN3C)[Re+]21 convert to all 0 features
rdkit not found this smiles for morgan: NOOSc1ccc(NC(=S)NCCCCCCCC[N]23CC4=CC=CC=[N]4[Re+]2[N]2=C(C3)C=CC=C2)cc1 convert to all 0 features

rdkit not found this smiles for morgan: Clc1cccc(Cl)c1NC(=O)N1CCN(CC1)c1ccc(cc1)-c1[nH]cnn1CCc1ccccc1 convert to all 0 features
rdkit not found this smiles for morgan: Clc1cccc(Cl)c1NC(=O)N1CCN(CC1)c1ccc(cc1)-c1ncnn1 convert to all 0 features
rdkit not found this smiles for morgan: Cc1ccc(cc1)-c1nc([nH]o1)-c1ccc(cc1)N1CCN(CC1)C(=O)Nc1ccccc1Cl convert to all 0 features
rdkit not found this smiles for morgan: Cn1cc(ccc1C(N)=O)S(=O)(=O)c1ccc(CN\C(Nc2ccncc2)=N/C#N)cc1 convert to all 0 features
rdkit not found this smiles for morgan: Cc1cc(nn1)-c1cc(C(=O)Nc2ccc(F)cn2)c2ncnn2c1 convert to all 0 features
rdkit not found this smiles for morgan: C[N]1=C(OC(=C1)c1cccc(Nc2nccc(n2)-c2cccnc2)c1)N1CCNC(=O)C11CC1 convert to all 0 features
rdkit not found this smiles for morgan: Cc1ccnc(Nc2cccc(C3=C[N](C)=C(O3)N3CCNC(=O)C33CC3)c2C)n1 convert to all 0 features
rdkit not found this smiles for morgan: CN1C=[N](C)C=C1c1ccc(C[C@H](NC(=O)[C@H]2N[C@@H]3CC[C@H]2C3)C#N)c(F)c1 convert to all 0 features
rdkit no

We then use the first model to generate estimates of pIC50 for the whole training dataset.

In [9]:
first_pass = model.predict(X_pred)

predicting...


## Training model 2

This step converts the features generated by the encoding step into vectors to be fed into the model. You can expect this to take over 5 minutes.

In [10]:
X_train,X_test,y_train,y_test,cat_list = digitalcell.data_process_omic(df_data, first_pass)

Converting features to vectors (this takes a while)


The following cells generate a gradient boosting model that improves on the performance of model 1 by incorporating additional information and encodings. The model actually consists of 3 estimators, providing an 80 percent lower and upper prediction interval. You can adjust the bounds on the confidence interval when you create the model. Each model takes about 3 or 4 minutes to build.

In [11]:
model2 = digitalcell.GBoostModel(
    lower_alpha=0.1, upper_alpha=0.9, n_estimators=10, org_list = cat_list, init_model = model)

In [12]:
# Fit and make predictions
model2.fit(X_train, y_train)

Calculating estimates...
Getting lower bound...
Getting upper bound...


We can quickly use this model to make predictions for all of the testing data, which is useful for getting performance metrics.

In [62]:
model2.predictions = model2.predict(X_test)

In [64]:
X_test.shape

(18874, 3664)

Be sure to pickle your model to preserve its deliciousness. (This cell just saves the model to a file.)

TODO: the date should not be hardcoded in

In [14]:
import pickle

filename = '../model_10_12_2.sav'
pickle.dump(model2, open(filename, 'wb'))

## Usage

The usage is pretty straightforward, but currently takes a few lines of code, which I wll write a wrapper function for shortly.

TODO: wrapper

In [15]:
import pickle

model2 = pickle.load( open( "../model_10_12_2.sav", "rb" ) )

In [16]:
X_drug = ['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N']
X_target = ['MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL']

In [109]:
test_1 = utils.data_process(X_drug, X_target, y, 
                                model.drug_encoding, model.target_encoding, 
                                split_method='no_split')
test_2 = model.predict(test_1)
test_3 = predict_drug(model2, X_drug,X_target,test_2)

Drug Target Interaction Prediction Mode...
in total: 1 drug-target pairs
encoding drug...
unique drugs: 1
encoding protein...
unique target sequence: 1
splitting dataset...
do not do train/test split on the data for already splitted data
predicting...
Encoding:  drug2emb_encoder
Elapsed time:  0.0009982585906982422
Encoding:  smiles2daylight
Elapsed time:  0.004998207092285156
Encoding:  CalculateConjointTriad
Elapsed time:  0.0070002079010009766
Encoding:  protein2emb_encoder
Elapsed time:  0.008998394012451172
pH : <class 'numpy.int64'>
Temp : <class 'numpy.int64'>
drug2emb_encoder : <class 'tuple'>
50
50
smiles2daylight : <class 'numpy.ndarray'>
2048
CalculateConjointTriad : <class 'numpy.ndarray'>
343
protein2emb_encoder : <class 'tuple'>
545
545
var_Abelson murine leukemia virus : <class 'numpy.float64'>
var_Agaricus bisporus : <class 'numpy.float64'>
var_Aspergillus fumigatiaffinis : <class 'numpy.float64'>
var_Avian sarcoma virus : <class 'numpy.float64'>
var_Bacillus anthracis 

In [48]:
test_4 = model2.predict_drug(X_drug,X_target,test_2)

Encoding:  drug2emb_encoder
Elapsed time:  0.0020020008087158203
Encoding:  smiles2daylight
Elapsed time:  0.005998849868774414
Encoding:  CalculateConjointTriad
Elapsed time:  0.00900125503540039
Encoding:  protein2emb_encoder
Elapsed time:  0.010999441146850586
pH : <class 'numpy.int64'>
Temp : <class 'numpy.int64'>
drug2emb_encoder : <class 'tuple'>
50
50
smiles2daylight : <class 'numpy.ndarray'>
2048
CalculateConjointTriad : <class 'numpy.ndarray'>
343
protein2emb_encoder : <class 'tuple'>
545
545
var_Abelson murine leukemia virus : <class 'numpy.float64'>
var_Agaricus bisporus : <class 'numpy.float64'>
var_Aspergillus fumigatiaffinis : <class 'numpy.float64'>
var_Avian sarcoma virus : <class 'numpy.float64'>
var_Bacillus anthracis : <class 'numpy.float64'>
var_Bacillus cereus (strain ATCC 14579 / DSM 31) : <class 'numpy.float64'>
var_Bombyx mori : <class 'numpy.float64'>
var_Borrelia burgdorferi : <class 'numpy.float64'>
var_Bos taurus : <class 'numpy.float64'>
var_Caenorhabditis 

TypeError: 'int' object is not subscriptable

Coming soon: data visualization, feature importances, recommended drugs per target

In [111]:
test_3

Unnamed: 0,lower,mid,upper
0,2.688733,8.145013,9.158571


In [66]:
X_train

array([[   6.        ,   37.        , 2266.        , ...,    0.        ,
           0.        ,    5.93279696],
       [   6.        ,   37.        ,  240.        , ...,    0.        ,
           0.        ,    6.6564703 ],
       [   6.        ,   37.        , 2227.        , ...,    0.        ,
           0.        ,    6.89564133],
       ...,
       [   2.5       ,   25.        ,  622.        , ...,    0.        ,
           0.        ,    6.26271915],
       [   2.5       ,   25.        , 1953.        , ...,    0.        ,
           0.        ,    7.46452904],
       [   2.5       ,   25.        ,  622.        , ...,    0.        ,
           0.        ,    8.14915848]])

pH : <class 'numpy.int64'>
Temp : <class 'numpy.int64'>
drug2emb_encoder : <class 'tuple'>
50
50
smiles2daylight : <class 'numpy.ndarray'>
2048
CalculateConjointTriad : <class 'numpy.ndarray'>
343
protein2emb_encoder : <class 'tuple'>
545
545
var_Abelson murine leukemia virus : <class 'numpy.float64'>
var_Agaricus bisporus : <class 'numpy.float64'>
var_Aspergillus fumigatiaffinis : <class 'numpy.float64'>
var_Avian sarcoma virus : <class 'numpy.float64'>
var_Bacillus anthracis : <class 'numpy.float64'>
var_Bacillus cereus (strain ATCC 14579 / DSM 31) : <class 'numpy.float64'>
var_Bombyx mori : <class 'numpy.float64'>
var_Borrelia burgdorferi : <class 'numpy.float64'>
var_Bos taurus : <class 'numpy.float64'>
var_Caenorhabditis elegans : <class 'numpy.float64'>
var_Canavalia ensiformis : <class 'numpy.float64'>
var_Candida albicans : <class 'numpy.float64'>
var_Canis lupus dingo : <class 'numpy.float64'>
var_Clostridium botulinum : <class 'numpy.float64'>
var_Crithidia fasciculata : <cla

In [114]:
digitalcell.model_metrics(model2, typelist, X_test, y_test)

TypeError: 'NoneType' object is not subscriptable

In [108]:
   def predict_drug(self, X_drug, X_target, first_pass = None, temp = 35, pH = 7, org = None):
        """
        Predict with all 3 models 
        
        :param X: test features
        :param y: test targets
        :return predictions: dataframe of predictions
        
        TODO: parallelize this code across processors
        """
        if first_pass is None:
            proc = DeepPurpose.utils.data_process(X_drug, X_target, 0, 
                                model.drug_encoding, model.target_encoding, 
                                split_method='no_split')
            first_pass = model.init_model.predict(proc)
        
        df_data = pd.DataFrame()
        df_data['SMILES'] = X_drug
        df_data['Target Sequence'] = X_target
        if self.temp_ph:
            df_data['pH'] = pH
            df_data['Temp'] = temp
        
        df_data = digitalcell.feature_select(df_data, self.drug_func_list, self.prot_func_list)
        df_data=df_data.join(self.org_list)
        if org is not None:
            df_data['var_'+org] = 1
            
        discard=['SMILES','Target Sequence','Organism','IC50','pIC50','ID','InChI','PubChem_ID','UniProt_ID','Kd','Ki','EC50','kon','koff']
        df_vars=df_data.columns.values.tolist()
        to_keep=[i for i in df_vars if i not in discard]
        X=df_data[to_keep]
        X['estimate'] = first_pass

        typelist = digitalcell.get_typelist(X)
        Z = np.asarray(digitalcell.flattener(np.asarray(X)))
        s=np.isnan(Z)
        Z[s]=0.0
        Z = Z.reshape(1, -1)

        predictions = pd.DataFrame()
        predictions["lower"] = self.lower_model.predict(Z)
        predictions["mid"] = self.mid_model.predict(Z)
        predictions["upper"] = self.upper_model.predict(Z)

        return predictions

In [89]:
from DeepPurpose.utils import *
from DeepPurpose.model_helper import Encoder_MultipleLayers, Embeddings        
from DeepPurpose.encoders import *
from DeepPurpose import DTI

from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

from matplotlib import pyplot as plt

class GBoostModel(BaseEstimator):
    """
    Adapted from https://github.com/WillKoehrsen/Data-Analysis/blob/master/prediction-intervals/prediction_intervals.ipynb
    Model that produces prediction intervals with a Scikit-Learn inteface
    
    :param lower_alpha: lower quantile for prediction, default=0.1
    :param upper_alpha: upper quantile for prediction, default=0.9
    :param **kwargs: additional keyword arguments for creating a GradientBoostingRegressor model
    """

    def __init__(self, lower_alpha=0.1, upper_alpha=0.9, drug_func_list= [drug2emb_encoder,smiles2daylight], 
                    prot_func_list = [CalculateConjointTriad, protein2emb_encoder], temp_ph = True, org_list = None,
                    n_estimators = 10, init_model = None, **kwargs):
        self.lower_alpha = lower_alpha
        self.upper_alpha = upper_alpha
        self.init_model = init_model

        self.drug_func_list = drug_func_list
        self.prot_func_list = prot_func_list
        self.temp_ph = temp_ph
        if org_list is not None:
            self.org_list = org_list
        else:
            self.org_list = []
            #TODO: figure out a way to do a default here
            
        # Three separate models
        self.lower_model = GradientBoostingRegressor(loss="quantile", alpha=self.lower_alpha, n_estimators = n_estimators, **kwargs)
        self.mid_model = GradientBoostingRegressor(loss="ls", n_estimators = n_estimators, **kwargs)
        self.upper_model = GradientBoostingRegressor(loss="quantile", alpha=self.upper_alpha, n_estimators = n_estimators, **kwargs)
        self.predictions = None
        self.typelist = []

    def fit(self, X_train, y_train):
        """
        Fit all three models
            
        :param X: train features
        :param y: train targets
        
        TODO: parallelize this code across processors
        """
        
        print("Calculating estimates...")
        self.mid_model.fit(train_scaled, y_train)
        print("Getting lower bound...")
        self.lower_model.fit(train_scaled, y_train)
        print("Getting upper bound...")
        self.upper_model.fit(train_scaled, y_train)

    def predict(self, Z):
        predictions = pd.DataFrame()
        predictions["lower"] = self.lower_model.predict(test_scaled)
        predictions["mid"] = self.mid_model.predict(test_scaled)
        predictions["upper"] = self.upper_model.predict(test_scaled)
        self.predictions = predictions
        return predictions
    
    def predict_drug(self, X_drug, X_target, first_pass = None, temp = 35, pH = 7, org = None):
        """
        Predict with all 3 models 
        
        :param X: test features
        :param y: test targets
        :return predictions: dataframe of predictions
        
        TODO: parallelize this code across processors
        """
        if first_pass is None:
            proc = DeepPurpose.utils.data_process(X_drug, X_target, 0, 
                                model.drug_encoding, model.target_encoding, 
                                split_method='no_split')
            first_pass = model.init_model.predict(proc)
        
        df_data = pd.DataFrame()
        df_data['SMILES'] = X_drug
        df_data['Target Sequence'] = X_target
        if self.temp_ph:
            df_data['pH'] = pH
            df_data['Temp'] = temp
        
        df_data = feature_select(df_data, self.drug_func_list, self.prot_func_list)
        df_data=df_data.join(self.org_list)
        if org is not None:
            df_data['var_'+org] = 1
            
        discard=['SMILES','Target Sequence','Organism','IC50','pIC50','ID','InChI','PubChem_ID','UniProt_ID','Kd','Ki','EC50','kon','koff']
        df_vars=df_data.columns.values.tolist()
        to_keep=[i for i in df_vars if i not in discard]
        X=df_data[to_keep]
        X[len(X.columns)] = first_pass
        self.typelist = get_typelist(X)
        Z = flattener(np.asarray(X))
        
        predictions = self.predict(Z)

        return predictions

    def plot_intervals(self, mid=False, start=None, stop=None):
        """
        Plot the prediction intervals
        
        :param mid: boolean for whether to show the mid prediction
        :param start: optional parameter for subsetting start of predictions
        :param stop: optional parameter for subsetting end of predictions
    
        :return fig: plotly figure
        """

        if self.predictions is None:
            raise ValueError("This model has not yet made predictions.")
            return
        
        fig = plot_intervals(predictions, mid=mid, start=start, stop=stop)
        return fig
    
    def calculate_and_show_errors(self):
        """
        Calculate and display the errors associated with a set of prediction intervals
        
        :return fig: plotly boxplot of absolute error metrics
        """
        if self.predictions is None:
            raise ValueError("This model has not yet made predictions.")
            return
        
        calculate_error(self.predictions)
        fig = show_metrics(self.predictions)
        return fig