**This is a basic tutorial for using DXGB as module**

In [1]:
import os
import pandas as pd
import DXGB
from DXGB.get_DXGB import get_DXGB

**get_DXGB** is the major function for DXGB, and it provides many options as shown following:

In [2]:
help(get_DXGB)

Help on function get_DXGB in module DXGB.get_DXGB:

get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=True, modelidx='1', featuretype='all', runrf=False, runscore=True)
    Get deltaVinaXGB all features and score
    
    Parameters
    ----------
    model : str
        model type, "DXGB" is the one for our previous trained deltaVinaXGB.
    modeldir : str
        directory for models, the previous trained models should be in the modeldir/model.
    datadir : str
        directory for data(input structures or features), and all output files.
    pdbid : str
        unique index for input data, can be pdb id, or any other customrized index.
    outfile : str
        output score file name (format is csv)
    runfeatures : bool
        whether to calculate features.
    water : str
        water type, can be:
        "rbw" --> all types of water, 
        "rw" --> only receptor water, 
        "bw" --> only bridging water,
        False --> no 

Here, we show some examples predicting scores and calculating features

### Calculate Scores

#### Only structures provided 

When only strutcures are provided, we need to conduct feature calculation fisrt, and predict scores based on that

Our model has been saved in *modeldir/model/*. <br>
If you have other trained model, you can also save it in *modeldir/youmodeltag*, and use that model to predict score. In that case, make sure your Input.csv (or Input_min/min_RW/min_BW.csv) has right features in right order

In [3]:
model = "DXGB" ### tag for our deltaVinaXGB model
modeldir = "/Users/jianinglu1/Documents/deltaVinaXGB/model" ### absolute model dir 
datadir = "/Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5" ### data dir
pdbid = "2al5" ### input index, can be other type
outfile = "score.csv" ### output file name
runfeatures = True ### we want to calculate features
water = "rbw" ### consider both receptor water and bridging water 
opt = "rbwo" ### optimize structures in all situations
rewrite = False ### we don't want overwrite previous generated structures and conformations
average = True ### we use ensemble models (10 models)
featuretype = "all" ### since we don't have any features, we want to calculate all features
runrf = True ### except Vina and deltaVinaXGB, we also want to calcualate deltaVinaRF
runscore = True ### we want to calculate scores

In [4]:
cwd = os.getcwd() ### our feature calculate will be conducted in datadir, it should be safe to go back after predict scores
get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=average, featuretype=featuretype, runrf=runrf, runscore=runscore)
os.chdir(cwd)

pdb index: 2al5
file directory: /Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5
feature will be calculated:all
Ligand for conformation stability:2al5_ligand.mol2
Ligand for Vina, SASA, BA, ION:2al5_ligand_rename.pdb
Protein without water molecules:2al5_protein.pdb
Protein with water molecules:2al5_protein_all.pdb
Finish Input Preparation
Protein Water: calculate both RW and BW
RW satisfies distance requirement:572
563 RW have been saved in 2al5_protein_RW.pdb
BW satisfies structural requirement:3
3 BW have been saved in 2al5_protein_BW.pdb
Finish generate BW
Consideration of Water Effect
Finish Optimization
C
Finish Vina, save in Vina58.csv
Finish SASA, save in SASA.csv
No Ion
Finish Ion, save in Num_Ions.csv
Co
Finish Vina, save in Vina58_min.csv
Finish SASA, save in SASA_min.csv
No Ion
Finish Ion, save in Num_Ions_min.csv
Crwo
BW satisfies structural requirement:3
Finish Bridging Water feature calculation, save in Feature_BW_min_RW.csv
Finish Vina, save in Vina58_min_RW.csv
Finish

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10


Two slowest steps in score calculation is **receptor water structure prepartion** and **ligand conformation generation**. It might take some time to finish. However, for same protein, receptor water molecules are only needed to be generated once, and similarly, for same ligand (different docking poses), the ligand conformation generation is only needed to be conducted onece.<br> You can directly copy and past previous generated structures, and use **rewrite=False** to omit above structure generation process in the score prediction. <br>
<br>
**Note**: 
1. When you rerun this for same data, all the feature files will be rewrote, which is can't be changed.
2. Sometimes, when the initial ligand (mol2 or sdf) can't be read into RDkit because of structure problem, we will omit the ligand stability calculation. In this situation, you should provide a pdb file for ligand to enable other feature calculation and score prediction.


In [5]:
### Take a look at score file 
score = pd.read_csv("../Test_2al5/score.csv")
score

Unnamed: 0,pdb,vina,XGB,RF20,vina_min,XGB_min,RF20_min,vina_min_RW,XGB_min_RW,RF20_min_RW,vina_min_BW,XGB_min_BW,RF20_min_BW
0,2al5,6.26831,6.611149,7.3313,6.410879,6.807068,7.450087,7.156691,7.112672,8.455182,6.896757,6.7354,8.129866


#### Only input feature file provided

When only input feature file is provided, we don't need to calculate features, we can directly predict scores.<br>
The input file with features, should be named as Input.csv

In [6]:
model = "DXGB" ### tag for our deltaVinaXGB model
modeldir = "/Users/jianinglu1/Documents/deltaVinaXGB/model" ### absolute model dir 
datadir = "/Users/jianinglu1/Documents/deltaVinaXGB/Test" ### data dir
pdbid = None ### don't need any more, the column name should be same as Input.csv generated using our script
outfile = "score.csv" ### output file name
runfeatures = False ### we don't want to calculate features any more
water = False ### no structures have been provided
opt = False ### no structures have been provided 
rewrite = False ### no structures have been provided
average = True ### we use ensemble models (10 models)
featuretype = "all" ### doesn't matter
runrf = True ### except Vina and deltaVinaXGB, we also want to calcualate deltaVinaRF
runscore = True ### we want to calculate scores

In [7]:
os.chdir(cwd) ### our feature calculate will be conducted in datadir, it should be safe to go back after predict scores
get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=average, featuretype=featuretype, runrf=runrf, runscore=runscore)
os.chdir(cwd)

file directory: /Users/jianinglu1/Documents/deltaVinaXGB/Test
output score file: score.csv
1
2
3
4
5
6
7
8
9
10


In [8]:
### Take a look at score file 
score = pd.read_csv("../Test/score.csv")
score.head()

Unnamed: 0,pdb,vina,XGB,RF20
0,1h22,7.12477,8.174137,8.347142
1,4k77,5.547744,5.699233,6.136828
2,4dld,5.275392,5.530861,6.140189
3,3f3c,5.484393,5.61397,6.395983
4,4cig,5.580979,5.157811,5.57571


### Calculate Features

To satisfy development requirement, we also provide method to only calculate all features or specific feature.

#### Calculate all for provided structures (rwo, rw)

In [9]:
model = "DXGB" ### tag for our deltaVinaXGB model
modeldir = "/Users/jianinglu1/Documents/deltaVinaXGB/model" ### absolute model dir 
datadir = "/Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5" ### data dir
pdbid = "2al5" ### input index, can be other type
outfile = "score.csv" ### provided, but should be empty
runfeatures = True ### we want to calculate features
water = "rbw" ### no water
opt = "rbwo" ### no optimization
rewrite = False ### we don't want overwrite previous generated structures and conformations
average = True ### doesn't matter
featuretype = "all" ### we want to calculate all features
runrf = False 
runscore = False ### we don't want to calculate scores

In [10]:
cwd = os.getcwd() ### our feature calculate will be conducted in datadir, it should be safe to go back after predict scores
get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=average, featuretype=featuretype, runrf=runrf, runscore=runscore)
os.chdir(cwd)

pdb index: 2al5
file directory: /Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5
feature will be calculated:all
Ligand for conformation stability:2al5_ligand.mol2
Ligand for Vina, SASA, BA, ION:2al5_ligand_rename.pdb
Protein without water molecules:2al5_protein.pdb
Protein with water molecules:2al5_protein_all.pdb
Finish Input Preparation
Protein Water: calculate both RW and BW
Use previous generated RW
Use previous RW and BW
Consideration of Water Effect
Use pervious generated CO
Use pervious generated C_RWO
Use pervious generated C_BWO
Finish Optimization
C
Finish Vina, save in Vina58.csv
Finish SASA, save in SASA.csv
No Ion
Finish Ion, save in Num_Ions.csv
Co
Finish Vina, save in Vina58_min.csv
Finish SASA, save in SASA_min.csv
No Ion
Finish Ion, save in Num_Ions_min.csv
Crwo
BW satisfies structural requirement:3
Finish Bridging Water feature calculation, save in Feature_BW_min_RW.csv
Finish Vina, save in Vina58_min_RW.csv
Finish SASA, save in SASA_min_RW.csv
No Ion
Finish Ion, sa

#### Calculate specfic feature (no opt, no rw)

##### Vina

In [11]:
model = "DXGB" ### tag for our deltaVinaXGB model
modeldir = "/Users/jianinglu1/Documents/deltaVinaXGB/model" ### absolute model dir 
datadir = "/Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5" ### data dir
pdbid = "2al5" ### input index, can be other type
outfile = "score.csv" ### provided, but should be empty
runfeatures = True ### we want to calculate features
water = False ### no water
opt = False ### no optimization
rewrite = False ### we don't want overwrite previous generated structures and conformations
average = True ### doesn't matter
featuretype = "Vina" ### we want to calculate all features
runrf = False 
runscore = False ### we don't want to calculate scores

In [12]:
cwd = os.getcwd() ### our feature calculate will be conducted in datadir, it should be safe to go back after predict scores
get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=average, featuretype=featuretype, runrf=runrf, runscore=runscore)
os.chdir(cwd)

pdb index: 2al5
file directory: /Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5
feature will be calculated:Vina
Ligand for conformation stability:2al5_ligand.mol2
Ligand for Vina, SASA, BA, ION:2al5_ligand_rename.pdb
Protein without water molecules:2al5_protein.pdb
Protein with water molecules:2al5_protein_all.pdb
Finish Input Preparation
No Consideration of Water
No Optimized Ligand
C
Finish Vina, save in Vina58.csv
Use previous calculated SASA in SASA.csv
Use previous calculated Ion in Num_Ions.csv
Use previous calculated ligand stability in dE_RMSD.csv
Finish Feature Calculation


##### SASA

In [13]:
model = "DXGB" ### tag for our deltaVinaXGB model
modeldir = "/Users/jianinglu1/Documents/deltaVinaXGB/model" ### absolute model dir 
datadir = "/Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5" ### data dir
pdbid = "2al5" ### input index, can be other type
outfile = "score.csv" ### provided, but should be empty
runfeatures = True ### we want to calculate features
water = False ### no water
opt = False ### no optimization
rewrite = False ### we don't want overwrite previous generated structures and conformations
average = True ### doesn't matter
featuretype = "SASA" ### we want to calculate all features
runrf = False 
runscore = False ### we don't want to calculate scores

In [14]:
cwd = os.getcwd() ### our feature calculate will be conducted in datadir, it should be safe to go back after predict scores
get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=average, featuretype=featuretype, runrf=runrf, runscore=runscore)
os.chdir(cwd)

pdb index: 2al5
file directory: /Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5
feature will be calculated:SASA
Ligand for conformation stability:2al5_ligand.mol2
Ligand for Vina, SASA, BA, ION:2al5_ligand_rename.pdb
Protein without water molecules:2al5_protein.pdb
Protein with water molecules:2al5_protein_all.pdb
Finish Input Preparation
No Consideration of Water
No Optimized Ligand
C
Use previous calculated Vina in Vina58.csv
Finish SASA, save in SASA.csv
Use previous calculated Ion in Num_Ions.csv
Use previous calculated ligand stability in dE_RMSD.csv
Finish Feature Calculation


##### ligand stability 

In [15]:
model = "DXGB" ### tag for our deltaVinaXGB model
modeldir = "/Users/jianinglu1/Documents/deltaVinaXGB/model" ### absolute model dir 
datadir = "/Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5" ### data dir
pdbid = "2al5" ### input index, can be other type
outfile = "score.csv" ### provided, but should be empty
runfeatures = True ### we want to calculate features
water = False ### no water
opt = False ### no optimization
rewrite = False ### we don't want overwrite previous generated structures and conformations
average = True ### doesn't matter
featuretype = "dE" ### we want to calculate all features
runrf = False 
runscore = False ### we don't want to calculate scores

In [16]:
cwd = os.getcwd() ### our feature calculate will be conducted in datadir, it should be safe to go back after predict scores
get_DXGB(model, modeldir, datadir, pdbid, outfile, runfeatures, water, opt, rewrite, average=average, featuretype=featuretype, runrf=runrf, runscore=runscore)
os.chdir(cwd)

pdb index: 2al5
file directory: /Users/jianinglu1/Documents/deltaVinaXGB/Test_2al5
feature will be calculated:dE
Ligand for conformation stability:2al5_ligand.mol2
Ligand for Vina, SASA, BA, ION:2al5_ligand_rename.pdb
Protein without water molecules:2al5_protein.pdb
Protein with water molecules:2al5_protein_all.pdb
Finish Input Preparation
No Consideration of Water
No Optimized Ligand
C
Use previous calculated Vina in Vina58.csv
Use previous calculated SASA in SASA.csv
Use previous calculated Ion in Num_Ions.csv
Use previous generated confs
Input Type:mol2
Finish ligand stability calculation, save in dE_RMSD.csv
Finish Feature Calculation
