# Basil Docking V0.1 - Docking and Preliminary Analysis
## Purpose

__Target Audience__<br>
Undergraduate chemistry/biochemistry students and, in general, people that have little to no knowledge of protein-ligand docking and would like to understand the general process of docking a ligand to a protein receptor.

__Brief Overview__<br>
Molecular docking is a computational method used to predict where molecules are able to bind to a protein receptor and what interactions exist between the molecule (from now on, refered to as "ligand") and the receptor. It is a popular technique utilized in drug discovery and design, as when creating new drugs and testing existing drugs aginst new receptors, it is useful to determine the likelihood of binding prior to screening as it can be used to eliminate molecules that are unlikely to bind to the receptor. This significantly reduces the potential cost and time needed to test the efficacy of a set of possible ligands. <br>

The general steps to perform molecular docking, assuming the ligand and receptor are ready to be docked, include the generation of potential ligand binding poses and the scoring of each generated pose (which predicts how strongly the ligand binds to the receptor, with a more negative score corresponding to a stronger bond). To dock a ligand to a protein, both the receptor and the ligand/s need to be "sanitized"; which includes making sure bonds and protonation states are as they would be in an organism. The receptor and ligand/s also need to be converted into the correct file formats depending on which docking engine is utilized. With all of these steps needed for preparation alone, introducing a need for an in depth view for each distinct step. This series attempts to provide that, as well as give users flexibility to customize the proteins, ligands, and procedures used.<br>

This notebook series encompasses 
1. The preparation needed prior to docking (protein and ligand sanitation, ensuring files are in readable formats, and finding possible binding pockets)
2. __The process of docking ligand/s to a protein receptor using two docking engines (VINA and SMINA) and visualizing/analyzing the outputs__
3. Further data collection and manipulation

__Stepwise summary for this notebook (docking and preliminary analysis, notebook 2 out of 3)__
- Get docking box sizes from docking-prep notebook
- Dock ligand to protein using either VINA or SMINA
- Visualize different poses of ligands docked to protein
- Visualize protein-ligand interactions of poses

The methods utilized by this notebook are based off of Angel J. Ruiz-Moreno's Jupyter-Dock notebooks, which can be found on their GitHub account AngelRuizMoreno

Ruiz-Moreno A.J. Jupyter Dock: Molecular Docking integrated in Jupyter Notebooks. https://doi.org/10.5281/zenodo.5514956

## Table of Libraries Used
### Operations, variable creation, and variable manipulation

| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :---|
| numpy | np | perform mathematical operations and fix NaN values in dataframe outputs | Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link). |
| pandas | pd | organize data in an easy-to-read format and allow for the exporting of data as a .csv file | The pandas development team. (2024). pandas-dev/pandas: Pandas (v2.2.3). Zenodo. https://doi.org/10.5281/zenodo.13819579 |
| re |n/a| regular expression; find and pull specific strings of characters depending on need, allow for easy naming and variable creation | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| os | n/a| allow for interaction with computer operating system, including the reading and writing of files |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| sys |n/a| manipulate python runtime environment |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation.|
| glob |n/a| pull files of interest, specifically for blind docking |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| warnings | n/a | filter warnings | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |

### Visualization
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| py3Dmol | n/a | apoprotein and protein complex visualization |  Keshavan Seshadri, Peng Liu, and David Ryan Koes. Journal of Chemical Education 2020 97 (10), 3872-3876. https://doi.org/10.1021/acs.jchemed.0c00579. |

### Docking
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| vina | n/a | ligand-protein docking |  Eberhardt, J., Santos-Martins, D., Tillack, A.F., Forli, S. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling. |
| --- | --- | --- | Trott, O., & Olson, A. J. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2), 455-461. |
| smina | n/a | ligand-protein docking |  Koes, D. R., Baumgartner, M. P., & Camacho, C. J. (2013). Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8), 1893–1904. https://doi.org/10.1021/ci300604z |
| fpocket | n/a | find possible binding pockets in protein receptors | Le Guilloux, V., Schmidtke, P. & Tuffery, P. Fpocket: An open source platform for ligand pocket detection. BMC Bioinformatics 10, 168 (2009). https://doi.org/10.1186/1471-2105-10-168. |
|pdbqt_to_sdf | n/a | create sdf files from pdbqt files created from docking with vina | Ruiz-Moreno A.J. Jupyter Dock: Molecular Docking integrated in Jupyter Notebooks. https://doi.org/10.5281/zenodo.5514956 |

### Data analysis
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| rdkit (Chem)| n/a | reorder/retrieve ligand atoms, retrieve information from ligand sdf files for visualization and comparison  |  RDKit: Open-source cheminformatics; http://www.rdkit.org |
| MDAnalysis (PDB)| mda | allow for the selection of atoms for separating protein from ligands and ligands from each other | R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 98-105, Austin, TX, 2016. SciPy, doi:10.25080/majora-629e541a-00e. |
| --- | --- | --- | N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J. Comput. Chem. 32 (2011), 2319-2327, doi:10.1002/jcc.21787. PMCID:PMC3144279. |
| prolif (Complex3D)| plf | calculate, record, and view protein-ligand interactions |  chemosim-lab/ProLIF: v0.3.3 - 2021-06-11.https://doi.org/10.5281/zenodo.4386984. |

### UI
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| IPython (ipywidgets)| n/a | allow for widgets to be implemented | Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: https://ipython.org |
| ipywidgets (Layout, Label, Dropdown, Box)| widgets | create dropdowns for docking engine, pocket number, ligand, and pose selection | Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: https://ipython.org |

Import all necessary libraries using the cells below

In [None]:
# find way for this to work without needing ssh key
# wouldn't work well on windows, find solution?
! git submodule update --init --recursive

In [1]:
import numpy as np
import pandas as pd
import numbers
import re
import sys, os
import glob

import py3Dmol
import ipywidgets as widgets
from ipywidgets import Layout, Label, Dropdown, Box, HBox, SelectMultiple

windows_os = False
try:
    from vina import Vina
except:
    windows_os = True

from openbabel import pybel
from rdkit import Chem
import MDAnalysis as mda
from MDAnalysis.coordinates import PDB
import prolif as plf
from prolif.plotting.complex3d import Complex3D
from ligandsplitter.ligandanalysis import group_idxes_from_mol, get_ligand_properties, oral_bioactive_classifier, interaction_regressor

sys.path.insert(1, 'utilities/')
from utils import pdbqt_to_sdf
from basil_utils import fetch_data_files, fetch_docking_data, get_ifps, get_scores, get_largest_array_column, expand_df, fill_df, save_dataframe, compare_poses_form, compare_poses

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

## Import values from docking-prep

Prior to docking, the data obtained from the previous notebook needs to be imported in order to be used. The glob library retrieves the .csv files created at the end of the previous notebook for both the protein and the ligands. The protein pocket information generated from fpocket is imported from the "prot_pockets.csv" file, and the ligand information is imported from the "ligand_information.csv" file.
</br>

To address the presence of multiple .csv files containing information for different proteins and ligands, each generated .csv file in this series contains within its name:
1. The PDB ID of the receptor
2. The number of ligands generated from the first notebook (for ligand_information only)
</br>

To select the .csv files of interest, make sure that the characters following the "id" in the name corresponds to the PDB ID of interest. For example, for a job using PDB ID "1oyt" as the receptor and three ligands, the .csv containing information on its pockets would be named "data/protein_pockets_id_1oyt.csv" and the .csv containing information on the ligands would be named "data/ligand_information_id_1oyt_3.csv"

In [2]:
receptor_files, prot_pocket_csvs, ligand_info_csvs = fetch_data_files()

The cell below creates a Dropdown widget which can be used to select the files of interest. Make sure that the PDB IDs in both selections are the same.

In [3]:
style = {'description_width': 'initial'}
box_layout = Layout(display = "flex", flex_flow = "column", align_items = "stretch", border = "solid", width = "75%")
items = [Dropdown(layout = {'width': 'initial'}, options = receptor_files, description = "Select PDB ID of receptor", style = style),
         Dropdown(layout = {'width': 'initial'}, options = ligand_info_csvs, description = "Select CSV with Ligand Information", style = style)]
box = Box(children=items, layout = box_layout)
box

Box(children=(Dropdown(description='Select PDB ID of receptor', layout=Layout(width='initial'), options=('deri…

From the Dropdown widget(s) above, dataframes are created from each .csv file that have values that are needed for docking to occur (e.g. ligand/s filenames, center of ligand/s, and size of ligand/s)

In [4]:
pdb_id = items[0].value
csv_selected_ligands = items[1].value
csv_smiles_ligands = csv_selected_ligands.replace('information', 'smiles_data')

try:
    prot_pockets = pd.read_csv(f"data/protein_pockets_id_{pdb_id}.csv", index_col = [0])
    can_blind_dock = True
except:
    can_blind_dock = False

ligand_information = pd.read_csv(csv_selected_ligands)
ligand_smiles = pd.read_csv(csv_smiles_ligands)

In [5]:
ligs = []
filenames = []
filenames_H = []
filenames_pdbqt = []
center = []
size = []
for r in ligand_information.index:
    ligs.append(ligand_information["ligs"][r])
    filenames.append(ligand_information["filenames"][r])
    filenames_H.append(ligand_information["filenames_H"][r])
    filenames_pdbqt.append(ligand_information["filenames_pdbqt"][r])
    temp = [float(ligand_information["center_x"][r]), float(ligand_information["center_y"][r]), float(ligand_information["center_z"][r])]
    center.append(temp)
    temp1 = [float(ligand_information["size_x"][r]), float(ligand_information["size_y"][r]), float(ligand_information["size_z"][r])]
    size.append(temp1)

## Docking

This notebook utilizes two docking engines for molecular docking: VINA and SMINA. VINA is one of many docking engines available in AutoDock Suite, and is widely used due to its relatively quick docking speed and easy-to-use interface compared to the other docking engines in the suite. SMINA is a fork of VINA, and allows for the modification of scoring terms by users and also adds other functions that make the engine more convenient (allowing multi-ligand files such as .sdf files, improving minimization algorithms, adding additional term types, and allowing for multiple ligand molecular formats). 

<div class="alert alert-block alert-info">
<b>Please note:</b> 
VINA only has force field parameters for atoms of the following elements
<ul> <li>hydrogen</li> <li>carbon</li> <li>oxygen</li> <li>nitrogen</li> <li>phosphorous</li> <li>sulfur</li> <li>calcium</li> <li>maganese</li> <li>iron</li> <li>zinc</li> <li>halogens (fluorine, chlorine, bromine, iodine)</li> </ul>
For ligands containing atoms that are not listed above, it is recommended that users either 1) select all ligands other than those containing atoms that are not supported using the selection widget below or 2) only use SMINA as the docking engine. Trying to dock a ligand with an unsupported ligand using VINA will result in an error.

To select multiple ligands using the selection widget, hold down the control key (PC) or command key (Mac) while clicking on the names of each ligand you would like to dock
</div>

In [None]:
style = {'description_width': 'initial'}
select_ligs = SelectMultiple(options = ligs, description = 'Select Ligand/s to Dock:', style = style)
select_ligs

Ligand docking can either be site-specific or blind. Site-specific docking uses a location of the receptor where we know the ligand binds, and uses the center and size of the ligand as determined in docking_prep. Blind docking attempts to bind the ligand in multiple potential pockets in the protein (determined using fpocket in docking_prep) and requires more computational energy to perform. The option selected in the dropbox below will determine the method used in this notebook.

In [None]:
style = {'description_width': 'initial'}
if can_blind_dock:
    select_type = Dropdown(options = ["Site-specific docking","Blind docking"], description = 'Select Docking Type:', style = style)
else:
    select_type = Dropdown(options = ["Site-specific docking"], description = 'Select Docking Type:', style = style)
select_type

In [None]:
pocket_center = []
pocket_size = []
for i in select_ligs.value:
    for pocket in prot_pockets.index:
        c_x = prot_pockets.loc[pocket,'center_x']
        c_y = prot_pockets.loc[pocket,'center_y']
        c_z = prot_pockets.loc[pocket,'center_z']
        s_x = prot_pockets.loc[pocket,'size_x']
        s_y = prot_pockets.loc[pocket,'size_y']
        s_z = prot_pockets.loc[pocket,'size_z']
        pocket_center.append([c_x, c_y, c_z])
        pocket_size.append([s_x, s_y, s_z])

### Docking using VINA

Below is a step-by-step (cell-by-cell) guide on how the VINA docking engine is used to generate poses and scores for each pocket and ligand
- Prior to docking, two new folders are created in the data folder to organize the output data (vina_out and vina_out_2). 
- Using the information collected in the docking-prep notebook, each pocket's center values and size values are added to their respective lists, which are called pocket_center and pocket_size. In both lists, each instance is a list of the x, y, and z values corresponding to one pocket's data (as a result, pocket_center and pocket_size are nested lists, and the length of both lists is equal to the number of binding pockets)
    - For example, pocket_center may look like this: [[x1, y1, z1],[x2, y2, z2],[x3, y3, z3]]
- Using the pocket size and center lists and the pdbqt files for the receptor and desired ligand, ligand poses are generated for each binding pocket (the number of poses depends on the value of n_poses, which is set to 5 in this notebook). The amount of computational effort needed to generate the poses for a given pocket and ligand is called the exhaustiveness. As exhaustiveness increases, the more reproducible the results tend to be. While the default value of exhaustiveness is 8, this notebook uses an exhaustiveness of 5 due to memory limitations.
- The results of running the VINA docking engine are stored as pdbqt files and can be located in the vina_out folder. In order to analyze and vizualize the results, the pdbqt files are converted into sdf files using the function pdbqt_to_sdf (created by Angel Ruiz-Moreno), which can be found in the vina_out_2 folder. The names of each file follows the formula of `(ligand name)_vina_pocket_(pocket number).pdbqt` for the pdbqt files and `(ligand name)_pocket_(pocket number)_(name of folder).sdf` for the sdf files.

In [None]:
current_dir = os.getcwd()
dataPath = os.path.join(current_dir, "data")
vina_out = os.path.join(current_dir, "data", "vina_out")
vina_out_2 = os.path.join(current_dir, "data", "vina_out_2")

bool_data = os.path.exists(dataPath)
bool_vina = os.path.exists(vina_out)
bool_vina2 = os.path.exists(vina_out_2)

if(bool_data == False):
    print("ERROR: Cannot find 'data' folder. Make sure you are in the correct directory.")
elif(bool_vina == False):
    print("ERROR: Cannot find 'vina_out' folder. Creating folder...")
    os.mkdir(vina_out)
elif(bool_vina2 == False):
    print("ERROR: Cannot find 'vina_out_2' folder. Creating folder...")
    os.mkdir(vina_out_2)

In [None]:
def vina_dock(ligand):
    # need to test windows specific
    if windows_os:
        receptor = f'data/PDBQT_files/{pdb_id}_protein.pdbqt'
        ligand_in = f"data/PDBQT_files/{ligand}_H.pdbqt"
        if select_type.value == "Blind docking":
            for pock_num, pocket in enumerate(prot_pockets.index):
                #create txt file with center and size values
                txt_config_list = [f"center_x = {pocket_center[pock_num][0]}", f"center_y = {pocket_center[pock_num][1]}", f"center_z = {pocket_center[pock_num][2]}", f"size_x = {pocket_size[pock_num][0]}", f"size_y = {pocket_size[pock_num][1]}", f"size_z = {pocket_size[pock_num][2]}"]
                with open(f"data/vina_out/{ligand}_pocket_{pocket}_config.txt", "w+") as datafile:
                    datafile.writelines(txt_config_list)
                out = f"data/vina_out/{ligand}_vina_pocket_{pocket}.pdbqt"
                ! ./utilities/vina_1.2.7_win.exe --receptor {receptor} --ligand {ligand_in} --config {f"data/vina_out/{ligand}_pocket_{pocket}_config.txt"} --exhaustiveness=5 --out {out}
        else:
            #create txt file with center and size values
            lig_index = ligs.index(ligand)
            txt_config_list = [f"center_x = {center[lig_index][0]}", f"center_y = {center[lig_index][1]}", f"center_z = {center[lig_index][2]}", f"size_x = {size[lig_index][0]}", f"size_y = {size[lig_index][1]}", f"size_z = {size[lig_index][2]}"]
            with open(f"data/vina_out/{ligand}_config.txt", "w+") as datafile:
                datafile.writelines(txt_config_list)
            out = f"data/vina_out/{ligand}.pdbqt"
            ! ./utilities/vina_1.2.7_win.exe --receptor {receptor} --ligand {ligand_in} --config {f"data/vina_out/{ligand}_config.txt"} --exhaustiveness=5 --out {out}
    else:
        v = Vina(sf_name='vina')
        v.set_receptor(f'data/PDBQT_files/{pdb_id}_protein.pdbqt')
        v.set_ligand_from_file(f"data/PDBQT_files/{ligand}_H.pdbqt")
        if select_type.value == "Blind docking":
            for pock_num, pocket in enumerate(prot_pockets.index):
                v.compute_vina_maps(center = pocket_center[pock_num], box_size = pocket_size[pock_num])
                v.dock(exhaustiveness=5, n_poses=5)
                v.write_poses("data/vina_out/" + str(ligand) + "_vina_pocket_" + str(pocket) + '.pdbqt', n_poses=5, overwrite=True)
        else:
            v.compute_vina_maps(center = center[ligs.index(ligand)], box_size = size[ligs.index(ligand)])
            v.dock(exhaustiveness=5, n_poses=5)
            v.write_poses("data/vina_out/" + str(ligand) + '.pdbqt', n_poses=5, overwrite=True)

In [None]:
for i in select_ligs.value:
    vina_dock(i)

In [None]:
# Create sdf files from pdbqt
for i in select_ligs.value:
    if select_type.value == "Blind docking":
        for pocket in prot_pockets.index:
            pdbqt_to_sdf(pdbqt_file=f"data/vina_out/{i}_vina_pocket_{pocket}.pdbqt",output=f"data/vina_out_2/{i}_pocket_{pocket}_vina_out_2.sdf")
    else:
        pdbqt_to_sdf(pdbqt_file=f"data/vina_out/{i}.pdbqt",output=f"data/vina_out_2/{i}_vina_out_2.sdf")

### Docking using SMINA

Below is a step-by-step (cell-by-cell) guide on how the SMINA docking engine is used to generate poses and scores for each pocket and ligand
- Prior to docking, two new folders are created in the data folder to organize the output data (smina_out and smina_out_2). The path for the smina docking engine executable is also initialized to allow for the docking engine to be used, as it is a local file.
- Using the the pdbqt file for the receptor, the mol2 file for the desired ligand, and the pocket center/size values from the prot_pockets dataframe, ligand poses are generated for each binding pocket (the number of poses depends on the value of num_modes, which is set to 5 in this notebook). The amount of computational effort needed to generate the poses for a given pocket and ligand is called the exhaustiveness. As exhaustiveness increases, the more reproducible the results tend to be. While the default value of exhaustiveness is 8, this notebook uses an exhaustiveness of 5 due to memory limitations.
- The results of running the SMINA docking engine are stored as sdf files and can be located in the smina_out folder. However, due to the fact that the output files do not have a flag marking it as three dimensional, the sdf files must be read using SDMolSupplier and re-written using SDWriter to avoid excessive errors. The re-written sdf files can be found in the smina_out_2 folder. The names of each file follows the formula of `(ligand name)_pocket_(pocket number)_(name of folder).sdf` for the sdf files.

In [None]:
current_dir = os.getcwd()
dataPath = os.path.join(current_dir, "data")
smina_out = os.path.join(current_dir, "data", "smina_out")
smina_out_2 = os.path.join(current_dir, "data", "smina_out_2")

bool_data = os.path.exists(dataPath)
bool_smina = os.path.exists(smina_out)
bool_smina2 = os.path.exists(smina_out_2)

if(bool_data == False):
    print("ERROR: Cannot find 'data' folder. Make sure you are in the correct directory.")
elif(bool_smina == False):
    print("ERROR: Cannot find 'smina_out' folder. Creating folder...")
    os.mkdir(smina_out)
elif(bool_smina2 == False):
    print("ERROR: Cannot find 'smina_out_2' folder. Creating folder...")
    os.mkdir(smina_out_2)

In [None]:
# Using SMINA to dock ligand/s in docking boxes based on fpocket's identified pockets
d = 0
for i in select_ligs.value: 
    if select_type.value == "Blind docking":
        for pock_num, pocket in enumerate(prot_pockets.index):
            rec = f'data/PDBQT_files/{pdb_id}_protein.pdbqt'
            lig = f'data/MOL2_files/{i}_H.mol2'
            outfile = f'data/smina_out/{i}_pocket_{pocket}_smina_out.sdf'
            ! smina -r {rec} -l {lig} -o {outfile} --center_x {pocket_center[pock_num][0]} --center_y {pocket_center[pock_num][1]} --center_z {pocket_center[pock_num][2]} --size_x {pocket_size[pock_num][0]} --size_y {pocket_size[pock_num][1]} --size_z {pocket_size[pock_num][2]} --exhaustiveness 5 --num_modes 5
    else:
        rec = f'data/PDBQT_files/{pdb_id}_protein.pdbqt'
        lig = f'data/MOL2_files/{i}_H.mol2'
        outfile = f'data/smina_out/{i}_smina_out.sdf'
        ! smina -r {rec} -l {lig} -o {outfile} --autobox_ligand {lig} --autobox_add 5 --exhaustiveness 5 --num_modes 5

In [None]:
%%capture
# Rewrite .sdf output files to add 3D tag
# This code will result in warnings. This is normal as long as the warning is
# "Warning: molecule is tagged as 2D, but at least one Z coordinate is not zero. Marking the mol as 3D."
mols_all = []
for i in select_ligs.value:
    mols = []
    if select_type.value == "Blind docking":
        for pocket in prot_pockets.index:
            with Chem.SDMolSupplier(f'data/smina_out/{i}_pocket_{pocket}_smina_out.sdf') as suppl:
                for mol in suppl:
                    if mol is not None:
                        Chem.MolToMolBlock(mol)
                        mols.append(mol)
            with Chem.SDWriter(f"data/smina_out_2/{i}_pocket_{pocket}_smina_out_2.sdf") as w:
                for mol in mols:
                    w.write(mol)
    else:
        with Chem.SDMolSupplier(f'data/smina_out/{i}_smina_out.sdf') as suppl:
            for mol in suppl:
                if mol is not None:
                    Chem.MolToMolBlock(mol)
                    mols.append(mol)
        with Chem.SDWriter(f"data/smina_out_2/{i}_smina_out_2.sdf") as w:
            for mol in mols:
                w.write(mol)

## Analysis of docking output

Now that we have results from molecular docking, we need to make sense of the information. If you were to open the sdf files in a text editor, you would see x, y, and z coordinates for each atom in the ligand, the bond types between atoms in the ligand, and the score of the ligand pose. While useful, this information is difficult to interpret and visualize. To get information regarding the number of interactions, the types of interaction, and the atoms (ligand) and residues (receptor) involved in binding the ligand to the receptor, interaction fingerprints (IFPs) can be generated and viewed using the prolif library, which can be used to identify key atoms in the ligand and key residues in the receptor involved in protein-ligand complex formation.

In [None]:
# load protein
prot_mol = Chem.MolFromPDBFile(f"data/PDB_files/{pdb_id}_protein_H.pdb")
protein_plf = plf.Molecule.from_rdkit(prot_mol)

In [None]:
style = {'description_width': 'initial'}
select_dock = Dropdown(options = [('smina'), ('vina')], description = 'Select the docking engine that was used:', style = style)
select_dock

In [None]:
# get interaction fingerprints for all docked poses
all_df, all_ifps, all_ligand_plf, ligand_plf_descriptors = get_ifps(select_type.value, select_dock.value, select_ligs.value, protein_plf, prot_pockets)


In [None]:
# get scores for all docked poses
scores = get_scores(select_type.value, select_dock.value, select_ligs.value, prot_pockets)
ligs_in_order = []
pocks_in_order = []
for value in ligand_plf_descriptors:
    ligand_name = value.split(",")[0]
    ligs_in_order.append(ligand_name)
    if select_type.value == "Blind docking":
        pocket_name = value.split(",")[1]
        pock_name_isolated = pocket_name.split(" ")[-1]
        pocks_in_order.append(int(pock_name_isolated))

In [None]:
df = pd.concat([d for d in all_df], axis=0, ignore_index=False, sort=False).reset_index()
df.insert(1, "Score", pd.Series(scores))
df.insert(1, "Ligand", pd.Series(ligs_in_order))
if select_type.value == "Blind docking":
    df.insert(1, "Pocket", pd.Series(pocks_in_order))
df = df.fillna(0)
# save csv
save_dataframe(df, select_dock.value, pdb_id, select_ligs.value)
#display csv
df

While the dataframe generated using the prolif library has a lot of useful information, we are also going to add the distance between interacting ligand and protein atoms, the indexes of both the ligand and protein atoms involved in the interaction, and the functional group the ligand's atom is a member of if applicable.

In [None]:
# fix for site specific
if select_type.value == "Blind docking":
    df2 = df[["Frame", "Score", "Ligand", "Pocket", "UNL1"]].copy()
else:
    df2 = df[["Frame", "Score", "Ligand", "UNL1"]].copy()

In [None]:
largest_array_column = get_largest_array_column(df, select_type.value)

In [None]:
%%capture
# create new columns for functional group, residue type, distance, and index information
expand_df(all_ifps, df, df2, largest_array_column)

To get the functional groups in each ligand, a dictionary is created where the keys are the indexes of atoms determined to be in a functional group, and the corresponding value is the name of the functional group. Due to keys being unable to be used more than once in a dictionary, atoms that are members of two or more functional groups will only have one of their functional groups listed as the value.

In [None]:
# find atom indices for ligand and protein, functional groups involved (ligand), residue type (protein), and
# distance between ligand and protein in interaction
fill_df(df2, all_ifps, all_ligand_plf, largest_array_column)

In [None]:
# view dataframe and saved compressed format to data folder
save_dataframe(df2, select_dock.value, pdb_id, select_ligs.value, csv_name = f"{pdb_id}_{str(len(select_ligs.value))}_ligands_docking_information_{select_dock.value}_extended")
df2

### Visualization of Docking Poses

Using the dropdown created by the cell below, two different ligand poses can be selected to be viewed with the receptor and compared. A given pose for a ligand can be compared against its original pose (e.g. its location and pose relative to the receptor prior to docking) or against another pose generated by the docking engine. The original ligand pose may not be useful for ligands that were added by uploading a local mol2 file or by inputting a SMILES string.


In [None]:
pose_mode = Dropdown(options = [('Compare to original pose', 1), ('Compare to other docked poses', 2)])
pose_mode

To select specific poses to view, use the form created by the cell below. Depending on whether blind docking or site-specific docking was used to find binding locations, you will be able to select generated poses for a specific pocket or poses without the ability to specify the pocket of interest, respectively. 

If you are comparing a pose against the original pose (i.e. the conformation prior to docking), only one additional pose can be specified for viewing. Otherwise, if you are comparing different docked poses, two different poses can be specified. 

<div class="alert alert-block alert-info">
<b>Tip for Pose Selection: </b>
When selecting the pocket number (if applicable) and the pose number corresponding to each desired pose in the dropdown, make sure that they belong to the same selection as noted in the parentheses of the dropdown's label. 

__Example 1__ _(After using site-specific docking as the docking method)_: </br>
To assign the conformation of a given ligand as seen in pose number 5 to View 1, select "5" in the dropdown marked "Pose Number (View 1)"
    
__Example 2__ _(After using blind docking as the docking method)_:</br>
To assign the conformation of a given ligand as seen in pocket number 5 and pose number 3 to View 1, select "5" in the dropdown marked "Pocket Number (View 1)" and "3" in the dropdown marked "Pose Number (View 1)"

__Example 3__ _(After using blind docking as the docking method)_: </br>
To assign the conformation of a given ligand as seen in pocket number 5 and pose number 3 to View 1 and the conformation as seen in pocket number 2 and pose number 4 to View 2, select "5" in the dropdown marked "Pocket Number (View 1)", "3" in the dropdown marked "Pose Number (View 1)", "2" in the dropdown marked "Pocket Number (View 2)", and "4" in the dropdown marked "Pose Number (View 2)"</div>

In [None]:
form, ligand_number, form_items1, form_items2 = compare_poses_form(select_type.value, select_ligs.value, prot_pockets, pose_mode.value)
form

For the poses selected in the form made by the cell above, a viewer containing the receptor (including the space-filling surface of the receptor) and the two ligand poses specified will be created. 

In [None]:
pockets = {}
poses = {}
for index, value in enumerate(form_items1):
    pockets[index + 1] = form_items1[index + 1].value

for index, value in enumerate(form_items2):
    poses[index + 1] = form_items2[index + 1].value

In [None]:
compare_poses(pdb_id, select_type.value, pose_mode.value, ligand_number.value, select_dock.value, pockets, poses)

### Visualization of Protein-Ligand Interactions

Using an IFP, the interactions between the ligand and the receptor can be visualized using prolif's Complex3D submodule. Only one pose and its interactions can be viewed at a time. If a pose has no interactions assigned to it, the ligand will not be shown in the display and only the protein will be visible.

In [None]:
# display interactions. select which one to view using dropdown
pose_pock_select = []
a = 0
while a < int(df.shape[0]):
    pose_pock_select.append((ligand_plf_descriptors[a], a))
    a += 1
style = {'description_width': 'initial'}
select_pose = Dropdown(options = pose_pock_select, description = 'Select Pose to View:', style = style)
select_pose

In [None]:
comp = Complex3D(all_ifps[select_pose.value], all_ligand_plf[select_pose.value], protein_plf)
comp.display()

In [None]:
# EVERYTHING BELOW THIS IS STILL IN PROGRESS

### Determine which ligand features affect ligand binding affinity the strongest

(info)

In [None]:
docking_data = fetch_docking_data()

In [None]:
style = {'description_width': 'initial'}
dock_data = Dropdown(options = docking_data, description = 'Select the docking results to analyze', style = style)
dock_data

In [None]:
# angry function
docking_results = pd.read_csv(dock_data.value)
rf_affinity_importances, xgb_affinity_importances = interaction_regressor(docking_results)

In [None]:
rf_affinity_importances

In [None]:
xgb_affinity_importances

### Determine if features and properties suggest potential oral bioavailability of ligand

Lipinski's Rule of 5 (LRO5) states that drugs that meet at least three of the following criteria are likely to be orally bioactive
1. Molecular mass less than 500 daltons
2. Octonal-water partition coefficient that is 5 or less
3. No more than 5 hydrogen bond donors
4. No more than 10 hydrogen bond acceptors

From this, it can bestated that in general small drugs that are more hydrophilic (low partition coefficient) that are overall stable (does not give or receive a large number of protons) are more likely to be orally bioactive than large, unstable, hydrophobic drugs.

(Why are there variants?)

The Ghose filter further refines the rules set up by LRO5 by adding the following criteria:
1. Octonal-water partition coefficient is between -0.4 and +5.6
2. Molar refractivity is between 40 and 130
3. Molecular weight is between 180 and 480 daltons
4. Number of atoms is between 20 and 70 (including hydrogen bond donors and acceptors)

Veber's rule completely changes the prediction methods to determine orally bioactive drugs, only checking for two criteria:
1. The molecule has 10 or fewer rotatable bonds
2. The molecule has a polar surface area equal to 140 square angstroms or less

In [6]:
# ligand_information dataframe used for ligand information
bioactive_df = pd.read_csv("data/test_train_bioactive_data.csv")
try:
    bioactive_df = pd.concat([bioactive_df, ligand_smiles])
except:
    bioactive_df = pd.read_csv("data/test_train_bioactive_data.csv")

In [7]:
lig_bioactive_df = get_ligand_properties(bioactive_df)
display(lig_bioactive_df)

Unnamed: 0,filename_hydrogens,smiles,molecular_weight,log_P,H_donors,H_acceptors,mol_refractivity,rotatable_bonds,polar_surface_area,orally_bioactive,...,num_of_In_atoms,num_of_Sn_atoms,num_of_Sb_atoms,num_of_I_atoms,num_of_Ir_atoms,num_of_Pt_atoms,num_of_Au_atoms,num_of_Hg_atoms,num_of_Pb_atoms,num_of_Bi_atoms
0,fluoxetine-R,CNCC[C@@H](Oc1ccc(cc1)C(F)(F)F)c2ccccc2\n,309.2882,4.4350,1,2,79.7987,7,21.26,1.0,...,0,0,0,0,0,0,0,0,0,0
1,fluoxetine-S,CNCC[C@@H](c1ccccc1)Oc2ccc(cc2)C(F)(F)F,309.2882,4.4350,1,2,79.7987,7,21.26,1.0,...,0,0,0,0,0,0,0,0,0,0
2,tylenol,CC(=O)Nc1ccc(cc1)O,151.1438,1.3506,2,2,42.4105,3,49.33,1.0,...,0,0,0,0,0,0,0,0,0,0
3,aspirin,CC(=O)Oc1ccccc1C(=O)O,180.1384,1.3101,1,4,44.7103,3,63.60,1.0,...,0,0,0,0,0,0,0,0,0,0
4,abacavir,c1nc2c(nc(nc2n1[C@@H]3C[C@@H](C=C3)CO)N)NC4CC4,286.2954,1.0923,3,7,79.7499,6,101.88,1.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,naldemedine,CC(C)(NC(=O)C1=C(O)[C@@H]2Oc3c(O)ccc4c3[C@@]23...,34.2380,3.4798,4,9,150.7311,11,141.18,1.0,...,0,0,0,0,0,0,0,0,0,0
117,oxybutynin,CCN(CC)CC#CCOC(=O)C(O)(c1ccccc1)C1CCCCC1,357.4312,3.3429,1,4,103.4368,10,49.77,1.0,...,0,0,0,0,0,0,0,0,0,0
118,methamphetamine,CN[C@@H](C)Cc1ccccc1,149.2070,1.8370,1,1,48.6677,5,12.03,1.0,...,0,0,0,0,0,0,0,0,0,0
0,data/MOL2_files/ALE300_H.mol2,[H]Oc1c([H])c([H])c([C@]([H])(O[H])C([H])([H])...,183.1804,0.3506,4,4,48.6581,7,72.72,,...,0,0,0,0,0,0,0,0,0,0


In [8]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(lig_bioactive_df)

Unnamed: 0,filename_hydrogens,smiles,molecular_weight,log_P,H_donors,H_acceptors,mol_refractivity,rotatable_bonds,polar_surface_area,orally_bioactive,mol,num_of_atoms,num_of_heavy_atoms,num_of_C_atoms,num_of_N_atoms,num_of_O_atoms,num_of_F_atoms,num_of_Al_atoms,num_of_P_atoms,num_of_S_atoms,num_of_Cl_atoms,num_of_Cr_atoms,num_of_Mn_atoms,num_of_Fe_atoms,num_of_Co_atoms,num_of_Ni_atoms,num_of_Cu_atoms,num_of_Zn_atoms,num_of_Ga_atoms,num_of_Ge_atoms,num_of_As_atoms,num_of_Br_atoms,num_of_Zr_atoms,num_of_Mo_atoms,num_of_Pd_atoms,num_of_Ag_atoms,num_of_Cd_atoms,num_of_In_atoms,num_of_Sn_atoms,num_of_Sb_atoms,num_of_I_atoms,num_of_Ir_atoms,num_of_Pt_atoms,num_of_Au_atoms,num_of_Hg_atoms,num_of_Pb_atoms,num_of_Bi_atoms
0,fluoxetine-R,CNCC[C@@H](Oc1ccc(cc1)C(F)(F)F)c2ccccc2\n,309.2882,4.435,1,2,79.7987,7,21.26,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d4270>,40,22,17,1,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,fluoxetine-S,CNCC[C@@H](c1ccccc1)Oc2ccc(cc2)C(F)(F)F,309.2882,4.435,1,2,79.7987,7,21.26,1.0,<rdkit.Chem.rdchem.Mol object at 0x1071843c0>,40,22,17,1,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,tylenol,CC(=O)Nc1ccc(cc1)O,151.1438,1.3506,2,2,42.4105,3,49.33,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d4190>,20,11,8,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,aspirin,CC(=O)Oc1ccccc1C(=O)O,180.1384,1.3101,1,4,44.7103,3,63.6,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d4430>,21,13,9,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,abacavir,c1nc2c(nc(nc2n1[C@@H]3C[C@@H](C=C3)CO)N)NC4CC4,286.2954,1.0923,3,7,79.7499,6,101.88,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d44a0>,39,21,14,6,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,diazepam,CN1c2ccc(cc2C(=NCC1=O)c3ccccc3)Cl,261.2652,3.1538,0,2,81.81,2,32.67,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d4510>,33,20,17,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,escitalopram,CN(C)CCC[C@@]1(c2ccc(cc2CO1)C#N)c3ccc(cc3)F,324.348,3.81298,0,3,90.914,7,36.26,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d4580>,45,24,20,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,metformin,CN(C)C(=N)NC(N)=N,129.1454,-1.03416,4,2,36.4635,2,88.99,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d45f0>,20,9,4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,digoxin,C[C@@H]1[C@H]([C@H](C[C@@H](O1)O[C@@H]2[C@H](O...,780.8276,2.2181,6,14,192.6108,18,203.06,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d4660>,119,55,41,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,mifepristone,CC#C[C@@]1(CC[C@@H]2[C@@]1(C[C@@H](C3=C4CCC(=O...,429.5274,5.4065,1,3,129.4318,6,40.54,1.0,<rdkit.Chem.rdchem.Mol object at 0x15c8d46d0>,67,32,29,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# select method of obtaining PDB ID using Dropdown widget: manual or random generation
style = {'description_width': 'initial'}
methods = Dropdown(options = ["LRO5", "Ghose", "Veber"], description = 'Select Method for Analysis:', style = style)
methods

In [None]:
bioactive_importances, classes_dict = oral_bioactive_classifier(lig_bioactive_df, methods.value)

In [None]:
bioactive_importances

In [None]:
for key, value in classes_dict.items():
    if value == 1:
        print(f"Predicted orally bioactive value for {key}: Yes")
    else:
        print(f"Predicted orally bioactive value for {key}: No")