# Basil Docking V0.1 - Docking Preparation
## Purpose

__Target Audience__<br>
Undergraduate chemistry/biochemistry students and, in general, people that have little to no knowledge of protein-ligand docking and would like to understand the general process of docking a ligand to a protein receptor.

__Brief Overview__<br>
Molecular docking is a computational method used to predict where molecules are able to bind to a protein receptor and what interactions exist between the molecule (from now on, refered to as "ligand") and the receptor. It is a popular technique utilized in drug discovery and design, as when creating new drugs and testing existing drugs aginst new receptors, it is useful to determine the likelihood of binding prior to screening as it can be used to eliminate molecules that are unlikely to bind to the receptor. This significantly reduces the potential cost and time needed to test the efficacy of a set of possible ligands. <br>

The general steps to perform molecular docking, assuming the ligand and receptor are ready to be docked, include the generation of potential ligand binding poses and the scoring of each generated pose (which predicts how strongly the ligand binds to the receptor, with a more negative score corresponding to a stronger bond). To dock a ligand to a protein, both the receptor and the ligand/s need to be "sanitized"; which includes making sure bonds and protonation states are as they would be in an organism. The receptor and ligand/s also need to be converted into the correct file formats depending on which docking engine is utilized. With all of these steps needed for preparation alone, introducing a need for an in depth view for each distinct step. This series attempts to provide that, as well as give users flexibility to customize the proteins, ligands, and procedures used.<br>

This notebook series encompasses<br>
1. __The preparation needed prior to docking (protein and ligand sanitation, ensuring files are in readable formats, and finding possible binding pockets)__
2. The process of docking ligand/s to a protein receptor using two docking engines (VINA and SMINA) and visualizing/analyzing the outputs
3. Further data collection and manipulation

__Stepwise summary for this notebook (docking preparation, notebook 1 out of 3)__<br>
- Get PDB file from the protein data bank and separate the protein and ligand into different files
- Import additional ligands (if desired)
- Prepare and separate ligands into their own MOL2 and PDBQT files
- Find possible binding pockets in protein
- View protein and ligand/s

The methods utilized by this notebook are based off of Angel J. Ruiz-Moreno's Jupyter-Dock notebooks, which can be found on their GitHub account AngelRuizMoreno

Ruiz-Moreno A.J. Jupyter Dock: Molecular Docking integrated in Jupyter Notebooks. https://doi.org/10.5281/zenodo.5514956

Methods for sanitizing the protein PDBQT file was adapted from Jessica Nash's iqb-2024 repository, which was used in the IQB 2024 workshop - Python for Molecular Docking, and can be found on her GitHub account janash. 

## Table of Libraries Used
### Operations, variable creation, and variable manipulation

| Module (Submodule/s)| Abbreviation| Role | Citation |
| :--- | :--- | :--- | :--- |
| numpy | np | perform mathematical operations, fix NaN values in dataframe outputs, and get docking box values from MDAnalysis | Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link). |
| pandas | pd | organize data in an easy-to-read format and allow for the exporting of data as a .csv file | The pandas development team. (2024). pandas-dev/pandas: Pandas (v2.2.3). Zenodo. https://doi.org/10.5281/zenodo.13819579 |
| re |n/a| regular expression; find and pull specific strings of characters depending on need, allow for easy naming and variable creation | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| os | n/a| allow for interaction with computer operating system, including the reading and writing of files |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| sys |n/a| manipulate python runtime environment |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation.|
| glob |n/a| pull files of interest, specifically for blind docking |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| warnings | n/a | filter warnings | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |

### Protein and Ligand Preparation
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| biopython (Bio.PDB, PDBList)| n/a | fetch and download pdb strucures from rcsb.org | Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878 |
| MDAnalysis (PDB)| mda | allow for the selection of atoms for separating protein from ligands and ligands from each other | R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 98-105, Austin, TX, 2016. SciPy, doi:10.25080/majora-629e541a-00e. |
| --- | --- | --- | N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J. Comput. Chem. 32 (2011), 2319-2327, doi:10.1002/jcc.21787. PMCID:PMC3144279. |
| pdb2pqr | n/a | prepare protein receptors for docking | PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, Baker NA. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W522-5. |
| --- | --- | --- | PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W665-7. |
| open babel (pybel)| n/a | prepare ligands for docking and allow for the conversion of ligand information to different file types |  O'Boyle, N.M., Banck, M., James, C.A. et al. Open Babel: An open chemical toolbox. J Cheminform 3, 33 (2011). https://doi.org/10.1186/1758-2946-3-33.|
| rdkit (Chem)| n/a | ligand sanitation |  RDKit: Open-source cheminformatics; http://www.rdkit.org |
| fpocket | n/a | find possible binding pockets in protein receptors | Le Guilloux, V., Schmidtke, P. & Tuffery, P. Fpocket: An open source platform for ligand pocket detection. BMC Bioinformatics 10, 168 (2009). https://doi.org/10.1186/1471-2105-10-168. |

### Visualization
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| rdkit.Chem (Draw)| n/a | ligand visualization |  RDKit: Open-source cheminformatics; http://www.rdkit.org |
| py3Dmol | n/a | apoprotein and protein complex visualization |  Keshavan Seshadri, Peng Liu, and David Ryan Koes. Journal of Chemical Education 2020 97 (10), 3872-3876. https://doi.org/10.1021/acs.jchemed.0c00579. |

### UI
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| IPython (ipywidgets, display)| n/a | allow for widgets to be implemented and displayed | Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: https://ipython.org |
| ipywidgets (FileUpload, Dropdown, Text, Layout, Label, Box, HBox)| widgets | create interactable wigets of different types | Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: https://ipython.org |

For using this notebook, certain libraries are required in order for analysis to perform as planned. You can either use a conda library (provided as a yml file) or install all required libraries using pip install. Only run the cells below if you will not use a conda library to install required libraries, and only use them as needed. If you are using a conda library, start at the coding cell that imports the libraries.

### Import libraries

Import all necessary libraries using the cells below

In [None]:
# wouldn't work well on windows (issue with finding git), find solution?
! git submodule update --recursive --init --remote

In [None]:
import numpy as np
import pandas as pd
import numbers
import re
import sys, os
import requests
import glob
import ipywidgets as widgets
from ipywidgets import FileUpload, Dropdown, SelectMultiple, Text, Layout, Label, Box, HBox, Button, Output
from IPython.display import display

from Bio.PDB import PDBList
import pdb2pqr
import MDAnalysis as mda 
from MDAnalysis.coordinates import PDB
from openbabel import pybel
from rdkit import Chem
from rdkit.Chem import Draw
import rcsbapi
from rcsbapi.search import AttributeQuery, Attr, TextQuery, ChemSimilarityQuery

sys.path.insert(1, 'utilities/ligandsplitter')
from ligandsplitter.basefunctions import create_folders, convert_type
from ligandsplitter.ligandsplit import File_Info, Ligand, retrieve_pdb_file, get_mol2_info, get_ligands, find_ligands_unique, write_mol2, separate_mol2_ligs, isolate_by_method
from ligandsplitter.ligandvalidate import parse_unique_ligands, validate_unique_ligands
from ligandsplitter.ligandgenerate import create_ligands_from_smiles, display_smiles_form, create_mols_from_smiles, create_search_for_expo, create_search_for_protein, display_expo_form, create_ligands_from_expo, create_proteins

sys.path.insert(1, 'utilities/')
from basil_utils import get_prot_pockets_data

import py3Dmol

import warnings
warnings.filterwarnings("ignore")

## Retrieve desired macromolecule for receptor

The desired protein receptor (and ligand/s, if the PDB entry is a complex) can be retrieved from a variety of sources. 

1. Retrieval Using Manual Text Entry
2. Retrieval Using Upload from Local File
3. Retrieval Using RCSB PDB Advanced Search

Before retrieving the receptor, folders to contain the data used within this notebook series need to be created and variables need to be initialized.

In [None]:
current_dir = create_folders()
name = "" #placeholder variable for receptor name
upload = {} #placeholder variable for uploaded receptor metadata

### Method One: Retrieval Using Manual Text Entry
One method of obtaining a protein receptor includes retrieval from the Protein Data Bank using the biopython module; specifically, the Bio.PDB package. The retrieved PDB structure file is then cleaned (refering to the removal of water molecules and ions that may interfere with docking) before it is separated into two files using MDAnalysis atom selection: a PDB file containing the protein receptor, and a MOL2 file containing the ligand/s bound to the protein receptor (if present).

Using the text input box created from running the cell below, type in the 4-character PDB ID for the desired receptor for molecular docking. The protein can either be just the apoprotein (no bound ligands) or in complex. Later cells splits the file into PDB files containing just the protein and just the ligands (if present).

Here are some possible PDB IDs to use if you need suggestions
- __1oyt__ (small protein with two ligands in complex)
- __9bxw__ (single chain protein with four unique ligands, nine total)
- __1a3b__ (small heterotrimer protein with one ligand in complex)
- __9dvp__ (large homodimer (greater than 1000 residues) with four unique ligands, 13 total)

<div class="alert alert-block alert-info">
<b>Tip for Text Widgets: </b> For cells that use a "Text" widget (like the one below), first execute the cell and then type in your text -- in this case, the PDB ID. Once you type in text, you can move on to the next cell, no need to press "Enter"/"Return"</div>

In [None]:
#Do not type in cell, execute and then type in the produced text box
select_name = Text(value = '', placeholder='Type 4-character PDB ID to be used', disabled=False)
select_name

In [None]:
name = select_name.value
print(name)

### Method Two: Retrieval Using Upload from Local File
If you have a file on your local device that contains information for a protein you'd like to use as a receptor, it can be uploaded and used within this notebook series. This allows for the use of structures generated from prediction software like AlphaFold, so proteins without an experimentally determined structure can be used.
The formats supported for upload include:
- .cif
- .pdb
- .ent
</br>

If your desired receptor is in a format that is not listed above, please convert it to one of the supported formats before uploading it.

<div class="alert alert-block alert-info">
<b>Tip for Upload Widgets: </b> For cells that use a "Upload" widget (like the one below), first execute the cell, then click the "Upload" button that is produced. </div>

In [None]:
# run this cell to create upload widget
file_upload = widgets.FileUpload(accept='.cif,.pdb,.ent', multiple=False)
display(file_upload)

In [None]:
upload = file_upload.value

### Method Three: Retrieval Using Advanced Search

If you want to search for proteins with specific properties, the advanced search function can be used to filter by enzyme classification name or number, the number of chains present, the number of amino acids present, or molecular weight. To make this process as simple as possible, a form will be generated where the criteria of interest can be selected by the user. After selecting the criteria you would like to search by, execute the cell containing the `create_proteins()` function to generate a list of proteins that meet your criteria.

In [None]:
attr_bool, attr_val, attr_comp, form_1, form_2, form_3 = create_search_for_protein()

In [None]:
display_expo_form(form_1, form_2, form_3)

In [None]:
result_receptor, query = create_proteins(attr_bool, attr_val, attr_comp)

With the list of potential receptors generated, you can look at various proteins that fit your criteria by first selecting it from the Dropdown widget below and then executing the cell underneath it. The cell underneath the Dropdown widget uses py3Dmol to allow for a 3D-view of the protein that can be moved and rotated.

To view other potential receptors, simply change the protein selected in the Dropdown widget (no need to re-run the Dropdown cell) then execute the cell that uses py3Dmol again.


In [None]:
# view proteins that satisfy criteria
style = {'description_width': 'initial'}
view_receptor = Dropdown(options = result_receptor, description = 'Select Desired Receptor to View:', style = style)
view_receptor

In [None]:
# whichever pdb id is selected in the dropdown cell above will be visualized in this cell
# to view a different protein to use as a receptor, select the desired pdb id in the drop down and re-run this cell
pdb_list = PDBList()
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

#visualization for ligands
pdb_lig_filename = pdb_list.retrieve_pdb_file(view_receptor.value, pdir="data/test_files", file_format="pdb")
view.addModel(open('data/test_files/pdb' + str(view_receptor.value) + '.ent','r').read(),format='pdb')
Prot=view.getModel()
Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})
view.zoomTo()
view.show()

Once the receptor to use have been determined, run the cell below and select the PDB ID of the protein you would like to use in molecular docking. Retrieved proteins can be found in the "data/PDB_files/" folder.

In [None]:
select_receptor = Dropdown(options = result_receptor, description = 'Select Receptor from Dropdown:', style = style)
select_receptor

In [None]:
name = select_receptor.value
print(name)

### Select Method of Retrieval 
Depending on the method used to get the receptor, different processes need to be used to get the required information for docking. 

Run the cell to be able to choose the method to use to retrieve the desired receptor. Select "Manual text entry" to use the 4 character PDB ID in the Text Widget above to download the protein from rcsb.org, "Upload from local file" to use a local .pdb/.ent/.cif file containing the receptor of interest, or "Advanced Search" to use a selected PDB ID that has desired properties.

In [None]:
# select method of obtaining PDB ID using Dropdown widget: manual or random generation
style = {'description_width': 'initial'}
fetch_methods = Dropdown(options = ["Manual text entry", "Upload from local file", "Advanced Search"], description = 'Select Method to Choose PDB ID:', style = style)
fetch_methods

In [None]:
# select desired output file for PDB ID: PDB or MMCIF
style = {'description_width': 'initial'}
file_format = Dropdown(options = ["pdb", "mmcif"], description = 'Select Desired Format for PDB ID File:', style = style)
file_format

The following cell retrieves the protein receptor and any ligands that may be present in the file based on the value of the Dropdown above. 

In [None]:
protein_filename, ligand_filename_initial = isolate_by_method(fetch_methods.value, file_format.value, name, upload)
short_filename = protein_filename.split("/")[-1]
pdb_id_initial = short_filename.split("_")[:-1]
if len(pdb_id_initial[0]) > 1:
    pdb_id = '_'.join(str(x) for x in pdb_id_initial)
else:
    pdb_id = pdb_id_initial
print(pdb_id)

## Retrieving desired ligand/s and separating ligands into separate .mol2 files

In this notebook, we will make sure that each ligand has its own mol2/pdbqt files. While this isn't a required step for for docking, separating the ligands into separate files makes data collection and analysis easier to perform and understand.

In addition to ligand separation, this notebook also contains three methods of retrieving additional ligands to be used in ligand docking other than those present in the original protein complex. This allows for the testing of non-canonical binding agents using ligands that are of interest to the user.

    1. Fetching from the RCSB Chemical Components Dictionary
    2. Importing local mol2 files from a personal computer
    3. Getting ligand/s mol2 files using SMILES strings


### Obtaining ligands from input PDB file

To create multiple output files from one input file, the original file must be read thoroughly to ensure all data is captured and the resulting files must be carefully pieced together to ensure that the mol2 format is followed perfectly, as any descrepencies in the output files can drastically impact docking results. The function `separate_mol2_ligs` first parses through the input file, obtaining the line numbers for the different attributes (molecule, atom, bond, and substructure) and determining which information belongs to each ligand based on the name associated with it. From this, the following attributes are obtained:
- Molecule Information
    - The line number where molecule information begins in the file
    - Ligand names in order of appearance in the file
    - Whether or not one ligand is present in the original MOL2 file multiple times
- Atom Information
    - The line number where atom information begins in the file
    - The location of the first instance of an atom corresponding to a given ligand
    - The number of atoms in a given ligand
    - The lines of the mol2 file that contain atom information acros all ligands
    - The total number of atoms across all ligands
- Bond Information
    - The line number where bond information begins in the file
    - The location of the first instance of a bond corresponding to a given ligand
    - The number of bonds in a given ligand
    - The lines of the mol2 file that contain bond information across all ligands
    - The total number of bonds across all ligands

Using all of this information, new mol2 files are created for each ligand, with the final number of mol2 files outputted equalling the number of ligands present in the input file.

For more information on the mol2 file format, [this pdf has a lot of useful information](https://www.structbio.vanderbilt.edu/archives/amber-archive/2007/att-1568/01-mol2_2pg_113.pdf)

In [None]:
# create separate mol2 files for ligand/s in input pdb file
try:
    ligs, filenames = separate_mol2_ligs(filename = ligand_filename_initial)
except:
    ligs = []
    filenames = []
    
print(ligs)
print(filenames)

If a protein has multiple identical chains, it is likely that identical ligands are written in the original MOL2 file under different indentifiers (for example, there could be two instances of dimethylsulfide, with one being labelled as DMS101 and the other labelled as DMS102). Since these ligands have an identical structure, only one instance is needed for docking. The cells below compare the SMILES strings of each ligand in the original MOL2 file to see if there are any redundant ligands and to remove all but one instance of the ligand if redundancies are detected.

In [None]:
# only execute validate_unique_ligands if ligands were extracted from input pdb file, else do nothing
if len(ligs) > 0 and len(filenames) > 0:
    ligs, filenames = validate_unique_ligands(ligs)

<div class="alert alert-block alert-info">
<b>Please note:</b> Some ligands covalently bind to residues of the receptor, and thus are not good candidates for molecular docking. Iron-sulfur clusters, for example, are cofactors that typically bind to sulfur atoms on CYS residues via thiol exchange or a similar mechanism. This means that trying to dock them into potential binding pockets is not necessarily the best method of determining where they will bind. </div>

Duplicates of a ligand in a protein complex's pdb file can result in innacurate calculations of ligand locations, sizes, and centers in future cells. To prevent this, the chain ID of the first occurence of each ligand present in the input pdb file is recorded, and will be used to accurately and precisely select the atoms present in the ligand.

In [None]:
# only determine which chain each ligand is in if ligands were extracted from input pdb file, else do nothing
# see if needed still 
lig_chain = []
if len(ligs) > 0 and len(filenames) > 0:
    with open(f"data/PDB_files/{pdb_id}_clean_ligand.pdb", "r") as outfile:
        temp_ligs = []
        data = outfile.readlines()
        for linenum, line in enumerate(data):
            ligand = line.split()
            if "HETATM" in ligand[0]:
                lig1 = ligand[3] + ligand[5]
                if "." in lig1:
                    temp_num = re.findall(r'\d+', ligand[4])
                    temp2_num = ''.join(str(x) for x in temp_num)
                    lig1 = ligand[3] + temp2_num
                if (lig1 in ligs) and (lig1 not in temp_ligs):
                    temp_ligs.append(lig1)
                    chain_id = ligand[4][0]
                    lig_chain.append(chain_id[0])
    for index, value in enumerate(lig_chain):
        print(f"Ligand {ligs[index]} is found in chain {value}")

### Method 1:  Adding ligands from the chemical component dictionary

If the ligand you want to dock is not present in the initial protein PDB, there are multiple ways to search for one. One method is by using the RCSB Chemical Component Dictionary (CCD), where ligands can be searched for by its chemical name, type, ID, or brand name, as well as by formula or structural similarity. To make this process as simple as possible, a form will be generated where the criteria of interest can be selected as specified

In [None]:
# create form for selecting properties to search by
attr_bool, attr_val, form_items1, form_items2 = create_search_for_expo()

In [None]:
# view form
display_expo_form(form_items1, form_items2)

After selecting the criteria you would like to search by, execute the next two cells to generate a list of ligands that meet your criteria. 

In [None]:
# obtain a list of PDB IDs that meet form requirements 
result_lig, query = create_ligands_from_expo(attr_bool, attr_val)

In [None]:
# select only molecular definitions that are in output; no structures
result_lig_list = []
for nonPoly in query(return_type="mol_definition"):
    result_lig_list.append(nonPoly)

With the list of potential ligands generated, you can look at various ligands that fit your criteria by first selecting it from the Dropdown widget below and then executing the cell underneath it. The cell underneath the Dropdown widget uses py3Dmol to allow for a 3D-view of the ligand that can be moved and rotated. 

To view other ligands, simply change the ligand selected in the Dropdown widget (no need to re-run the Dropdown cell) then execute the cell that uses py3Dmol again.

In [None]:
# view ligands that satisfy criteria
style = {'description_width': 'initial'}
view_ligand = Dropdown(options = result_lig_list, description = 'Select Ligand to View from Dropdown:', style = style)
view_ligand

In [None]:
# whichever pdb id is selected in the dropdown cell above will be visualized in this cell
# to view a different ligand, select the desired ligand pdb id in the drop down and re-run this cell
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

#visualization for ligands
try: # try getting ligand as sdf file first
    lig_mol2 = requests.get(f'https://files.rcsb.org/ligands/download/{view_ligand.value}_ideal.sdf')
    with open(f"data/test_files/{view_ligand.value}_ligand.sdf", "w+") as file:
        file.write(lig_mol2.text)
    lig_filename = f"data/test_files/{view_ligand.value}_ligand.sdf"
except: # if sdf doesn't work, get ligand as cif file
    lig_mol2 = requests.get(f'https://files.rcsb.org/ligands/download/{view_ligand.value}.cif')
    with open(f"data/PDB_files/{view_ligand.value}_ligand.cif", "w+") as file:
        file.write(lig_mol2.text)
    lig_filename = f"data/PDB_files/{view_ligand.value}_ligand.cif"
    pdb_mol2 = [m for m in pybel.readfile(filename = lig_filename, format='cif')][0]
    out_mol2 = pybel.Outputfile(filename = f"data/test_files/{view_ligand.value}_ligand.sdf", overwrite = True, format='sdf')
    out_mol2.write(pdb_mol2)
view.addModel(open('data/test_files/' + str(view_ligand.value) + '_ligand.sdf','r').read(),format='sdf')
view.zoomTo()
view.show()

Once the ligand or ligands to use have been determined, run the cell below and select all ligands you would like to use in molecular docking.

<div class="alert alert-block alert-info">
<b>Tip for SelectMultiple Widgets: </b> For cells that use a "Select Multiple" widget (like the one below), first execute the cell and then select the desired output/outputs -- in this case, ligands that satisfy the criteria selected. To select multiple ligands using the selection widget, hold down the control key (PC) or command key (Mac) while clicking on the names of each ligand you would like to dock</div>

In [None]:
# select ligand of interest using SelectMultiple widget
# To select multiple ligands using the selection widget, 
# hold down the control key (PC) or command key (Mac) 
# while clicking on the names of each ligand you would like to dock
select_ligand = SelectMultiple(options = result_lig_list, description = 'Select Desired Ligand from Dropdown:', style = style)
select_ligand

All ligands selected in the SelectMultiple widget above will be iterated through and downloaded to the data folder in .sdf or .cif format (data/PDB_files/) and in .mol2 format (data/MOL2_files/). The names of each ligand will be added to the lig and filenames lists.

In [None]:
# create mol2 file for each ligand selected
for ligand_id in select_ligand.value:
    try: # try getting ligand as sdf file first
        lig_mol2 = requests.get(f'https://files.rcsb.org/ligands/download/{ligand_id}_ideal.sdf')
        with open(f"data/PDB_files/{ligand_id}_ligand.sdf", "w+") as file:
            file.write(lig_mol2.text)
        lig_filename = f"data/PDB_files/{ligand_id}_ligand.sdf"
        pdb_mol2 = [m for m in pybel.readfile(filename = lig_filename, format='sdf')][0]
    except: # if sdf doesn't work, get ligand as cif file
        lig_mol2 = requests.get(f'https://files.rcsb.org/ligands/download/{ligand_id}.cif')
        with open(f"data/PDB_files/{ligand_id}_ligand.cif", "w+") as file:
            file.write(lig_mol2.text)
        lig_filename = f"data/PDB_files/{ligand_id}_ligand.cif"
        pdb_mol2 = [m for m in pybel.readfile(filename = lig_filename, format='cif')][0]
    out_mol2 = pybel.Outputfile(filename = f"data/MOL2_files/{ligand_id}.mol2", overwrite = True, format='mol2')
    out_mol2.write(pdb_mol2)
    ligs.append(ligand_id)
    filenames.append(f"data/MOL2_files/{ligand_id}.mol2")

In [None]:
print(ligs)

### Method 2:  Adding ligands from local .mol2 files

To dock a ligand that is not present in the imported PDB file, we can upload its mol2 file (which can be obtained on the pdb website) and obtain all the relavent information using ipywidgets. The upload widget will only accept mol2 files; any other file type will result in an error. Multiple files are able to be uploaded at once. To use the uploader, the cell below needs to be run. After running the cell, the upload button will appear, allowing mol2 files to be selected. After uploading the files, the next cell will be ready to be run and will write each uploaded mol2 file into the "Data" folder.

In [None]:
# create upload widget to allow for uploading of local mol2 files
lig_files = []
upload = widgets.FileUpload(accept='.mol2', multiple=True)
display(upload)

In [None]:
# get information for each uploaded mol2 file
for file_num, upload_filename in enumerate(upload.value):
    uploaded_file_name = upload_filename['name']
    lig_files.append(uploaded_file_name)

# write mol2 files into data folder for each ligand 
for lig_num, name in enumerate(lig_files):
    with open("data/MOL2_files/" + str(name), "wb") as fp:
        fp.write(upload.value[lig_num]["content"])
    filenames.append("data/MOL2_files/" + str(name))
    name_alone = name.split('.')[0]
    ligs.append(name_alone)

In [None]:
print(ligs)

### Method 3: Adding ligands using user-input SMILE format

If you are familiar with SMILES format, you can input the SMILES string for the ligand/s in the cell below. Invalid SMILES strings will result in an error. This method is not recommended for those with no experience with SMILES formatting, as a small mistake in the SMILES string can result in the creation of an invalid molecule and can cause issues in the docking process. Up to 20 ligands at a time can currently be generated using SMILES strings depending on the size of the ligands.

In [None]:
# let users select how many ligands to add (up to 20)
style = {'description_width': 'initial'}
num_of_ligs = Dropdown(options = range(1, 21), description = 'Select number of ligands to input', style = style)
num_of_ligs

To allow for easy creation of ligands, a form will be used that will take in a name for the ligand and the SMILES string for the ligand. It will be oriented so that each row corresponds to one ligand, and the number of rows depends on the number of ligands your selected in the Dropdown widget above. After the form is filled out, do not re-run the cell.

In [None]:
# create form to take name and smiles input
names_for_ligs, smiles_for_ligs, form_items1, form_items2 = create_ligands_from_smiles(num_of_ligs)

In [None]:
# display form
display_smiles_form(num_of_ligs, form_items1, form_items2)

The cell below will use the values entered in the form above to create a mol2 file containing all of the ligands generated by SMILES strings, iterating through each entry in the form. 

In [None]:
cleaned_names = {}
cleaned_smiles = {}

# because widgets result in names/smiles strings being stored as a text object, retrieve and store form inputs as
# strings in new dictionary (cleaned_names, cleaned_smiles)
for num, val in enumerate(names_for_ligs):
    index = "name" + str(num + 1)
    cleaned_names[index] = names_for_ligs[val].value
for num, val in enumerate(smiles_for_ligs):
    index = "scratch" + str(num + 1)
    cleaned_smiles[index] = smiles_for_ligs[val].value

In [None]:
# use values in form to create a concatenated mol2 file containing information for each ligand
name_vals, scratch_vals = create_mols_from_smiles(num_of_ligs.value, cleaned_names, cleaned_smiles)

Using the combined mol2 file generated above, mol2 files for each ligand will be created and stored in the "data/MOL2_files/" folder. The name of the ligand that was chosen (via the form above) will be used to name the resulting files.

In [None]:
# split concatenated mol2 file into separate mol2 files for each ligand
smile_lig, smile_filename = separate_mol2_ligs(filename = 'data/MOL2_files/InputMols.mol2', name_vals = name_vals)

In [None]:
for num, i in enumerate(smile_lig):
    ligs.append(i)
    filenames.append(smile_filename)

In [None]:
print(ligs)

## Cleaning and Preparing Ligands for Docking

Before docking, both the protein receptor and ligand/s need to be sanitized to ensure the shape of the ligand and receptor molecules are valid and to reduce the possibility of biologically irrelevant/unlikely/impossible poses. Sanitizing includes adding the hydrogens that are missing in the PDB/MOL2 files, making sure the charges of the protein are correct, and converting both PDB (protein receptor) and MOL2 (ligand/s) files to PDBQT format (which is necessary for docking using the VINA engine), which stores the hydrogen and charge information for each molecule.

In [None]:
# protein sanitization
# add hydrogens to protein receptor
input_file = f"data/PDB_files/{pdb_id}_protein.pdb"
pqr_file = f"data/PDB_files/{pdb_id}_protein.pqr"
output_file = f"data/PDB_files/{pdb_id}_protein_H.pdb"

! pdb2pqr --pdb-output={output_file} --pH=7.4 --whitespace {input_file} {pqr_file}

In [None]:
# protein sanitization
# create pdbqt file for receptor
try:
    to_pdbqt = mda.Universe(pqr_file)
    to_pdbqt.atoms.write(f"data/PDBQT_files/{pdb_id}_protein.pdbqt")

    # remove "TITLE" and "CRYST1" labels with "REMARK" to reduce chance of errors later on
    with open(f"data/PDBQT_files/{pdb_id}_protein.pdbqt", 'r') as file:
        file_content = file.read()
    file_content = file_content.replace('TITLE', 'REMARK').replace('CRYST1', 'REMARK')
    with open(f"data/PDBQT_files/{pdb_id}_protein.pdbqt", 'w') as file:
        file.write(file_content)
except EOFError as error:
    print(f"Hydrogens unable to be added. Check log output generated by pdb2pqr: data/PDB_files/{pdb_id}_protein.log")

The cells below focus on ligand sanitation, creating new MOL2 files that contain the locations of hydrogens in the ligand/s which then get converted into PDBQT files.

In [None]:
# ligand sanitization
# add hydrogens to ligands
filenames_H = []
a = 0
for i in filenames:
    mol= [m for m in pybel.readfile(filename= str(i),format='mol2')][0]
    mol.addh()
    s = "data/MOL2_files/" + str(ligs[a]) + "_H.mol2"
    filenames_H.append(s)
    out = pybel.Outputfile(filename= "data/MOL2_files/" + str(ligs[a]) + "_H.mol2",format='mol2',overwrite=True)
    out.write(mol)
    out.close()
    a += 1

In [None]:
# ligand sanitization
# convert to pdbqt
n = 0
filenames_pdbqt = []
for i in filenames:
    ligand = [m for m in pybel.readfile(filename= str(i) ,format='mol2')][0]
    s = "data/PDBQT_files/" + str(ligs[n]) + "_H.pdbqt"
    filenames_pdbqt.append(s)
    ligand.write(filename = s, format='pdbqt', overwrite=True)
    n += 1

For docking, information about the size and center of the ligand/s is needed to ensure that the entire ligand can be docked to the desired binding pocket. To add a little bit of "wiggle room", the lengths of the x, y, and z dimensions are increased by 5 angstroms (if the length is positive, five is added; if the length is negative, five is subtracted).

In [None]:
# get center and size of ligand/s
lig_box_c = []
lig_box_s = []
for h, i in enumerate(filenames_H):
    u2 = mda.Universe(i)
    ligand_mda = u2.atoms
    pocket_center = ligand_mda.center_of_geometry() # get coordinates for center of ligand
    pocket_center_list = np.ndarray.tolist(pocket_center)
    ligand_box = ligand_mda.positions.max(axis=0) - ligand_mda.positions.min(axis=0) #calculate size of ligand
    ligand_box_list = np.ndarray.tolist(ligand_box)
    ligand_box_list2 = []
    for value in ligand_box_list: # add five angstroms to each value to allow for "wiggle room"
        if value < 0:
            ligand_box_list2.append(float(value - 5))
        elif value > 0:
            ligand_box_list2.append(float(value + 5))
        else:
            ligand_box_list2.append(float(0))
    lig_box_c.append(pocket_center_list)
    lig_box_s.append(ligand_box_list2)

## Find possible binding pockets in protein using fpocket

fpocket is an algorithm that aids in protein pocket detection and scoring. Based on variables including solvent accessibility, the hydrophobicity of residues, density, flexibility, residue charges, and more (all contributing variables are listed in the table below), the likelihood of a pocket acting as a binding site to a nonspecified ligand is calculated (also known as the druggability score), which helps determine possible docking boxes to be used in ligand docking.

Column descriptions for data output (pocket_descriptors.csv):

| Descriptor | Role |
| :--- | :--- |
| drug_score | score ranging from 0 to 1 describing the likelihood of a drug binding to a given pocket, where 0.5 is the threshold where the binding of a drug in the pocket is possible |
| volume | pocket volume|
|nb_asph| the number of alpha spheres in a pocket, which measures the size of cavity normalized to the largest pocket|
|inter_chain | an integer equal to 0 (if the pocket is made of a single chain) or 1 (if the pocket is comprised of 2 chains)|
|apol_asph_proportion | proportion of apolar alpha spheres; the percentage of alpha spheres in a pocket that are apolar|
|mean_asph_radius| mean alpha sphere radius|
|as_density| alpha sphere density of pocket, calculated by taking the mean of all alpha sphere pair-to-pair distances. smaller values indicate a more compact and dense pocket|
|mean_asph_solv_acc| mean alpha sphere solvent accessibility|
|mean_loc_hyd_dens| mean local hydrophobic density; identification of areas of the binding pocket with localized hydrophobicity. calculated by seeing how many apolar spheres overlap with each other. the sum of all apolar neighbors is divided by the total number of apolar spheres|
|flex| flexibility of pocket (b factor)|
|hydrophobicity_score| the hydrophobicity score, which is the mean hydrophobicity score of all residues in the pocket|
|volume_score| the volume score, which is the mean volume score of all amino acids in contact with at least one alpha sphere of the pocket|
|charge_score| the charge score, which is the mean charge for all amino acids in contact with at least one alpha sphere of the pocket|
|polarity_score| the polarity score, which is the hydrophilicity of the binding pocket, which is calculated by taking the mean of all polarity scores of all residues in the pocket|
|a0_apol | describes apolar Van der Waals surface of pocket|
|a0_pol | describes polar Van der Waals surface of pocket|
|af_apol | describes apolar Van der Waals surface of pocket|
|af_pol | describes polar Van der Waals surface of pocket|
|n_abpa| the number of abpas in the binding site |
|three-letter amino acid code (i.e. "ala")|Absolute amino acid composition of a given pocket, divided into groups by amino acid|
|chain_1_type| chain 1 type; an integer equal to 0 (if the pocket is a protein pocket), 1 (if the pocket is a nucleic acid pocket), or 2 (if the pocket is a HETATM pocket)|
|chain_2_type| chain 2 type; an integer equal to 0 (if the pocket is a protein pocket), 1 (if the pocket is a nucleic acid pocket), or 2 (if the pocket is a HETATM pocket)|
|num_res_chain_1|  the total number of residues in chain 1|
|num_res_chain_2| number of residues on chain 2. if the pocket is only made up of one chain, the value of this descriptor is equal to the value of "num_res_chain_1"|
|lig_het_tag|  HETATM tag of ligands situated in the binding pocket|
|name_chain_1|  the name of the first chain in contact with the pocket (denoted using a letter [i.e. "A"])|
|name_chain_2|  the name of the second chain in contact with the pocket (denoted using a letter [i.e. "A"]). if the pocket is only made up of one chain, the value of this descriptor is equal to the value of "name_chain_1"|

In [None]:
#use fpocket to view potential pockets in protein
try:
    ! fpocket -f {"data/PDB_files/"+ str(pdb_id)+"_protein.pdb"} -d > {"data/pocket_descriptors.csv"}
    can_blind_dock = True
except:
    print("Unable to find protein pockets")
    can_blind_dock = False

In [None]:
if can_blind_dock:
    prot_pockets = pd.read_csv('data/pocket_descriptors.csv',sep=' ',index_col=[0])
else:
    print("Cannot determine pockets. Skip to Section 1.7 - View ligands and receptor together prior to docking")

In [None]:
#get pockets and docking boxes for all pockets in a dataframe
if can_blind_dock:
    get_prot_pockets_data(current_dir, pdb_id, prot_pockets)
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        display(prot_pockets)
else:
    print("Cannot determine pockets for a non-protein receptor. Skip to Section 1.7 - View ligands and receptor together prior to docking")

To make docking more efficient, a drugability score cutoff can be implemented, where all pockets with a drugability score greater than the selected value will be added to a new dataframe, which will be used to create the docking boxes for ligands. If the number of posible binding pockets found by fpocket exceeds 25, it is highly recommended that the cells below are run and a cutoff implemented.

In [None]:
if can_blind_dock:
    value = [round(x * 0.05, 2) for x in range(20)]
    value = value[1:]
    cutoff_score = Dropdown(options = value, description = 'Select Cutoff Druggability Score:', style = style)
    display(cutoff_score)
else:
    print("Cannot determine pockets for a non-protein receptor. Skip to Section 1.7 - View ligands and receptor together prior to docking")

In [None]:
if can_blind_dock:
    prot_pockets = prot_pockets[prot_pockets['drug_score'] >= float(cutoff_score.value)]
    display(prot_pockets)
else:
    print("Cannot determine pockets for a non-protein receptor. Skip to Section 1.7 - View ligands and receptor together prior to docking")

Regardless if a druggability cutoff is used, the cell below exports the dataframe as a .csv file, as the pockets and their properties need to be accessed by future notebooks. 

In [None]:
if can_blind_dock:
    prot_pockets.to_csv(f"data/protein_pockets_id_{pdb_id}.csv")
else:
    print("Cannot determine pockets for a non-protein receptor. Skip to Section 1.7 - View ligands and receptor together prior to docking")

## View ligands and receptor together prior to docking

While viewing the ligand/s and receptor is not required, being able to see what the molecules look like as well as being able to see the possible binding pockes on the receptor does help (continue). There are a few different methods this notebook will use to visualize the ligands/proteins to be used in docking. <br>
The first method this notebook will be using is rdkit's Draw module, which takes rdkit molecules and displays a static image of them. This method is easy to implement and only takes one line of code (assuming a list of rdkit Molecules already exists).<br> 
The second method that will be used is py3Dmol, which requires more code to implement but allows for the user to move and rotate the molecule/s and allows for larger molecules (including proteins) to be viewed.

In [None]:
# create list of rdkit molecules
mols = []
ligand_smiles = []
for i in ligs:
    mol = Chem.MolFromMol2File("data/MOL2_files/" + str(i) + "_H.mol2",sanitize=False)
    select_mol_smile = Chem.MolToSmiles(mol)
    print(select_mol_smile)
    ligand_smiles.append(select_mol_smile)
    mols.append(mol)

# view ligands
Draw.MolsToGridImage(mols, molsPerRow=5, subImgSize=(300,300))

Below is code to create the py3Dmol viewer, which consists of three different views. They are as follows:
1. a viewer containing the ligand/s and the receptor, in which the space filling model (surface) of the receptor is present
2. a viewer containing the ligand/s and the receptor, with the addition of transparent boxes around each ligand demonstrating the size and center of the ligand docking boxes. The colors of the ligand boxes differ for clarity's sake, but are otherwise meaningless
3. a viewer containing the ligand/s and the receptor, with the addition of the binding pockets found by fpocket. The colors of the binding pockets differ for clarity's sake, but are otherwise meaningless

### View 1:  ligand/s and the space filling model (surface) of the receptor

In [None]:
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

# add models to viewers
view.addModel(open(f'data/PDB_files/{pdb_id}_protein.pdb','r').read(),format='pdb')
Prot=view.getModel()
Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})
 
# add ligand/s to all py3Dmol viewers
for i in filenames_H:
    view.addModel(open(i,'r').read(),format='mol2')
    ref_m = view.getModel()
    ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})
            
view.addSurface(py3Dmol.VDW,{'opacity':0.6,'color':'white'})


view.zoomTo()
view.show()

### View 2: receptor and ligand/s with ligand docking boxes.

In [None]:
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

# add receptor (protein) model to py3Dmol viewer
view.addModel(open(f'data/PDB_files/{pdb_id}_protein.pdb','r').read(),format='pdb')
Prot=view.getModel()
Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})

#visualization for ligands and docking boxes for each ligand
for i in filenames_H:
    view.addModel(open(i,'r').read(),format='mol2')
    ref_m = view.getModel()
    ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})
    
colors = ['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'magenta']
a = 0
for j, i in enumerate(filenames_H):
    view.addBox({"center": dict(x = lig_box_c[j][0], y = lig_box_c[j][1], z= lig_box_c[j][2]), "dimensions": dict(d = abs(lig_box_s[j][0]), h = abs(lig_box_s[j][1]), w = abs(lig_box_s[j][2])), "color" : colors[a], "opacity" : 0.5})
    a += 1
    if a > 6:
        a = 0

view.zoomTo()
view.show()

### View 3: receptor and ligand/s with protein binding pockets

To avoid parsing through every binding pocket file only to visualize a portion of them, a list containing all of the pqr file paths for binding pockets with a druggability score greater than the selected cutoff will be created. The pqr file format includes charge and radius field information for each atom in a binding pocket in addition to information recorded in pdb files.

In [None]:
if can_blind_dock:
    revised_files = []
    pocketPath = os.path.join(current_dir, "data", "PDB_files", str(pdb_id) + "_protein_out", "*.pqr")
    pocketFiles = glob.glob(pocketPath)
    for file in pocketFiles:
        split_1 = file.split("/")[-1]
        split_2 = split_1.split("_")[0]
        index_num = re.findall(r'\d+', split_2)
        index_num2 = ''.join(str(x) for x in index_num)
        if int(index_num2) in prot_pockets.index:
            revised_files.append(file)

In [None]:
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

# add receptor (protein) model to py3Dmol viewer
if can_blind_dock:
    view.addModel(open(f'data/PDB_files/{pdb_id}_protein.pdb','r').read(),format='pdb')
    Prot=view.getModel()
    Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})

    #visualization ligands
    colors = ['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'magenta']
    for h, i in enumerate(filenames_H):
        # add ligand/s to all py3Dmol viewers
        view.addModel(open(i,'r').read(),format='mol2')
        ref_m = view.getModel()
        ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})

    a = 0
    for file in revised_files:
        view.addModel(open(file,'r').read(),format = 'pqr')
        pockets = view.getModel()
        pockets.setStyle({},{'sphere':{'color':colors[a],'opacity':0.5}}) 
        a += 1
        if a > 6:
            a = 0

    view.zoomTo()
    view.show()

## Saving results for further use and cleaning up

To use the data collected in this notebook for the next notebook in this series (Docking and Preliminary Analysis), a .csv file containing ligand filenames and ligand box sizes and centers will be created, allowing for the variables to be easily imported and used.

In [None]:
center_x = []
center_y = []
center_z = []
size_x = []
size_y = []
size_z = []
for h, i in enumerate(ligs):
    center_x.append(lig_box_c[h][0])
    center_y.append(lig_box_c[h][1])
    center_z.append(lig_box_c[h][2])
    size_x.append(lig_box_s[h][0])
    size_y.append(lig_box_s[h][1])
    size_z.append(lig_box_s[h][2])
ligand_information = pd.DataFrame({"pdb_id": pdb_id,
                                   "ligs": ligs,
                                   "filenames": filenames,
                                   "filenames_H": filenames_H,
                                   "filenames_pdbqt": filenames_pdbqt,
                                   "center_x": center_x,
                                   "center_y": center_y,
                                   "center_z": center_z,
                                   "size_x": size_x,
                                   "size_y": size_y,
                                   "size_z": size_z
                                  })

ligand_smiles_data = pd.DataFrame({"filename_hydrogens": filenames_H,
                                   "smiles": ligand_smiles})

ligand_information.to_csv(f'data/ligand_information_id_{pdb_id}_{str(len(ligs))}.csv', index = True)
ligand_smiles_data.to_csv(f'data/ligand_smiles_data_id_{pdb_id}_{str(len(ligs))}.csv', index = False)

At this point, you may have some files in the "data/test_files/" folder that you no longer need. You can remove these files manually using your computers directory, or you can run the next two cells to delete them. The first cell outputs the names of every file within the folder: __please make sure you no longer need any of the files listed before running the second cell__. The files will be permanently removed from your computer.

In [None]:
testing_data = os.path.join('data', 'test_files', '*')
testing_files = glob.glob(testing_data)
testing_files

In [None]:
for file in testing_files:
    ! rm {file}