# Basil Docking V0.1 - Docking Preparation
## Purpose

__Target Audience__<br>
Undergraduate chemistry/biochemistry students and, in general, people that have little to no knowledge of protein-ligand docking and would like to understand the general process of docking a ligand to a protein receptor.

__Brief Overview__<br>
Molecular docking is a computational method used to predict where molecules are able to bind to a protein receptor and what interactions exist between the molecule (from now on, refered to as "ligand") and the receptor. It is a popular technique utilized in drug discovery and design, as when creating new drugs and testing existing drugs aginst new receptors, it is useful to determine the likelihood of binding prior to screening as it can be used to eliminate molecules that are unlikely to bind to the receptor. This significantly reduces the potential cost and time needed to test the efficacy of a set of possible ligands. <br>

The general steps to perform molecular docking, assuming the ligand and receptor are ready to be docked, include the generation of potential ligand binding poses and the scoring of each generated pose (which predicts how strongly the ligand binds to the receptor, with a more negative score corresponding to a stronger bond). To dock a ligand to a protein, (insert text).<br>

This notebook series encompasses<br>
1. __The preparation needed prior to docking (protein and ligand sanitation, ensuring files are in readable formats, and finding possible binding pockets)__
2. The process of docking ligand/s to a protein receptor using two docking engines (VINA and SMINA) and visualizing/analyzing the outputs
3. Further data collection and manipulation
4. Utilizing machine learning to determine key residues (on the protein) and functional groups (on the ligand) responsible for protein-ligand binding

__Stepwise summary for this notebook (docking preparation, notebook 1 out of 4)__<br>
- Get PDB file from the protein data bank and separate the protein and ligand into different files
- Import additional ligands (if desired)
- Prepare and separate ligands into their own MOL2 and PDBQT files
- Find possible binding pockets in protein
- View protein and ligand/s

The methods utilized by this notebook are based off of Angel J. Ruiz-Moreno's Jupyter-Dock notebooks, which can be found on their GitHub account AngelRuizMoreno

Ruiz-Moreno A.J. Jupyter Dock: Molecular Docking integrated in Jupyter Notebooks. https://doi.org/10.5281/zenodo.5514956

Methods for sanitizing the protein PDBQT file was adapted from Jessica Nash's iqb-2024 repository, which was used in the IQB 2024 workshop - Python for Molecular Docking, and can be found on her GitHub account janash. 

## Table of Libraries Used
### Operations, variable creation, and variable manipulation

| Module (Submodule/s)| Abbreviation| Role | Citation |
| :--- | :--- | :--- | :--- |
| numpy | np | perform mathematical operations, fix NaN values in dataframe outputs, and get docking box values from MDAnalysis | Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link). |
| pandas | pd | organize data in an easy-to-read format and allow for the exporting of data as a .csv file | The pandas development team. (2024). pandas-dev/pandas: Pandas (v2.2.3). Zenodo. https://doi.org/10.5281/zenodo.13819579 |
| re |n/a| regular expression; find and pull specific strings of characters depending on need, allow for easy naming and variable creation | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| os | n/a| allow for interaction with computer operating system, including the reading and writing of files |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| sys |n/a| manipulate python runtime environment |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation.|
| glob |n/a| pull files of interest, specifically for blind docking |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| warnings | n/a | filter warnings | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |

### Protein and Ligand Preparation
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| biopython (Bio.PDB, PDBList)| n/a | fetch and download pdb strucures from rcsb.org | Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878 |
| MDAnalysis (PDB)| mda | allow for the selection of atoms for separating protein from ligands and ligands from each other | R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 98-105, Austin, TX, 2016. SciPy, doi:10.25080/majora-629e541a-00e. |
| --- | --- | --- | N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J. Comput. Chem. 32 (2011), 2319-2327, doi:10.1002/jcc.21787. PMCID:PMC3144279. |
| pdb2pqr | n/a | prepare protein receptors for docking | PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, Baker NA. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W522-5. |
| --- | --- | --- | PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W665-7. |
| open babel (pybel)| n/a | prepare ligands for docking and allow for the conversion of ligand information to different file types |  O'Boyle, N.M., Banck, M., James, C.A. et al. Open Babel: An open chemical toolbox. J Cheminform 3, 33 (2011). https://doi.org/10.1186/1758-2946-3-33.|
| rdkit (Chem)| n/a | ligand sanitation |  RDKit: Open-source cheminformatics; http://www.rdkit.org |
| fpocket | n/a | find possible binding pockets in protein receptors | Le Guilloux, V., Schmidtke, P. & Tuffery, P. Fpocket: An open source platform for ligand pocket detection. BMC Bioinformatics 10, 168 (2009). https://doi.org/10.1186/1471-2105-10-168. |

### Visualization
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| rdkit.Chem (Draw)| n/a | ligand visualization |  RDKit: Open-source cheminformatics; http://www.rdkit.org |
| py3Dmol | n/a | apoprotein and protein complex visualization |  Keshavan Seshadri, Peng Liu, and David Ryan Koes. Journal of Chemical Education 2020 97 (10), 3872-3876. https://doi.org/10.1021/acs.jchemed.0c00579. |

### UI
| Module (Submodule/s)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| IPython (ipywidgets, display)| n/a | allow for widgets to be implemented and displayed | Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: https://ipython.org |
| ipywidgets (FileUpload, Dropdown, Text, Layout, Label, Box, HBox)| widgets | create interactable wigets of different types | Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: https://ipython.org |

In [None]:
import numpy as np
import pandas as pd
import numbers
import re
import sys, os
import glob
import ipywidgets as widgets
from ipywidgets import FileUpload, Dropdown, Text, Layout, Label, Box, HBox
from IPython.display import display

from Bio.PDB import PDBList
import pdb2pqr
import MDAnalysis as mda 
from MDAnalysis.coordinates import PDB
from openbabel import pybel
from rdkit import Chem
from rdkit.Chem import Draw

import py3Dmol

import warnings
warnings.filterwarnings("ignore")

## Retrieve desired protein and ligand/s

The desired protein receptor (and ligand/s, if the PDB entry is a complex) can be retrieved from the Protein Data Bank using the biopython module; specifically, the Bio.PDB package. The retrieved PDB structure file is then cleaned (refering to the removal of water molecules and ions that may interfere with docking) before it is separated into two files using MDAnalysis atom selection: a PDB file containing the protein receptor, and a MOL2 file containing the ligand/s bound to the protein receptor (if present).

In [None]:
# create data path/dir, return error if exists
current_dir = os.getcwd()
dataPath = os.path.join(current_dir, "data")
try:
    os.mkdir(dataPath)
except OSError as error:
    print(error)

# create pdb file path/dir, return error if exists
pdbPath = os.path.join(dataPath, "PDB_files")
try:
    os.mkdir(pdbPath)
except OSError as error:
    print(error)

# create mol2 file path/dir, return error if exists
mol2Path = os.path.join(dataPath, "MOL2_files")
try:
    os.mkdir(mol2Path)
except OSError as error:
    print(error)

# create pdbqt file path/dir, return error if exists
pdbqtPath = os.path.join(dataPath, "PDBQT_files")
try:
    os.mkdir(pdbqtPath)
except OSError as error:
    print(error)

Using the text input box created from running the cell below, type in the 4-character PDB ID for the desired receptor for molecular docking. The protein can either be just the apoprotein (no bound ligands) or in complex. The subsequent cell splits the PDB file into PDB files containing just the protein and just the ligands (if present)

Here are some possible PDB IDs to use if you need suggestions
- __1oyt__ (small protein with two ligands in complex)
- __id1__ (find a single chain protein with 2-4+ ligands)
- __id2__ (find a multi-chain ligand, all distinct chains)
- __id3__ (find a multi-chain ligand that has some identical chains)

In [None]:
name = Text(value = '', placeholder='Type 4-character PDB ID to be used', disabled=False)
name

In [None]:
pdb_list = PDBList()

# get PDB from pdb.org
pdb_id = str(name.value)
pdb_filename = pdb_list.retrieve_pdb_file(pdb_id, pdir="data/PDB_files", file_format="pdb")

# isolate protein
u = mda.Universe(pdb_filename)
protein = u.select_atoms("protein")
protein.write(f"data/PDB_files/{pdb_id}_protein.pdb")

# isolate ligands and remove water molecules from PDB file
ligand = u.select_atoms("not protein and not resname HOH")
ligand.write(f"data/PDB_files/{pdb_id}_ligand.pdb")

with open(f"data/PDB_files/{pdb_id}_clean_ligand.pdb", 'w+') as datafile:
    with open(f"data/PDB_files/{pdb_id}_ligand.pdb","r") as outfile:
        data = outfile.readlines()
    for line in data:
        if 'HETATM' in line:
            split_line = line.split()
            line_1 = re.findall(r'[a-zA-Z]', split_line[1])
            line_2 = re.findall(r'[a-zA-Z]', split_line[2])
            line_3 = re.findall(r'[a-zA-Z]', split_line[3])
            line_1_join = ''.join(str(x) for x in line_1)
            line_2_join = ''.join(str(x) for x in line_2)
            line_3_join = ''.join(str(x) for x in line_3)
            if 'HETATM' not in split_line[0]:
                datafile.write(line)
            # only write hetatm lines if they are not atomic ions -- if the alphabetical characters in the
            # res name column and atom name column are the same, it is likely an atomic ion
            elif (split_line[0] == 'HETATM') & (line_2_join != line_3_join):
                datafile.write(line)
            # if res number is 10000 or greater, columns for atom type and res number are counted as one
            # due to lack of white space, affecting numbering
            elif (split_line[0] != 'HETATM') & (line_1_join != line_2_join):
                datafile.write(line)
        else:
            datafile.write(line)
        
# convert ligand pdb file to mol2 file
pdb_mol2 = [m for m in pybel.readfile(filename = f"data/PDB_files/{pdb_id}_clean_ligand.pdb", format='pdb')][0]
out_mol2 = pybel.Outputfile(filename = f"data/MOL2_files/{pdb_id}_ligand.mol2", overwrite = True, format='mol2')
out_mol2.write(pdb_mol2)

## Separating ligands into separate .mol2 files

__1. Separating ligands from input pdb file into separate mol2 files (if needed)__ <br>
In this notebook, we will make sure that each ligand has its own mol2/pdbqt files. While this isn't a required step for for docking, separating the ligands into separate files makes data collection and analysis easier to perform and understand.

__2. Importing local mol2 files from a personal computer__ <br>
__3. Getting ligand/s mol2 files using SMILES strings__ <br>
In addition to ligand separation, this notebook also contains two methods of retrieving additional ligands to be used in ligand docking other than those present in the original protein complex. This allows for the testing of non-canonical binding agents using ligands that are of interest to the user.

### Method 1 : Obtaining ligands from input PDB file

To create multiple output files from one input file, the original file must be read thoroughly to ensure all data is captured and the resulting files must be carefully pieced together to ensure that the mol2 format is followed perfectly, as any descrepencies in the output files can drastically impact docking results. The function `separate_mol2_ligs` first parses through the input file, obtaining the line numbers for the different attributes (molecule, atom, bond, and substructure) and determining which information belongs to each ligand based on the name associated with it. From this, the following attributes are obtained:
- the line number where molecule information begins in the file
- the line number where atom information begins in the file
- the line number where bond information begins in the file
- the line number where structure information begins in the file
- ligand names in order of appearance in the file
- the location of the first instance of an atom corresponding to a given ligand
- the number of atoms in a given ligand
- the lines of the mol2 file that contain atom information acros all ligands
- the total number of atoms across all ligands
- the location of the first instance of a bond corresponding to a given ligand
- the number of bonds in a given ligand
- the lines of the mol2 file that contain bond information across all ligands
- the total number of bonds across all ligands
- the location of the first instance of a structure corresponding to a given ligand
- structure data for a given ligand

Using all of this information, new mol2 files are created for each ligand, with the final number of mol2 files outputted equalling the number of ligands present in the input file.

For more information on the mol2 file format, [this pdf has a lot of useful information](https://www.structbio.vanderbilt.edu/archives/amber-archive/2007/att-1568/01-mol2_2pg_113.pdf)

In [None]:
ligs = [] # name of ligands
filenames = [] # resulting file names for each ligand

In [None]:
# determine if first item in a line can be converted to an integer, used to determine where to look for
# ligand atoms (useful when other headers are in between "@<TRIPOS>ATOM" and "@<TRIPOS>BOND")
def convert_type(start_type):
    try:
        isinstance(int(start_type), int)  
        return True
    except ValueError:
        return False

In [None]:
def separate_mol2_ligs(filename = ''):
    ligand_file = os.path.join(current_dir, filename)
    tripos_mol = []
    tripos_atom = []
    tripos_bond = []
    with open(ligand_file, "r") as outfile:
        data = outfile.readlines()
        for linenum, line in enumerate(data):
            if "@<TRIPOS>MOLECULE" in line:
                tripos_mol.append(linenum)
            if "@<TRIPOS>ATOM" in line:
                tripos_atom.append(linenum)
            if '@<TRIPOS>BOND' in line:
                tripos_bond.append(linenum)

    # variable initialization
    # variable group one - atoms
    lig1 = '' # ligand name
    ligs_temp = [] # initial list of all ligands found in combined mol2 file
    ligs_unique = [] # list of all unique ligands
    ligs_unique_index = [] # list of all indexes of unique ligands
    ligs_repeat = [] # list of all ligands that appear more than once
    
    lig_loc = [] # location of first instance of an atom corresp. to a ligand in mol2 file
    lig_loc_unique = [] # location of first instance of an atom corresp. to a unique ligand in mol2 file
    
    lines_atoms = [] # lines of atoms
    atoms = [] # number of atoms for each ligand
    all_atoms = 0 # number of total atoms
    
    # variable group two - bonds
    lig_bond_loc = [] # location of first instance of a bond corresp. to a ligand in mol2 file (nested list)

    lines_bonds = [] # lines containing bonds for each ligand
    lines_bonds_2 = [] # lines containing bonds for each ligand (renumbered to start at 1 for each ligand)
    bonds = [] # number of bonds for each ligand
    all_bonds = 0 # number of total bonds
    
    # find lines containing atoms for each ligand
    with open(ligand_file, "r+") as outfile:
        data = outfile.readlines()
        a = 1
        for instance, value in enumerate(tripos_atom):
            temp_lines_counter = 0
            for linenum, line in enumerate(data):
                for i in range((linenum > value) and (linenum < tripos_bond[instance])):
                    if (convert_type(line.split()[0])) and (len(line.split()) > 7):
                        ligand = line
                        lig_atom = ligand.split()
                        lig1 = str(lig_atom[-2])
                        # if ligand is labelled as "UNL1", change it
                        if lig1 == 'UNL1' and len(smile_names) > 0:
                            lig_atom[-2] = smile_names[instance]
                            lig1 = smile_names[instance]
                        # if a ligand is not in the list of identified ligands and is not labeled as 
                        # "UNL1", record the line number
                        if (lig1 not in ligs_temp) & (lig1 != 'UNL1'):
                            ligs_temp.append(lig1)
                            lig_loc.append(int(linenum))
                            a = 1
                        # if the number corresponding to the order of atoms is equal to one, it means
                        # these atoms belong to a new ligand, record the line number
                        elif (int(lig_atom[0]) == 1):
                            lig_loc.append(int(linenum))
                            find_lig = data[tripos_mol[instance] + 1]
                            find_lig_2 = find_lig.split()
                            ligs_temp.append(find_lig_2[0])
                            a = 1
                        # if a ligand is in the list of identified ligands and is a different ligand 
                        # than the one in the line above it, the new ligand is a duplicate of a previously
                        # identified ligand, record the line number
                        elif ((lig1 in ligs_temp) and (lig1 != ligs_temp[-1])):
                            ligs_temp.append(lig1)
                            lig_loc.append(int(linenum))
                            a = 1
                        lig_atom[0] = str(a)
                        lig_atom[-3] = str(1)
                        newline_2 = ' '.join(str(x) for x in lig_atom)
                        lines_atoms.append(newline_2)
                        temp_lines_counter += 1
                    a += 1
            lig_loc.append(lig_loc[-1] + temp_lines_counter)
    
    # determine the number of unique ligands and their location in lig_temp list
    odd_counter = 0
    for index, templig in enumerate(ligs_temp):
        if templig not in ligs_repeat:
            ligs_unique.append(templig)
            ligs_unique_index.append(index)
            ligs_repeat.append(templig)
            if len(tripos_atom) == 1:
                lig_loc_unique.append(lig_loc[index])
                if (((index + 1) < len(ligs_temp)) and (ligs_temp[index + 1] in ligs_repeat)) or ((index + 1) == len(ligs_temp)):
                    lig_loc_unique.append(lig_loc[index + 1])
            else:
                lig_loc_unique.append(lig_loc[index + odd_counter])
                lig_loc_unique.append(lig_loc[index + 1 + odd_counter])
                odd_counter += 1
    
    # get number of atoms in each unique ligand
    d = 0
    while d < len(lig_loc_unique) - 1:
        if len(tripos_atom) == 1:
            atoms_1 = lig_loc_unique[d + 1] - lig_loc_unique[d]
            atoms.append(int(atoms_1))
            d += 1
        else:
            atoms_1 = lig_loc_unique[d + 1] - lig_loc_unique[d]
            atoms.append(int(atoms_1))
            d += 2
    all_atoms = sum(atoms)
    
    # find lines containing bonds for each unique ligand
    for ligand_number, atom in enumerate(atoms):
        odder_counter = 2
        ligand_bonds = []
        ligs_multiple_bond_sects = []
        with open(ligand_file, "r+") as outfile:
            data = outfile.readlines()
            if len(tripos_atom) == 1:
                for instance, value in enumerate(tripos_bond):
                    for linenum, line in enumerate(data):
                        if (linenum > value) and (len(line.split()) == 4):
                            bond = line
                            bond_num = bond.split()
                            bond_atom1 = int(bond_num[1])
                            bond_atom2 = int(bond_num[2])
                            if (max(bond_atom1, bond_atom2) <= (sum(atoms[:ligand_number]) + atom) and min(bond_atom1, bond_atom2) > sum(atoms[:ligand_number])):
                                ligand_bonds.append(linenum)
                                lines_bonds.append(bond)
            else:
                for linenum, line in enumerate(data):
                    if (linenum > tripos_bond[ligand_number]) and ((tripos_bond[-1] == tripos_bond[ligand_number]) or (linenum < lig_loc_unique[ligand_number + odder_counter])) and (len(line.split()) == 4):
                        bond = line
                        ligand_bonds.append(linenum)
                        ligs_multiple_bond_sects.append(bond)
                lines_bonds.append(ligs_multiple_bond_sects)
                odder_counter += 1
        if len(ligand_bonds) > 0:
            bonds.append(len(ligand_bonds))
            lig_bond_loc.append(ligand_bonds)
    
    # renumber bond numbers in each line for each ligand (only needed for multiple ligand being present
    # under one instance of the "@<TRIPOS>ATOM" header)
    ind = 0
    if len(tripos_atom) == 1:
        for num, item in enumerate(lig_bond_loc):
            b = 1
            temp_lines = []
            for lineloc in item:
                bond_line = lines_bonds[ind].split()
                bond_line[0] = str(b)
                bond_line[1] = int(bond_line[1]) - sum(atoms[:num])
                bond_line[2] = int(bond_line[2]) - sum(atoms[:num])
                bond_line_2 = ' '.join(str(x) for x in bond_line)
                temp_lines.append(bond_line_2)
                ind += 1
                b += 1
            lines_bonds_2.append(temp_lines)
   
    #write a mol2 file for each ligand using collected data
    l = 0
    k = 0
    # if data from each ligand that is the same category (atom, bond, etc) is interrupted 
    # (e.g. "ATOM INFO FOR LIG 1" -> "BOND INFO FOR LIG 1" -> "ATOM INFO FOR LIG 2" -> "BOND INFO FOR LIG 2" -> ...)
    if (lig_loc_unique[-1] - lig_loc_unique[0]) > sum(atoms):
        while l < len(ligs_unique):
            filename = "data/MOL2_files/" + str(ligs_unique[l]) + ".mol2"
            filenames.append(filename)
            infile = open(filename, "w")
            tripos_mol = ["@<TRIPOS>MOLECULE\n", str(ligs_unique[l]) + "\n", str(atoms[l]) + " " + str(bonds[l]) + " " + str(1) + "\n","****\n", "****\n"]
            tripos_atoms = ["@<TRIPOS>ATOM\n"]
            int1 = int(lig_loc_unique[k]) - lig_loc_unique[0]
            int2 = int(lig_loc_unique[k+1]) - lig_loc_unique[0]
            if k > 0:
                int1 = int(lig_loc_unique[k]) - sum(lig_loc_unique[:k]) - (6 * l)
                int2 = int(lig_loc_unique[k+1]) - sum(lig_loc_unique[:k]) - (6 * l)
            while int1 < int2:
                tripos_atoms.append(str(lines_atoms[int1]) + "\n")
                int1 += 1
            tripos_bonds = ["@<TRIPOS>BOND\n"]
            sub_lines = lines_bonds[l]
            for item in sub_lines:
                tripos_bonds.append(item)
            infile.writelines(tripos_mol)
            infile.writelines(tripos_atoms)
            infile.writelines(tripos_bonds)
            ligs.append(ligs_unique[l])
            infile.close()
            l += 1
            k += 2
    # if data from each ligand that is the same category (atom, bond, etc) is unbroken
    # (e.g. "ATOM INFO FOR LIG 1" -> "ATOM INFO FOR LIG 2" -> "BOND INFO FOR LIG 1" -> "BOND INFO FOR LIG 2" -> ...)
    else:
        while l < len(ligs_unique):
            filename = "data/MOL2_files/" + str(ligs_unique[l]) + ".mol2"
            filenames.append(filename)
            infile = open(filename, "w") 
            tripos_mol = ["@<TRIPOS>MOLECULE\n", str(ligs_unique[l]) + "\n", str(atoms[l]) + " " + str(bonds[l]) + " " + str(1) + "\n","****\n", "****\n"]
            tripos_atoms = ["@<TRIPOS>ATOM\n"]
            int1 = int(lig_loc_unique[l]) - lig_loc_unique[0]
            int2 = int(lig_loc_unique[l+1]) - lig_loc_unique[0]
            while int1 < int2:
                tripos_atoms.append(str(lines_atoms[int1]) + "\n")
                int1 += 1
            tripos_bonds = ["@<TRIPOS>BOND\n"]
            sub_bond = lig_bond_loc[l]
            sub_lines = lines_bonds_2[l]
            for index, value in enumerate(sub_bond):
                tripos_bonds.append(str(sub_lines[index]) + "\n")
            infile.writelines(tripos_mol)
            infile.writelines(tripos_atoms)
            infile.writelines(tripos_bonds)
            infile.close()
            ligs.append(ligs_unique[l])
            l += 1
    return ligs

In [None]:
# create separate mol2 files for ligand/s in input pdb file
file = "data/MOL2_files/" + str(pdb_id) + "_ligand.mol2"
separate_mol2_ligs(filename = file)

<div class="alert alert-block alert-info">
<b>Please note:</b> Some ligands covalently bind to residues of the receptor, and thus are not good candidates for molecular docking. Iron-sulfur clusters, for example, are cofactors that typically bind to sulfur atoms on CYS residues via thiol exchange or a similar mechanism. This means that trying to dock them into potential binding pockets is not necessarily the best method of determining where they will bind. </div>

Duplicates of a ligand in a protein complex's pdb file can result in innacurate calculations of ligand locations, sizes, and centers in future cells. To prevent this, the chain ID of the first occurence of each ligand present in the input pdb file is recorded, and will be used to accurately and precisely select the atoms present in the ligand.

In [None]:
# determine which chain each ligand is in
lig_chain = []
with open(f"data/PDB_files/{pdb_id}_ligand.pdb", "r") as outfile:
    temp_ligs = []
    data = outfile.readlines()
    for linenum, line in enumerate(data):
        ligand = line.split()
        if "HETATM" in ligand[0]:
            lig1 = ligand[3] + ligand[5]
            if "." in lig1:
                temp_num = re.findall(r'\d+', ligand[4])
                temp2_num = ''.join(str(x) for x in temp_num)
                lig1 = ligand[3] + temp2_num
            if lig1 not in temp_ligs:
                temp_ligs.append(lig1)
                chain_id = ligand[4][0]
                lig_chain.append(chain_id[0])
print(lig_chain)

### Method 2:  Adding ligands from local .mol2 files

To dock a ligand that is not present in the imported PDB file, we can upload its mol2 file (which can be obtained on the pdb website) and obtain all the relavent information using ipywidgets. The upload widget will only accept mol2 files; any other file type will result in an error. Multiple files are able to be uploaded at once. To use the uploader, the cell below needs to be run. After running the cell, the upload button will appear, allowing mol2 files to be selected. After uploading the files, the next cell will be ready to be run and will write each uploaded mol2 file into the "Data" folder.

In [None]:
lig_files = []
upload = widgets.FileUpload(accept='.mol2', multiple=True)
display(upload)

In [None]:
# get information for each uploaded mol2 file
for file_num, upload_filename in enumerate(upload.value):
    uploaded_file = upload.value[upload_filename]
    uploaded_file_name = uploaded_file['metadata']['name']
    lig_files.append(uploaded_file_name)

# write mol2 files into data folder for each ligand 
for name in lig_files:
    with open("data/MOL2_files/" + str(name), "wb") as fp:
        fp.write(upload.value[name]["content"])
    filenames.append(name)
    name_alone = name.split('.')[0]
    ligs.append(name_alone)

### Method 3: Adding ligands using user-input SMILE format

If you are familiar with SMILES format, you can input the SMILES string for the ligand/s in the cell below. Invalid SMILES strings will result in an error. This method is not recommended for those with no experience with SMILES formatting, as a small mistake in the SMILES string can result in the creation of an invalid molecule and can cause issues in the docking process.

In [None]:
# LEE- TEST EXTENSIVELY
style = {'description_width': 'initial'}
num_of_ligs = Dropdown(options = list(range(6)), description = 'Select number of ligands to input', style = style)
num_of_ligs

In [None]:
form_item_layout = Layout(
    display='flex',
    flex_flow='row',
    justify_content='space-between')

name1 = Text(value = '', placeholder='Type the name of ligand 1 with no spaces', disabled=False)
name2 = Text(value = '', placeholder='Type the name of ligand 2 with no spaces', disabled=False)
name3 = Text(value = '', placeholder='Type the name of ligand 3 with no spaces', disabled=False)
name4 = Text(value = '', placeholder='Type the name of ligand 4 with no spaces', disabled=False)
name5 = Text(value = '', placeholder='Type the name of ligand 5 with no spaces', disabled=False)
scratch1 = Text(value = '', placeholder='Type in ligand 1 using SMILE codes', disabled=False)
scratch2 = Text(value = '', placeholder='Type in ligand 2 using SMILE codes', disabled=False)
scratch3 = Text(value = '', placeholder='Type in ligand 3 using SMILE codes', disabled=False)
scratch4 = Text(value = '', placeholder='Type in ligand 4 using SMILE codes', disabled=False)
scratch5 = Text(value = '', placeholder='Type in ligand 5 using SMILE codes', disabled=False)


form_items1 = [Box([Label(value='Name of ligand 1'), name1], layout=form_item_layout),
               Box([Label(value='Name of ligand 2'), name2], layout=form_item_layout),
               Box([Label(value='Name of ligand 3'), name3], layout=form_item_layout),
               Box([Label(value='Name of ligand 4'), name4], layout=form_item_layout), 
               Box([Label(value='Name of ligand 5'), name5], layout=form_item_layout)]

form_items2 = [Box([Label(value='SMILES string for ligand 1'), scratch1], layout=form_item_layout),
               Box([Label(value='SMILES string for ligand 2'), scratch2], layout=form_item_layout),
               Box([Label(value='SMILES string for ligand 3'), scratch3], layout=form_item_layout),
               Box([Label(value='SMILES string for ligand 4'), scratch4], layout=form_item_layout), 
               Box([Label(value='SMILES string for ligand 5'), scratch5], layout=form_item_layout)]

form1 = Box(form_items1, layout = Layout(
    display = 'flex',
    flex_flow = 'column',
    border = 'solid 2px',
    align_items = 'stretch',
    width = '50%'
))
form2 = Box(form_items2, layout = Layout(
    display = 'flex',
    flex_flow = 'column',
    border = 'solid 2px',
    align_items = 'stretch',
    width = '50%'
))

form = HBox([form1, form2])
form

In [None]:
name_vals = {"name1": name1.value, "name2": name2.value, 
             "name3": name3.value,"name4": name4.value, "name5": name5.value}
scratch_vals = {"scratch1": scratch1.value, "scratch2": scratch2.value, 
                "scratch3": scratch3.value, "scratch4": scratch4.value, "scratch5": scratch5.value}
smiles = []
smile_names = []
a = 0
while a < num_of_ligs.value:
    name_temp = "name" + str(a + 1)
    scratch_temp = "scratch" + str(a + 1)
    lig_name = name_vals[name_temp]
    lig_scratch = scratch_vals[scratch_temp]
    lig_test = Chem.MolFromSmiles(lig_scratch)
    if (len(lig_scratch) < 2000) & (lig_test is not None):
        smile_names.append(lig_name)
        smiles.append(lig_scratch)
    a += 1
    
out=pybel.Outputfile(filename='data/MOL2_files/InputMols.mol2',format='mol2',overwrite=True)
for index, smi in enumerate(smiles):
    mol = pybel.readstring(string=smi,format='smiles')
    mol.title= str(smile_names[index])
    mol.make3D('mmff94s')
    mol.localopt(forcefield = 'mmff94s', steps = 500)
    out.write(mol)
out.close()

In [None]:
separate_mol2_ligs(filename = 'data/MOL2_files/InputMols.mol2')

## Cleaning and Preparing Ligands for Docking

Before docking, both the protein receptor and ligand/s need to be sanitized to ensure the shape of the ligand and receptor molecules are valid and to reduce the possibility of biologically irrelevant/unlikely/impossible poses. Sanitizing includes adding the hydrogens that are missing in the PDB/MOL2 files, making sure the charges of the protein are correct, and converting both PDB (protein receptor) and MOL2 (ligand/s) files to PDBQT format (which is necessary for docking using the VINA engine), which stores the hydrogen and charge information for each molecule.

In [None]:
# protein sanitization
# add hydrogens to protein receptor
input_file = "data/PDB_files/" + str(pdb_id) + "_protein.pdb"
pqr_file = "data/PDB_files/" + str(pdb_id) + "_protein.pqr"
output_file = "data/PDB_files/" + str(pdb_id) + "_protein_H.pdb"
! pdb2pqr --pdb-output {output_file} --pH 7.4 --whitespace {input_file} {pqr_file}

In [None]:
# protein sanitization
# create pdbqt file for receptor
to_pdbqt = mda.Universe(pqr_file)
to_pdbqt.atoms.write(f"data/PDBQT_files/{pdb_id}_protein.pdbqt")

# remove "TITLE" and "CRYST1" labels with "REMARK" to reduce chance of errors later on
with open(f"data/PDBQT_files/{pdb_id}_protein.pdbqt", 'r') as file:
    file_content = file.read()
file_content = file_content.replace('TITLE', 'REMARK').replace('CRYST1', 'REMARK')
with open(f"data/PDBQT_files/{pdb_id}_protein.pdbqt", 'w') as file:
    file.write(file_content)

In [None]:
# ligand sanitization
# add hydrogens to ligands
filenames_H = []
a = 0
for i in filenames:
    mol= [m for m in pybel.readfile(filename= str(i),format='mol2')][0]
    mol.addh()
    s = "data/MOL2_files/" + str(ligs[a]) + "_H.mol2"
    filenames_H.append(s)
    out = pybel.Outputfile(filename= "data/MOL2_files/" + str(ligs[a]) + "_H.mol2",format='mol2',overwrite=True)
    out.write(mol)
    out.close()
    a += 1

In [None]:
# ligand sanitization
# convert to pdbqt
n = 0
filenames_pdbqt = []
for i in filenames:
    ligand = [m for m in pybel.readfile(filename= str(i) ,format='mol2')][0]
    s = "data/PDBQT_files/" + str(ligs[n]) + "_H.pdbqt"
    filenames_pdbqt.append(s)
    ligand.write(filename = s, format='pdbqt', overwrite=True)
    n += 1

For docking, information about the size and center of the ligand/s is needed to ensure that the entire ligand can be docked to the desired binding pocket. To add a little bit of "wiggle room", the lengths of the x, y, and z dimensions are increased by 5 angstroms (if the length is positive, five is added; if the length is negative, five is subtracted).

In [None]:
# get center and size of ligand/s
lig_box_c = []
lig_box_s = []
for h, i in enumerate(filenames_H):
    res_name_joined = ligs[h][0:3]
    res_id_joined = ligs[h][3:]
    if h < len(lig_chain):
        res_chain = lig_chain[h]
        ligand_mda = u.select_atoms("resname " + str(res_name_joined) + " and resnum " + str(res_id_joined) +" and chainID " + str(res_chain))
    else:
        # need to create new universe. "u" as universe will not work as ligand will not be in pdb file and thus
        # would not have a chain id
        u2 = mda.Universe(i)
        ligand_mda = u2.select_atoms()
    pocket_center = ligand_mda.center_of_geometry()
    pocket_center_list = np.ndarray.tolist(pocket_center)
    ligand_box = ligand_mda.positions.max(axis=0) - ligand_mda.positions.min(axis=0)
    ligand_box_list = np.ndarray.tolist(ligand_box)
    ligand_box_list2 = []
    for value in ligand_box_list:
        if value < 0:
            ligand_box_list2.append(float(value - 5))
        elif value > 0:
            ligand_box_list2.append(float(value + 5))
        else:
            ligand_box_list2.append(float(0))
    lig_box_c.append(pocket_center_list)
    lig_box_s.append(ligand_box_list2)

## Find possible binding pockets in protein using fpocket

fpocket is an algorithm that aids in protein pocket detection and scoring. Based on variables including solvent accessibility, the hydrophobicity of residues, density, flexibility, residue charges, and more (all contributing variables are listed in the table below), the likelihood of a pocket acting as a binding site to a nonspecified ligand is calculated (also known as the druggability score), which helps determine possible docking boxes to be used in ligand docking.

Column descriptions for data output (pocket_descriptors.csv):

| Descriptor | Role |
| :--- | :--- |
| drug_score | score ranging from 0 to 1 describing the likelihood of a drug binding to a given pocket, where 0.5 is the threshold where the binding of a drug in the pocket is possible |
| volume | pocket volume|
|nb_asph| the number of alpha spheres in a pocket, which measures the size of cavity normalized to the largest pocket|
|inter_chain | an integer equal to 0 (if the pocket is made of a single chain) or 1 (if the pocket is comprised of 2 chains)|
|apol_asph_proportion | proportion of apolar alpha spheres; the percentage of alpha spheres in a pocket that are apolar|
|mean_asph_radius| mean alpha sphere radius|
|as_density| alpha sphere density of pocket, calculated by taking the mean of all alpha sphere pair-to-pair distances. smaller values indicate a more compact and dense pocket|
|mean_asph_solv_acc| mean alpha sphere solvent accessibility|
|mean_loc_hyd_dens| mean local hydrophobic density; identification of areas of the binding pocket with localized hydrophobicity. calculated by seeing how many apolar spheres overlap with each other. the sum of all apolar neighbors is divided by the total number of apolar spheres|
|flex| flexibility of pocket (b factor)|
|hydrophobicity_score| the hydrophobicity score, which is the mean hydrophobicity score of all residues in the pocket|
|volume_score| the volume score, which is the mean volume score of all amino acids in contact with at least one alpha sphere of the pocket|
|charge_score| the charge score, which is the mean charge for all amino acids in contact with at least one alpha sphere of the pocket|
|polarity_score| the polarity score, which is the hydrophilicity of the binding pocket, which is calculated by taking the mean of all polarity scores of all residues in the pocket|
|a0_apol | describes apolar Van der Waals surface of pocket|
|a0_pol | describes polar Van der Waals surface of pocket|
|af_apol | describes apolar Van der Waals surface of pocket|
|af_pol | describes polar Van der Waals surface of pocket|
|n_abpa| the number of abpas in the binding site |
|three-letter amino acid code (i.e. "ala")|Absolute amino acid composition of a given pocket, divided into groups by amino acid|
|chain_1_type| chain 1 type; an integer equal to 0 (if the pocket is a protein pocket), 1 (if the pocket is a nucleic acid pocket), or 2 (if the pocket is a HETATM pocket)|
|chain_2_type| chain 2 type; an integer equal to 0 (if the pocket is a protein pocket), 1 (if the pocket is a nucleic acid pocket), or 2 (if the pocket is a HETATM pocket)|
|num_res_chain_1|  the total number of residues in chain 1|
|num_res_chain_2| number of residues on chain 2. if the pocket is only made up of one chain, the value of this descriptor is equal to the value of "num_res_chain_1"|
|lig_het_tag|  HETATM tag of ligands situated in the binding pocket|
|name_chain_1|  the name of the first chain in contact with the pocket (denoted using a letter [i.e. "A"])|
|name_chain_2|  the name of the second chain in contact with the pocket (denoted using a letter [i.e. "A"]). if the pocket is only made up of one chain, the value of this descriptor is equal to the value of "name_chain_1"|

In [None]:
#use fpocket to view potential pockets in protein
! fpocket -f {"data/PDB_files/" + str(pdb_id) + "_clean.pdb"} -d > {"data/pocket_descriptors.csv"}

In [None]:
prot_pockets = pd.read_csv('data/pocket_descriptors.csv',sep=' ',index_col=[0])

In [None]:
#get pockets and docking boxes for all pockets in a dataframe
fpocket_out = "data/PDB_files/" + str(pdb_id)+ "_clean_out/"
f_pocket_dir = os.path.join(current_dir, fpocket_out)
for file in os.listdir(f_pocket_dir):
    if 'env_atm' in file:
        atoms = []
        res_and_atoms = []
        pocket_num = int(file.split('_')[0].replace('pocket',''))
        out_dir = os.path.join(f_pocket_dir, file)
        with open(out_dir, 'r') as outfile:
            data = outfile.readlines()
        for line in data:
            split_line = line.split()
            if len(split_line) > 1:
                select_atom_num = split_line[1]
                select_atom = split_line[2]
                select_residue = split_line[3]
                select_residue_num = split_line[5]
                # if the residue number for a protein has four digits (greater than 999), split_line[5] will be 
                # equal to the x coordinate of the atom as the whitespace between the chain identifier and the 
                # residue number will disappear. the following if statement addresses this
                if "." in select_residue_num: 
                    temp_residue_num = re.findall(r'\d+', split_line[4])
                    temp2_residue_num = ''.join(str(x) for x in temp_residue_num)
                    select_residue_num = int(temp2_residue_num)
                atoms.append(select_atom_num)
                md_input1 = "(resid " + str(select_residue_num) + " and name " + str(select_atom) + ")"
                res_and_atoms.append(md_input1)

        # get center of docking box
        atom_string = ', '.join(str(x) for x in atoms)
        res_and_atom_string = ' or '.join(str(x) for x in res_and_atoms)
        md_input2 = "id " + str(atom_string)
        pocket_mda = u.select_atoms(res_and_atom_string)
        pocket_center = pocket_mda.center_of_geometry()
        pocket_center_list = np.ndarray.tolist(pocket_center)

        # get size of docking box
        ligand_box = pocket_mda.positions.max(axis=0) - ligand_mda.positions.min(axis=0)
        ligand_box_list = np.ndarray.tolist(ligand_box)
        ligand_box_list2 = []
        for value in ligand_box_list:
            if value < 0:
                ligand_box_list2.append(float(value - 5))
            elif value > 0:
                ligand_box_list2.append(float(value + 5))
            else:
                ligand_box_list2.append(float(0))
        
        prot_pockets.loc[pocket_num,'center_x'] = pocket_center_list[0]
        prot_pockets.loc[pocket_num,'center_y'] = pocket_center_list[1]
        prot_pockets.loc[pocket_num,'center_z'] = pocket_center_list[2]
        prot_pockets.loc[pocket_num,'size_x'] = abs(ligand_box_list2[0])
        prot_pockets.loc[pocket_num,'size_y'] = abs(ligand_box_list2[1])
        prot_pockets.loc[pocket_num,'size_z'] = abs(ligand_box_list2[2])
        
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(prot_pockets)

To make docking more efficient, all pockets with a drugability score greater than 0.20 will be added to a new dataframe, which will be used to create the docking boxes for ligands. If the number of posible binding pockets found by fpocket exceeds 25, it is highly recommended that the cell below is run.

In [None]:
prot_pockets2 = prot_pockets[prot_pockets['drug_score'] >= 0.20]
prot_pockets2

In [None]:
prot_pockets2.to_csv("data/protein_pockets.csv")

## View ligands and proteins together prior to docking

While viewing the ligand/s and receptor is not required, being able to see what the molecules look like as well as being able to see the possible binding pockes on the receptor does help (continue). There are a few different methods this notebook will use to visualize the ligands/proteins to be used in docking. <br>
The first method this notebook will be using is rdkit's Draw module, which takes rdkit molecules and displays a static image of them. This method is easy to implement and only takes one line of code (assuming a list of rdkit Molecules already exists).<br> 
The second method that will be used is py3Dmol, which requires more code to implement but allows for the user to move and rotate the molecule/s and allows for larger molecules (including proteins) to be viewed.

In [None]:
# create list of rdkit molecules
mols = []
ligand_smiles = []
for i in ligs:
    mol = Chem.MolFromMol2File("data/MOL2_files/" + str(i) + "_H.mol2",sanitize=False)
    select_mol_smile = Chem.MolToSmiles(mol)
    ligand_smiles.append(select_mol_smile)
    mols.append(mol)

# view ligands
Draw.MolsToGridImage(mols, molsPerRow=5, subImgSize=(300,300))

Below is code to create the py3Dmol viewer, which consists of three different views. They are as follows:
1. a viewer containing the ligand/s and the receptor, in which the space filling model (surface) of the receptor is present
2. a viewer containing the ligand/s and the receptor, with the addition of transparent boxes around each ligand demonstrating the size and center of the ligand docking boxes. The colors of the ligand boxes differ for clarity's sake, but are otherwise meaningless
3. a viewer containing the ligand/s and the receptor, with the addition of the binding pockets found by fpocket. The colors of the binding pockets differ for clarity's sake, but are otherwise meaningless

To avoid parsing through every binding pocket file only to visualize a portion of them, a list containing all of the pqr file paths for binding pockets with a druggability score greater than 0.20 will be created. The pqr file format includes charge and radius field information for each atom in a binding pocket in addition to information recorded in pdb files.

In [None]:
revised_files = []
pocketPath = os.path.join(current_dir, "data", "PDB_files", str(pdb_id) + "_clean_out", "*.pqr")
pocketFiles = glob.glob(pocketPath)
for file in pocketFiles:
    split_1 = file.split("/")[-1]
    split_2 = split_1.split("_")[0]
    index_num = re.findall(r'\d+', split_2)
    index_num2 = ''.join(str(x) for x in index_num)
    if int(index_num2) in prot_pockets2.index:
        revised_files.append(file)

<div class="alert alert-block alert-info">
<b>Please note:</b> 
When viewing a larger protein receptor, a py3Dmol viewer may not be able to support three views at once. For proteins that are greater than 100 kDa in size, it is recommended to use two different py3Dmol viewers that depict one view each.
</div>

The cell below creates a py3Dmol viewer that has all three views described above in a grid that are linked, where moving one view moves the other two views to the same position in space. __Only use for proteins that are less than 100 kDa in size.__

In [None]:
# View Protein and ligand/s together
# Only use for smaller proteins
view = py3Dmol.view(height = 800, width = 900, viewergrid = (1,3), linked = True)
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

for i in filenames_H:
    viewer_count = 0
    while viewer_count < 3:
        # add receptor (protein) model to all py3Dmol viewers
        view.addModel(open('data/PDB_files/' + str(pdb_id) + '_protein.pdb','r').read(),format='pdb')
        Prot=view.getModel(viewer = (0,viewer_count))
        Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}}, viewer=(0,viewer_count))
 
        # add ligand/s to all py3Dmol viewers
        view.addModel(open(i,'r').read(),format='mol2')
        ref_m = view.getModel(viewer = (0, viewer_count))
        ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})
        viewer_count += 1

view.addSurface(py3Dmol.VDW,{'opacity':0.6,'color':'white'}, viewer=(0,0))

#visualization for docking boxes for each ligand (viewer 2)
a = 0
colors = ['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'magenta']
for num, i in enumerate(filenames_H):
    view.addBox({"center" :dict(x = lig_box_c[num][0], y = lig_box_c[num][1], z = lig_box_c[num][2]), "dimensions": dict(d = abs(lig_box_s[num][0]), h = abs(lig_box_s[num][1]), w = abs(lig_box_s[num][2])), "color" : colors[a], "opacity" : 0.5}, viewer = (0,1))
    a += 1
    if a > 6:
        a = 0

#visualization for binding pockets found by fpocket (viewer 3)
b = 0
for file in revised_files:
    view.addModel(open(file,'r').read(),format = 'pqr', viewer = (0,2))
    pockets = view.getModel(viewer = (0,2))
    pockets.setStyle({},{'sphere':{'color':colors[b],'opacity':0.5}}) 
    b += 1
    if b > 6:
        b = 0

view.zoomTo()
view.show()

The cells below creates a py3Dmol viewer that 1) visualizes ligands and their docking boxes with the receptor and 2) visualizes the size and location of potential binding pockets along with the ligands and receptor. These can be used for proteins that are of any size, but are necessary for proteins greater than 100 kDa in size.

In [None]:
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

# add receptor (protein) model to py3Dmol viewer
view.addModel(open('data/PDB_files/' + str(pdb_id) + '_protein.pdb','r').read(),format='pdb')
Prot=view.getModel()
Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})

#visualization for ligands and docking boxes for each ligand
for i in filenames_H:
    view.addModel(open(i,'r').read(),format='mol2')
    ref_m = view.getModel()
    ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})
    
colors = ['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'magenta']
a = 0
for j, i in enumerate(filenames_H):
    view.addBox({"center": dict(x = lig_box_c[j][0], y = lig_box_c[j][1], z= lig_box_c[j][2]), "dimensions": dict(d = abs(lig_box_s[j][0]), h = abs(lig_box_s[j][1]), w = abs(lig_box_s[j][2])), "color" : colors[a], "opacity" : 0.5})
    a += 1
    if a > 6:
        a = 0

view.zoomTo()
view.show()

In [None]:
view = py3Dmol.view()
view.removeAllModels()
view.setViewStyle({'style':'outline','color':'black','width':0.1})

# add receptor (protein) model to py3Dmol viewer
view.addModel(open('data/PDB_files/' + str(pdb_id) + '_protein.pdb','r').read(),format='pdb')
Prot=view.getModel()
Prot.setStyle({'cartoon':{'arrows':True, 'tubes':True, 'style':'oval', 'color':'white'}})

#visualization ligands
colors = ['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'magenta']
for h, i in enumerate(filenames_H):
    # add ligand/s to all py3Dmol viewers
    view.addModel(open(i,'r').read(),format='mol2')
    ref_m = view.getModel()
    ref_m.setStyle({},{'stick':{'colorscheme':'greenCarbon','radius':0.2}})

a = 0
for file in revised_files:
    view.addModel(open(file,'r').read(),format = 'pqr')
    pockets = view.getModel()
    pockets.setStyle({},{'sphere':{'color':colors[a],'opacity':0.5}}) 
    a += 1
    if a > 6:
        a = 0

view.zoomTo()
view.show()

## Save results for further use

To use the data collected in this notebook for the next notebook in this series (Docking and Preliminary Analysis), a .csv file containing ligand filenames and ligand box sizes and centers will be created, allowing for the variables to be easily imported and used.

In [None]:
center_x = []
center_y = []
center_z = []
size_x = []
size_y = []
size_z = []
for h, i in enumerate(ligs):
    center_x.append(lig_box_c[h][0])
    center_y.append(lig_box_c[h][1])
    center_z.append(lig_box_c[h][2])
    size_x.append(lig_box_s[h][0])
    size_y.append(lig_box_s[h][1])
    size_z.append(lig_box_s[h][2])
ligand_information = pd.DataFrame({"ligs": ligs,
                                   "filenames": filenames,
                                   "filenames_H": filenames_H,
                                   "filenames_pdbqt": filenames_pdbqt,
                                   "center_x": center_x,
                                   "center_y": center_y,
                                   "center_z": center_z,
                                   "size_x": size_x,
                                   "size_y": size_y,
                                   "size_z": size_z
                                  })
ligand_smiles_data = pd.DataFrame({"filename_hydrogens": filenames_H,
                                   "smiles": ligand_smiles})
ligand_information.to_csv('data/ligand_information.csv', index = False)
ligand_smiles_data.to_csv('data/ligand_smiles_data.csv', index = False)