# Basil Docking V0.1 - Machine Learning Analysis
## Purpose

__Target Audience__<br>
Undergraduate chemistry/biochemistry students and, in general, people that have little to no knowledge of protein-ligand docking and would like to understand the general process of docking a ligand to a protein receptor.

__Brief Overview__<br>
Molecular docking is a computational method used to predict where molecules are able to bind to a protein receptor and what interactions exist between the molecule (from now on, refered to as "ligand") and the receptor. It is a popular technique utilized in drug discovery and design, as when creating new drugs and testing existing drugs aginst new receptors, it is useful to determine the likelihood of binding prior to screening as it can be used to eliminate molecules that are unlikely to bind to the receptor. This significantly reduces the potential cost and time needed to test the efficacy of a set of possible ligands. <br>

The general steps to perform molecular docking, assuming the ligand and receptor are ready to be docked, include the generation of potential ligand binding poses and the scoring of each generated pose (which predicts how strongly the ligand binds to the receptor, with a more negative score corresponding to a stronger bond). To dock a ligand to a protein, (insert text).<br>

This notebook series encompasses<br>
1. The preparation needed prior to docking (protein and ligand sanitation, ensuring files are in readable formats, and finding possible binding pockets)
2. The process of docking ligand/s to a protein receptor using two docking engines (VINA and SMINA) and visualizing/analyzing the outputs
3. Further data collection and manipulation
4. __Utilizing machine learning to determine key residues (on the protein) and functional groups (on the ligand) responsible for protein-ligand binding__

__Stepwise summary for this notebook (docking preparation, notebook 4 out of 4)__<br>
- Determine the likelihood of a compound being orally bioactive using Lipinski's Rule of Five


## Table of Libraries Used
### Operations, variable creation, and variable manipulation

| Module (Submodule)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| numpy | np | performs mathematical operations, fixes NaN values in dataframe outputs, and gets docking box values from MDAnalysis | Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link). |
| pandas | pd | organizes data in an easy-to-read format and allows for the exporting of data as a .csv file | The pandas development team. (2024). pandas-dev/pandas: Pandas (v2.2.3). Zenodo. https://doi.org/10.5281/zenodo.13819579 |
| re |n/a| regular expression; finds and pulls specific strings of characters depending on need, allows for easy naming and variable creation | Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| os | n/a| allows for interaction with computer operating system, including the reading and writing of files |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation. |
| sys |n/a| manipulates python runtime environment |  Van Rossum, G. (2020). The Python Library Reference, release 3.8.2. Python Software Foundation.|

### Protein and Ligand Preparation
| Module (Submodule)| Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| open babel (pybel)| n/a | hygrogenates ligands and prepares ligands for docking |  O'Boyle, N.M., Banck, M., James, C.A. et al. Open Babel: An open chemical toolbox. J Cheminform 3, 33 (2011). https://doi.org/10.1186/1758-2946-3-33.|
| rdkit (Chem)| n/a | ligand creation and sanitation |  RDKit: Open-source cheminformatics; http://www.rdkit.org |

### Machine Learning Methods
| Library/Module | Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| sklearn (RandomForestClassifier, DecisionTreeClassifier, SVC)| n/a | add descrip. |  Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. |

### Data analysis
| Module (Submodule) | Abbreviation | Role | Citation |
| :--- | :--- | :--- | :--- |
| rdkit (Chem (AllChem, Crippen, Lipinski))| n/a | calculate Lipinski descriptors using RDKit mol and SMILES strings  |  RDKit: Open-source cheminformatics; http://www.rdkit.org |
| prolif | plf | calculate, record, and view protein-ligand interactions|  chemosim-lab/ProLIF: v0.3.3 - 2021-06-11.https://doi.org/10.5281/zenodo.4386984. |

For using this notebook, certain libraries are required in order for analysis to perform as planned. You can either use a conda library (provided as a yml file) or install all required libraries using pip install. Only run the cells below if you will not use a conda library to install required libraries, and only use them as needed. If you are using a conda library, start at the coding cell that imports the libraries.

In [None]:
# create cell to import all libraries via pip install
! pip install numpy

In [None]:
! pip install pandas

In [None]:
! pip install ipywidgets

In [None]:
! pip install mdanalysis

In [None]:
! pip install pybel

In [None]:
! pip install rdkit

In [None]:
! pip install sklearn

In [None]:
! pip install prolif

In [None]:
! pip install scipy

In [None]:
# Random Forest, SVM, DecisionTree
import re
import numpy as np
import pandas as pd
import os, sys
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem, Draw, Crippen, Lipinski
from rdkit import DataStructs
from rdkit.Chem.Draw import IPythonConsole

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split, cross_val_score, cross_validate
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.ensemble import RandomForestClassifier #maybe regressor using score as target

IPythonConsole.ipython_useSVG=True

## Lipinski's Rule of 5 and Oral Bioactivity

In [None]:
#lipinski's rule of 5 and determining if a ligand/derivative is likely to be pharmacologically active using decision tree
# (only for orally active -- look into other routes of administation too?)

In [None]:
lro5 = pd.read_csv('data/ligand_smiles_data.csv') # find way to add other csv files?
# get possible ligands based on receptor protein type?

In [None]:
# determine and record the number of atoms and the number of heavy atoms in each ligand
atom_type_dict = {}
atom_abbriv = ['C','N','O','F','Al','P','S','Cl','Cr','Mn','Fe','Co','Ni','Cu',
               'Zn','Ga','Ge','As','Br','Zr','Mo','Pd','Ag','Cd','In','Sn','Sb',
               'I','Ir','Pt','Au','Hg','Pb','Bi']
mol_format = []
atom_total = []
atom_total_heavy = []
for index, row in lro5.iterrows():
    mol = Chem.MolFromMol2File(row['filename_hydrogens'],sanitize=False)
    if mol is not None:
        mol_H = Chem.AddHs(mol)
        mol_format.append(mol_H)
        mol_atoms = mol_H.GetNumAtoms()
        atom_total.append(mol_atoms)
        mol_atoms_heavy = mol_H.GetNumHeavyAtoms()
        atom_total_heavy.append(mol_atoms_heavy)
    else:
        #currently only works for molecules containing only atoms with single letter names, need to fix
        string = row['smiles']
        string_alpha = re.findall(r'[a-zA-Z]', string)
        string_H = re.findall(r'[H]', string)
        mol_format.append(np.nan)
        atom_total.append(len(string_alpha))
        atom_total_heavy.append(len(string_alpha) - len(string_H))
lro5['mol'] = mol_format
lro5['num_of_atoms'] = atom_total
lro5['num_of_heavy_atoms'] = atom_total_heavy
lro5

In [None]:
# determine the number of different heavy atoms
# LEE - TEST WITH LIG CONT. Co2+ or Co3+
num_of_atoms_dict = {}
def number_of_atoms(atom_list, df):
    for i in atom_list:
        substruct_list = []
        for index, row in df.iterrows():
            smile_string = row['smiles']
            if len(i) == 1:
                string_finder_lower = re.findall(r'{}(?![aelu+][+\d])(?!([aeolu]+[+\d]))'.format(i.lower()), smile_string)
                string_finder_upper = re.findall(r'{}(?![aelu+][+\d])(?!([aeolu]+[+\d]))'.format(i), smile_string)
                substruct_list.append(len(string_finder_lower) + len(string_finder_upper))
            else:
                string_finder_brackets = re.findall(r'[\[]{}[\]]'.format(i), smile_string)
                string_finder_charged = re.findall(r'[\[]{}[+][+\d]'.format(i), smile_string)
                substruct_list.append(len(string_finder_brackets) + len(string_finder_charged))
        df['num_of_{}_atoms'.format(i)] = substruct_list

number_of_atoms(atom_abbriv, lro5)
lro5

In [None]:
# calculate weight of ligands
atom_weights = {
    'C':12.0096,
    'N': 14.006,
    'O': 15.999,
    'F': 18.998,
    'Al': 26.981,
    'P': 30.974,
    'S': 32.059,
    'Cl': 35.45,
    'Cr': 51.9961,
    'Mn': 54.938,
    'Fe': 55.845,
    'Co': 58.933,
    'Ni': 58.693,
    'Cu': 63.546,
    'Zn': 65.38,
    'Ga': 69.723,
    'Ge': 72.630,
    'As': 74.921,
    'Br': 79.901,
    'Zr': 91.224,
    'Mo': 95.95,
    'Pd': 106.42,
    'Ag': 107.8682,
    'Cd': 112.414,
    'In': 114.818,
    'Sn': 118.71,
    'Sb': 121.760,
    'I': 126.904,
    'Ir': 192.217,
    'Pt': 195.08,
    'Au': 196.966570,
    'Hg': 200.592,
    'Pb': 207.2,
    'Bi': 208.980
}
ligand_weights = []
for index, row in lro5.iterrows():
    ligand_atom_nums = sum(row[5:])
    weight_da = 0
    if row['num_of_heavy_atoms'] == ligand_atom_nums:
        for num, column in enumerate(row[5:]):
            column_title = list(lro5)[num + 5]
            atom_name = re.split("_", column_title)
            atom_type_weight = atom_weights[atom_name[2]]
            weight_da = weight_da + (atom_type_weight *  column)
    weight_da = weight_da + ((row.iloc[3] - row.iloc[4]) * 1.007)
    ligand_weights.append(weight_da)
lro5.insert(2, "molecular_weight", ligand_weights)
lro5

In [None]:
# calculate logP (partition coefficient) of ligands
log_P = []
H_donors = []
H_acceptors = []
for index, row in lro5.iterrows():
    mol = row.iloc[3]
    if type(mol) != float:
        log = Crippen.MolLogP(mol)
        log_P.append(log)
        donor = Lipinski.NumHDonors(mol)
        H_donors.append(donor)
        acceptor = Lipinski.NumHAcceptors(mol)
        H_acceptors.append(acceptor)
    else:
        pass
lro5.insert(3, "log_P", log_P)
lro5.insert(4, "H_donors", H_donors)
lro5.insert(5, "H_acceptors", H_acceptors)
lro5

In [None]:
# LEE NOTE TO LEE: create csv file with same columns generate above for testing/training

In [None]:
# determine important residues/residue types (receptor), fxnal groups (ligand), and important interactions