2021-02-12

# **XGBScore: A Gradient Boosted Decision Tree Scoring Function for Structure Based Virtual Screening**

This is my lab book for my dissertation project. It will contain my daily work towards the project, and *in silico* experiments performed in python 3 code cells for debugging and evaluation before being turned into separate scripts for publication and analysis.

### **Table of Contents**
- Database Scraping
- Dataset Assembly
- Dataset Cleaning
- Feature Engineering
- Adding Decoys
- Splitting the Data
- Feature Importance/Dimensionality Reduction
- Training the Model
- Optimising the Model
- Evaluating the Model

# **Database Scraping**

While there will likely be lots of redundancy, four databases are being scraped for data for the training and test sets. These are:
- Binding MOAD
- PDBBind
- Binding DB
- DUD-E

Generally, the approach will be to download the raw data into a folder called 'Dissertation/Data/Raw/{database_name}/Compressed', e.g. for Binding_MOAD: 'Dissertation/Data/Raw/Binding_MOAD/Compressed'. Then, the files will be extracted into 'Dissertation/Data/Raw/Binding_MOAD/Extracted'.

## **Binding MOAD**

Binding MOAD offers several different datasets. I have downloaded the structures (biounits) and binding data for the [**Non-redundant dataset only with binding data**](https://bindingmoad.org/Home/download). This dataset contains all complexes which have known binding data, and includes one 'leader' from each family with similar binding to prevent very similar data clogging the dataset. 

It has been downloaded to 'Dissertation/Data/Raw/Binding_MOAD/Compressed' and extracted to 'Dissertation/Data/Raw/Binding_MOAD/Extracted'. I will look at downloading the larger set once the pipeline for raw data to clean training and test datasets has been put together. 


The Binding MOAD files are stored as 'Biounit' files or .bio files, which are just partial structure PDB files and can be treated as shuch by changing the extension from '.bio1' to '.pdb'. We should download the original PDB files to a third folder 'Dissertation/Data/Raw/Binding_MOAD/original_PDB_files' just in case they are more useful. The code below downloads them:

In [None]:
import os
import requests
from tqdm import tqdm

def create_target_url(filepath):
    
    # get the pdb code from the filename
    target_name = filepath.split('.')[0]
    
    # set the url string using the pdb code as where to download the pdb file
    target_url = f'https://files.rcsb.org/download/{target_name}.pdb'
    return target_url, target_name

def save_target_file(url):
    
    # get the file url and the target name
    url, target_name = create_target_url(url)
    
    # change this as to where you need to save the pdb files
    target_path = f'/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/{target_name}.pdb'
    response = requests.get(url)
    
    # ping the pdb and download the file if the url exists
    if response.status_code == 200:
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')


# get the list of all the Binding_MOAD extracted protein-ligand complex biounit files
filepaths = os.listdir('/home/milesm/Dissertation/Data/Raw/Binding_MOAD/Extracted/BindingMOAD_2020')

# download the pdb files for all the Binding_MOAD extracted protein-ligand complex biounit files
with tqdm(total=len(filepaths)) as pbar:
    for target_file in filepaths:
        save_target_file(target_file)
        pbar.update(1)

## PDBBind

I have downloaded the ['Protein-ligand complexes: The refined set'](http://www.pdbbind.org.cn/download.php) to the standard directory 'Dissertation/Data/Raw/PDBBind/Compressed', and extracted it to 'Dissertation/Data/Raw/PDBBind/Extracted'. No more needed to be done to the PDBBind dataset.

## Binding DB

Cannot see a way of differentiating only those complexes which have crystal structures attached to them. At this stage is is not being included in the project.

## DUD-E

For some reason, when I try to download the whole dataset at once, the server is throwing a 503/overload saying I've exceeded the maximum number of simultaneous downloads, despite it only being one tarball file. I think we can scrape the files one by one fairly quickly as there are only 102 of them. The files downloaded will be [these ones, the compressed file of the receptor and all actives and decoys in the individual target directory, like aa2ar.tar.gz](http://dude.docking.org/targets/aa2ar). Each file has a standard base url of 'dude.docking.org/targets/{target}/{target.tar.gz}'. So if we scrape all the hrefs from the main index page, we can just populate a list and ping them one at a time:

In [None]:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

# set the index url
index_url = 'http://dude.docking.org/subsets/all'

def create_target_url(url):
    
    # this creates the url string where the file should logically be stored on DUD-E
    position = url.find('/targets/') + len('/targets/')
    target_name = url[position:]
    target_url = f'http://dude.docking.org/targets/{target_name}/{target_name}.tar.gz'
    
    # returns the url and the target protein name for use as the master folder name
    return target_url, target_name

def save_target_file(url):
    
    # get the url and the name for the folder to store the complexes in
    url, target_name = create_target_url(url)
    
    # change this path to where you want the files to be downloaded to
    target_path = f'/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/{target_name}.tar.gz'
    
    # ping the url and check it exists
    response = requests.get(url, verify=False)
    if response.status_code == 200:
        
        # save the compressed archive file with the decoys, actives and receptor in
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')

# check the DUD-E homepage for the HTML and make it readable with bs4
index_page = requests.get(index_url)
html_content = index_page.text
soup = BeautifulSoup(html_content, 'html.parser')

# get urls of all the target proteins avaliable
file_urls = [url['href'] for url in soup.find_all('a') if '/targets/' in str(url)]

# for each target protein, download all the associated files
with tqdm(total=len(file_urls)) as pbar:
    for target_file in file_urls:
        save_target_file(target_file)
        pbar.update(1)

2021-02-13

DUD-E files have been downloaded in a separate tar.gz file for each target protein. These all need extracting, which can be performed by the script below:

In [None]:
import tarfile
import os
from tqdm import tqdm

# simple function to extract a file
def extract_file(fname):
    tar = tarfile.open(fname, "r:gz")
    tar.extractall()
    tar.close()

# make a list of all the filepaths of the compressed protein target files in the folder
files = [('/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/' + file) for file in os.listdir('/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/')]

# extract all the files
with tqdm(total=len(files)) as pbar:
    for file in files:
        extract_file(file)
        pbar.update(1)

These can then be sorted by filetype, and the extracted folders copied and pasted into the 'Dissertation/Data/Raw/DUD-E/Extracted' folder.

# **Data Cleaning**

2021-02-18

## Software Installations

All the software outlined below is needed for the steps of producing a csv file of features from each protein-ligand complex and decoys from DUD-E

#### **PyMOL 2.4**

Downloaded the tar.bz2 file from [PyMOL](https://pymol.org/2/), unpacked it and ran pymol using: 

#### **Autodock and Autodocktools via MGLTools 1.5.6**

Downloaded the x64 GUI linux installer from the official source and executed it.

#### **Openbabel 3.1.1**

I have had to build openbabel several times from source with different flags for cmake before getting it working - what worked was installing:
- libopenbabel-dev
- libopenbabel4v5
- openbabel-gui

Using the command:

And then cloning openbabel from github and building from source using the instructions from the documentation with specific flags for the python bindings when using cmake:

#### **Python Libraries**

All python libraries except openbabel were pip installed with "python3 -m pip install xxx"
- BioPython
- Open Drug Discovery Toolkit (ODDT)
- Pandas
- Numpy
- Biopandas

2021-02-20

## **General Approach**

The general approach for each cleaning each dataset and the end goals of the data cleaning can be seen below in Figure 1. The general ideal will be to clean data created in the last step from the 'Dissertation/Data/Raw/{database_name}/Extracted' directory into a standard format of:
- A foldername with the pdb code containing:
    - A ligand.pdb file
    - A receptor.pdb file

These folders will be stored in a new folder: 'Dissertation/Data/Parsed/{database_name}'

<img src='Images/data_cleaning_pipeline.png' align='center'/>

*Figure 1 - Flowchart of planned data cleaning process*

2021-02-21

## **DUD-E Data Cleaning**

### **Splitting active and decoy ligand multimol files**

The DUD-E ligands and actives are stored in multimol .mol2 files, which need to be split into separate .mol2 files for GWOVina docking. This has been done with a custom python script which has been pushed to the github as 'split_ligands.py'. See the code comments for how it works. This script produces parsed DUD-E data in the form:

- Foldername is protein target:
    - 'actives' folder contains one mol2 file for each active ligand
    - 'decoys' folder contains one mol2 file for each decoy
    - crystal_ligand.mol2 file is a docked ligand
    - receptor,pdb is the receptor pdb file

### **Docking active and decoy ligand mol2 files to target protein with GWOVina**

The supplied actives and decoys do not appear to be docked for the DUD-E database. Therefore, each one needs docking with GWOVina before being converted to a PDBQT file with Autodocktools. This script is a work in progress, but it will automate the docking process and create a new subset of DUD-E data in 'Dissertation/Data/Parsed/DUD-E/Docked' with a folder for each protein-ligand and protein-decoy complex.

2021-02-22

## **Binding MOAD Data Cleaning**

### **Splitting Protein-Ligand Complexes**

The raw extracted Binding MOAD dataset just contains .pdb complex files, which need to be separated into a 'protein.pdb' file and a 'ligand.pdb' file. In .pdb files, amino acids/residues are stored as type 'ATOM', whereas non-amino acid residues are stored as type 'HETATM'. This 'HETATM' type includes sulfates, zincs and waters that are likely not ligands, as well as cofactors. 

Therefore, I have written a python script called 'split_complex.py' and pushed it to the project github. This looks for all the different residues where all the atoms are classed as 'HETATM', and then counts how many atoms are present in the residue, and keeps the longest 'HETATM' residue or ligand. This way, the chemical ligand is kept, as waters, zincs and sulfates have very few atoms compared to standard ligands. **This approach misses peptide ligands and this will be addressed later**. The script produces a 'protein.pdb' and a 'ligand.pdb' file in a folder named as the complex pdb code, in the master folder 'Dissertation/Data/Parsed/Binding_MOAD/'

2021-03-01

The meeting with Dr Houston has been postponed by one day. I am going to look at the GWOVina and autodocktools command line usage and try to automate the docking and converstion process to .pdbqt files. It looks like the process goes:

1. Convert to .pdbqt with autodocktools
2. Dock with GWOVina (For DUD-E compounds and ligands)

I will need to evaluate whether we need docking or not to actually score the ligands, or if they just need preparing with autodock.

BINANA.py output has also been analysed, and outputs to a text file which I have written a script to parse and convert to a .csv file. I have got an error for some ligands when using autodocktools to convert to .pdbqt, I think it is to do with the fact those structures are dimers with two identical ligands docked.
I have successfully sorted the dimer problem with two identical ligands by adding an error exception to the script. Autodocktools command line usage is working well for pdb to pdbqt conversion, but automation of the whole process will be much easier with openbabel imported as a part of oddt. Unfortunately, I have written a python script to use openbabel to do this but the resulting pdbqt file is problematic, and won't work in autodock or BINANA. I am going to do a side by side comparison of autodock output and openbabel output to see the differences and what might be causing them. If not, I will have to use rd-kit and build that from source to use as part of oddt.