2021-02-12

# **XGBScore: A Gradient Boosted Decision Tree Scoring Function for Structure Based Virtual Screening**

This is my lab book for my dissertation project. It will contain my daily work towards the project, and *in silico* experiments performed in python 3 code cells for debugging and evaluation before being turned into separate scripts for publication and analysis.

### **Table of Contents**
- Database Scraping
- Dataset Assembly
- Dataset Cleaning
- Feature Engineering
- Adding Decoys
- Splitting the Data
- Feature Importance/Dimensionality Reduction
- Training the Model
- Optimising the Model
- Evaluating the Model

# **Database Scraping**

While there will likely be lots of redundancy, four databases are being scraped for data for the training and test sets. These are:
- Binding MOAD
- PDBBind
- Binding DB
- DUD-E

## Binding MOAD

Binding MOAD offers several different datasets. I have downloaded the structures and binding data for the **Non-redundant dataset only with binding data**. This dataset contains all complexes which have known binding data, and includes one 'leader' from each family with similar binding to prevent very similar data clogging the dataset. 

## PDBBind

PDBBind server seems to be down at the moment, so I will return to this dataset later.

## Binding DB

Cannot see a way of differentiating only those complexes which have crystal structures attached to them. Let's move on to DUD-E.

## DUD-E

For some reason, when I try to download the whole dataset at once, the server is throwing a 503/overload saying I've exceeded the maximum number of simultaneous downloads, despite it only being one tarball file. I think we can scrape the files one by one fairly quickly as there are only 102 of them. Each file has a standard base url of 'dude.docking.org/targets/{target}/{target.tar.gz}'. So if we scrape all the hrefs from the main index page, we can just populate a list and ping them one at a time:

In [None]:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

index_url = 'http://dude.docking.org/subsets/all'

def create_target_url(url):
    position = url.find('/targets/') + len('/targets/')
    target_name = url[position:]
    target_url = f'http://dude.docking.org/targets/{target_name}/{target_name}.tar.gz'
    return target_url, target_name

def save_target_file(url):
    url, target_name = create_target_url(url)
    target_path = f'/home/milesm/Dissertation/Data/Raw/DUD-E/{target_name}.tar.gz'
    response = requests.get(url, verify=False)
    if response.status_code == 200:
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')

index_page = requests.get(index_url)
html_content = index_page.text
soup = BeautifulSoup(html_content, 'html.parser')
file_urls = [url['href'] for url in soup.find_all('a') if '/targets/' in str(url)]

with tqdm(total=len(file_urls)) as pbar:
    for target_file in file_urls:
        save_target_file(target_file)
        pbar.update(1)

2021-02-13

DUD-E files have been downloaded in a separate tar.gz file for each one. These need extracting and sorting. 

In [None]:
import tarfile
import os
from tqdm import tqdm

def extract_file(fname):
    tar = tarfile.open(fname, "r:gz")
    tar.extractall()
    tar.close()

files = [('/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/' + file) for file in os.listdir('/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/')]

with tqdm(total=len(files)) as pbar:
    for file in files:
        extract_file(file)
        pbar.update(1)

The Binding MOAD files are stored as 'Biounit' files or .bio files, so I can't open them with PyMOL and have to use JMOLViewer. We should download the original PDB files as well just in case they are more useful:

In [None]:
import os
import requests
from tqdm import tqdm

def create_target_url(filepath):
    target_name = filepath.split('.')[0]
    target_url = f'https://files.rcsb.org/download/{target_name}.pdb'
    return target_url, target_name

def save_target_file(url):
    url, target_name = create_target_url(url)
    target_path = f'/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/{target_name}.pdb'
    response = requests.get(url)
    if response.status_code == 200:
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')


filepaths = os.listdir('/home/milesm/Dissertation/Data/Raw/Binding_MOAD/Extracted/BindingMOAD_2020')

with tqdm(total=len(filepaths)) as pbar:
    for target_file in filepaths:
        save_target_file(target_file)
        pbar.update(1)

# **Data Cleaning**

Right now we have multiple files of separate ligands and receptors that we know either bind or don't bind. These will need (I think) docking and converting into pdbqt files for analysis by BINANA/Scoria. We will use ODDT and Autodock for this(?)

2021-02-18

I have had to build openbabel several times from source with different flags for cmake before getting it working - what worked was apt installing:
- libopenbabel-dev
- libopenbabel4v5
- openbabel-gui

And then cloning openbabel from github and building from source with specific flags.

2021-02-20

openbabels functionality was not immediately suitable for separating ligands from the protein-ligand complexes. I have written a different script using BioPython to split the Binding MOAD data into ligand and protein separate .pdb files and have pushed it to the project github. The overall data cleaning process has several steps (Fig. 1)

<img src='Images/data_cleaning_pipeline.png' align='center'/>

*Figure 1 - Flowchart of planned data cleaning process*Assembly

2021-02-21

The DUD-E ligands and actives were stored in multimol .mol2 files, which needed to be split into separate .mol2 files for GWOVina docking. This has been done with a custom python script which has also been pushed to the github. I have a meeting with Dr. Houston tomorrow to discuss the workflow above and confirm it is correct.

2021-02-22

The meeting with Dr Houston has been postponed by one day. I am going to look at the GWOVina command line usage and try to automate the docking and converstion process to .pdbqt files. It looks like the process goes:

1. Convert to .pdbqt with autodocktools
2. Dock with GWOVina

I will need to evaluate whether we need docking or not to actually score the ligands, or if they just need preparing with autodock.