2021-12-02

# **XGBScore: A Gradient Boosted Decision Tree Scoring Function for Structure Based Virtual Screening**

This is my lab book for my dissertation project. It will contain my daily work towards the project, and *in silico* experiments performed in python 3 code cells for debugging and evaluation before being turned into separate scripts for publication and analysis.

### **Table of Contents**
- Database Scraping
- Dataset Assembly
- Dataset Cleaning
- Feature Engineering
- Adding Decoys
- Splitting the Data
- Feature Importance/Dimensionality Reduction
- Training the Model
- Optimising the Model
- Evaluating the Model

# Database Scraping

While there will likely be lots of redundancy, four databases are being scraped for data for the training and test sets. These are:
- Binding MOAD
- PDBBind
- Binding DB
- DUD-E

## Binding MOAD

Binding MOAD offers several different datasets. I have downloaded the structures and binding data for the **Non-redundant dataset only with binding data**. This dataset contains all complexes which have known binding data, and includes one 'leader' from each family with similar binding to prevent very similar data clogging the dataset. These are stored in the Dissertation/Data/Raw/Binding_MOAD folder.

## PDBBind

PDBBind server seems to be down at the moment, so I will return to this dataset later.

## Binding DB

Cannot see a way of differentiating only those complexes which have crystal structures attached to them. Let's move on to DUD-E

## DUD-E

For some reason, when I try to download the whole dataset at once, the server is throwing a 503/overload saying I've exceeded the maximum number of simultaneous downloads, despite it only being one tarball file. I think we can scrape the files one by one fairly quickly as there are only 102 of them. Each file has a standard base url of 'dude.docking.org/targets/{target}/{target.tar.gz}'. So if we scrape all the hrefs from the main index page, we can just populate a list and ping them one at a time:

In [None]:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

index_url = 'http://dude.docking.org/subsets/all'

def create_target_url(url):
    position = url.find('/targets/') + len('/targets/')
    target_name = url[position:]
    target_url = f'http://dude.docking.org/targets/{target_name}/{target_name}.tar.gz'
    return target_url, target_name

def save_target_file(url):
    url, target_name = create_target_url(url)
    target_path = f'/home/milesm/Dissertation/Data/Raw/DUD-E/{target_name}.tar.gz'
    response = requests.get(url, verify=False)
    if response.status_code == 200:
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')

index_page = requests.get(index_url)
html_content = index_page.text
soup = BeautifulSoup(html_content, 'html.parser')
file_urls = [url['href'] for url in soup.find_all('a') if '/targets/' in str(url)]

with tqdm(total=len(file_urls)) as pbar:
    for target_file in file_urls:
        save_target_file(target_file)
        pbar.update(1)

 11%|█         | 11/102 [09:57<55:56, 36.89s/it]  