2021-02-12

# **XGBScore: A Gradient Boosted Decision Tree Scoring Function for Structure Based Virtual Screening**

This is my lab book for my dissertation project. It will contain my daily work towards the project, and *in silico* experiments performed in python 3 code cells for debugging and evaluation before being turned into separate scripts for publication and analysis.

### **Table of Contents**
- Database Scraping
- Dataset Assembly
- Dataset Cleaning
- Feature Engineering
- Adding Decoys
- Splitting the Data
- Feature Importance/Dimensionality Reduction
- Training the Model
- Optimising the Model
- Evaluating the Model

# **Database Scraping**

While there will likely be lots of redundancy, four databases are being scraped for data for the training and test sets. These are:
- Binding MOAD
- PDBBind
- Binding DB
- DUD-E

## Binding MOAD

Binding MOAD offers several different datasets. I have downloaded the structures and binding data for the **Non-redundant dataset only with binding data**. This dataset contains all complexes which have known binding data, and includes one 'leader' from each family with similar binding to prevent very similar data clogging the dataset. 

## PDBBind

PDBBind server seems to be down at the moment, so I will return to this dataset later.

## Binding DB

Cannot see a way of differentiating only those complexes which have crystal structures attached to them. Let's move on to DUD-E.

## DUD-E

For some reason, when I try to download the whole dataset at once, the server is throwing a 503/overload saying I've exceeded the maximum number of simultaneous downloads, despite it only being one tarball file. I think we can scrape the files one by one fairly quickly as there are only 102 of them. Each file has a standard base url of 'dude.docking.org/targets/{target}/{target.tar.gz}'. So if we scrape all the hrefs from the main index page, we can just populate a list and ping them one at a time:

In [26]:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

index_url = 'http://dude.docking.org/subsets/all'

def create_target_url(url):
    position = url.find('/targets/') + len('/targets/')
    target_name = url[position:]
    target_url = f'http://dude.docking.org/targets/{target_name}/{target_name}.tar.gz'
    return target_url, target_name

def save_target_file(url):
    url, target_name = create_target_url(url)
    target_path = f'/home/milesm/Dissertation/Data/Raw/DUD-E/{target_name}.tar.gz'
    response = requests.get(url, verify=False)
    if response.status_code == 200:
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')

index_page = requests.get(index_url)
html_content = index_page.text
soup = BeautifulSoup(html_content, 'html.parser')
file_urls = [url['href'] for url in soup.find_all('a') if '/targets/' in str(url)]

with tqdm(total=len(file_urls)) as pbar:
    for target_file in file_urls:
        save_target_file(target_file)
        pbar.update(1)

100%|██████████| 102/102 [37:25<00:00, 22.02s/it] 


2021-02-13

DUD-E files have been downloaded in a separate tar.gz file for each one. These need extracting and sorting. 

In [5]:
import tarfile
import os
from tqdm import tqdm

def extract_file(fname):
    tar = tarfile.open(fname, "r:gz")
    tar.extractall()
    tar.close()

files = [('/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/' + file) for file in os.listdir('/home/milesm/Dissertation/Data/Raw/DUD-E/Compressed/')]

with tqdm(total=len(files)) as pbar:
    for file in files:
        extract_file(file)
        pbar.update(1)

100%|██████████| 102/102 [00:09<00:00, 11.33it/s]


The Binding MOAD files are stored as 'Biounit' files or .bio files, so I can't open them with PyMOL and have to use JMOLViewer. We should download the original PDB files as well just in case they are more useful:

In [None]:
import os
import requests
from tqdm import tqdm

def create_target_url(filepath):
    target_name = filepath.split('.')[0]
    target_url = f'https://files.rcsb.org/download/{target_name}.pdb'
    return target_url, target_name

def save_target_file(url):
    url, target_name = create_target_url(url)
    target_path = f'/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/{target_name}.pdb'
    response = requests.get(url)
    if response.status_code == 200:
        with open(target_path, 'wb') as file:
            file.write(response.content)
            file.close()
    else:
        print(f'Whoops! Somethings wrong: Response {response.status_code}')


filepaths = os.listdir('/home/milesm/Dissertation/Data/Raw/Binding_MOAD/Extracted/BindingMOAD_2020')

with tqdm(total=len(filepaths)) as pbar:
    for target_file in filepaths:
        save_target_file(target_file)
        pbar.update(1)

 31%|███       | 1535/4991 [34:26<6:45:36,  7.04s/it] 

# Data Cleaning

Right now we have multiple files of separate ligands and receptors that we know either bind or don't bind. These will need (I think) docking and converting into pdbqt files for analysis by BINANA/Scoria. We will use ODDT and Autodock for this(?)

2021-02-13

In [13]:
import os
import oddt
import openbabel
print(os.path.isfile('/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb'))
oddt.toolkit.readfile('pdb', '/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb')

True


AttributeError: 'NoneType' object has no attribute 'readfile'

2021-02-18

I have had to build openbabel several times from source with different flags for cmake before getting it working - what worked was apt installing:
- libopenbabel-dev
- libopenbabel4v5
- openbabel-gui

And then cloning openbabel from github and building from source with specific flags. Now, to try this code again!import os
import oddt
import openbabel
print(os.path.isfile('/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb'))
oddt.
oddt.toolkit.readfile('pdb', '/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb')

In [29]:
import os
import oddt
print(os.path.isfile('/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb'))
molecules = oddt.toolkit.readfile('pdb', '/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb')

True


In [47]:
molecules = oddt.toolkit.readfile('pdb', '/home/milesm/Dissertation/Data/Raw/PDBBind/CASF-2016/coreset/1a30')
for mol in molecules:
    print(mol.ligand)

AttributeError: Molecule has no such property: ligand

In [48]:
import scoria
mol = scoria.Molecule()
mol.load_pdb_into(filename='/home/milesm/Dissertation/Data/Raw/Binding_MOAD/original_PDB_files/1a0q.pdb')

ERROR: Unknown bond distance between elements N and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements C and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements O and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements O and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements N and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements C and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements N and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements C and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements O and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements O and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements C and ZN. Assuming 2.687.
ERROR: Unknown bond distance between elements ZN and N. Assuming 2.687.
ERROR: Unknown bond distance between elements ZN and C. Assuming 2.687.
ERROR: Unknown bond distance between elements ZN and O. Assuming

In [43]:
print(type(mol))

<class 'scoria.Molecule.Molecule'>


In [46]:
print(mol.other_molecules.__dict__)

{'_OtherMolecules__parent_molecule': <scoria.Molecule.Molecule object at 0x7fc9576964a8>}


In [24]:
import os

from Bio.PDB import *
from tqdm import tqdm
from warnings import filterwarnings
filterwarnings('ignore')

class LigandResidueSelect(Select):
    
    def __init__(self, chain, residue):
        self.chain = chain
        self.residue = residue

    def accept_chain(self, chain):
        return chain.id == self.chain.id

    def accept_residue(self, residue):
        """ Recognition of heteroatoms - Remove water molecules """
        return residue == self.residue and is_het(residue)

class NonHetSelect(Select):
    
    def accept_residue(self, residue):
        return 1 if residue.id[0] == " " else 0


def is_het(residue):
    res = residue.id[0]
    return res != " " and res != "W"

def extract_protein(pdb, filename):
    io = PDBIO()
    io.set_structure(pdb)
    io.save(filename, NonHetSelect())

def extract_ligands(pdb, filename):
    """ Extraction of the heteroatoms of .pdb files """
    io = PDBIO()
    io.set_structure(pdb)
    for model in pdb:
        for chain in model:
            for residue in chain:
                if not is_het(residue):
                    continue
                if '_ZN' in str(residue.id):
                    pass
                else:
                    io.save(filename, LigandResidueSelect(chain, residue))

def split_structure(structure_code, structure):
    os.mkdir(f'/home/milesm/Dissertation/Data/Parsed/Binding_MOAD/{structure_code}')
    extract_ligands(structure, f'/home/milesm/Dissertation/Data/Parsed/Binding_MOAD/{structure_code}/{structure_code}_ligand.pdb')
    extract_protein(structure, f'/home/milesm/Dissertation/Data/Parsed/Binding_MOAD/{structure_code}/{structure_code}_receptor.pdb')

parser = PDBParser()
structure = parser.get_structure("Test","/home/milesm/Desktop/1a0q.pdb")
split_structure('1a0q', structure)