# Summary
In this article I'll go through the overall thought process of setting up a benchmark set. For the sake of simplicity I'll do it for a single target to evaluate the performance of a binding site comparison algorithm. I'll try to provide code whenever possible against public resources or snippets if a bit of scripting is needed. When the public domain information is insufficient, I might fall back to resources like 3decision, MOE families, which is not public domain knowledge, but can very much help fulfill the task at hand. 

# The target binding site
In order to show the overall process, let's start with a well studied target & relatively easy target, HSP90. A usual guinea pig since my PhD in Xavier Barril's lab, but it's a good example to show without exploring too much in super large protein families in the beginning. 
I aim to apply the process outlined here later to a thrombin, a trypsin like serine protease, which will likely involve some adjustments.

# HSP90
The heat shock protein 90 is a rather abundant protein in the cell and helps during protein folding of not yet formed proteins, or protects already folded proteins from external stress (thermal stress for instance) - thus HSP = Heat Shock Protein. Let's start with [HSP90 alpha from the homo sapiens](https://www.uniprot.org/uniprotkb/P07900/entry), even though they are common among eucaryotes. 
[Several kinases are dependent (activated) on HSP90, especially those acting as hub](https://www.sciencedirect.com/science/article/pii/S0021925820780828). This is one of the reasons why HSP90 has been investigated as potential drug target for treatment of several forms of cancer. The action of HSP is dependent on ATP and the dephosphorylation of this molecule & the ATP binding site is located on a particular are in the N-terminal part of HSP90. This will be the binding site to focus on here.

## Why HSP90 as a first example
- HSP90 has a lot of structures available in the public domain. 
- It's not part of any gigantic protein family (kinases, GPCR's etc), which keeps the initial comparison space to cover a bit smaller. 
- The fold of the protein is still conserved among several other proteins, so there's matter for detecting expected similarities
- It binds ATP, such as a lot of proteins in nature - which is interesting for the if I bind the same molecule I must be similar conundrum. 
- The binding site can undergo important conformational changes, which is good to evaluate sensity on conformation
- Water plays a very important role upon binding of small molecules into the ATP binding site

## Domain architecture
HSP90 alpha human is composed of two domains: 
- the N-terminal Histidine kinase, DNA gyrase B and HSP90-like ATPase domain (ranging from amino acid 40 to 193)
- the C-terminal HSP90 protein domain (196-714)

The ATP binding site of interest is on the N-terminal part and this is the part that you have the most crystal strucutres for in the RCSB today. A full length Alphafold model available in the public domain. [NB: there appears to be another ATP binding site on the C-terminal part, that is only accessible when activated](https://pubmed.ncbi.nlm.nih.gov/12755697/) - so interesting to maybe look out for that one as well. 

## The topology of the ATP binding site (N-ter)


In [5]:
#| code-fold: true

import nglview
view = nglview.show_pdbid("3t0z")  # load "3pqr" from RCSB PDB and display viewer widget
view.representations = [
    {"type": "cartoon", "params": {
        "sele": "protein", "color": "residueindex"
    }},
    {"type": "ball+stick", "params": {
        "sele": "hetero","color":"element"
    }},
    {"type": "ball+stick", "params": {
        "sele": "hetero and _C", "color":"yellow"
    }},
    {"type": "licorice", "params": {
        "sele": "47-55 or 91 or 93 or 96-97 or 98 or 102 or 106 or 107 or 112 or 132-139 or 150 or 162 or or 186 ",
         "color": "element"
    }},{"type": "contact", "params": {
        "sele": "47-55 or 93 or 96-97 or 98 or 102 or 106 or 107 or 112 or 132-139 or 150 or 162 or ligand or water",
         "color": "element"
    }}

]

view.camera = 'orthographic'
view.background = 'black'

view


{{< mol-rcsb 3t0z viewportShowAnimation=false layoutShowLeftPanel=false layoutShowControls=false >}}

The binding site is composed of a section containing the adenin moety which is characterized by a beta sheet at the bottom of the site, and two helices lining the site. The sidechains exposed to the binding site lumen are globally hydrophobic, a part from the very important aspartate 93, which is interacting directly with the adenine moeity. The adenin moeity is sourrounded by water molecules and several of these watersare important hallmarks of several HSP90 binders.
The ribose moeity is not forming any H-bonds with the protein itself but has hydroxyls oriented towards the solvent. The ether of the ribose is orented towards the valine 107, adjacent to a rather hydrophobic part of the pocket coated by Y139, F138 and W162. 
The triphospate is solvent exposed and interacting with a small helix-loop-helix motive which, as we will probably see a bit later part of the more mobile regions of the binding site. 


## Establishing a reference set
Now we have defined a binding site of interest, let's try to establish first a few obvious scenarios one might want to cover with a pocket comparison method. In my previous post I stated that the principle use case for binding site comparison methods that I'm focusing on is large scale comparison or screening. The underlying use cases that I'm mainly interested in are NOT protein function prediction, but rather the prediction of potential counter targets, or extracting bound ligands from related binding sites to inform structurally during my compound design cycle. 

Let's consider that we are working on a drug design project on the HSP90 N-terminal ATP binding site and we have our favourite structure of HSP90 as starting point, [i.e. 4cwr](https://www.rcsb.org/structure/4CWR).  

If I want to find similar binding sites to my query binding site vs all known/putative binding sites, what hits I'd expect to get first in the hitlist ?: 

1. the ATP binding site of other HSP90 alpha human structures with the same overall conformation (more or less) - same sequence & conformation
2. the ATP binding site of other HSP90 human isoforms with a similar conformation - likely locally identical sequences & conformation
3. the ATP binding site of HSP90 sequences from other species with a similar conformation - locally very similar sequences
4. all of the above but with a bit different conformations - conformations
5. the ATP binding site of close homologs (sequence - families etc) to the query structure - similar sequences
6. the ATP binding sites of proteins sharing the same fold as HSP90 - same fold
7. nucleotide binding sites with similar interaction patterns but dissimar fold - same interactions
8. all binding sites binding ATP must be similar (provocative on purpose ...) - same ligand
9. unexpected & unrelated / unknown similarities - nightmare


These first five give a graduation up to which level another ATP binding site could be potentially close to the HSP90 binding site. These are the obvious clusters of sequences, structures & conformations one would expect to find. As a result, one can use this type of graduation also for validating binding site comparison methods. One major difference with the setting I'm laying out here, is that the background data encompasses the full RCSB PDB structures containing all ligand binding sites + putative binding sites (empty clefts). This sets the approach I'm suggesting into stark contrast with previous benchmark sets. Such previous sets were classically composed of a list of expected matches of binding site pairs and decoys (expected mismatches). 
As Vincent Le Guilloux, if you can avoid a threshold effect, avoid it! This is exactly such a case where a discrete split between a match & a mismatch is introduced. 
As a result my background data (what one usually calls a decoy) is the full pocketome and I'll try to use metrics of success that measure, how many of the potentially expected hits are found before a bulk of less expected hits & why. 

In the subsequent sections I'll go through all the painful steps to create the dataset corresponding to each of the sections 1 to 5. The situation 6 & 7 are a bit more tricky to set up. As for 8 - that's the big issue with binding site comparison benchmarking - you don't know until you know, but I'll try to do a bit of my homework on that as well!


### 1. Same Sequence & conformation
Alright, here we go ... the same sequence is the easiest case of all of them & several previous studies included a selection of structures, but as you'll see even here to do things properly it gets quickly tricky. 
The following script will cover the required steps: 
- gather all structures (PDB codes) containing a resolved HSP90 alpha human N terminal domain
- filter out structures with mutations on binding site residues compared to the wild type
- get an all by all comparison of the binding sites (structurally speaking) which would allow for some rough clustering of conformations




### SCOP classification

If one considers that the SCOP family of this domain will likely contain other HSP90 Nter ATP binding sites then this is a useful resource to look for already assigned similarities by protein fold classification. This has been used already in several other papers in the litterature, but let's exemplify here for the sake of completeness. 

The N-ter ATP binding domain of HSP90 and the hierarchy of [SCOP classifications can be found here](https://scop.mrc-lmb.cam.ac.uk/term/8028628).

![./images/binding_site_comparison_benchmark/scop_hsp90.png](../images/binding_site_comparison_benchmark/scop_hsp90.png)

