## Project: Prediction of Cryptic Binding Sites

#### Contact: David Hoksza (Charles University Prag david.hoksza@matfyz.cuni.cz)

### Introduction

Accurate prediction of protein-ligand binding sites is crucial for drug discovery, protein engineering, and understanding biological functions at the molecular level. Binding site prediction can be performed based on either the three-dimensional structure of a protein or its sequence. Traditionally, structure-based methods have been more commonly used, as binding is governed by the structural and physico-chemical complementarity between the protein and its small molecule binding partner.

However, structure-based approaches face inherent challenges because they are typically trained and evaluated on holo (ligand-bound) structures. In these cases, the ligand is removed, leaving a clear cavity in the protein structure, which is then detected by the prediction method. This creates a bias, as methods trained on such data often struggle to predict pockets in apo (ligand-free) protein forms, where the binding site may be obscured or, conversely, too open for the ligand to bind. Binding sites that exhibit different conformations in apo and holo forms are referred to as cryptic binding sites.

Most existing protein-ligand binding site prediction methods are biased toward detecting holo-binding sites because current benchmarks and training datasets predominantly consist of holo structures. As a result, these methods are often ineffective at identifying cryptic binding sites. To address this limitation, a novel dataset and benchmark called [CryptoBench](https://osf.io/pz4a9/) has been introduced. CryptoBench is built upon the [AHoJ-DB](https://apoholo.cz/db), a comprehensive database of apo-holo protein conformations. For every experimentally determined binding site (i.e., a combination of a protein structure and a ligand forming a binding site), AHOJ-DB stores all known apo and holo forms. CryptoBench focuses on apo-holo protein pairs with significant structural rearrangements in their binding sites. On this dataset, existing state-of-the-art (SOTA) methods [have been evaluated](https://academic.oup.com/bioinformatics/article/41/1/btae745/7927823) and compared with a transfer learning approach. This approach employs a simple neural network (NN) applied on top of a protein language model (PLM-NN), which has demonstrated superior performance over existing SOTA methods.

### Goal

The goal of this project is to improve the PLM-based SOTA approach for cryptic binding site (CBS) detection on the CryptoBench test set. This can be achieved by either extending the training set for the PLM-NN or improving the PLM-NN itself. Potential improvements to the PLM-NN could include enhancing the simple fully-connected NN that uses PLM embeddings as input.

To extend the training set, the following approach can be utilized using [AlphaFold Protein Structure Database](https://alphafold.ebi.ac.uk/) in combination with [AHoJ-DB](https://apoholo.cz/db):

1. Process AHoJ-DB to identify holo protein structures for which no experimental apo structures exist (i.e., it's data could not have been used in CryptoBench).
2. Retrieve the UniProt IDs of these proteins.
3. Obtain AlphaFold-predicted structures for these proteins from AlphaFold DB.
4. Filter the predicted structures to identify those with significant conformational changes compared to the experimental holo structures. These structures may harbor binding sites that could be considered potential cryptic sites.
5. Use the filtered set of potential cryptic sites to extend the training set for the PLM-NN.

Data for steps 1 and 2 will be provided by the organizers.

### Data and Model

- The train/test splits containing PDB IDs of apo-holo structure pairs (and corresponding UniProt IDs) can be obtained from the [CryptoBench OSF project site](https://osf.io/pz4a9).
- To assess structural similarity of pockets, we recommend using BioPython's [Bio.PDB package](https://biopython.org/docs/1.75/api/Bio.PDB.html). A short tutorial for working with PDB structures in BioPython can be found [here](https://biopython.org/DIST/docs/tutorial/Tutorial.html#sec240).
- AlphaFold DB can be queried [using its API](https://www.ebi.ac.uk/training/online/courses/alphafold/accessing-and-predicting-protein-structures-with-alphafold/accessing-predicted-protein-structures-in-the-alphafold-database/whats-the-best-way-to-access-the-database/). The resulting JSON data can be parsed to obtain the predicted structure of the requested protein.
- The code for generating PLM-NN embeddings can be found [here](https://github.com/skrhakv/esm2-generator). Additionally, much of the code for working with protein language models, training the neural network for binding site prediction, and handling protein structures can be reused from the labs on protein-ligand binding site prediction.