# Access _PLINDER_ data for training ML models

The goal of this tutorial is to provide a simple and simple hands-on demo for a new user to access _PLINDER_ dataset in prepparation for training machine learning models.

Here, we are going to demonstrate how to get the input data:
- protein receptor fasta sequence
- small molecules ligand SMILES string

- #TODO: access to linked _apo_ structure


In the process, we will show:
- How to query _PLINDER_ index and splits to select relevant data using `plinder.core` API
- Extract task-specific data one might want to use for training a task-specific ML model, eg. one protein, one ligand
- How to use `plinder.core` API and `PlinderDataset` class to supply dataset inputs for `train` or `val` splits
- Example how to create a simple diversity sampler based on cluster labels
- #TODO: load linked `apo` structures


### Load _PLINDER_ index

We recommend users interact with the dataset using _PLINDER_ Python API.

You may need to install `plinder.loader` with: `` pip install '.[loader]'``

#TODO: test and replace with ``pip install plinder``

In [3]:
from __future__ import annotations
import pandas as pd
from plinder.core.scores import query_index

Load _PLINDER_ index with selected columns from annotations table. For a full list with descriptions, please refer to [docs](https://plinder-org.github.io/plinder/dataset.html).

In [4]:
# get plinder index with selected annotation columns specified 
plindex = query_index(
    columns=["system_id", "ligand_id", "ligand_rdkit_canonical_smiles", "ligand_is_ion", "ligand_is_artifact", "system_num_ligand_chains", "system_num_neighboring_protein_chains"],
    filters=[
        ("system_type", "==", "holo"),
        ("system_num_neighboring_protein_chains", "<=", 5)
    ]
)

2024-08-29 21:48:12,016 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.78s


In [5]:
plindex.head()

Unnamed: 0,system_id,ligand_id,ligand_rdkit_canonical_smiles,ligand_is_ion,ligand_is_artifact,system_num_ligand_chains,system_num_neighboring_protein_chains
0,3grt__1__1.A_2.A__1.B,3grt__1__1.B,Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...,False,False,1,2
1,3grt__1__1.A_2.A__1.C,3grt__1__1.C,N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...,False,False,1,2
2,3grt__1__1.A_2.A__2.B,3grt__1__2.B,Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...,False,False,1,2
3,3grt__1__1.A_2.A__2.C,3grt__1__2.C,N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...,False,False,1,2
4,1grx__1__1.A__1.B,1grx__1__1.B,N[C@@H](CCC(=O)N[C@@H](CS)C(=O)NCC(=O)O)C(=O)O,False,False,1,1


In [6]:
plindex.groupby("system_num_neighboring_protein_chains").system_id.count()

system_num_neighboring_protein_chains
1    406826
2    213268
3     43478
4     10835
5      1783
Name: system_id, dtype: int64

## Extracting specific data using _PLINDER_ annotations
As we can see just from the data tables above - a significant fraction of _PLINDER_ systems contain complex multi protein chain systems. If we would like to focus only on single protein and single ligand systems, we can use the annotated columns to filter out systems that:
- contain only one protein chain
- only one "proper" ligand

In _PLINDER_ ions and artifacts are also included in the index, so, we will used columns `ligand_is_ion` and `ligand_is_artifact` to only select "proper" ligands.

In [7]:
# define "proper" ligands that are not ions or artifacts
plindex["ligand_is_proper"] = (
    ~plindex["ligand_is_ion"] & ~plindex["ligand_is_artifact"]
)
# make count of these "proper" ligands per system
plindex["system_proper_num_ligand_chains"] = plindex.groupby("system_id")["ligand_is_proper"].transform("sum")

In [8]:
plindex.groupby("ligand_is_proper").system_id.count()

ligand_is_proper
False    128401
True     547789
Name: system_id, dtype: int64

In [37]:
sum((plindex_split["system_num_ligand_chains"] == 1) & (plindex_split["system_num_neighboring_protein_chains"] == 1))

200173

In [39]:
sum((plindex_split["system_proper_num_ligand_chains"] == 1) & (plindex_split["system_num_neighboring_protein_chains"] == 1))

297360

In [40]:
sum(plindex_split["system_num_neighboring_protein_chains"] == 1)

351198

## Loading splits

Let's have a look at the splits using _PLINDER_ API

In [9]:
from plinder.core import PlinderDataset, get_split

In [10]:
# get the current plinder split
split_df = get_split()
split_df.head()

2024-08-29 21:48:40,079 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.70s
2024-08-29 21:48:40,817 | plinder.core.split.utils:40 | INFO : reading /Users/vladas/.local/share/plinder/2024-06/v2/splits/split.parquet
2024-08-29 21:48:40,961 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 3.11s


Unnamed: 0,system_id,uniqueness,split,cluster,cluster_for_val_split,system_pass_validation_criteria,system_pass_statistics_criteria,system_proper_num_ligand_chains,system_proper_pocket_num_residues,system_proper_num_interactions,system_proper_ligand_max_molecular_weight,system_has_binding_affinity,system_has_apo_or_pred
0,101m__1__1.A__1.C_1.D,101m__A__C_D_c188899,train,c14,c0,True,True,1,27,20,616.177293,False,False
1,102m__1__1.A__1.C,102m__A__C_c237197,train,c14,c0,True,True,1,26,20,616.177293,False,True
2,103m__1__1.A__1.C_1.D,103m__A__C_D_c252759,train,c14,c0,False,True,1,26,16,616.177293,False,False
3,104m__1__1.A__1.C_1.D,104m__A__C_D_c274687,train,c14,c0,False,True,1,27,21,616.177293,False,False
4,105m__1__1.A__1.C_1.D,105m__A__C_D_c221688,train,c14,c0,False,True,1,28,20,616.177293,False,False


For simplicity let's merge plindex and split DataFrames into one

In [49]:
# merge to a single DataFrame
plindex_split = plindex.merge(split_df, on="system_id", how="inner")

### Load dataset
We recommend users interact with the dataset using PLINDER Python API.
You may need to install plinder.loader: `` pip install '.[loader]'``

class {class} `PlinderDataset` is the primary method of access data.

- `df`: the split to use
- `split`: the split to sample from
- `file_with_system_ids`: path to a file containing a list of system ids (default: full index)
- `store_file_path`: if True, include the file path of the source structures in the dataset
- `load_alternative_structures`: if True, include alternative structures in the dataset
- `num_alternative_structures`: number of alternative structures (apo and pred) to include

Below, we are providing a function to access class {class}`PlinderDataset` and samples from protein-ligand similarity cluster based on sampling user-defined function `sampler_func` via a warpper function `load_dataset_path`. 

In [54]:
val_dataset = PlinderDataset(
    df=split_df,
    split='val'
)

note, if files not already available this downloads them to `~/.local/share/plinder/{PLINDER_RELEASE}/{PLINDER_ITERATION}` directory

#### Define diversity sampler function
Here, we have provided an example of how one might use `torch.utils.data.WeightedRandomSampler`. However, users are free to sample diversity any how they see fit. For this example, we are going to use the sample dversity based on the `cluster` column in the splits dataframe.

### Extract specific molecular format needed for training
The function `get_model_input` wraps it all together, allowing us to extract the sequence fasta and smiles needed for training.

### Sample diverse training set 

### Get validation set without cluster sampling
Here, we will show how to get validation set without cluster sampling.