# Access _PLINDER_ data for training ML models

The goal of this tutorial is to provide a simple and simple hands-on demo for a new user to access _PLINDER_ dataset in prepparation for training machine learning models.

Here, we are going to demonstrate how to get the key input data:
- protein receptor fasta sequence
- small molecules ligand SMILES string
- access to linked _apo_ and _pred_ structure


In the process, we will show:
- How to query _PLINDER_ index and splits to select relevant data using `plinder.core` API
- Extract task-specific data one might want to use for training a task-specific ML model, eg. one protein, one ligand
- How to use `plinder.core` API and `PlinderDataset` class to supply dataset inputs for `train` or `val` splits
- Load linked `apo` and `pred` structures
- Example how to create a simple diversity sampler based on cluster labels


## Loading _PLINDER_

We recommend users interact with the dataset using _PLINDER_ Python API.

To install the API run: ``pip install plinder.[loader]``

In [1]:
%load_ext autoreload
%autoreload 2

from __future__ import annotations
import pandas as pd
from plinder.core.scores import query_index

### Load _PLINDER_ index with selected columns from annotations table
For a full list with descriptions, please refer to [docs](https://plinder-org.github.io/plinder/dataset.html).

In [2]:
# get plinder index with selected annotation columns specified 
plindex = query_index(
    columns=["system_id", "ligand_id", "ligand_rdkit_canonical_smiles", "ligand_is_ion", "ligand_is_artifact", "system_num_ligand_chains", "system_num_neighboring_protein_chains"],
    filters=[
        ("system_type", "==", "holo"),
        ("system_num_neighboring_protein_chains", "<=", 5)
    ]
)

2024-08-30 18:12:12,823 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.69s


In [3]:
plindex.head()

Unnamed: 0,system_id,ligand_id,ligand_rdkit_canonical_smiles,ligand_is_ion,ligand_is_artifact,system_num_ligand_chains,system_num_neighboring_protein_chains
0,3grt__1__1.A_2.A__1.B,3grt__1__1.B,Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...,False,False,1,2
1,3grt__1__1.A_2.A__1.C,3grt__1__1.C,N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...,False,False,1,2
2,3grt__1__1.A_2.A__2.B,3grt__1__2.B,Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...,False,False,1,2
3,3grt__1__1.A_2.A__2.C,3grt__1__2.C,N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...,False,False,1,2
4,1grx__1__1.A__1.B,1grx__1__1.B,N[C@@H](CCC(=O)N[C@@H](CS)C(=O)NCC(=O)O)C(=O)O,False,False,1,1


In [4]:
plindex.groupby("system_num_neighboring_protein_chains").system_id.count()

system_num_neighboring_protein_chains
1    406826
2    213268
3     43478
4     10835
5      1783
Name: system_id, dtype: int64

### Extracting specific data using _PLINDER_ annotations
As we can see just from the data tables above - a significant fraction of _PLINDER_ systems contain complex multi protein chain systems.

#### Task specific selection
If we would like to focus on single protein and single ligand systems for training, we can use the annotated columns to filter out systems that:
- contain only one protein chain
- only one ligand

Remember: In _PLINDER_ artifacts and (single atom) ions are also included in the index if they are part of the pocket.
- We can use columns `ligand_is_ion` and `ligand_is_artifact` to only select "proper" ligands.

Let's find out how many annotated ligands are "proper".

In [5]:
# define "proper" ligands that are not ions or artifacts
plindex["ligand_is_proper"] = (
    ~plindex["ligand_is_ion"] & ~plindex["ligand_is_artifact"]
)

In [6]:
plindex.groupby("ligand_is_proper").system_id.count()

ligand_is_proper
False    128401
True     547789
Name: system_id, dtype: int64

#### User choice

The annotations table gives flexibility to choose the systems for training:
- One could strictly choose to use only the data that contains single protein single ligand systems
- Alternatively one could expand the number of systems to include systems containing single proper ligands, and optionally ignore the artifacts and ions in the pocket

Let's compare the numbers of such systems!

In [7]:
# create mask for single receptor single ligand systems
systems_1P1L = (plindex["system_num_neighboring_protein_chains"] == 1) & (plindex["system_num_ligand_chains"] == 1)

# make count of these "proper" ligands per system
plindex["system_proper_num_ligand_chains"] = plindex.groupby("system_id")["ligand_is_proper"].transform("sum")

# create mask only for single receptor single "proper" ligand systems
systems_proper_1P1L = (plindex["system_num_neighboring_protein_chains"] == 1) & (plindex["system_proper_num_ligand_chains"] == 1) & plindex["ligand_is_proper"]

print(f"Number of single receptor single ligand systems: {sum(systems_1P1L)}")
print(f"Number of single receptor single \"proper\" ligand systems: {sum(systems_proper_1P1L)}")

Number of single receptor single ligand systems: 238228
Number of single receptor single "proper" ligand systems: 282433


As we can see - the second choice can provide up to 20% more data for training, however, the caveat is that some of the interactions made by artifacts or ions may influence the binding pose of the "proper" ligand. The user could come up with further strategies to filtering using annotations table or external tools, but this is beyond the scope of this tutorial.

### Loading splits

Now, after curating the systems of interest, let's have a look at the splits using _PLINDER_ API.

- How to use `plinder.core` API and `PlinderDataset` class to supply dataset inputs for `train` or `val` splits

In [8]:
from plinder.core import get_split

#### Getting the splits
The `get_split` function provided the current _PLINDER_ split, the detailed description of this DataFrame is provide in the [dataset documentation](https://plinder-org.github.io/plinder/dataset.html#splits-splits), but for our practical purposes we are mostly interested in `system_id` and `split` that assigns each of our systems to a specific split category.

In [9]:
# get the current plinder split
split_df = get_split()

2024-08-30 18:12:15,789 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.67s
2024-08-30 18:12:16,464 | plinder.core.split.utils:40 | INFO : reading /Users/vladas/.local/share/plinder/2024-06/v2/splits/split.parquet
2024-08-30 18:12:16,609 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 2.38s


In [10]:
split_df.head()

Unnamed: 0,system_id,uniqueness,split,cluster,cluster_for_val_split,system_pass_validation_criteria,system_pass_statistics_criteria,system_proper_num_ligand_chains,system_proper_pocket_num_residues,system_proper_num_interactions,system_proper_ligand_max_molecular_weight,system_has_binding_affinity,system_has_apo_or_pred
0,101m__1__1.A__1.C_1.D,101m__A__C_D_c188899,train,c14,c0,True,True,1,27,20,616.177293,False,False
1,102m__1__1.A__1.C,102m__A__C_c237197,train,c14,c0,True,True,1,26,20,616.177293,False,True
2,103m__1__1.A__1.C_1.D,103m__A__C_D_c252759,train,c14,c0,False,True,1,26,16,616.177293,False,False
3,104m__1__1.A__1.C_1.D,104m__A__C_D_c274687,train,c14,c0,False,True,1,27,21,616.177293,False,False
4,105m__1__1.A__1.C_1.D,105m__A__C_D_c221688,train,c14,c0,False,True,1,28,20,616.177293,False,False


Some specific method developers working on _flexible_ docking may also find handy the annotation column `system_has_apo_or_pred` indicating if the system has available `apo` or `pred` linked structures (see later).

In [11]:
split_df.groupby(["split", "system_has_apo_or_pred"]).system_id.count()

split    system_has_apo_or_pred
removed  False                      56876
         True                       41842
test     False                        548
         True                         488
train    False                     189703
         True                      119437
val      False                        456
         True                         376
Name: system_id, dtype: int64

For simplicity let's merge plindex and split DataFrames into one

In [12]:
# merge to a single DataFrame
plindex_split = plindex.merge(split_df, on="system_id", how="left")

### Getting links to `apo` or `pred` structures

In [13]:
from plinder.core.scores import query_links

In [14]:
links_df = query_links(
    # columns=["reference_system_id", "id", "sort_score"],
)

2024-08-30 18:12:19,507 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.91s
2024-08-30 18:12:22,941 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 5.92s


Note the table is sorted by `sort_score` that is resolution for `apo`s and `plddt` for `pred`s. The `apo` or `pred` is specified in the additionally added `kind` column.

In [15]:
links_df.head()

Unnamed: 0,reference_system_id,id,pocket_fident,pocket_lddt,protein_fident_qcov_weighted_sum,protein_fident_weighted_sum,protein_lddt_weighted_sum,target_id,sort_score,receptor_file,...,posebusters_volume_overlap_with_inorganic_cofactors,posebusters_volume_overlap_with_waters,fraction_reference_proteins_mapped,fraction_model_proteins_mapped,lddt,bb_lddt,per_chain_lddt_ave,per_chain_bb_lddt_ave,filename,kind
0,6pl9__1__1.A__1.C,2vb1_A,100.0,86.0,100.0,100.0,96.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6pl9__1__1.A_...,...,True,True,1.0,1.0,0.903772,0.968844,0.890822,0.959674,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
1,6ahh__1__1.A__1.G,2vb1_A,100.0,98.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6ahh__1__1.A_...,...,True,True,1.0,1.0,0.894349,0.962846,0.883217,0.954721,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
2,5b59__1__1.A__1.B,2vb1_A,100.0,91.0,100.0,100.0,96.0,2vb1,0.65,/plinder/2024-06/assignments/apo/5b59__1__1.A_...,...,True,True,1.0,1.0,0.903266,0.962318,0.890656,0.955258,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
3,3ato__1__1.A__1.B,2vb1_A,100.0,99.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/3ato__1__1.A_...,...,True,True,1.0,1.0,0.89053,0.954696,0.879496,0.946326,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
4,6mx9__1__1.A__1.K,2vb1_A,100.0,98.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6mx9__1__1.A_...,...,True,True,1.0,1.0,0.904116,0.964309,0.892434,0.955853,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo


If a user wants to consider only one linked structure per system - we can easily drop duplicates, first sorting by `sort_score`. Using this priority score, `pred` structures will not be used unless there is no `apo` available. Alternative can be achieved by sorting with `ascending=False`, or filtering by `kind=="pred"` column.

In [16]:
single_links_df = links_df.sort_values("sort_score", ascending=True).drop_duplicates("reference_system_id")
single_links_df.head()

Unnamed: 0,reference_system_id,id,pocket_fident,pocket_lddt,protein_fident_qcov_weighted_sum,protein_fident_weighted_sum,protein_lddt_weighted_sum,target_id,sort_score,receptor_file,...,posebusters_volume_overlap_with_inorganic_cofactors,posebusters_volume_overlap_with_waters,fraction_reference_proteins_mapped,fraction_model_proteins_mapped,lddt,bb_lddt,per_chain_lddt_ave,per_chain_bb_lddt_ave,filename,kind
0,6pl9__1__1.A__1.C,2vb1_A,100.0,86.0,100.0,100.0,96.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6pl9__1__1.A_...,...,True,True,1.0,1.0,0.903772,0.968844,0.890822,0.959674,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
110,6agr__1__1.A__1.G,2vb1_A,100.0,94.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6agr__1__1.A_...,...,True,True,1.0,1.0,0.893323,0.960976,0.882798,0.953605,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
111,4qgz__1__1.A__1.C,2vb1_A,100.0,80.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/4qgz__1__1.A_...,...,True,True,1.0,1.0,0.890561,0.954621,0.880202,0.947527,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
112,4owa__2__1.B__1.NA,2vb1_A,100.0,95.0,100.0,100.0,92.0,2vb1,0.65,/plinder/2024-06/assignments/apo/4owa__2__1.B_...,...,True,True,1.0,1.0,0.886402,0.945307,0.865961,0.928915,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo
113,6wgo__1__1.A__1.E,2vb1_A,100.0,98.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6wgo__1__1.A_...,...,True,True,1.0,1.0,0.899726,0.962098,0.885907,0.951752,/Users/vladas/.local/share/plinder/2024-06/v2/...,apo


Now that we have links to `apo` / `pred` structures, we can see how many of those are available for our single protein single ligand systems

In [17]:
plindex_split[systems_1P1L].groupby(["split", "system_has_apo_or_pred"]).system_id.count()

split    system_has_apo_or_pred
removed  False                       4720
         True                       41685
test     False                         59
         True                         487
train    False                      33897
         True                      118925
val      False                         26
         True                         374
Name: system_id, dtype: int64

In [18]:
plindex_split_1P1L_links = plindex_split[systems_1P1L].merge(single_links_df, left_on="system_id", right_on="reference_system_id", how="left")

In [19]:
# let's check how many systems have linked structures
plindex_split_1P1L_links['system_has_linked_apo_or_pred'] = ~plindex_split_1P1L_links.filename.isna()
plindex_split_1P1L_links.groupby(["split", "system_has_linked_apo_or_pred"]).system_id.count()

split    system_has_linked_apo_or_pred
removed  False                              7097
         True                              39308
test     False                                76
         True                                470
train    False                             47109
         True                             105713
val      False                                30
         True                                370
Name: system_id, dtype: int64

In [20]:
# TODO: fix inconsistency
plindex_split_1P1L_links.groupby(["system_has_apo_or_pred", "system_has_linked_apo_or_pred"]).system_id.count()

system_has_apo_or_pred  system_has_linked_apo_or_pred
False                   False                             37600
                        True                               1102
True                    False                             16712
                        True                             144759
Name: system_id, dtype: int64

#### Selecting final dataset
Let's select only the set that has linked structures for flexible docking

In [21]:
plindex_final_df = plindex_split_1P1L_links[
    (plindex_split_1P1L_links.system_has_linked_apo_or_pred) & (plindex_split_1P1L_links.split != "removed")
]
plindex_final_df.groupby(["split", "system_has_linked_apo_or_pred"]).system_id.count()

split  system_has_linked_apo_or_pred
test   True                                470
train  True                             105713
val    True                                370
Name: system_id, dtype: int64

In [22]:
plindex_final_df[["ligand_rdkit_canonical_smiles", "filename"]].iloc[0].filename

'/Users/vladas/.local/share/plinder/2024-06/v2/links/pred_links.parquet'

### Using _PLINDER_ API to load dataset by split

In [31]:
from plinder.core import PlinderDataset
from plinder.core.loader import get_model_input_files

In [24]:
train_dataset = PlinderDataset(
    df=plindex_final_df,
    split="train",
    num_alternative_structures=2,
    file_paths_only=True,
)

Note: function `get_model_input_files` accepts `split =` "train", "val" or "test"

In [25]:
sample_dataset = get_model_input_files(
    plindex_final_df,
    split = "val",
    max_num_sample = 10,
    num_alternative_structures = 1,
    )

2024-08-30 18:12:26,737 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.84s
2024-08-30 18:12:27,402 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.66s
2024-08-30 18:12:30,058 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.75s
2024-08-30 18:12:30,794 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 3.39s
2024-08-30 18:12:34,089 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 1.01s
2024-08-30 18:12:34,803 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.71s
2024-08-30 18:12:38,148 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 1.01s
2024-08-30 18:12:38,833 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.68s
2024-08-30 18:12:41,860 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.73s
2024-08-30 18:12:42,615 | plinder.core.scores.links.query_links:24 | INFO : runtime succeed

note, if files not already available this downloads them to `~/.local/share/plinder/{PLINDER_RELEASE}/{PLINDER_ITERATION}` directory

In [26]:
sample_dataset

[(PosixPath('/Users/vladas/.local/share/plinder/2024-06/v2/systems/4cj6__1__1.A__1.B/sequences.fasta'),
  'CC1=C(/C=C/C(C)=C/C=C/C(C)=C/C=O)C(C)(C)CCC1',
  ['/Users/vladas/.local/share/plinder/2024-06/v2/linked_structures/pred/4cj6__1__1.A__1.B/P12271_A/superposed.cif']),
 (PosixPath('/Users/vladas/.local/share/plinder/2024-06/v2/systems/3cj9__1__1.A__1.D/sequences.fasta'),
  'Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)(O)O)[C@@H](O)[C@H]1O',
  ['/Users/vladas/.local/share/plinder/2024-06/v2/linked_structures/apo/3cj9__1__1.A__1.D/3cj1_A/superposed.cif']),
 (PosixPath('/Users/vladas/.local/share/plinder/2024-06/v2/systems/7da6__1__1.A__1.B/sequences.fasta'),
  'NCCCC[C@@H](C=O)NC(=O)CNC(=O)[C@H](CCCN=C(N)N)NC(=O)[C@@H](N)Cc1ccccc1',
  ['/Users/vladas/.local/share/plinder/2024-06/v2/linked_structures/apo/7da6__1__1.A__1.B/7h54_A/superposed.cif']),
 (PosixPath('/Users/vladas/.local/share/plinder/2024-06/v2/systems/5dyw__1__1.A__1.E/sequences.fasta'),
  'CC(=O)N[C@H]1[C@@H](O[C@H]2[C@H](O)[C@@H](

### Using _PLINDER_ clusters in sampling
- #TODO: Example how to create a simple diversity sampler based on cluster labels

#### Define diversity sampler function
Here, we have provided an example of how one might use `torch.utils.data.WeightedRandomSampler`. However, users are free to sample diversity any how they see fit. For this example, we are going to use the sample dversity based on the `cluster` column in the splits dataframe.