# MLSB/PLINDER Data Access
(mlsb-notebook-target)=

The goal of this tutorial is to provide background information for the MLSB/PLINDER challenge, as well as a simple hands-on demo for how participants can access and use the _PLINDER_ dataset. 

## Background information <a id="background-information"></a>

For background information on the rules of the challenge, see [MLSB/P(L)INDER challenge rules](#mlsb-rules-target) for more information.

## Accessing and loading data for training <a class="anchor" id="load-data"></a>

Here, we are going to demonstrate how to get the key input data:
- protein receptor fasta sequence
- small molecules ligand SMILES string
- access to linked _apo_ and _pred_ structure


In the process, we will show:
- How to download the _PLINDER_ data
- How to query _PLINDER_ index and splits to select relevant data using `plinder.core` API
- Extract task-specific data one might want to use for training a task-specific ML model, eg. one protein, one ligand
- How to use `plinder.core` API to:
    - supply dataset inputs for `train` or `val` splits
    - load linked `apo` and `pred` structures
    - use diversity subsampling based on cluster labels

### Download _PLINDER_ <a class="anchor" id="download-plinder"></a>

To download, run: `plinder_download --release 2024-06 --iteration v2 --yes` <br>
This will download and unpack all neccesary files. For more information on download check out [Dataset Tutorial](https://plinder-org.github.io/plinder/tutorial/dataset.html#getting-the-data)

:::{note} The dataset is hundreds of gigabytes in size; downloading and extracting should take about 40 minutes. If you want to play around with a toy example dataset, please use `--iteration tutorial`
:::


### Loading _PLINDER_  <a class="anchor" id="loading-plinder"></a>

We recommend users interact with the dataset using _PLINDER_ Python API.

To install the API run: ``pip install plinder[loader]``. If you are using `zsh` terminal, you will have to quote the package like ``"plinder[loader]"``

NOTE: once the _PLINDER_ is downloaded locally, you can use it in offline mode to save time for data queries with: `os.environ["PLINDER_OFFLINE"] = "true"`

In [1]:
%load_ext autoreload
%autoreload 2
%env PLINDER_LOG_LEVEL=0
# once PLINDER is downloaded you can set this to true
# %env PLINDER_OFFLINE=true

from __future__ import annotations

env: PLINDER_LOG_LEVEL=0


#### Load _PLINDER_ index with selected columns from annotations table
For a full list with descriptions, please refer to [docs](https://plinder-org.github.io/plinder/dataset.html#annotation-tables-index).

In [2]:
from plinder.core.scores import query_index

In [3]:
# get plinder index with selected annotation columns specified
plindex = query_index(
    columns=["system_id", "ligand_id",
             "ligand_rdkit_canonical_smiles", "ligand_is_ion",
             "ligand_is_artifact", "system_num_ligand_chains",
             "system_num_protein_chains",
             "ligand_is_proper",
             "system_proper_num_ligand_chains",
             ],
    splits=["train", "val"], # This is the default
)

In [4]:
plindex.head()

Unnamed: 0,system_id,ligand_id,ligand_rdkit_canonical_smiles,ligand_is_ion,ligand_is_artifact,system_num_ligand_chains,system_num_protein_chains,ligand_is_proper,system_proper_num_ligand_chains,split
0,3grt__1__1.A_2.A__1.B,3grt__1__1.B,Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...,False,False,1,2,True,1,train
1,3grt__1__1.A_2.A__1.C,3grt__1__1.C,N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...,False,False,1,2,True,1,train
2,3grt__1__1.A_2.A__2.B,3grt__1__2.B,Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...,False,False,1,2,True,1,train
3,3grt__1__1.A_2.A__2.C,3grt__1__2.C,N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...,False,False,1,2,True,1,train
4,1grx__1__1.A__1.B,1grx__1__1.B,N[C@@H](CCC(=O)N[C@@H](CS)C(=O)NCC(=O)O)C(=O)O,False,False,1,1,True,1,train


In [5]:
# Display number of system neighboring protein chains
plindex.groupby("system_num_protein_chains").system_id.count()

system_num_protein_chains
1    260142
2    131014
3     23306
4      4466
5       610
Name: system_id, dtype: int64

#### Extracting specific data using _PLINDER_ annotations  <a class="anchor" id="extracting-annotations"></a>
As we can see just from the data tables above - a significant fraction of _PLINDER_ systems contain complex multi protein chain systems.

##### Task specific selection
If we would like to focus on single protein and single ligand systems for training, we can use the annotated columns to filter out systems that:
- contain only one protein chain
- only one ligand

Remember: In _PLINDER_ artifacts and (single atom) ions are also included in the index if they are part of the pocket.
- `ligand_is_proper` combines columns `ligand_is_ion` and `ligand_is_artifact` to only select "proper" ligands.

Let's find out how many annotated ligands are "proper".

In [6]:
plindex.groupby("ligand_is_proper").system_id.count()

ligand_is_proper
False     74608
True     344930
Name: system_id, dtype: int64

##### User choice

The annotations table gives flexibility to choose the systems for training:
- One could strictly choose to use only the data that contains single protein single ligand systems
- Alternatively one could expand the number of systems to include systems containing single proper ligands, and optionally ignore the artifacts and ions in the pocket

Let's compare the numbers of such systems!

In [7]:
# create mask for single receptor single ligand systems
systems_1p1l = (plindex["system_num_protein_chains"] == 1) & (plindex["system_num_ligand_chains"] == 1)

# create mask only for single receptor single "proper" ligand systems
systems_proper_1p1l = (plindex["system_num_protein_chains"] == 1) & (plindex["system_proper_num_ligand_chains"] == 1) & plindex["ligand_is_proper"]

print(f"Number of single receptor single ligand systems: {sum(systems_1p1l)}")
print(f"Number of single receptor single \"proper\" ligand systems: {sum(systems_proper_1p1l)}")

Number of single receptor single ligand systems: 153222
Number of single receptor single "proper" ligand systems: 182861


As we can see - the second choice can provide up to 20% more data for training, however, the caveat is that some of the interactions made by artifacts or ions may influence the binding pose of the "proper" ligand. The user could come up with further strategies to filtering using annotations table or external tools, but this is beyond the scope of this tutorial.

#### Getting links to `apo` or `pred` structures <a class="anchor" id="getting-apo-pred"></a>

:::{currentmodule} plinder.core
:::

For users interested in including `apo` and `pred` structures in their workflow, all the information needed can be obtained from the function {func}`query_links`

In [8]:
from plinder.core.scores import query_links

In [9]:
links_df = query_links(
    columns=["reference_system_id", "id", "sort_score"],
)

:::{note} The table is sorted by `sort_score` that is resolution for `apo`s and `plddt` for `pred`s. The `apo` or `pred` is specified in the additionally added `filename` and `kind` column that specifies if the structure was sourced from PDB or AF2DB, respectively.
:::

In [10]:
links_df.head()

Unnamed: 0,reference_system_id,id,sort_score,kind
0,6pl9__1__1.A__1.C,2vb1_A,0.65,apo
1,6ahh__1__1.A__1.G,2vb1_A,0.65,apo
2,5b59__1__1.A__1.B,2vb1_A,0.65,apo
3,3ato__1__1.A__1.B,2vb1_A,0.65,apo
4,6mx9__1__1.A__1.K,2vb1_A,0.65,apo


If a user wants to consider only one linked structure per system - we can easily drop duplicates, first sorting by `sort_score`. Using this priority score, `pred` structures will not be used unless there is no `apo` available. Alternative can be achieved by sorting with `ascending=False`, or filtering by `kind=="pred"` column.

In [11]:
single_links_df = links_df.sort_values("sort_score", ascending=True).drop_duplicates("reference_system_id")
single_links_df.head()

Unnamed: 0,reference_system_id,id,sort_score,kind
0,6pl9__1__1.A__1.C,2vb1_A,0.65,apo
110,6agr__1__1.A__1.G,2vb1_A,0.65,apo
111,4qgz__1__1.A__1.C,2vb1_A,0.65,apo
112,4owa__2__1.B__1.NA,2vb1_A,0.65,apo
113,6wgo__1__1.A__1.E,2vb1_A,0.65,apo


Now that we have links to `apo` / `pred` structures, we can see how many of those are available for our single protein single ligand systems

In [12]:
plindex_split_1p1l_links = plindex[systems_1p1l].merge(single_links_df, left_on="system_id", right_on="reference_system_id", how="left")

In [13]:
# let's check how many systems have linked structures
plindex_split_1p1l_links['system_has_linked_apo_or_pred'] = ~plindex_split_1p1l_links.kind.isna()
plindex_split_1p1l_links.groupby(["split", "system_has_linked_apo_or_pred"]).system_id.count()

split  system_has_linked_apo_or_pred
train  False                             47109
       True                             105713
val    False                                30
       True                                370
Name: system_id, dtype: int64

##### Selecting final dataset
Let's select only the set that has linked structures for flexible docking

In [14]:
plindex_final_df = plindex_split_1p1l_links[
    (plindex_split_1p1l_links.system_has_linked_apo_or_pred)
]
plindex_final_df.groupby(["split", "system_has_linked_apo_or_pred"]).system_id.count()

split  system_has_linked_apo_or_pred
train  True                             105713
val    True                                370
Name: system_id, dtype: int64

#### Using _PLINDER_ API to load dataset by split <a class="anchor" id="api-load-split"></a>

More to come here after revamping the data loader code in `plinder`

:::{currentmodule} plinder.core
:::

:::{note} if files not already available this downloads them to `~/.local/share/plinder/{PLINDER_RELEASE}/{PLINDER_ITERATION}` directory
:::

#### Using _PLINDER_ clusters in sampling <a class="anchor" id="cluster-sampling"></a>

##### Define diversity sampler function
:::{currentmodule} plinder.core
:::


Here, we have provided an example of how one might use the function `get_diversity_samples` which is based on  `torch.utils.data.WeightedRandomSampler`.

NOTE: This example function is provided for demonstration purposes and users are encouraged to come up with sampling strategy that suits their need. <br>

In general, diversity can be sampled using cluster information described [here](https://plinder-org.github.io/plinder/dataset.html#clusters-clusters).
All cluster information can easily be added to `plindex`. <br>

See below an example, we are going to sample based on the following cluster label:
`pli_qcov__70__community`.

The returned DataFrame could then be passed to {func}`get_model_input_files` the same way `plindex_final_df` was used above.

In [17]:
def get_diversity_samples(
    plindex,
    cluster_column: str = "pli_qcov__70__community",
):
    from torch.utils.data import WeightedRandomSampler

    cluster_counts = plindex[cluster_column].value_counts().rename("cluster_count")
    plindex = plindex.merge(cluster_counts, left_on=cluster_column, right_index=True).reset_index()
    cluster_weights = 1.0 / plindex.cluster_count.values
    sampler = WeightedRandomSampler(
        weights=cluster_weights, num_samples=len(cluster_weights)
    )
    sampler_index = [i for i in sampler]
    return plindex.loc[tuple(sampler_index), ("system_id", "split")].drop_duplicates()

cluster_column = "pli_qcov__70__community"
plindex_clusters = query_index(columns=["system_id", cluster_column]).drop(columns=["split"])
plindex_with_clusters = plindex_final_df.merge(plindex_clusters, on="system_id", how="left")
sampled_df = get_diversity_samples(plindex_with_clusters, cluster_column)

In [18]:
sampled_df.system_id.nunique(), plindex.system_id.nunique()

(28509, 309972)