# Python API tutorial

## Installation

### Getting Plinder

Due to dependencies that are not installable via `pip`, `plinder` is currently not
available at PyPI.
You can download the official
[_GitHub_ repository](https://github.com/plinder-org/plinder/)
instead, for example via `git`.

```console
$ git clone https://github.com/plinder-org/plinder.git
```

### Creating the Conda environment

The most convenient way to install the aforementioned extra dependencies is a _Conda_
environment.
If you have not _Conda_ installed yet, we recommend its installation via
[miniforge](https://github.com/conda-forge/miniforge).
Afterwards the environment can be created from the `environment.yml` in the local
repository clone.

:::{note}
We currently only support a Linux environment.
`plinder` uses `openstructure` for some of its functionality and is available from the
`aivant` conda channel using `conda install aivant::openstructure`, but it is only built
targeting Linux architectures.
For Windows and MacOS users, please see the relevant
[_Docker_](#docker-target) resources.
:::

```console
$ mamba env create -f environment.yml
$ mamba activate plinder
```

### Installing plinder

Now `plinder` can be installed into the created environment:

```console
$ pip install .
```

(docker-target)=
### Alternative: Using a Docker container

We also publish the `plinder` project as a docker container as alternative to the
_Conda_-based installation, to ensure the highest level of compatibility with
non-Linux platforms.
See the relevant docker resources here for more details:

- `docker-compose.yml`: defines a `base` image, the `plinder` "app" and a `test`
  container
- `dockerfiles/base/`: contains the files for the `base` image
- `dockerfiles/main/`: contains the files for the `plinder` "app" image

## Overview

The user-facing subpackage of `plinder` is `plinder.core`. This provides access to the underlying utility functions for accessing the dataset, split and annotations. It provides access to 5 top-level functions `get_config` (to access PLINDER global configuration), `get_plindex` (to access full annotation table), `get_manifest` (for PLINDER system to PDBID mapping), `get_split` (to access full split table). In addition, it provides access to the data class `PlinderSystem` for reconstituting PLINDER system from `system_id`

Furthermore, it has the following sub-packages:
- `plinder.core.loader`: Interface for loading dataset with loader class `plinder.core.loader.PlinderDataset`. This is an atom3d](https://atom3d.readthedocs.io/en/latest/getting_started.html) compliant loader, so interface should be familiar with users with experience with atom3d.
- `plinder.core.score`: This sub-package proivides users with functions useful for querying similarity (`query_ligand_similarity`, `cross_ligand_similarity`, `query_protein_similarity`, `cross_protein_similarity`), cluster identity(`query_clusters`), annotations(`query_index`), apo/pred linkage (`query_links`)

## Example Usage

### Configure dataset environment variable
> We need to set environment variables to point to the release and iteration of choice. For the sake of demonstartion, this will be set to point to a smaller toy example dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=toy`.

:::{note}
The version used for the preprint is `PLINDER_RELEASE=2024-04` and `PLINDER_ITERATION=v1`, while the current version with updated annotations to be used for the MLSB challenge is`PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=v2`. You could do this directly from a shell terminal with `export PLINDER_RELEASE=2024-04 && export PLINDER_ITERATION=toy` or do it with python with `os.environ`.
:::


In [1]:
import os
from pathlib import Path

release = "2024-04"
iteration = "tutorial"
os.environ["PLINDER_RELEASE"] = release
os.environ["PLINDER_ITERATION"] = iteration
os.environ["PLINDER_REPO"] =  str(Path.home()/"plinder-org/plinder")
os.environ["PLINDER_LOCAL_DIR"] =  str(Path.home()/".local/share/plinder")
version = f"{release}/{iteration}"

### Get config
Get the configuration to check that all parameters are correctly set.  In the snippet below we will check to see if the location of the remote and local PLINDER path. This should point to `~/.local/share/plinder/\<release\>/\<iteration\>` for local cache and `gs://plinder/\<release\>/\<iteration\>` for remote data directory

In [2]:
import plinder.core.utils.config

cfg = plinder.core.utils.config.get_config()
print(f"""
local cache directory: {cfg.data.plinder_dir}
remote data directory: {cfg.data.plinder_remote}
""")


local cache directory: /Users/yusuf/.local/share/plinder/2024-04/tutorial
remote data directory: gs://plinder/2024-04/tutorial



### Query annotations table

**Query full dataset**

To get full annotations table, run the code below. The columns of the table are described in [Annotation tables](https://plinder-org.github.io/plinder/dataset.html).

In [36]:
from plinder.core import get_plindex
annotation_df = get_plindex()
annotation_df.head()


2024-08-21 19:18:57,298 | plinder.core.index.utils.get_plindex:24 | INFO : runtime succeeded: 0.00s


Unnamed: 0,entry_pdb_id,entry_release_date,entry_oligomeric_state,entry_determination_method,entry_keywords,entry_pH,entry_resolution,entry_rfree,entry_r,entry_clashscore,...,ligand_interacting_ligand_chains_Pfam,ligand_neighboring_ligand_chains_Pfam,ligand_interacting_ligand_chains_PANTHER,ligand_neighboring_ligand_chains_PANTHER,system_ligand_chains_SCOP2,system_ligand_chains_SCOP2B,protein_lddt_qcov_weighted_sum__100__strong__component,pli_qcov__100__strong__component,system_ccd_codes,uniqueness
7115,5dax,2015-08-20,dimeric,X-RAY DIFFRACTION,OXIDOREDUCTASE,8.0,1.7,0.1986,0.1807,1.54,...,,,,,,,c1381,c301829,NI_AKG_58L,5dax_c1381_c301829_NI_AKG_58L
7116,5dax,2015-08-20,dimeric,X-RAY DIFFRACTION,OXIDOREDUCTASE,8.0,1.7,0.1986,0.1807,1.54,...,,,,,,,c1381,c301829,NI_AKG_58L,5dax_c1381_c301829_NI_AKG_58L
7117,5dax,2015-08-20,dimeric,X-RAY DIFFRACTION,OXIDOREDUCTASE,8.0,1.7,0.1986,0.1807,1.54,...,,,,,,,c1381,c301829,NI_AKG_58L,5dax_c1381_c301829_NI_AKG_58L
7118,5dax,2015-08-20,dimeric,X-RAY DIFFRACTION,OXIDOREDUCTASE,8.0,1.7,0.1986,0.1807,1.54,...,,,,,,,c1381,c276429,NI_AKG_58L,5dax_c1381_c276429_NI_AKG_58L
7119,5dax,2015-08-20,dimeric,X-RAY DIFFRACTION,OXIDOREDUCTASE,8.0,1.7,0.1986,0.1807,1.54,...,,,,,,,c1381,c276429,NI_AKG_58L,5dax_c1381_c276429_NI_AKG_58L


**Query specific columns** <br> 

To query the annotations table for specific columns or filter by specific criteria, use  `plinder.core.scores.query_index` function. The function could be called without any argument to a table of `system_id` and 	`entry_pdb_id`. However,  the function could be called by passing `columns` argument which is a list of columns based on information in [Annotation tables](https://plinder-org.github.io/plinder/dataset.html). 

In [41]:
from plinder.core.scores import query_index
# Get system_id and entry_pdb_id columns
query_index()

2024-08-21 19:27:12,948 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.28s


Unnamed: 0,system_id,entry_pdb_id
0,5dax__1__1.A__1.B_1.C_1.D,5dax
1,5dax__1__1.A__1.B_1.C_1.D,5dax
2,5dax__1__1.A__1.B_1.C_1.D,5dax
3,5dax__1__2.A__2.B_2.C_2.D,5dax
4,5dax__1__2.A__2.B_2.C_2.D,5dax
...,...,...
1552,4lps__1__1.A__1.C_1.D_1.E,4lps
1553,4lps__1__1.A__1.C_1.D_1.E,4lps
1554,4lps__2__1.B__1.M_1.N_1.O,4lps
1555,4lps__2__1.B__1.M_1.N_1.O,4lps


In [42]:
# Get specific columns from the annotation table
cols_of_interest = ['entry_pdb_id', 'entry_release_date', 'entry_oligomeric_state',
'system_ccd_codes', 'entry_clashscore', 'entry_resolution']
query_index(columns=cols_of_interest)

2024-08-21 19:27:15,662 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.33s


Unnamed: 0,entry_pdb_id,entry_release_date,entry_oligomeric_state,system_ccd_codes,entry_clashscore,entry_resolution
0,5dax,2015-08-20,dimeric,NI_AKG_58L,1.54,1.7
1,5dax,2015-08-20,dimeric,NI_AKG_58L,1.54,1.7
2,5dax,2015-08-20,dimeric,NI_AKG_58L,1.54,1.7
3,5dax,2015-08-20,dimeric,NI_AKG_58L,1.54,1.7
4,5dax,2015-08-20,dimeric,NI_AKG_58L,1.54,1.7
...,...,...,...,...,...,...
1552,4lps,2013-07-16,monomeric,MG_GDP_PO4,4.56,2.0
1553,4lps,2013-07-16,monomeric,MG_GDP_PO4,4.56,2.0
1554,4lps,2013-07-16,monomeric,MG_GDP_PO4,4.56,2.0
1555,4lps,2013-07-16,monomeric,MG_GDP_PO4,4.56,2.0


**Query annotations with specific filters** <br>

We could also pass a `filters` argument which is a list of tuple. Filter syntax: [[(column, op, val), …],…] where op is [==, =, >, >=, <, <=, !=, in, not in]. The innermost tuples are transposed into a set of filters applied through an AND operation. See https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html for more information on filter syntax.

In [40]:
# Filter by specific criteria
filters = [("entry_clashscore", ">", "2.0"), ("entry_resolution", "==", "1.5")]
query_index(columns=cols_of_interest, filters=filters)

2024-08-21 19:25:01,093 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.29s


Unnamed: 0,entry_pdb_id,entry_release_date,entry_oligomeric_state,system_ccd_codes,entry_clashscore,entry_resolution
0,3vm1,2011-12-05,monomeric,SRM_SF4_BCT,5.07,1.5
1,3vm1,2011-12-05,monomeric,SRM_SF4_BCT,5.07,1.5
2,3vm1,2011-12-05,monomeric,SRM_SF4_BCT,5.07,1.5


### Inspect manifest table <br>
The manifest table shows the mapping of each system_id to their respective PDB entry.

In [43]:
from plinder.core import get_manifest
get_manifest()

Unnamed: 0,system_id,entry_pdb_id
0,5dax__1__1.A__1.B_1.C_1.D,5dax
1,5dax__1__1.A__1.B_1.C_1.D,5dax
2,5dax__1__1.A__1.B_1.C_1.D,5dax
3,5dax__1__2.A__2.B_2.C_2.D,5dax
4,5dax__1__2.A__2.B_2.C_2.D,5dax
...,...,...
1552,4lps__1__1.A__1.C_1.D_1.E,4lps
1553,4lps__1__1.A__1.C_1.D_1.E,4lps
1554,4lps__2__1.B__1.M_1.N_1.O,4lps
1555,4lps__2__1.B__1.M_1.N_1.O,4lps


### Query protein similarity
The are three kinds of similarity datasets we provide:
- Similarity between ligand bound structures (`holo`)
- Similarity between ligand bound and unbound protein structures (`apo`)
- Similarity between ligand bound and Alphafold predicted structures
Any of these could be specified with `search_db` of the function  `plinder.core.scores.query_protein_similarity`

In [20]:

from plinder.core.scores import query_protein_similarity
query_protein_similarity(
    search_db="apo",
    filters=[("similarity", ">", "50")]
)

2024-08-21 17:12:57,399 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.30s
2024-08-21 17:12:57,580 | plinder.core.scores.protein.query_protein_similarity:24 | INFO : runtime succeeded: 3.08s


Unnamed: 0,query_system,target_system,protein_mapping,mapping,protein_mapper,source,metric,similarity
0,3to9__2__2.A__2.B,5j9t_A,2.A:0.A,,foldseek,foldseek,protein_lddt_weighted_sum,86
1,3to9__2__2.A__2.B,5j9t_A,2.A:0.A,2.A:0.A,foldseek,foldseek,protein_lddt_weighted_max,86
2,3to9__2__2.A__2.B,5j9t_A,2.A:0.A,2.A:0.A,foldseek,foldseek,protein_lddt_max,86
3,3to9__2__2.A__2.B,5j9t_A,2.A:0.A,2.A:0.A,foldseek,foldseek,protein_lddt_qcov_weighted_sum,86
4,3to9__2__2.A__2.B,5j9t_A,2.A:0.A,2.A:0.A,foldseek,foldseek,protein_lddt_qcov_weighted_max,86
...,...,...,...,...,...,...,...,...
133,3to9__2__2.A__2.B,5j9u_E,2.A:0.E,,foldseek,both,protein_seqsim_qcov_weighted_sum,100
134,3to9__2__2.A__2.B,5j9u_E,2.A:0.E,2.A:0.E,foldseek,both,protein_seqsim_qcov_weighted_max,100
135,3to9__2__2.A__2.B,5j9u_E,2.A:0.E,2.A:0.E,foldseek,both,protein_seqsim_qcov_max,100
136,3to9__2__2.A__2.B,5j9u_E,2.A:0.E,,foldseek,foldseek,pocket_lddt,87


### Load plinder system data onject from system_id
To reconstitute PLINDER systems directly from a set of system_ids, run the snippet below. This will give you access to `PlinderSystem` data object

In [12]:

from plinder.core.system.utils import load_systems
load_systems(
    system_ids=["5dax__1__1.A__1.B_1.C_1.D", "3to9__2__2.A__2.B"]
)

{'5dax__1__1.A__1.B_1.C_1.D': PlinderSystem(system_id=5dax__1__1.A__1.B_1.C_1.D),
 '3to9__2__2.A__2.B': PlinderSystem(system_id=3to9__2__2.A__2.B)}

### Load split data
To get the splits, run the snippet below. You will get a table with the following columns: <br>
`system_id`: PLINDER system id  <br>
`split`: split include following categories `train`, `val`, `test` and `removed`<br>
`cluster`: cluster used for test set de-leaking
`cluster_for_val_split`: cluster used for validation set sampling

In [46]:
from plinder.core import get_split

In [49]:
split_df = get_split()
split_df

2024-08-22 02:05:32,053 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.00s


Unnamed: 0,system_id,split,cluster,cluster_for_val_split
1235,8dat__1__1.A_1.B__1.L,train,c2,c0
1236,8dat__1__1.A__1.M,train,c2,c0
1237,8dat__1__1.B__1.N,train,c2,c0
1238,8dat__1__1.B__1.O,removed,c2,c0
1239,8dat__1__1.C__1.P,train,c2,c0
...,...,...,...,...
435619,3lpp__4__1.D__1.I,removed,c8113,c3320
435620,3lpp__4__1.D__1.X,removed,c158828,c139
435621,5lps__1__1.A__1.B,train,c88235,c158
435622,5lps__1__2.A__2.B,train,c88234,c158


In [27]:
from plinder.core.scores import query_index
query_index()

2024-08-21 17:48:20,723 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.67s


Unnamed: 0,system_id,entry_pdb_id
0,5dax__1__1.A__1.B_1.C_1.D,5dax
1,5dax__1__1.A__1.B_1.C_1.D,5dax
2,5dax__1__1.A__1.B_1.C_1.D,5dax
3,5dax__1__2.A__2.B_2.C_2.D,5dax
4,5dax__1__2.A__2.B_2.C_2.D,5dax
...,...,...
1552,4lps__1__1.A__1.C_1.D_1.E,4lps
1553,4lps__1__1.A__1.C_1.D_1.E,4lps
1554,4lps__2__1.B__1.M_1.N_1.O,4lps
1555,4lps__2__1.B__1.M_1.N_1.O,4lps


In [22]:
from plinder.core.scores import query_links
query_links()

2024-08-21 17:44:22,879 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-08-21 17:44:27,821 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 8.11s


Unnamed: 0,reference_system_id,id,pocket_fident,pocket_lddt,protein_fident_qcov_weighted_sum,protein_fident_weighted_sum,protein_lddt_weighted_sum,target_id,sort_score,receptor_file,...,posebusters_volume_overlap_with_inorganic_cofactors,posebusters_volume_overlap_with_waters,fraction_reference_proteins_mapped,fraction_model_proteins_mapped,lddt,bb_lddt,per_chain_lddt_ave,per_chain_bb_lddt_ave,filename,kind
0,6pl9__1__1.A__1.C,2vb1_A,100.0,86.0,100.0,100.0,96.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6pl9__1__1.A_...,...,True,True,1.0,1.0,0.903772,0.968844,0.890822,0.959674,/Users/yusuf/.local/share/plinder/2024-04/tuto...,apo
1,6ahh__1__1.A__1.G,2vb1_A,100.0,98.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6ahh__1__1.A_...,...,True,True,1.0,1.0,0.894349,0.962846,0.883217,0.954721,/Users/yusuf/.local/share/plinder/2024-04/tuto...,apo
2,5b59__1__1.A__1.B,2vb1_A,100.0,91.0,100.0,100.0,96.0,2vb1,0.65,/plinder/2024-06/assignments/apo/5b59__1__1.A_...,...,True,True,1.0,1.0,0.903266,0.962318,0.890656,0.955258,/Users/yusuf/.local/share/plinder/2024-04/tuto...,apo
3,3ato__1__1.A__1.B,2vb1_A,100.0,99.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/3ato__1__1.A_...,...,True,True,1.0,1.0,0.890530,0.954696,0.879496,0.946326,/Users/yusuf/.local/share/plinder/2024-04/tuto...,apo
4,6mx9__1__1.A__1.K,2vb1_A,100.0,98.0,100.0,100.0,95.0,2vb1,0.65,/plinder/2024-06/assignments/apo/6mx9__1__1.A_...,...,True,True,1.0,1.0,0.904116,0.964309,0.892434,0.955853,/Users/yusuf/.local/share/plinder/2024-04/tuto...,apo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597774,6x3q__1__1.A__1.B,A8AWU7_A,100.0,79.0,99.0,99.0,88.0,A8AWU7,38.90,/plinder/2024-06/assignments/pred/6x3q__1__1.A...,...,True,True,1.0,1.0,0.815736,0.877814,0.806444,0.871054,/Users/yusuf/.local/share/plinder/2024-04/tuto...,pred
597775,8st5__1__1.A__1.B,A8AWU7_A,100.0,95.0,99.0,99.0,88.0,A8AWU7,38.90,/plinder/2024-06/assignments/pred/8st5__1__1.A...,...,True,True,1.0,1.0,0.814876,0.885938,0.814176,0.881858,/Users/yusuf/.local/share/plinder/2024-04/tuto...,pred
597776,6efd__1__1.A__1.B,A8AWU7_A,100.0,81.0,99.0,99.0,87.0,A8AWU7,38.90,/plinder/2024-06/assignments/pred/6efd__1__1.A...,...,True,True,1.0,1.0,0.814404,0.879823,0.810680,0.872417,/Users/yusuf/.local/share/plinder/2024-04/tuto...,pred
597777,8st6__1__1.A__1.D,A8AWU7_A,100.0,80.0,99.0,99.0,88.0,A8AWU7,38.90,/plinder/2024-06/assignments/pred/8st6__1__1.A...,...,True,True,1.0,1.0,0.816566,0.884372,0.813010,0.877505,/Users/yusuf/.local/share/plinder/2024-04/tuto...,pred


### Load dataset for training ML model
To instantiate the loader for training machine learning model, we need to call the class `plinder.core.PlinderDataset`. The class has the following signature:
```
class PlinderDataset(Dataset):  # type: ignore
    """
    Creates a dataset from plinder systems

    Parameters
    ----------
    df : pd.DataFrame | None
        the split to use
    split : str
        the split to sample from
    split_parquet_path: str | Path | None = None,
        path to split parquet to use
    file_with_system_ids : str | Path
        path to a file containing a list of system ids (default: full index)
    store_file_path : bool, default=True
        if True, include the file path of the source structures in the dataset
    load_alternative_structures : bool, default=False
        if True, include alternative structures in the dataset
    num_alternative_structures : int, default=1
        number of alternative structures (apo and pred) to include
    """

    def __init__(
        self,
        df: pd.DataFrame | None = None,
        split: str = "train",
        split_parquet_path: str | Path | None = None,
        store_file_path: bool = True,
        load_alternative_structures: bool = False,
        num_alternative_structures: int = 1,
    )
```
Note: If df and split_parquet_path is set to None, class automatically invoke get_split()


#### Load training data
Plinder loader was specifically created to be [atom3d](https://atom3d.readthedocs.io/en/latest/getting_started.html) loader compatible

In [54]:
from plinder.core.loader import PlinderDataset
train_data = PlinderDataset(
    df=None, split="train",
    split_parquet_path= None,
    store_file_path=True,
    load_alternative_structures=False,
    num_alternative_structures=1)

print(len(train_data))  # Print length
print(train_data[0].keys()) # Print keys stored in first structure


2024-08-22 02:27:09,677 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.00s
