# 1. Configure dataset environment variable
> You need to set this to point to the release and iteration of choice. For the sake of demonstartion, this will be set to point to a smaller toy example dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=toy`. 

>NOTE!: the version used for the preprint is `PLINDER_RELEASE=2024-04` and `PLINDER_ITERATION=v1`, while the current version  updated annotations to be used for the MLSB challenge is`PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=v2` .


```bash
export PLINDER_RELEASE=2024-04 # Current release
export PLINDER_ITERATION=toy # Current iteration
```

In [5]:
from __future__ import annotations
import os
from pathlib import Path

release = "2024-04"
iteration = "toy"
os.environ["PLINDER_RELEASE"] = release
os.environ["PLINDER_ITERATION"] = iteration
os.environ["PLINDER_REPO"] =  str(Path.home()/"plinder-org/plinder")
os.environ["PLINDER_LOCAL_DIR"] =  str(Path.home()/".local/share/plinder")
version = f"{release}/{iteration}"

# 2. Download dataset
> You can download all the content of directory, with the command below

In [4]:
# Download systems and splits
! time python ${PLINDER_REPO}/scripts/download_plinder.py --bucket_name plinder-collab-bucket --release 2024-04 --iteration toy  --unpack

bucket_name: plinder-collab-bucket release: 2024-04 iteration: toy specific_dirs: [] unpack: Trueskip_download: False
100%|█████████████████████████████████████████| 299/299 [01:32<00:00,  3.25it/s]
2024-08-15 01:52:55,024 | plinder.core.utils.gcs.download_many:24 | INFO : runtime succeeded: 92.65s
Extracting /Users/yusuf/.local/share/plinder/2024-04/toy/entries/da.zip: 100%|█|
Extracting /Users/yusuf/.local/share/plinder/2024-04/toy/entries/lp.zip: 100%|█|
Extracting /Users/yusuf/.local/share/plinder/2024-04/toy/entries/to.zip: 100%|█|
Extracting /Users/yusuf/.local/share/plinder/2024-04/toy/systems/da.zip: 100%|█|
Extracting /Users/yusuf/.local/share/plinder/2024-04/toy/systems/lp.zip: 100%|█|
Extracting /Users/yusuf/.local/share/plinder/2024-04/toy/systems/to.zip: 100%|█|
python ${PLINDER_REPO}/scripts/download_plinder.py --bucket_name  --release    346.58s user 125.66s system 448% cpu 1:45.38 total


# 3. Inspect the content


```
2024-04/                     # The "`plinder` release" (`PLINDER_RELEASE`)
|-- toy                       # The "`plinder` iteration" (`PLINDER_ITERATION`)
|   |-- systems              # Actual structure files for all systems (split by `two_char_code` and zipped)
|   |-- splits               # List of system ids in a .parquet and each split  the configs used to generate them (if available)
|   |-- clusters             # Pre-calculated cluster labels derived from the protein similarity dataset
|   |-- entries              # Raw annotations prior to consolidation (split by `two_char_code` and zipped)
|   |-- fingerprints         # Index mapping files for the ligand similarity dataset
|   |-- index                # Consolidated tabular annotations
|   |-- leakage              # Leakage results
|   |-- ligand_scores        # Ligand similarity parquet dataset
|   |-- ligands              # Ligand data expanded from entries for computing similarity
|   |-- linked_structures    # Linked structures
|   |-- mmp                  # Matched Molecular Series/Pair data
|   |-- scores               # Extended protein similarity parquet dataset
```


In [11]:
! tree ${PLINDER_LOCAL_DIR}/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems|head

[01;34m/Users/yusuf/.local/share/plinder/2024-04/toy/systems[00m
├── [01;34m1daa__1__1.A__1.C[00m
│   ├── chain_mapping.json
│   ├── [01;34mligand_files[00m
│   │   └── 1.C.sdf
│   ├── receptor.cif
│   ├── receptor.pdb
│   ├── sequences.fasta
│   ├── system.cif
│   ├── system.pdb


In [13]:
! tree ${PLINDER_LOCAL_DIR}/${PLINDER_RELEASE}/${PLINDER_ITERATION}/splits

[01;34m/Users/yusuf/.local/share/plinder/2024-04/toy/splits[00m
├── plinder-pl50.parquet
└── plinder-pl50.yaml

0 directories, 2 files


# 3. Query similarity dataset

In [18]:
import plinder.core.utils.config
from plinder.core.scores.query import make_query_no_schema
from plinder.core.scores.protein import query_protein_similarity
from plinder.core.scores import query_ligand_similarity
from plinder.core.utils import gcs, cpl


cfg = plinder.core.utils.config.get_config()
print(f"""
local cache directory: {cfg.data.plinder_dir}
remote data directory: {cfg.data.plinder_remote}
""")
data_dir = Path(cfg.data.plinder_dir)


local cache directory: /Users/yusuf/.local/share/plinder/2024-04/toy
remote data directory: gs://plinder/2024-04/toy



In [20]:
def get_specific_protein_similarity(search_db, columns, filters, metric):
    prot_sim_df = query_protein_similarity(
        search_db=search_db,
        #columns=columns,
        filters=filters
    )
    return prot_sim_df[prot_sim_df.metric.isin(metric)]

In [21]:
get_specific_protein_similarity("apo", None, [("similarity", ">", "50")], "pocket_lddt")

2024-08-14 20:41:55,369 | plinder.core.utils.cpl.ping:24 | INFO : runtime succeeded: 1.06s
2024-08-14 20:41:55,657 | plinder.core.scores.protein.query_protein_similarity:24 | INFO : runtime succeeded: 3.04s


Unnamed: 0,query_system,target_system,protein_mapping,mapping,protein_mapper,source,metric,similarity
21,3to9__2__2.A__2.B,5j9t_A,2.A:0.A,,foldseek,foldseek,pocket_lddt,88
44,3to9__2__2.A__2.B,5j9t_E,2.A:0.E,,foldseek,foldseek,pocket_lddt,88
67,3to9__2__2.A__2.B,5j9t_I,2.A:0.I,,foldseek,foldseek,pocket_lddt,86
90,3to9__2__2.A__2.B,5j9u_A,2.A:0.A,,foldseek,foldseek,pocket_lddt,87
113,3to9__2__2.A__2.B,5j9u_D,2.A:0.D,,foldseek,foldseek,pocket_lddt,87
136,3to9__2__2.A__2.B,5j9u_E,2.A:0.E,,foldseek,foldseek,pocket_lddt,87


In [25]:
query_ligand_similarity(

   filters=[("tanimoto_similarity_max", ">", "50")]
)

2024-08-14 20:57:53,809 | plinder.core.utils.cpl.ping:24 | INFO : runtime succeeded: 0.30s
2024-08-14 20:57:54,091 | plinder.core.scores.ligand.query_ligand_similarity:24 | INFO : runtime succeeded: 1.87s


Unnamed: 0,query_ligand_id,target_ligand_id,tanimoto_similarity_max
0,19987,19987,100
1,19987,12591,67
2,19987,8937,67
3,19987,19079,64
4,19987,30879,56
...,...,...,...
2239,4274,40444,51
2240,4274,34909,51
2241,4274,8435,51
2242,4274,44573,51
