In [1]:
# Note: Cell is tagged to not show up in the mkdocs build
%load_ext autoreload
%autoreload 2

<div class="admonition abstract highlight">
    <p class="admonition-title">In short</p>
    <p>This tutorial shows how to create datasets with PDBs through the .zarr format.</p>
</div>

<div class="admonition abstract example">
    <p class="admonition-title">This feature is still very new.</p>
    <p>The features we will show in this tutorial are still experimental. We would love to learn from the community how we can make it easier to create datasets.</p>
</div>

### Dummy PDB example

In [1]:
import platformdirs

import datamol as dm

from polaris.dataset import DatasetFactory
from polaris.dataset.converters import PDBConverter

SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "dataset_pdb")

### Fetch PDB files from RCSB PDB

In [None]:
import biotite.database.rcsb as rcsb

pdb_path = rcsb.fetch("6s89", "pdb", SAVE_DIR)
print(pdb_path)

### Create dataset from PDB file

In [14]:
save_dst = dm.fs.join(SAVE_DIR, "tutorial_pdb.zarr")

factory = DatasetFactory(zarr_root_path=save_dst)
factory.reset(save_dst)

factory.register_converter("pdb", PDBConverter(pdb_column="pdb"))
factory.add_from_file(pdb_path)

# Build the dataset
dataset = factory.build()

### Check the dataset

In [15]:
dataset

0,1
name,
description,
tags,
user_attributes,
owner,
polaris_version,0.7.10.dev22+g8edf177.d20240814
default_adapters,pdbARRAY_TO_PDB
zarr_root_path,/Users/lu.zhu/Library/Caches/polaris-tutorials/002/tutorial_pdb.zarr
readme,
annotations,pdbis_pointerTruemodalityPROTEIN_3DdescriptionNoneuser_attributesdtypeobject

0,1
pdb,ARRAY_TO_PDB

0,1
pdb,is_pointerTruemodalityPROTEIN_3DdescriptionNoneuser_attributesdtypeobject

0,1
is_pointer,True
modality,PROTEIN_3D
description,
user_attributes,
dtype,object


### Check data table

In [16]:
dataset.table

Unnamed: 0,pdb
0,pdb/6s89


### Get PDB data from specific row
A array of list of `biotite.Atom` will be returned.
See more details at [fastpdb](https://github.com/biotite-dev/fastpdb) and [Atom](https://github.com/biotite-dev/biotite/blob/main/src/biotite/structure/atoms.py).

In [None]:
dataset.get_data(0, "pdb")

### Create dataset from multiple PDB files

In [7]:
pdb_paths = rcsb.fetch(["1l2y", "4i23"], "pdb", SAVE_DIR)
print(pdb_paths)

['/Users/lu.zhu/Library/Caches/polaris-tutorials/002/1l2y.pdb', '/Users/lu.zhu/Library/Caches/polaris-tutorials/002/4i23.pdb']


In [8]:
factory = DatasetFactory(SAVE_DIR.join("pdbs.zarr"))

converter = PDBConverter()
factory.register_converter("pdb", converter)

factory.add_from_files(pdb_paths, axis=0)
dataset = factory.build()

In [9]:
dataset.table

Unnamed: 0,pdb
0,pdb/1l2y
1,pdb/4i23


In [None]:
dataset.get_data(1, "pdb")

The process of completing the dataset's metadata and uploading it to the hub follows the same steps as outlined in the tutorial [dataset_zarr.ipynb](docs/tutorials/dataset_zarr.ipynb)

The End. 