# Structure, Systems, Dataset and Loader Tutorial

# Topics:
- Interacting with `Structure` class
- Interacting `PlinderSystem`
- Interacting `PlinderDataset`
- Loader



## Setup

### Installation

`plinder` is available on *PyPI*.

```
pip install plinder
```

### Environment variable configuration
:::{note}
We need to set environment variables to point to the release and iteration of choice.
For the sake of demonstration, this will be set to point to a smaller tutorial example
dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=tutorial`.
:::
:::{note}

## Getting the configuration

At first we get the configuration to check that all parameters are correctly set. 
In the snippet below, we will check, if the local and remote *PLINDER* paths point to
the expected location.

In [1]:
import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")

local cache directory: /Users/yusuf/.local/share/plinder/2024-06/v2
remote data directory: gs://plinder/2024-06/v2


## Data ecosystem overview
This tutorial assumes user have downloaded _PLINDER_ dataset before now. While the examples will run without users doing anything, we encourage users to download the data for performance sake. _PLINDER_ data hierarchy is shown below. We have organized this tutorial to follow this same hierarchy from ground up
![image](../static/asset/data/plinder_data_hierarchy.png)

## 0. Structure files

After download all files will be store locally at `~/.local/share/plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems`. The current default is `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION-v2`

There, we have sub-folders that corresponds to each systems. In each sub-folder, we have:
- Receptor PDB: receptor.cif and receptor.pdb
- Ligand SDF's: `<biounit_instance_id>.<chain_id>.sdf`. For complexes with more than one ligands, all the sdfs are saved
- Sequence fasta: sequence.fasta
For more information on the file organization, see "<link-to-dataset-tutorial>"

## 1. Structure Python Abstraction
To make interacting with our data seamless, {class} class Structure, a pydantic data class that:
- Loads all the structure files + smiles 
- Gets coordinates
- Featurizes residues and atoms of associated protein and ligand molecules
- Masks molecules to account for resolved vs unresolved part 
To interact with the example, do the following:

### Load the structure for a given system_id
For this purpose we will use `"1avd__1__1.A__1.C"` as our example system id.

In [2]:
from plinder.core.structure.structure import Structure
from plinder.core import PlinderSystem
from pathlib import Path

from biotite.sequence.io.fasta import FastaFile

input_smiles = "CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O" # Need to account for unresolved part of the ligand
input_sdf = Path(cfg.data.plinder_dir)/"systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf"
system_id = "1avd__1__1.A__1.C"
protein_structure_path = Path(cfg.data.plinder_dir)/"systems/1avd__1__1.A__1.C/receptor.cif"
input_sequence_path = Path(cfg.data.plinder_dir)/"systems/1avd__1__1.A__1.C/sequences.fasta"
list_ligand_sdf_and_input_smiles = [
    (input_sdf,input_smiles)]

input_sequences = {k: v for k, v in FastaFile.read_iter(input_sequence_path)}

# Load holo structure
holo_struc = Structure(
    id=system_id,
    protein_path=protein_structure_path,
    protein_sequence=input_sequences,
    ligand_sdfs = {"1.C": str(input_sdf)},
    ligand_smiles = {"1.C": str(input_smiles)}

    )


[23:37:05] Molecule does not have explicit Hs. Consider calling AddHs()


In [3]:
holo_struc.ligand_mols

{'1.C': (<rdkit.Chem.rdchem.Mol at 0x15612fca0>,
  <rdkit.Chem.rdchem.Mol at 0x15612fe60>,
  (array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]])),
  <rdkit.Chem.rdchem.Mol at 0x15612fd10>,
  (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])),
  (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])))}

In [4]:
holo_struc.input_ligand_conformers

{'1.C': <rdkit.Chem.rdchem.Mol at 0x15612fe60>}

### List fields
- We list all fields and their `FieldInfo` to show which ones are required. `id`, `protein_path` and `protein_sequence` are required. Everything else is optionally. Particularly worth mentioning is the decision to make `list_ligand_sdf_and_input_smiles` optional; this is because ligand will not be availbale in apo and predicted structures.
- Out of these field `ligand_mols` and `protein_atom_array` is computed within the object if set to default. 
- `ligand_mols` returns a chain-mapped dictionary of of the form:
    ```python
    {
        "<instance_id>.<chain_id>": (
            rdkit mol of template smiles of type `Chem.Mol`,
            random conformer of rdkit mol of template smiles of type `Chem.Mol`,
            conformer atoms to template smiles map with of type `tuple[NDArray.int_, NDArray.int_]`,
            rdkit mol of solved ligand structure of type `Chem.Mol`,
            solved ligand atom to template smile atom map of type `tuple[NDArray.int_, NDArray.int_]`,
            conformer atoms to solved ligand atom map of type `tuple[NDArray.int_, NDArray.int_]`
        )

    }
    ```
- While `protein_atom_array` returns [biotite AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) of the receptor protein structure.
- `add_ligand_hydrogens` specifies whether to adds hydrogens to ligand
- `structure_type`: could be `"holo"`, `"apo"` or `"pred"`

In [5]:
# Show fields
holo_struc.model_fields

{'id': FieldInfo(annotation=str, required=True),
 'protein_path': FieldInfo(annotation=Path, required=True),
 'protein_sequence': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default=None),
 'ligand_sdfs': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default=None),
 'ligand_smiles': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default=None),
 'protein_atom_array': FieldInfo(annotation=Union[AtomArray, NoneType], required=False, default=None),
 'ligand_mols': FieldInfo(annotation=Union[dict[str, tuple[Mol, Mol, tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]], Mol, tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]], tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]]]], NoneType], required=False, default=None),
 'add_ligand_hydrogens': FieldInfo(annotation=bool, required=False, default=False),
 'structure_type': FieldInf

In [6]:
# Inspect ligand_mols
holo_struc.ligand_mols

{'1.C': (<rdkit.Chem.rdchem.Mol at 0x15612fca0>,
  <rdkit.Chem.rdchem.Mol at 0x15612fe60>,
  (array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]])),
  <rdkit.Chem.rdchem.Mol at 0x15612fd10>,
  (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])),
  (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])))}

In [7]:
# Inspect protein_atom_araay

In [8]:
holo_struc.protein_atom_array[0]

Atom(np.array([31.221, 22.957, 43.101], dtype=float32), chain_id="1.A", res_id=3, ins_code="", res_name="LYS", hetero=False, atom_name="N", element="N")

### List structure protein properties
Show protein related properties

In [9]:
for property in holo_struc.get_properties():
    if "protein" in property:
        print(property)

protein_backbone_mask
protein_calpha_coords
protein_calpha_mask
protein_chain_ordered
protein_chains
protein_coords
protein_n_atoms
protein_sequence_from_structure
protein_structure_b_factor
protein_structure_tokenized_sequence
protein_unique_atom_names
protein_unique_residue_ids
protein_unique_residue_names


#### Protein backbone mask
This is a boolean mask that can be used to select backbone atoms from biotite `AtomArray`. The indices of `True` corresponds to backbone indices.

In [10]:
holo_struc.protein_backbone_mask

array([ True,  True,  True, False, False, False, False, False, False,
        True,  True,  True, False, False, False,  True,  True,  True,
       False, False, False,  True,  True,  True, False, False, False,
       False, False,  True,  True,  True, False, False, False, False,
        True,  True,  True, False,  True,  True,  True, False, False,
       False, False, False, False,  True,  True,  True, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True, False, False, False, False,  True,  True,
        True, False, False, False, False, False,  True,  True,  True,
       False, False, False, False, False,  True,  True,  True, False,
       False, False, False, False,  True,  True,  True, False,  True,
        True,  True, False, False, False,  True,  True,  True, False,
       False, False, False, False,  True,  True,  True, False, False,
       False, False, False,  True,  True,  True, False, False, False,
       False,  True,

#### Protein Calpha mask
This shows the mask of calpha atoms

In [11]:
holo_struc.protein_calpha_mask

array([False,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

In [12]:
holo_struc.protein_chain_ordered

['1.A']

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

### Get protein chains for all atoms
The list of chain IDs in the structure. Order of how they appear not kept.

In [13]:
holo_struc.protein_chains

['1.A']

### Get protein coordinates
This property gets the 3D positions of each of the atoms in protein molecules

In [14]:
holo_struc.protein_coords

[array([[31.221, 22.957, 43.101],
        [31.828, 24.118, 42.476],
        [31.979, 23.854, 41.021],
        ...,
        [34.341, 35.018, 24.674],
        [35.484, 35.831, 24.497],
        [33.105, 35.742, 24.15 ]], dtype=float32)]

### Get number of atoms of protein molecule

In [15]:
holo_struc.protein_n_atoms

964

### Get protein structure atom names
Returns all atoms names the same way they appear in the structure

### Get protein b-factors
Get protein atom occupancies. If not available in a structure, it's set to zero.

In [16]:
holo_struc.protein_structure_b_factor

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

### Get protein residue names

### Get protein residues number
Residue number as they appear in structure

### Get fasta from protein structure


### Get tokenized sequence
Get tensor of sequence converted to integer-based amino acid token

In [17]:
holo_struc.protein_structure_tokenized_sequence

tensor([11,  4, 15, 10, 16,  7, 11, 17, 16,  2,  3, 10,  7, 15,  2, 12, 16,  9,
         7,  0, 19,  2, 15,  1,  7,  6, 13, 16,  7, 16, 18, 16, 16,  0, 19, 16,
         0, 16, 15,  2,  6,  9, 11,  6, 15, 14, 10,  8,  7, 16,  6,  2, 16,  9,
         2, 11,  1, 16,  5, 14, 16, 13,  7, 13, 16, 19,  2, 17, 11, 13, 15,  6,
        15, 16, 16, 19, 13, 16,  7,  5,  4, 13,  9,  3,  1,  2,  7, 11,  6, 19,
        10, 11, 16, 12, 17, 10, 10,  1, 15, 15, 19,  2,  3,  9,  7,  3,  3, 17,
        11,  0, 16,  1, 19,  7,  9,  2,  9, 13, 16,  1, 10,  1, 16])

#### Inspect holo sequences
Returns a chain-mapped dictionary of sequences from seqres
```python
{
    "<instance_id>.<chain_id>": sequence of type `str`

}
```

In [18]:
holo_struc.protein_sequence

{'1.A': 'ARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE'}

In [19]:
holo_struc.filter(
        property="atom_name",
        mask="CA",

    )

[23:37:07] Molecule does not have explicit Hs. Consider calling AddHs()


Structure(
    (
        'id',
        '1avd__1__1.A__1.C',
    ),
    (
        'protein_path',
        /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.cif,
    ),
    (
        'protein_sequence',
        {
            '1.A': 'ARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE',
        },
    ),
    (
        'ligand_sdfs',
        {
            '1.C': '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf',
        },
    ),
    (
        'ligand_smiles',
        {
            '1.C': 'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O',
        },
    ),
    (
        'protein_atom_array',
        <class 'biotite.structure.AtomArray'> with shape (123,),
    ),
    (
        'ligand_mols',
        {
            '1.C': (
                <rdkit.Chem.rdchem.Mol object at 0x1562703c0>,
                <rdkit.Chem.rdchem.Mol object at 0x15ce805

### List ligand properties
Show liagnd related properties

In [20]:
for property in holo_struc.get_properties():
    if "ligand" in property:
        print(property)

input_ligand_conformer2resolved_stacks
input_ligand_conformer2smiles_stacks
input_ligand_conformer_coords
input_ligand_conformers
input_ligand_templates
ligand_chain_ordered
ligand_conformer2resolved_mask
resolved_ligand_mols
resolved_ligand_mols_coords
resolved_ligand_structure2smiles_stacks
resolved_ligand_structure_coords
resolved_smiles_ligand_mask_stacked


:::{todo}
- Vladas to write the description for the ligand properties
:::

### Ligand atom id mapping mapping
TODO: Vladas

conforrmer to solved structure mappings

In [21]:
holo_struc.input_ligand_conformer2resolved_stacks

{'1.C': (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
  array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]]))}

conformer to

### Ligand conformer to input smiles mapping
TODO: Vladas

In [22]:
holo_struc.input_ligand_conformer2smiles_stacks

{'1.C': (array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]),
  array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]))}

In [23]:
holo_struc.input_ligand_conformer_coords

{'1.C': array([[ 4.1748436 ,  0.45433175,  0.32472356],
        [ 2.80881867, -0.02750113,  0.04236991],
        [ 2.66192611, -1.11128166, -0.55813492],
        [ 1.67381615,  0.70920623,  0.43596204],
        [ 0.33013961,  0.31686703,  0.20331715],
        [-0.26752988, -0.52223263,  1.31293962],
        [-1.76939599, -0.47843872,  1.11231553],
        [-2.04519106, -0.01650565, -0.28351731],
        [-1.26184089, -0.5996018 , -1.24681587],
        [ 0.09352656, -0.40019854, -1.08703447],
        [ 0.71640631,  0.16171004, -2.17215408],
        [-2.14727479,  1.47310426, -0.40186445],
        [-2.69789253,  1.82486712, -1.63155343],
        [-2.39235494, -1.66988754,  1.38086291],
        [ 0.12200306, -0.11443877,  2.56858381]])}

### Ligand conformer coordinates
TODO: Vladas


In [24]:
holo_struc.input_ligand_conformer_coords

{'1.C': array([[ 4.1748436 ,  0.45433175,  0.32472356],
        [ 2.80881867, -0.02750113,  0.04236991],
        [ 2.66192611, -1.11128166, -0.55813492],
        [ 1.67381615,  0.70920623,  0.43596204],
        [ 0.33013961,  0.31686703,  0.20331715],
        [-0.26752988, -0.52223263,  1.31293962],
        [-1.76939599, -0.47843872,  1.11231553],
        [-2.04519106, -0.01650565, -0.28351731],
        [-1.26184089, -0.5996018 , -1.24681587],
        [ 0.09352656, -0.40019854, -1.08703447],
        [ 0.71640631,  0.16171004, -2.17215408],
        [-2.14727479,  1.47310426, -0.40186445],
        [-2.69789253,  1.82486712, -1.63155343],
        [-2.39235494, -1.66988754,  1.38086291],
        [ 0.12200306, -0.11443877,  2.56858381]])}

### Ligand conformer coordinates
TODO: Vladas

## 2. Interacting with PLINDER systems
PlinderSystem is the next layer of abstraction above `Structure`. It provides encapsulation around all structures associated with a particular `system_id`. With these, we can access the `holo` and alternate (`apo` and `pred`) structure.

In [25]:
sample_system = PlinderSystem(
    system_id="1avd__1__1.A__1.C")

### Check holo structure
Since having `holo` structure is a defining feature of _PLINDER_ system, holo structures is by definition available for all systems 

In [26]:
sample_system.holo_structure

2024-09-22 23:37:07,900 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:07,901 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:08,037 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:08,037 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:08,038 | plinder.core.index.utils:148 | INFO : loading entries from 1 zips
2024-09-22 23:37:08,041 | plinder.core.index.utils:163 | INFO : loaded 1 entries
2024-09-22 23:37:08,041 | plinder.core.index.utils.load_entries:24 | INFO : runtime succeeded: 0.14s
[23:37:08] Molecule does not have explicit Hs. Consider calling AddHs()


Structure(
    (
        'id',
        '1avd__1__1.A__1.C',
    ),
    (
        'protein_path',
        /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.cif,
    ),
    (
        'protein_sequence',
        {
            '1.A': 'ARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE',
        },
    ),
    (
        'ligand_sdfs',
        {
            '1.C': '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf',
        },
    ),
    (
        'ligand_smiles',
        {
            '1.C': 'CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O',
        },
    ),
    (
        'protein_atom_array',
        <class 'biotite.structure.AtomArray'> with shape (964,),
    ),
    (
        'ligand_mols',
        {
            '1.C': (
                <rdkit.Chem.rdchem.Mol object at 0x15ce3bb50>,
                <rdkit.Chem.rdchem.Mol object at 0x15ce5e

### Get annotations 
This `system` property returns `json` data of annotations for a the system in question. To get the annotations of all other systems sharing the same PDB entry ids, use `.entry` property.

In [27]:
sample_system.system

{'pdb_id': '1avd',
 'biounit_id': '1',
 'ligands': [{'pdb_id': '1avd',
   'biounit_id': '1',
   'asym_id': 'C',
   'instance': 1,
   'ccd_code': 'NAG',
   'plip_type': 'SMALLMOLECULE',
   'bird_id': '',
   'centroid': [36.85636520385742, 25.090288162231445, 17.591215133666992],
   'smiles': 'CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O',
   'resolved_smiles': 'OC[C@H]1O[CH][C@@H]([C@H]([C@@H]1O)O)NC(=O)C',
   'residue_numbers': [1],
   'rdkit_canonical_smiles': 'CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O',
   'molecular_weight': 221.0899372,
   'crippen_clogp': -3.077599999999999,
   'num_rot_bonds': 2,
   'num_hbd': 5,
   'num_hba': 6,
   'num_rings': 1,
   'num_heavy_atoms': 15,
   'is_covalent': True,
   'covalent_linkages': ['17:ASN:A:17:ND2__600:NAG:C:.:C1'],
   'neighboring_residues': {'1.A': [9, 11, 15, 16, 17, 34, 35, 36, 123]},
   'neighboring_ligands': [],
   'interacting_residues': {'1.A': [34, 15]},
   'interacting_ligands': [],
   'interactions': {'1.A': {'15':

### Get paths of the underlying structure files

`archive` points to the subfolder where all the files (except `apo` and `pred` files) relating to a given system are stored

In [28]:
sample_system.archive

PosixPath('/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C')

Similarly, the  `system.cif`, `receptor.cif`, `receptor.pdb`, `sequence.fasta`, and ligand sdfs can be accessed via `.system_cif`, `.receptor_cif`, `.receptor_pdb`, `.sequence_fasta` and `.ligand_sdf` properties respectively. 

To get all the paths of the structures together, use `.structures`

In [29]:
sample_system.structures

['/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.cif',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.pdb',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/chain_mapping.json',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/sequences.fasta',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/system.cif',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf']

In [30]:
sample_system.system_cif

'/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/system.cif'

### Get binding site water (`.water_mapping`)
This returns the information about binding site water 

In [31]:
sample_system.water_mapping

### Chain mapping 
:::{todo}
Confirm with Jay
:::

`.chain_mapping` maps chain ids in system (`<instance_id>.<asym_id>`) to PDB author chain ids 

In [32]:
sample_system.chain_mapping

{'1.A': 'A'}

linked_structures, linked_archive, get_linked_structure 
linked_archive, linked_structures,  get_linked_structure, alt_structures

### Linked apo and predicted structures.
The following properties provides different kind of information about linked structures as described below:
- `.linked_archive`: returns paths the local subfolder where the linked structures are saved; 
- `.linked_structures`: returns the dataframe of linked structures along with all their metrics while 
- `.get_linked_structure`:  gives the path to a specific linked structure.
- `.best_linked_structures_paths`: Gives the best linked structures based on `scrmsd_wave` which is average symmetry-corrected RMSD across mapped ligands weighted by number of atoms. This selects maximum of two alternate structure with at most one `apo` and `pred` each when available
- `.alt_structures`: returns the dictionary`Structure` object of the best `apo` and `pred` which the corresponding `holo` chain as key

In [33]:
sample_system.linked_archive

2024-09-22 23:37:08,629 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:08,630 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s


PosixPath('/Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures')

In [34]:
sample_system.linked_structures

2024-09-22 23:37:09,017 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.17s
2024-09-22 23:37:09,379 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 0.71s


Unnamed: 0,reference_system_id,id,pocket_fident,pocket_lddt,protein_fident_qcov_weighted_sum,protein_fident_weighted_sum,protein_lddt_weighted_sum,target_id,sort_score,receptor_file,...,posebusters_most_extreme_ligand_element_waters,posebusters_most_extreme_protein_element_waters,posebusters_most_extreme_ligand_vdw_waters,posebusters_most_extreme_protein_vdw_waters,posebusters_most_extreme_sum_radii_waters,posebusters_most_extreme_distance_waters,posebusters_most_extreme_sum_radii_scaled_waters,posebusters_most_extreme_relative_distance_waters,posebusters_most_extreme_clash_waters,kind
0,1avd__1__1.A__1.C,1vyo_B,100.0,83.0,98.0,98.0,93.0,1vyo,1.48,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
1,1avd__1__1.A__1.C,1vyo_A,100.0,82.0,98.0,98.0,93.0,1vyo,1.48,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
2,1avd__1__1.A__1.C,1nqn_A,100.0,76.0,94.0,99.0,88.0,1nqn,1.8,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
3,1avd__1__1.A__1.C,1rav_B,100.0,81.0,99.0,99.0,90.0,1rav,2.2,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
4,1avd__1__1.A__1.C,1rav_A,100.0,83.0,99.0,99.0,90.0,1rav,2.2,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
5,1avd__1__1.A__1.C,P02701_A,100.0,94.0,98.0,98.0,93.0,P02701,91.22,/plinder/2024-06/assignments/pred/1avd__1__1.A...,...,,,,,,,,,,pred


In [35]:
sample_system.get_linked_structure(link_kind="apo", link_id='1vyo_B')

'/Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/1vyo_B.cif'

In [36]:
sample_system.alternate_structures

{'1vyo_B': Structure(
     (
         'id',
         '1vyo_B',
     ),
     (
         'protein_path',
         /Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/1vyo_B.cif,
     ),
     (
         'protein_sequence',
         {
             '1.A': 'ARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE',
         },
     ),
     (
         'ligand_sdfs',
         None,
     ),
     (
         'ligand_smiles',
         None,
     ),
     (
         'protein_atom_array',
         <class 'biotite.structure.AtomArray'> with shape (958,),
     ),
     (
         'ligand_mols',
         None,
     ),
     (
         'add_ligand_hydrogens',
         False,
     ),
     (
         'structure_type',
         'apo',
     ),
 ),
 '1vyo_A': Structure(
     (
         'id',
         '1vyo_A',
     ),
     (
         'protein_path',
         /Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/1vyo_A.ci

### Get `Openstructure` entities and views
`.receptor_entity` returns receptor `mol.EntityHandle` object
`.ligand_views` returns `mol.ResidueView` for all ligands

:::{note}
You must have Openstructure installed to use this property
:::


### Others properties
This includes:
- `num_ligands`: Number of ligand chains
- `smiles`: Ligand smiles dictionary
- `num_proteins`: Number of protein chains


## 3. Interacting with the PLINDER dataset
`PlinderDataset` provides an interface to interact with _PLINDER_ data as a dataset. It is a subclass of `torch.utils.data.Dataset`, as such subclassing it and extending should be familiar to most users. Flexibility and general applicability is our top concern when designing this interface and `PlinderDataset` allows users to not only define their own split but to also bring their own featurizer.
It can be initialized with the following parameters
```
Parameters
    ----------
    df : pd.DataFrame | None
        the split to use
    split : str
        the split to sample from
    split_parquet_path : str | Path, default=None
        split parquet file
    input_structure_priority : str, default="apo"
        Which alternate structure to proritize
    featurizer: Callable[
            [Structure, int], dict[str, torch.Tensor]
    ] = structure_featurizer,
        Transformation to turn structure to input tensors
    padding_value : int
        Value for padding uneven array
    **kwargs : Any
        Any other keyword args
``` 

In [37]:
from plinder.core.loader import PlinderDataset

#### Make _PLINDER_ training dataset with default parameters
When no parameter is set, PlinderDataset automatically defaults to the training set of the most current version of the dataset. To change this behaviour, we can explicitly pass a data frame to the parameter `df` or split file to `split_parquet_path`. Either of these must have at least two columns named `system_id` and `split`. This also use our default featurizer `plinder.core.loader.featurizer.featurizer`. 
NOTE: We have provided this `plinder.core.loader.featurizer.featurizer` as an example featurizer; users are encourage to use featurizers that suit their need.

In [38]:
train_dataset = PlinderDataset("train")

2024-09-22 23:37:10,146 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.13s
2024-09-22 23:37:10,888 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.13s
2024-09-22 23:37:11,021 | plinder.core.split.utils:40 | INFO : reading /Users/yusuf/.local/share/plinder/2024-06/v2/splits/split.parquet
2024-09-22 23:37:11,301 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.59s


We can inspect one of the item in the dataset. Let's select the 1000th item to inspect

In [39]:
example_train_data = train_dataset[1000]

2024-09-22 23:37:14,078 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 2.06s
2024-09-22 23:37:14,245 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.17s
2024-09-22 23:37:16,768 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.53s
2024-09-22 23:37:16,850 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.08s
2024-09-22 23:37:16,852 | plinder.core.index.utils:148 | INFO : loading entries from 1 zips
2024-09-22 23:37:16,856 | plinder.core.index.utils:163 | INFO : loaded 1 entries
2024-09-22 23:37:16,857 | plinder.core.index.utils.load_entries:24 | INFO : runtime succeeded: 0.75s
2024-09-22 23:37:17,222 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.16s
2024-09-22 23:37:17,389 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 0.46s
2024-09-22 23:37:17,705 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:17

In [40]:
example_train_data.keys()

dict_keys(['system_id', 'holo_structure', 'alternate_structures', 'features_and_coords', 'path'])

We can see we have access to:
- `system_id`: _PLINDER_ system id, 
- `holo_structure`: `Structure` object with all the properties described above, 
- `features_and_coords`: Features and coordinates based on the featurize passed to the dataset 
- `path`: and path to the underlying structure files

In [41]:
example_train_data["system_id"]

'4nv1__2__1.C__1.L'

In [42]:
example_train_data["holo_structure"].get_properties()

['__fields_set__',
 'input_ligand_conformer2resolved_stacks',
 'input_ligand_conformer2smiles_stacks',
 'input_ligand_conformer_coords',
 'input_ligand_conformers',
 'input_ligand_templates',
 'input_sequence_list_ordered_by_chain',
 'input_sequence_residue_mask_stacked',
 'ligand_chain_ordered',
 'ligand_conformer2resolved_mask',
 'model_extra',
 'model_fields_set',
 'protein_backbone_mask',
 'protein_calpha_coords',
 'protein_calpha_mask',
 'protein_chain_ordered',
 'protein_chains',
 'protein_coords',
 'protein_n_atoms',
 'protein_sequence_from_structure',
 'protein_structure_b_factor',
 'protein_structure_tokenized_sequence',
 'protein_unique_atom_names',
 'protein_unique_residue_ids',
 'protein_unique_residue_names',
 'resolved_ligand_mols',
 'resolved_ligand_mols_coords',
 'resolved_ligand_structure2smiles_stacks',
 'resolved_ligand_structure_coords',
 'resolved_smiles_ligand_mask_stacked',
 'sequence_atom_mask_stacked']

In [43]:
for feat_and_coord in example_train_data["features_and_coords"].keys():
    print(feat_and_coord)

sequence_atom_mask_feature
input_sequence_residue_mask_feature
protein_coordinates
protein_calpha_coordinates
protein_structure_residue_feature
input_conformer_ligand_feature
input_conformer_ligand_coordinates
resolved_ligand_mols_feature


Let's inspect the protein residue-level feature based on one-hot encoding of the residue type

In [44]:
example_train_data["features_and_coords"]["protein_structure_residue_feature"]

tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]], dtype=torch.float64)

#### Customize `PlinderDataset`

To customize PlinderDataset, we will do the following:
- Select a specific subset of _PLINDER_ data that has only one ligand chain and the ligand obeys Lipinski's rule of five. 
- We will also write our own featurizer to demonstrate how one might to this in the future.

**Select single ligand, single protein training set**

In [45]:
from plinder.core.scores import query_index
from plinder.core.split import get_split

# Get list of single ligand, Lipinski-complaint system ids
system_id_list = query_index(
            columns=["system_id"],
            filters=[("system_num_ligand_chains", "==", 1),
                     ("ligand_is_lipinski", "==", True)],
        ).system_id.to_list()
# Get the most current split
split_df = get_split()

# Select only the subset we care about
lipinski_split_df = split_df[split_df.system_id.isin(system_id_list )]

# lipinski_split_df should be pass to PlinderDataSet `df` parameter


2024-09-22 23:37:18,116 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.14s
2024-09-22 23:37:18,324 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:37:18,434 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.00s


**Bring your own featurizer**

In line with our philosophy of freedom to choose, we encourage users to write their own featurizer. To write a `PlinderDataset`-compatible featurizer, there are a few things to consider:
- PlinderDataset expects a featurizer that accepts Structure object and tensor paddinf value as the only argument
- Must return a dictionary 
- Use `structure.protein_chain_ordered` and `structure.ligand_chain_ordered` to order how the chains are stacked
- Remember to pad the tensors along the atom dimension since chains don't ususally  have equal number of atom

In [54]:
import torch

from plinder.core.structure.structure import Structure
from plinder.core.structure.atoms import (
    _stack_atom_array_features,
    _stack_ligand_feat,
    _one_hot_encode_stack,
)
from plinder.core.utils import constants as pc
from plinder.core.loader.utils import pad_and_stack


def user_defined_featurizer(
    structure: Structure, pad_value: int = -100
) -> tuple[Structure, list[str]]:

    # This must be used to order the chain features
    protein_chain_order = structure.protein_chain_ordered
    ligand_chain_order = structure.ligand_chain_ordered
    protein_atom_array = structure.protein_atom_array
    sequence_atom_mask_stacked = structure.sequence_atom_mask_stacked
    input_sequence_residue_mask_stacked = structure.input_sequence_residue_mask_stacked
    protein_coordinates_stacked = structure.protein_coords
    protein_calpha_coordinates_stacked = structure.protein_calpha_coords
    resolved_ligand_mols_coords = structure.resolved_ligand_mols_coords

    # Get residue type feature
    protein_structure_residue_type_arr = _stack_atom_array_features(
        protein_atom_array, "res_name", protein_chain_order
    )
    protein_structure_residue_type_stack = [
        feat
        for feat in _one_hot_encode_stack(
            protein_structure_residue_type_arr, pc.AA_TO_INDEX, "UNK"
        )
    ]

    # Get resolved ligand mols coordinate
    resolved_ligand_mols_coords_stack = [
        coord
        for coord in _stack_ligand_feat(resolved_ligand_mols_coords, ligand_chain_order)
    ]
    features = {
        "sequence_atom_mask_feature": sequence_atom_mask_stacked,
        "input_sequence_residue_mask_feature": input_sequence_residue_mask_stacked,
        "protein_coordinates_feature": protein_coordinates_stacked,
        "protein_calpha_coordinates_feature": protein_calpha_coordinates_stacked,
        "protein_structure_residue_feature": protein_structure_residue_type_stack,
        "resolved_ligand_mols_feature": resolved_ligand_mols_coords_stack,
    }

    # Pad tensors to make chains have equal length.
    # This part is essential to create tensors with uniform dimensions
    padded_features = {
        feat_name: pad_and_stack(
            [torch.tensor(feat_per_chain) for feat_per_chain in feat],
            dim=0,
            value=pad_value,
        )
        for feat_name, feat in features.items()
    }

    # Set features as new properties
    return padded_features


In [55]:
custom_train_dataset = PlinderDataset(df=lipinski_split_df,
               split="train",
               featurizer=user_defined_featurizer)

2024-09-22 23:49:11,680 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.13s
2024-09-22 23:49:12,166 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.00s


In [56]:
custom_train_dataset[0]

2024-09-22 23:49:15,366 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:15,367 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:15,505 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:15,505 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:15,505 | plinder.core.index.utils:148 | INFO : loading entries from 1 zips
2024-09-22 23:49:15,509 | plinder.core.index.utils:163 | INFO : loaded 1 entries
2024-09-22 23:49:15,509 | plinder.core.index.utils.load_entries:24 | INFO : runtime succeeded: 0.14s
2024-09-22 23:49:15,978 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.20s
2024-09-22 23:49:16,176 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 0.56s
2024-09-22 23:49:16,595 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:16

{'system_id': '2bmr__1__2.A__2.E',
 'holo_structure': Structure(
     (
         'id',
         '2bmr__1__2.A__2.E',
     ),
     (
         'protein_path',
         /Users/yusuf/.local/share/plinder/2024-06/v2/systems/2bmr__1__2.A__2.E/receptor.cif,
     ),
     (
         'protein_sequence',
         {
             '2.A': 'MSYQNLVSEAGLTQKLLIHGDKELFQHELKTIFARNWLFLTHDSLIPSPGDYVKAKMGVDEVIVSRQNDGSVRAFLNVCRHRGKTLVHAEAGNAKGFVCGYHGWGYGSNGELQSVPFEKELYGDAIKKKCLGLKEVPRIESFHGFIYGCFDAEAPPLIDYLGDAAWYLEPTFKYSGGLELVGPPGKVVVKANWKSFAENFVGDGYHVGWTHAAALRAGQSVFSSIAGNAKLPPEGAGLQMTSKYGSGMGVFWGYYSGNFSADMIPDLMAFGAAKQEKLAKEIGDVRARIYRSFLNGTIFPNNSFLTGSAAFRVWNPIDENTTEVWTYAFVEKDMPEDLKRRVADAVQRSIGPAGFWESDDNENMETMSQNGKKYQSSNIDQIASLGFGKDVYGDECYPGVVGKSAIGETSYRGFYRAYQAHISSSNWAEFENASRNWHIEHTKTTDR',
         },
     ),
     (
         'ligand_sdfs',
         {
             '2.E': '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/2bmr__1__2.A__2.E/ligand_files/2.E.sdf',
         },
     ),
     (
         'ligand_sm

## 4. Making Loader
The goal of this section is walk you through how to make a simple torch loader


Now that we have a dataset, we need to batch them to feed into a neural network. To do this we will need a collate function . Let's see how to write one.
Note: Each item in the PlindaDataset return `"system_id"`, `"holo_structure"`, `"alternate_structures"`, `"features_and_coords"`, `"path"` and our collate function must be written to handle these keys. We have provided an example here `from plinder.core.loader.utils.collate_batch`

Having said, that let's see what it looks like in practice.


In [57]:
from plinder.core.loader.utils import collate_batch

Let's wrap `torch.utils.data.DataLoader` around `PlinderDataset` 

In [58]:
from typing import Callable, Any
from torch.utils.data import DataLoader
def get_torch_loader(
    dataset: PlinderDataset,
    batch_size: int = 2,
    shuffle: bool = True,
    num_workers: int = 1,
    collate_fn: Callable[[list[dict[str, Any]]], dict[str, Any]] = collate_batch,
    **kwargs: Any,
) -> DataLoader[PlinderDataset]:
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        collate_fn=collate_fn,
        **kwargs,
    )

In [59]:

train_loader = get_torch_loader(
    train_dataset
)
for data in train_loader:

    sample_torch_data = data
    break


2024-09-22 23:49:28,989 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:28,989 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:29,740 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.62s
2024-09-22 23:49:29,878 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.14s
2024-09-22 23:49:29,878 | plinder.core.index.utils:148 | INFO : loading entries from 1 zips
2024-09-22 23:49:29,883 | plinder.core.index.utils:163 | INFO : loaded 1 entries
2024-09-22 23:49:29,883 | plinder.core.index.utils.load_entries:24 | INFO : runtime succeeded: 0.89s
2024-09-22 23:49:30,342 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.18s
2024-09-22 23:49:30,551 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 0.57s
2024-09-22 23:49:30,953 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-22 23:49:30

In [60]:
sample_torch_data.keys()

dict_keys(['system_ids', 'holo_structures', 'alternate_structures', 'paths', 'features_and_coords'])

In [61]:
sample_torch_data['system_ids']

['2zsj__1__1.B__1.F', '1ow0__4__1.D__1.N']

In [62]:
for k, v in sample_torch_data['features_and_coords'].items():
    print(k, v.shape)

sequence_atom_mask_feature torch.Size([2, 1, 3866])
input_sequence_residue_mask_feature torch.Size([2, 1, 352])
protein_coordinates torch.Size([2, 1, 2649, 3])
protein_calpha_coordinates torch.Size([2, 1, 350, 3])
protein_structure_residue_feature torch.Size([2, 1, 2649, 21])
input_conformer_ligand_feature torch.Size([2, 1, 16, 16])
input_conformer_ligand_coordinates torch.Size([2, 1, 16, 3])
resolved_ligand_mols_feature torch.Size([2, 1, 15, 3])
