# Structure, Systems, Dataset and Loader Tutorial

# Topics:
- Interacting with `Structure` class
- Interacting `PlinderSystem`
- Interacting `PlinderDataset`
- Loader



## Setup

### Installation

`plinder` is available on *PyPI*.

```
pip install plinder
```

### Environment variable configuration
:::{note}
We need to set environment variables to point to the release and iteration of choice.
For the sake of demonstration, this will be set to point to a smaller tutorial example
dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=tutorial`.
:::
:::{note}

## Getting the configuration

At first we get the configuration to check that all parameters are correctly set. 
In the snippet below, we will check, if the local and remote *PLINDER* paths point to
the expected location.

In [219]:
import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")

local cache directory: /Users/yusuf/.local/share/plinder/2024-06/v2
remote data directory: gs://plinder/2024-06/v2


## Data ecosystem overview
This tutorial assumes user have downloaded _PLINDER_ dataset before now. While the examples will run without users doing anything, we encourage users to download the data for performance sake. _PLINDER_ data hierarchy is shown below. We have organized this tutorial to follow this same hierarchy from ground up
![image](../static/asset/data/plinder_data_hierarchy.png)

## 0. Structure files

After download all files will be store locally at `~/.local/share/plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems`. The current default is `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION-v2`

There, we have sub-folders that corresponds to each systems. In each sub-folder, we have:
- Receptor PDB: receptor.cif and receptor.pdb
- Ligand SDF's: `<biounit_instance_id>.<chain_id>.sdf`. For complexes with more than one ligands, all the sdfs are saved
- Sequence fasta: sequence.fasta
For more information on the file organization, see "<link-to-dataset-tutorial>"

## 1. Structure Python Abstraction
To make interacting with our data seamless, {class} class Structure, a pydantic data class that:
- Loads all the structure files + smiles 
- Gets coordinates
- Featurizes residues and atoms of associated protein and ligand molecules
- Masks molecules to account for resolved vs unresolved part 
To interact with the example, do the following:

### Load the structure for a given system_id
For this purpose we will use `"1avd__1__1.A__1.C"` as our example system id.

In [260]:
from plinder.core.structure.structure import Structure
from plinder.core import PlinderSystem
from pathlib import Path

system_id = "1avd__1__1.A__1.C"
linked_apo_id = "P02701_A"

# Load holo structure
holo_struc = PlinderSystem(system_id=system_id).holo_structure

2024-09-20 16:51:14,406 | plinder.core.structure.atoms:158 | INFO : generate_conformer: MMFFOptimizeMolecule - more iterations are required, extending by 500 steps


### List fields
- We list all fields and their `FieldInfo` to show which ones are required. `id`, `protein_path` and `protein_sequence` are required. Everything else is optionally. Particularly worth mentioning is the decision to make `list_ligand_sdf_and_input_smiles` optional; this is because ligand will not be availbale in apo and predicted structures.
- Out of these field `ligand_mols` and `protein_atom_array` is computed within the object if set to default. 
- `ligand_mols` returns a chain-mapped dictionary of of the form:
    ```python
    {
        "<instance_id>.<chain_id>": (
            rdkit mol of template smiles of type `Chem.Mol`,
            random conformer of rdkit mol of template smiles of type `Chem.Mol`,
            conformer atoms to template smiles map with of type `tuple[NDArray.int_, NDArray.int_]`,
            rdkit mol of solved ligand structure of type `Chem.Mol`,
            solved ligand atom to template smile atom map of type `tuple[NDArray.int_, NDArray.int_]`,
            conformer atoms to solved ligand atom map of type `tuple[NDArray.int_, NDArray.int_]`
        )

    }
    ```
- While `protein_atom_array` returns [biotite AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) of the receptor protein structure.
- `add_ligand_hydrogens` specifies whether to adds hydrogens to ligand
- `structure_type`: could be `"holo"`, `"apo"` or `"pred"`

In [221]:
# Show fields
holo_struc.model_fields

{'id': FieldInfo(annotation=str, required=True),
 'protein_path': FieldInfo(annotation=Path, required=True),
 'protein_sequence': FieldInfo(annotation=Path, required=True),
 'list_ligand_sdf_and_input_smiles': FieldInfo(annotation=Union[list[tuple[Path, str]], NoneType], required=False, default=None),
 'protein_atom_array': FieldInfo(annotation=Union[AtomArray, NoneType], required=False, default=None),
 'ligand_mols': FieldInfo(annotation=Union[dict[str, tuple[Mol, Mol, tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]], Mol, tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]], tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]]]], NoneType], required=False, default=None),
 'add_ligand_hydrogens': FieldInfo(annotation=bool, required=False, default=False),
 'structure_type': FieldInfo(annotation=str, required=False, default='holo')}

In [222]:
# Inspect ligand_mols
holo_struc.ligand_mols

{'1.C': (<rdkit.Chem.rdchem.Mol at 0x1cd19f7d0>,
  <rdkit.Chem.rdchem.Mol at 0x1cd19cba0>,
  (array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]])),
  <rdkit.Chem.rdchem.Mol at 0x1cd19d1c0>,
  (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])),
  (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])))}

In [223]:
# Inspect protein_atom_araay

In [224]:
holo_struc.protein_atom_array[0]

Atom(np.array([31.221, 22.957, 43.101], dtype=float32), chain_id="1.A", res_id=3, ins_code="", res_name="LYS", hetero=False, atom_name="N", element="N")

### List structure protein properties
Show protein related properties

In [225]:
for property in holo_struc.get_properties():
    if "protein" in property:
        print(property)

protein_backbone_mask
protein_calpha_coords
protein_calpha_mask
protein_chain_ordered
protein_chains
protein_coords
protein_n_atoms
protein_sequence_from_structure
protein_structure_atom_names
protein_structure_b_factor
protein_structure_residue_names
protein_structure_residues
protein_structure_sequence_fasta
protein_structure_tokenized_sequence


#### Protein backbone mask
This is a boolean mask that can be used to select backbone atoms from biotite `AtomArray`. The indices of `True` corresponds to backbone indices.

In [226]:
holo_struc.protein_backbone_mask

array([ True,  True,  True, False, False, False, False, False, False,
        True,  True,  True, False, False, False,  True,  True,  True,
       False, False, False,  True,  True,  True, False, False, False,
       False, False,  True,  True,  True, False, False, False, False,
        True,  True,  True, False,  True,  True,  True, False, False,
       False, False, False, False,  True,  True,  True, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True, False, False, False, False,  True,  True,
        True, False, False, False, False, False,  True,  True,  True,
       False, False, False, False, False,  True,  True,  True, False,
       False, False, False, False,  True,  True,  True, False,  True,
        True,  True, False, False, False,  True,  True,  True, False,
       False, False, False, False,  True,  True,  True, False, False,
       False, False, False,  True,  True,  True, False, False, False,
       False,  True,

#### Protein Calpha mask
This shows the mask of calpha atoms

In [227]:
holo_struc.protein_calpha_mask

array([False,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

In [228]:
holo_struc.protein_chain_ordered

['1.A']

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

### Get protein chains for all atoms
The list of chain IDs in the structure. Order of how they appear not kept.

In [229]:
holo_struc.protein_chains

['1.A']

### Get protein coordinates
This property gets the 3D positions of each of the atoms in protein molecules

In [230]:
holo_struc.protein_coords

array([[31.221, 22.957, 43.101],
       [31.828, 24.118, 42.476],
       [31.979, 23.854, 41.021],
       ...,
       [34.341, 35.018, 24.674],
       [35.484, 35.831, 24.497],
       [33.105, 35.742, 24.15 ]], dtype=float32)

### Get number of atoms of protein molecule

In [231]:
holo_struc.protein_n_atoms

964

### Get protein structure atom names
Returns all atoms names the same way they appear in the structure

In [232]:
holo_struc.protein_unique_atom_names

['C',
 'CA',
 'CB',
 'CD',
 'CD1',
 'CD2',
 'CE',
 'CE1',
 'CE2',
 'CE3',
 'CG',
 'CG1',
 'CG2',
 'CH2',
 'CZ',
 'CZ2',
 'CZ3',
 'N',
 'ND1',
 'ND2',
 'NE',
 'NE1',
 'NE2',
 'NH1',
 'NH2',
 'NZ',
 'O',
 'OD1',
 'OD2',
 'OE1',
 'OE2',
 'OG',
 'OG1',
 'OH',
 'SD',
 'SG']

### Get protein b-factors
Get protein atom occupancies. If not available in a structure, it's set to zero.

In [233]:
holo_struc.protein_structure_b_factor

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

### Get protein residue names

In [234]:
holo_struc.protein_unique_residue_names

['ALA',
 'ARG',
 'ASN',
 'ASP',
 'CYS',
 'GLN',
 'GLU',
 'GLY',
 'HIS',
 'ILE',
 'LEU',
 'LYS',
 'MET',
 'PHE',
 'PRO',
 'SER',
 'THR',
 'TRP',
 'TYR',
 'VAL']

### Get protein residues number
Residue number as they appear in structure

In [235]:
holo_struc.protein_unique_residue_ids

[3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125]

### Get sequence from protein structure


In [236]:
holo_struc.protein_sequence_from_structure

'>receptor\nKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRT'

In [None]:
holo_struc.protein_sequence

### Get tokenized sequence
Get tensor of sequence converted to integer-based amino acid token

In [237]:
holo_struc.protein_structure_tokenized_sequence

tensor([11,  4, 15, 10, 16,  7, 11, 17, 16,  2,  3, 10,  7, 15,  2, 12, 16,  9,
         7,  0, 19,  2, 15,  1,  7,  6, 13, 16,  7, 16, 18, 16, 16,  0, 19, 16,
         0, 16, 15,  2,  6,  9, 11,  6, 15, 14, 10,  8,  7, 16,  6,  2, 16,  9,
         2, 11,  1, 16,  5, 14, 16, 13,  7, 13, 16, 19,  2, 17, 11, 13, 15,  6,
        15, 16, 16, 19, 13, 16,  7,  5,  4, 13,  9,  3,  1,  2,  7, 11,  6, 19,
        10, 11, 16, 12, 17, 10, 10,  1, 15, 15, 19,  2,  3,  9,  7,  3,  3, 17,
        11,  0, 16,  1, 19,  7,  9,  2,  9, 13, 16,  1, 10,  1, 16])

#### Inspect holo sequences
Returns a chain-mapped dictionary of sequences from seqres
```python
{
    "<instance_id>.<chain_id>": sequence of type `str`

}
```

In [238]:
holo_struc.protein_sequence

{'1.A': 'ARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYTTAVTATSNEIKESPLHGTENTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE'}

In [239]:
holo_struc.filter(
        property="atom_name",
        mask="CA",

    )

Structure(
    (
        'id',
        '1avd__1__1.A__1.C',
    ),
    (
        'protein_path',
        /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.cif,
    ),
    (
        'protein_sequence',
        /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/sequences.fasta,
    ),
    (
        'list_ligand_sdf_and_input_smiles',
        [
            (
                /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf,
                'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O',
            ),
        ],
    ),
    (
        'protein_atom_array',
        <class 'biotite.structure.AtomArray'> with shape (123,),
    ),
    (
        'ligand_mols',
        {
            '1.C': (
                <rdkit.Chem.rdchem.Mol object at 0x1cd19f7d0>,
                <rdkit.Chem.rdchem.Mol object at 0x1cd19cba0>,
                (
                    <class 'numpy.ndarray'> with shape (1, 15),
      

### List ligand properties
Show liagnd related properties

In [240]:
for property in holo_struc.get_properties():
    if "ligand" in property:
        print(property)

input_ligand_conformer2resolved_stacks
input_ligand_conformer2smiles_stacks
input_ligand_conformer_coords
input_ligand_conformers
input_ligand_templates
ligand_chain_ordered
ligand_conformer2resolved_mask
resolved_ligand_mols
resolved_ligand_mols_coords
resolved_ligand_structure2smiles_stacks
resolved_ligand_structure_coords
resolved_smiles_ligand_mask


:::{todo}
- Vladas to write the description for the ligand properties
:::

### Ligand atom id mapping mapping
TODO: Vladas

conforrmer to solved structure mappings

In [241]:
holo_struc.input_ligand_conformer2resolved_stacks

{'1.C': (array([[ 9,  4,  5,  6,  7, 11,  1,  0,  3, 14, 13,  8, 12,  2]]),
  array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]]))}

conformer to

### Ligand conformer to input smiles mapping
TODO: Vladas

In [242]:
holo_struc.input_ligand_conformer2smiles_stacks

{'1.C': (array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]),
  array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]]))}

In [243]:
holo_struc.input_ligand_conformer_coords

{'1.C': array([[-2.83700691,  2.36126463,  0.28437332],
        [-2.42165928,  0.98309349,  0.72201484],
        [-3.23134886,  0.14470484,  1.10349342],
        [-1.07511582,  0.73641591,  0.63011553],
        [-0.48277969, -0.53094777,  1.0809545 ],
        [ 1.05432673, -0.44899733,  1.07440101],
        [ 1.57964087, -0.31813821, -0.36717482],
        [ 0.60975084, -0.9426719 , -1.39497456],
        [-0.05877692, -2.07826678, -0.86218037],
        [-0.91964392, -1.76723073,  0.23402732],
        [-0.86187374, -2.91598786,  1.07917704],
        [ 1.36988164, -1.35032138, -2.66294677],
        [ 2.0478061 , -0.22445079, -3.21860561],
        [ 1.73895402,  1.07428959, -0.69398775],
        [ 1.6463587 , -1.58762236,  1.71806322]])}

### Ligand conformer coordinates
TODO: Vladas


In [244]:
holo_struc.input_ligand_conformer_coords

{'1.C': array([[-2.83700691,  2.36126463,  0.28437332],
        [-2.42165928,  0.98309349,  0.72201484],
        [-3.23134886,  0.14470484,  1.10349342],
        [-1.07511582,  0.73641591,  0.63011553],
        [-0.48277969, -0.53094777,  1.0809545 ],
        [ 1.05432673, -0.44899733,  1.07440101],
        [ 1.57964087, -0.31813821, -0.36717482],
        [ 0.60975084, -0.9426719 , -1.39497456],
        [-0.05877692, -2.07826678, -0.86218037],
        [-0.91964392, -1.76723073,  0.23402732],
        [-0.86187374, -2.91598786,  1.07917704],
        [ 1.36988164, -1.35032138, -2.66294677],
        [ 2.0478061 , -0.22445079, -3.21860561],
        [ 1.73895402,  1.07428959, -0.69398775],
        [ 1.6463587 , -1.58762236,  1.71806322]])}

### Ligand conformer coordinates
TODO: Vladas

## 2. Interacting with PLINDER systems
PlinderSystem is the next layer of abstraction above `Structure`. It provides encapsulation around all structures associated with a particular `system_id`. With these, we can access the `holo` and alternate (`apo` and `pred`) structure.

In [282]:
sample_system = PlinderSystem(
    system_id="1avd__1__1.A__1.C",
      input_smiles_dict={"1.C": "CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O"})

### Check holo structure
Since having `holo` structure is a defining feature of _PLINDER_ system, holo structures is by definition available for all systems 

In [283]:
sample_system.holo_structure

2024-09-20 21:03:01,657 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-20 21:03:01,657 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-20 21:03:01,696 | plinder.core.structure.atoms:158 | INFO : generate_conformer: MMFFOptimizeMolecule - more iterations are required, extending by 500 steps


Structure(
    (
        'id',
        '1avd__1__1.A__1.C',
    ),
    (
        'protein_path',
        /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.cif,
    ),
    (
        'protein_sequence',
        /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/sequences.fasta,
    ),
    (
        'list_ligand_sdf_and_input_smiles',
        [
            (
                /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf,
                'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O',
            ),
        ],
    ),
    (
        'protein_atom_array',
        <class 'biotite.structure.AtomArray'> with shape (964,),
    ),
    (
        'ligand_mols',
        {
            '1.C': (
                <rdkit.Chem.rdchem.Mol object at 0x143fd02e0>,
                <rdkit.Chem.rdchem.Mol object at 0x143fd0190>,
                (
                    <class 'numpy.ndarray'> with shape (1, 15),
      

### Get annotations 
This `system` property returns `json` data of annotations for a the system in question. To get the annotations of all other systems sharing the same PDB entry ids, use `.entry` property.

In [287]:
sample_system.system

{'pdb_id': '1avd',
 'biounit_id': '1',
 'ligands': [{'pdb_id': '1avd',
   'biounit_id': '1',
   'asym_id': 'C',
   'instance': 1,
   'ccd_code': 'NAG',
   'plip_type': 'SMALLMOLECULE',
   'bird_id': '',
   'centroid': [36.85636520385742, 25.090288162231445, 17.591215133666992],
   'smiles': 'CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O',
   'resolved_smiles': 'OC[C@H]1O[CH][C@@H]([C@H]([C@@H]1O)O)NC(=O)C',
   'residue_numbers': [1],
   'rdkit_canonical_smiles': 'CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O',
   'molecular_weight': 221.0899372,
   'crippen_clogp': -3.077599999999999,
   'num_rot_bonds': 2,
   'num_hbd': 5,
   'num_hba': 6,
   'num_rings': 1,
   'num_heavy_atoms': 15,
   'is_covalent': True,
   'covalent_linkages': ['17:ASN:A:17:ND2__600:NAG:C:.:C1'],
   'neighboring_residues': {'1.A': [9, 11, 15, 16, 17, 34, 35, 36, 123]},
   'neighboring_ligands': [],
   'interacting_residues': {'1.A': [34, 15]},
   'interacting_ligands': [],
   'interactions': {'1.A': {'15':

### Get paths of the underlying structure files

`archive` points to the subfolder where all the files (except `apo` and `pred` files) relating to a given system are stored

In [288]:
sample_system.archive

PosixPath('/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C')

Similarly, the  `system.cif`, `receptor.cif`, `receptor.pdb`, `sequence.fasta`, and ligand sdfs can be accessed via `.system_cif`, `.receptor_cif`, `.receptor_pdb`, `.sequence_fasta` and `.ligand_sdf` properties respectively. 

To get all the paths of the structures together, use `.structures`

In [294]:
sample_system.structures

['/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.cif',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/receptor.pdb',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/chain_mapping.json',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/sequences.fasta',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/system.cif',
 '/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf']

In [289]:
sample_system.system_cif

'/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/system.cif'

### Get binding site water (`.water_mapping`)
This returns the information about binding site water 

In [292]:
sample_system.water_mapping

{}

### Chain mapping 
:::{todo}
Confirm with Jay
:::

`.chain_mapping` maps chain ids in system (`<instance_id>.<asym_id>`) to PDB author chain ids 

In [293]:
sample_system.chain_mapping

{'1.A': 'A'}

linked_structures, linked_archive, get_linked_structure 
linked_archive, linked_structures,  get_linked_structure, alt_structures

### Linked apo and predicted structures.
The following properties provides different kind of information about linked structures as described below:
- `.linked_archive`: returns paths the local subfolder where the linked structures are saved; 
- `.linked_structures`: returns the dataframe of linked structures along with all their metrics while 
- `.get_linked_structure`:  gives the path to a specific linked structure.
- `.best_linked_structures_paths`: Gives the best linked structures based on `scrmsd_wave` which is average symmetry-corrected RMSD across mapped ligands weighted by number of atoms. This selects maximum of two alternate structure with at most one `apo` and `pred` each when available
- `.alt_structures`: returns the dictionary`Structure` object of the best `apo` and `pred` which the corresponding `holo` chain as key

In [295]:
sample_system.linked_archive

PosixPath('/Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures')

In [296]:
sample_system.linked_structure

2024-09-21 11:07:33,340 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.18s
2024-09-21 11:07:33,703 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 1.06s


Unnamed: 0,reference_system_id,id,pocket_fident,pocket_lddt,protein_fident_qcov_weighted_sum,protein_fident_weighted_sum,protein_lddt_weighted_sum,target_id,sort_score,receptor_file,...,posebusters_most_extreme_ligand_element_waters,posebusters_most_extreme_protein_element_waters,posebusters_most_extreme_ligand_vdw_waters,posebusters_most_extreme_protein_vdw_waters,posebusters_most_extreme_sum_radii_waters,posebusters_most_extreme_distance_waters,posebusters_most_extreme_sum_radii_scaled_waters,posebusters_most_extreme_relative_distance_waters,posebusters_most_extreme_clash_waters,kind
0,1avd__1__1.A__1.C,1vyo_B,100.0,83.0,98.0,98.0,93.0,1vyo,1.48,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
1,1avd__1__1.A__1.C,1vyo_A,100.0,82.0,98.0,98.0,93.0,1vyo,1.48,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
2,1avd__1__1.A__1.C,1nqn_A,100.0,76.0,94.0,99.0,88.0,1nqn,1.8,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
3,1avd__1__1.A__1.C,1rav_B,100.0,81.0,99.0,99.0,90.0,1rav,2.2,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
4,1avd__1__1.A__1.C,1rav_A,100.0,83.0,99.0,99.0,90.0,1rav,2.2,/plinder/2024-06/assignments/apo/1avd__1__1.A_...,...,,,,,,,,,,apo
5,1avd__1__1.A__1.C,P02701_A,100.0,94.0,98.0,98.0,93.0,P02701,91.22,/plinder/2024-06/assignments/pred/1avd__1__1.A...,...,,,,,,,,,,pred


In [299]:
sample_system.get_linked_structure(link_kind="apo", link_id='1vyo_B')

'/Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/1vyo_B.cif'

In [300]:
sample_system.best_linked_structures_paths

{'apo': {'1.A': '/Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/1vyo_A.cif'},
 'pred': {'1.A': '/Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/P02701_A.cif'}}

In [301]:
sample_system.alt_structures

{'apo': {'1.A': Structure(
      (
          'id',
          'linked_structures',
      ),
      (
          'protein_path',
          /Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/1vyo_A.cif,
      ),
      (
          'protein_sequence',
          /Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/sequences.fasta,
      ),
      (
          'list_ligand_sdf_and_input_smiles',
          None,
      ),
      (
          'protein_atom_array',
          <class 'biotite.structure.AtomArray'> with shape (947,),
      ),
      (
          'ligand_mols',
          {
  
          },
      ),
      (
          'add_ligand_hydrogens',
          False,
      ),
      (
          'structure_type',
          'apo',
      ),
  )},
 'pred': {'1.A': Structure(
      (
          'id',
          'linked_structures',
      ),
      (
          'protein_path',
          /Users/yusuf/.local/share/plinder/2024-06/v2/linked_structures/P02701_A.cif,
      ),
      (
     

### Get `Openstructure` entities and views
`.receptor_entity` returns receptor `mol.EntityHandle` object
`.ligand_views` returns `mol.ResidueView` for all ligands

:::{note}
You must have Openstructure installed to use this property
:::


### Others properties
This includes:
- `num_ligands`: Number of ligand chains
- `smiles`: Ligand smiles dictionary
- `num_proteins`: Number of protein chains


## 3. Interacting with the PLINDER dataset

In [254]:
from plinder.core.loader.dataset import get_torch_loader, PlinderDataset

#### Make plinder dataset

In [270]:
train_dataset = PlinderDataset(split="train")
#train_dataset = PlinderDataset(df=splits_df[splits_df.system_id =="6pl9__1__1.A__1.C"])

2024-09-20 17:09:18,162 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.00s
2024-09-20 17:09:20,068 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.68s
2024-09-20 17:09:22,749 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 4.59s
2024-09-20 17:09:23,120 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.08s
  .apply(lambda x: dict(zip(x[0], x[1])), axis=1)


In [None]:
test_data = train_dataset[1]

test_data[110]

## 4. Loader

In [None]:
train_loader = get_torch_loader(
    train_dataset
)

In [None]:
for data in train_loader:

    test_torch = data
    break
    #for k, v in test_torch['input_features'].items():
    #    if v.shape[1] > 1:
    #        break

In [None]:
test_torch.keys()

In [None]:
test_torch['system_ids']

In [None]:
for k, v in test_torch['features_and_coords'].items():
    print(k, v.shape)

In [None]:
holo_struc.ligand_mols

In [None]:
holo_struc.input_ligand_conformer2resolved_stacks

In [266]:
holo_struc = Structure.load_structure(
    id=system_id,
    protein_path=protein_structure_path,
    protein_sequence=input_sequence_path,
    list_ligand_sdf_and_input_smiles=list_ligand_sdf_and_input_smiles

    )

2024-09-20 16:58:27,992 | plinder.core.structure.atoms:158 | INFO : generate_conformer: MMFFOptimizeMolecule - more iterations are required, extending by 500 steps


[(PosixPath('/Users/yusuf/.local/share/plinder/2024-06/v2/systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf'), 'CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O')]


In [267]:
struct = Structure.load_structure(id="102m__1__1.A__1.C",
          protein_path=Path("/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/receptor.cif"),
          list_ligand_sdf_and_input_smiles=[(Path("/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/ligand_files/1.C.sdf"),
          "C=CC1=C(C)C2=Cc3c(C)c(CCC(=O)O)c4n3[Fe]35<-N6=C(C=c7c(C=C)c(C)c(n73)=CC1=N->52)C(C)=C(CCC(=O)O)C6=C4")],
          protein_sequence=Path("/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/sequence.fasta")
)



[(PosixPath('/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/ligand_files/1.C.sdf'), 'C=CC1=C(C)C2=Cc3c(C)c(CCC(=O)O)c4n3[Fe]35<-N6=C(C=c7c(C=C)c(C)c(n73)=CC1=N->52)C(C)=C(CCC(=O)O)C6=C4')]


In [268]:
struct.input_ligand_conformers

{'1.C': <rdkit.Chem.rdchem.Mol at 0x1d23d8120>}

In [264]:
struct.ligand_mols