# Structure, Systems, Dataset and Loader Tutorial

# Topics:
- Interacting with `Structure` class
- Interacting `PlinderSystem`
- Interacting `PlinderDataset`
- Loader



## Setup

### Installation

`plinder` is available on *PyPI*.

```
pip install plinder
```

### Environment variable configuration
:::{note}
We need to set environment variables to point to the release and iteration of choice.
For the sake of demonstration, this will be set to point to a smaller tutorial example
dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=tutorial`.
:::
:::{note}

## Getting the configuration

At first we get the configuration to check that all parameters are correctly set. 
In the snippet below, we will check, if the local and remote *PLINDER* paths point to
the expected location.

In [None]:
import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")

## Data ecosystem overview
This tutorial assumes user have downloaded _PLINDER_ dataset before now. While the examples will run without users doing anything, we encourage users to download the data for performance sake. _PLINDER_ data hierarchy is shown below. We have organized this tutorial to follow this same hierarchy from ground up
![image](../static/asset/data/plinder_data_hierarchy.png)

## 0. Structure files

After download all files will be store locally at `~/.local/share/plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems`. The current default is `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION-v2`

There, we have sub-folders that corresponds to each systems. In each sub-folder, we have:
- Receptor PDB: receptor.cif and receptor.pdb
- Ligand SDF's: `<biounit_instance_id>.<chain_id>.sdf`. For complexes with more than one ligands, all the sdfs are saved
- Sequence fasta: sequence.fasta
For more information on the file organization, see "<link-to-dataset-tutorial>"

## 1. Structure Python Abstraction
To make interacting with our data seamless, {class} class Structure, a pydantic data class that:
- Loads all the structure files + smiles 
- Gets coordinates
- Featurizes residues and atoms of associated protein and ligand molecules
- Masks molecules to account for resolved vs unresolved part 
To interact with the example, do the following:

### Load the structure for a given system_id
For this purpose we will use `"1avd__1__1.A__1.C"` as our example system id.

In [None]:
from plinder.core.structure.structure import Structure
from plinder.core import PlinderSystem
from pathlib import Path

system_id = "1avd__1__1.A__1.C"
linked_apo_id = "P02701_A"

# Load holo structure
holo_struc = PlinderSystem(system_id=system_id).holo_structure

### List fields
- We list all fields and their `FieldInfo` to show which ones are required. `id`, `protein_path` and `protein_sequence` are required. Everything else is optionally. Particularly worth mentioning is the decision to make `list_ligand_sdf_and_input_smiles` optional; this is because ligand will not be availbale in apo and predicted structures.
- Out of these field `ligand_mols` and `protein_atom_array` is computed within the object if set to default. 
- `ligand_mols` returns a chain-mapped dictionary of of the form:
    ```python
    {
        "<instance_id>.<chain_id>": (
            rdkit mol of template smiles of type `Chem.Mol`,
            random conformer of rdkit mol of template smiles of type `Chem.Mol`,
            conformer atoms to template smiles map with of type `tuple[NDArray.int_, NDArray.int_]`,
            rdkit mol of solved ligand structure of type `Chem.Mol`,
            solved ligand atom to template smile atom map of type `tuple[NDArray.int_, NDArray.int_]`,
            conformer atoms to solved ligand atom map of type `tuple[NDArray.int_, NDArray.int_]`
        )

    }
    ```
- While `protein_atom_array` returns [biotite AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) of the receptor protein structure.
- `add_ligand_hydrogens` specifies whether to adds hydrogens to ligand
- `structure_type`: could be `"holo"`, `"apo"` or `"pred"`

In [None]:
# Show fields
holo_struc.model_fields

In [None]:
# Inspect ligand_mols
holo_struc.ligand_mols

In [None]:
# Inspect protein_atom_araay

In [None]:
holo_struc.protein_atom_array[0]

### List structure protein properties
Show protein related properties

In [None]:
for property in holo_struc.get_properties():
    if "protein" in property:
        print(property)

#### Protein backbone mask
This is a boolean mask that can be used to select backbone atoms from biotite `AtomArray`. The indices of `True` corresponds to backbone indices.

In [None]:
holo_struc.protein_backbone_mask

#### Protein Calpha mask
This shows the mask of calpha atoms

In [None]:
holo_struc.protein_calpha_mask

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

In [None]:
holo_struc.protein_chain_ordered

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

### Get protein chains for all atoms
The list of chain IDs in the structure. Order of how they appear not kept.

In [None]:
holo_struc.protein_chains

### Get protein coordinates
This property gets the 3D positions of each of the atoms in protein molecules

In [None]:
holo_struc.protein_coords

### Get number of atoms of protein molecule

In [None]:
holo_struc.protein_n_atoms

### Get protein structure atom names
Returns all atoms names the same way they appear in the structure

In [None]:
holo_struc.protein_unique_atom_names

### Get protein b-factors
Get protein atom occupancies. If not available in a structure, it's set to zero.

In [None]:
holo_struc.protein_structure_b_factor

### Get protein residue names

In [None]:
holo_struc.protein_unique_residue_names

### Get protein residues number
Residue number as they appear in structure

In [None]:
holo_struc.protein_unique_residue_ids

### Get sequence from protein structure


In [None]:
holo_struc.protein_sequence_from_structure

In [None]:
holo_struc.protein_sequence

### Get tokenized sequence
Get tensor of sequence converted to integer-based amino acid token

In [None]:
holo_struc.protein_structure_tokenized_sequence

#### Inspect holo sequences
Returns a chain-mapped dictionary of sequences from seqres
```python
{
    "<instance_id>.<chain_id>": sequence of type `str`

}
```

In [None]:
holo_struc.protein_sequence

In [None]:
holo_struc.filter(
        property="atom_name",
        mask="CA",

    )

### List ligand properties
Show liagnd related properties

In [None]:
for property in holo_struc.get_properties():
    if "ligand" in property:
        print(property)

:::{todo}
- Vladas to write the description for the ligand properties
:::

### Ligand atom id mapping mapping
TODO: Vladas

conforrmer to solved structure mappings

In [None]:
holo_struc.input_ligand_conformer2resolved_stacks

conformer to

### Ligand conformer to input smiles mapping
TODO: Vladas

In [None]:
holo_struc.input_ligand_conformer2smiles_stacks

In [None]:
holo_struc.input_ligand_conformer_coords

### Ligand conformer coordinates
TODO: Vladas


In [None]:
holo_struc.input_ligand_conformer_coords

### Ligand conformer coordinates
TODO: Vladas

## 3. Interacting with the PLINDER dataset

In [None]:
from plinder.core.loader.dataset import get_torch_loader, PlinderDataset

#### Make plinder dataset

In [None]:
train_dataset = PlinderDataset(split="train")
#train_dataset = PlinderDataset(df=splits_df[splits_df.system_id =="6pl9__1__1.A__1.C"])

In [None]:
test_data = train_dataset[1]

test_data[110]

## 4. Loader

In [None]:
train_loader = get_torch_loader(
    train_dataset
)

In [None]:
for data in train_loader:

    test_torch = data
    break
    #for k, v in test_torch['input_features'].items():
    #    if v.shape[1] > 1:
    #        break

In [None]:
test_torch.keys()

In [None]:
test_torch['system_ids']

In [None]:
for k, v in test_torch['features_and_coords'].items():
    print(k, v.shape)

In [None]:
holo_struc.ligand_mols

In [None]:
holo_struc.input_ligand_conformer2resolved_stacks