# Structure, Systems, Dataset and Loader Tutorial

# Topics:
- Interacting with `Structure` class
- Interacting `PlinderSystem`
- Interacting `PlinderDataset`
- Loader



## Setup

### Installation

`plinder` is available on *PyPI*.

```
pip install plinder
```

### Environment variable configuration
:::{note}
We need to set environment variables to point to the release and iteration of choice.
For the sake of demonstration, this will be set to point to a smaller tutorial example
dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=tutorial`.
:::
:::{note}

## Getting the configuration

At first we get the configuration to check that all parameters are correctly set. 
In the snippet below, we will check, if the local and remote *PLINDER* paths point to
the expected location.

In [None]:
import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")

## Data ecosystem overview
This tutorial assumes user have downloaded _PLINDER_ dataset before now. While the examples will run without users doing anything, we encourage users to download the data for performance sake. _PLINDER_ data hierarchy is shown below. We have organized this tutorial to follow this same hierarchy from ground up
![image](../static/asset/data/plinder_data_hierarchy.png)

## 0. Structure files

After download all files will be store locally at `~/.local/share/plinder/${PLINDER_RELEASE}/${PLINDER_ITERATION}/systems`. The current default is `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION-v2`

There, we have sub-folders that corresponds to each systems. In each sub-folder, we have:
- Receptor PDB: receptor.cif and receptor.pdb
- Ligand SDF's: `<biounit_instance_id>.<chain_id>.sdf`. For complexes with more than one ligands, all the sdfs are saved separately.
- Sequence fasta: sequence.fasta
For more information on the file organization, see "<link-to-dataset-tutorial>"

## 1. Structure Python Abstraction
To make interacting with our data seamless, {class} class Structure, a pydantic data class that:
- Loads all the structure files + smiles 
- Gets coordinates
- Featurizes residues and atoms of associated protein and ligand molecules
- Masks molecules to account for resolved vs unresolved part

To interact with the example, do the following:

### Load the structure for a given system_id
For this purpose we will use `"1avd__1__1.A__1.C"` as our example system id.

In [295]:
from plinder.core.structure.structure import Structure
from plinder.core import PlinderSystem
from pathlib import Path

from biotite.sequence.io.fasta import FastaFile

input_smiles = "CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O)O" # Need to account for unresolved part of the ligand
input_sdf = Path(cfg.data.plinder_dir)/"systems/1avd__1__1.A__1.C/ligand_files/1.C.sdf"
system_id = "1avd__1__1.A__1.C"
protein_structure_path = Path(cfg.data.plinder_dir)/"systems/1avd__1__1.A__1.C/receptor.cif"
input_sequence_path = Path(cfg.data.plinder_dir)/"systems/1avd__1__1.A__1.C/sequences.fasta"
list_ligand_sdf_and_input_smiles = [
    (input_sdf,input_smiles)]

input_sequences = {k: v for k, v in FastaFile.read_iter(input_sequence_path)}

# Load holo structure
holo_struc = PlinderSystem(system_id=system_id).holo_structure

2024-09-23 10:15:26,737 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-23 10:15:26,737 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-23 10:15:26,867 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-23 10:15:26,868 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s
2024-09-23 10:15:26,868 | plinder.core.index.utils:148 | INFO : loading entries from 1 zips
2024-09-23 10:15:26,871 | plinder.core.index.utils:163 | INFO : loaded 1 entries
2024-09-23 10:15:26,872 | plinder.core.index.utils.load_entries:24 | INFO : runtime succeeded: 0.13s


In [None]:
# Show fields
holo_struc.model_fields

### List fields
- We list all fields and their `FieldInfo` to show which ones are required. `id`, `protein_path` and `protein_sequence` are required. Everything else is optionally. Particularly worth mentioning is the decision to make `list_ligand_sdf_and_input_smiles` optional; this is because ligand will not be availbale in apo and predicted structures.
- Out of these field `ligand_mols` and `protein_atom_array` is computed within the object if set to default. 
- `ligand_mols` returns a chain-mapped dictionary of of the form:
    ```python
    {
        "<instance_id>.<chain_id>": (
            RDKit 2D mol from template SMILES of type `Chem.Mol`,
            RDKit mol from template SMILES with random 3D conformer of type `Chem.Mol`,
            RDKit mol of solved (holo) ligand structure of type `Chem.Mol`,
            paired stacked arrays (template vs holo) mapping atom order by index of type `tuple[NDArray.int_, NDArray.int_]`
        )

    }
    ```
- While `protein_atom_array` returns [biotite AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) of the receptor protein structure.
- `add_ligand_hydrogens` specifies whether to adds hydrogens to ligand
- `structure_type`: could be `"holo"`, `"apo"` or `"pred"`

In [296]:
# Inspect ligand_mols
holo_struc.ligand_mols

{'1.C': (<rdkit.Chem.rdchem.Mol at 0x1bd0067a0>,
  <rdkit.Chem.rdchem.Mol at 0x1bd005070>,
  <rdkit.Chem.rdchem.Mol at 0x1bd006810>,
  (array([[13,  4,  5,  7,  9, 10,  1,  0,  3,  6,  8, 12, 11,  2]]),
   array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]])))}

In [None]:
holo_struc.ligand_smiles

In [None]:
# Inspect protein_atom_araay

In [None]:
holo_struc.protein_atom_array[0]

### List structure protein properties
Show protein related properties

In [None]:
for property in holo_struc.get_properties():
    if "protein" in property:
        print(property)

#### Protein backbone mask
This is a boolean mask that can be used to select backbone atoms from biotite `AtomArray`. The indices of `True` corresponds to backbone indices.

In [None]:
holo_struc.protein_backbone_mask

#### Protein Calpha mask
This shows the mask of calpha atoms

In [None]:
holo_struc.protein_calpha_mask

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

In [None]:
holo_struc.protein_chain_ordered

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

### Get protein chains for all atoms
The list of chain IDs in the structure. Order of how they appear not kept.

In [None]:
holo_struc.protein_chains

### Get protein coordinates
This property gets the 3D positions of each of the atoms in protein molecules

In [None]:
holo_struc.protein_coords

### Get number of atoms of protein molecule

In [None]:
holo_struc.protein_n_atoms

### Get protein structure atom names
Returns all atoms names the same way they appear in the structure

In [None]:
holo_struc.protein_unique_atom_names

### Get protein b-factors
Get protein atom occupancies. If not available in a structure, it's set to zero.

In [None]:
holo_struc.protein_structure_b_factor

### Get protein residue names

In [None]:
holo_struc.protein_unique_residue_names

### Get protein residues number
Residue number as they appear in structure

In [None]:
holo_struc.protein_unique_residue_ids

### Get sequence from protein structure


In [None]:
holo_struc.protein_sequence_from_structure

In [None]:
holo_struc.protein_sequence

### Get tokenized sequence
Get tensor of sequence converted to integer-based amino acid token

In [None]:
holo_struc.protein_structure_tokenized_sequence

#### Inspect holo sequences
Returns a chain-mapped dictionary of sequences from seqres
```python
{
    "<instance_id>.<chain_id>": sequence of type `str`

}
```

In [None]:
holo_struc.protein_sequence

In [None]:
holo_struc.filter(
        property="atom_name",
        mask="CA",

    )

### List ligand properties
Show ligand related properties

In [None]:
for property in holo_struc.get_properties():
    if "ligand" in property:
        print(property)

:::{todo}

The input ligands are provided using dictionaries.
These dictionaries contain information for each ligand:
- input_ligand_templates: 2D RDKit mols generated from RDKit canonical SMILES (taken from annotation table)
- input_ligand_conformers: 3D (random) conformers generated for each input mol
- input_ligand_conformers_coords: positional coordintates for 3D conformers
- resolved_ligand_mols: RDKit mols of solved (holo) ligand structures
- resolved_ligand_mols_coords: positional coordintates for holo ligand structures
- ligand_template2resolved_atom_order_stacks: paired stacked arrays (template vs holo) mapping atom order by index
- ligand_chain_ordered: ordered list of all ligands by their keys

:::

### Ligand atom id mapping mapping

Unlike the protein sequence - there is no canonical order to ligand atoms in the molecule.
It can be further complicated by automorphisms present in the structure due to symmetry, i.e. there is more than one match that is possible between the structures.

This is important when calculating ligand structure loss, as the most optimal atom order can change between the different inference results. Typically, it is accepted to take the atom ordering resulting in the best objective score and use that for the loss calculation.

Occasionally futher ambiguity arises to to part of the ligand structure being unresolved in the holo structure - this can lead to multiple available matches. We use RascalMCES algorithm from RDKit to provide all the possible matches between the atom order in the input structure (from SMILES) to the resolved holo structure.

This is provided as stacks of atom order arrays that reorder the template and holo indices to provide matches. Each stack is a unique order transformation and should be iterated.

### Ligand conformer to input smiles mapping

Each RDKit ligand mol that is generated from SMILES matches the atom order that is in the starting SMILES.
This order is retained when 3D conformer is generated.

! NOTE: While we ensure that each PLINDER ligand can be loaded and sanitized into RDKit 2D molecule, some ligands may struggle to generate sanitizeable 3D conformers. We nonetheless are trying to provide each ligand with as sensible starting structure as possible and make it accessible via coordinate arrays.

In [None]:
# sample_system_2 = PlinderSystem(
#     system_id="102m__1__1.A__1.C",
# )
# sample_system_2.holo_structure.ligand_mols['1.C'][1]

In [None]:
holo_struc.ligand_template2resolved_atom_order_stacks

Below we use RDKit functionality to draw indixes for `'1.C'` ligand 2D, conformer and holo structures

In [None]:
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.drawOptions.addAtomIndices = True

In [None]:
holo_struc.input_ligand_templates["1.C"]

In [None]:
holo_struc.input_ligand_conformers["1.C"]

In [None]:
holo_struc.resolved_ligand_mols["1.C"]

### Ligand conformer coordinates

As you can tell, the input 2D and 3D conformer indices match, but the resolved ligand is different.
Thus to perform a correct comparison for their coordinates one should use atom order stacks.


In [None]:
input_atom_order_stack, holo_atom_order_stack = holo_struc.ligand_template2resolved_atom_order_stacks["1.C"]

In [None]:
holo_struc.resolved_ligand_mols_coords['1.C']

In [None]:
holo_struc.input_ligand_conformers_coords['1.C'][holo_atom_order_stack]

## 2. Interacting with PLINDER systems
PlinderSystem is the next layer of abstraction above `Structure`. It provides encapsulation around all structures associated with a particular `system_id`. With these, we can access the `holo` and alternate (`apo` and `pred`) structure.

In [None]:
sample_system = PlinderSystem(
    system_id="1avd__1__1.A__1.C",
)

### Check holo structure
Since having `holo` structure is a defining feature of _PLINDER_ system, holo structures is by definition available for all systems 

In [None]:
sample_system.holo_structure

### Get annotations 
This `system` property returns `json` data of annotations for a the system in question. To get the annotations of all other systems sharing the same PDB entry ids, use `.entry` property.

In [None]:
sample_system.system

### Get paths of the underlying structure files

`archive` points to the subfolder where all the files (except `apo` and `pred` files) relating to a given system are stored

In [None]:
sample_system.archive

Similarly, the  `system.cif`, `receptor.cif`, `receptor.pdb`, `sequence.fasta`, and ligand sdfs can be accessed via `.system_cif`, `.receptor_cif`, `.receptor_pdb`, `.sequence_fasta` and `.ligand_sdf` properties respectively. 

To get all the paths of the structures together, use `.structures`

In [None]:
sample_system.structures

In [None]:
sample_system.system_cif

### Get binding site water (`.water_mapping`)
This returns the information about binding site water 

In [None]:
sample_system.water_mapping

### Chain mapping 
:::{todo}
Confirm with Jay
:::

`.chain_mapping` maps chain ids in system (`<instance_id>.<asym_id>`) to PDB author chain ids 

In [None]:
sample_system.chain_mapping

linked_structures, linked_archive, get_linked_structure 
linked_archive, linked_structures,  get_linked_structure, alt_structures

### Linked apo and predicted structures.
The following properties provides different kind of information about linked structures as described below:
- `.linked_archive`: returns paths the local subfolder where the linked structures are saved; 
- `.linked_structures`: returns the dataframe of linked structures along with all their metrics while 
- `.get_linked_structure`:  gives the path to a specific linked structure.
- `.best_linked_structures_paths`: Gives the best linked structures based on `scrmsd_wave` which is average symmetry-corrected RMSD across mapped ligands weighted by number of atoms. This selects maximum of two alternate structure with at most one `apo` and `pred` each when available
- `.alt_structures`: returns the dictionary`Structure` object of the best `apo` and `pred` which the corresponding `holo` chain as key

In [None]:
sample_system.linked_archive

In [None]:
sample_system.linked_structures
# TODO AttributeError: 'PlinderSystem' object has no attribute 'linked_structure'

In [None]:
sample_system.get_linked_structure(link_kind="apo", link_id='1vyo_B')

In [None]:
sample_system.alternate_structures

### Get `Openstructure` entities and views
`.receptor_entity` returns receptor `mol.EntityHandle` object
`.ligand_views` returns `mol.ResidueView` for all ligands

:::{note}
You must have Openstructure installed to use this property
:::


### Others properties
This includes:
- `num_ligands`: Number of ligand chains
- `smiles`: Ligand smiles dictionary
- `num_proteins`: Number of protein chains


## 3. Interacting with the PLINDER dataset
`PlinderDataset` provides an interface to interact with _PLINDER_ data as a dataset. It is a subclass of `torch.utils.data.Dataset`, as such subclassing it and extending should be familiar to most users. Flexibility and general applicability is our top concern when designing this interface and `PlinderDataset` allows users to not only define their own split but to also bring their own featurizer.
It can be initialized with the following parameters
```
Parameters
    ----------
    df : pd.DataFrame | None
        the split to use
    split : str
        the split to sample from
    split_parquet_path : str | Path, default=None
        split parquet file
    input_structure_priority : str, default="apo"
        Which alternate structure to proritize
    featurizer: Callable[
            [Structure, int], dict[str, torch.Tensor]
    ] = structure_featurizer,
        Transformation to turn structure to input tensors
    padding_value : int
        Value for padding uneven array
    **kwargs : Any
        Any other keyword args
``` 

In [None]:
from plinder.core.loader import PlinderDataset

#### Make _PLINDER_ training dataset with default parameters
When no parameter is set, PlinderDataset automatically defaults to the training set of the most current version of the dataset. To change this behaviour, we can explicitly pass a data frame to the parameter `df` or split file to `split_parquet_path`. Either of these must have at least two columns named `system_id` and `split`. This also use our default featurizer `plinder.core.loader.featurizer.featurizer`. 
NOTE: We have provided this `plinder.core.loader.featurizer.featurizer` as an example featurizer; users are encourage to use featurizers that suit their need.

In [None]:
train_dataset = PlinderDataset(split="train")
#train_dataset = PlinderDataset(df=splits_df[splits_df.system_id =="6pl9__1__1.A__1.C"])

In [None]:
test_data = train_dataset[1]

test_data[110]

## 4. Loader

In [None]:
train_loader = get_torch_loader(
    train_dataset
)

In [None]:
for data in train_loader:

    test_torch = data
    break
    #for k, v in test_torch['input_features'].items():
    #    if v.shape[1] > 1:
    #        break

In [None]:
test_torch.keys()

In [None]:
test_torch['system_ids']

In [None]:
for k, v in test_torch['features_and_coords'].items():
    print(k, v.shape)

In [None]:
holo_struc.ligand_mols

In [None]:
holo_struc.input_ligand_conformer2resolved_stacks

In [None]:
holo_struc = Structure.load_structure(
    id=system_id,
    protein_path=protein_structure_path,
    protein_sequence=input_sequence_path,
    list_ligand_sdf_and_input_smiles=list_ligand_sdf_and_input_smiles

    )

In [None]:
struct = Structure.load_structure(id="102m__1__1.A__1.C",
          protein_path=Path("/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/receptor.cif"),
          list_ligand_sdf_and_input_smiles=[(Path("/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/ligand_files/1.C.sdf"),
          "C=CC1=C(C)C2=Cc3c(C)c(CCC(=O)O)c4n3[Fe]35<-N6=C(C=c7c(C=C)c(C)c(n73)=CC1=N->52)C(C)=C(CCC(=O)O)C6=C4")],
          protein_sequence=Path("/Users/yusuf/.local/share/plinder/2024-06/v2/systems/102m__1__1.A__1.C/sequence.fasta")
)

In [None]:
struct.input_ligand_conformers

In [None]:
struct.ligand_mols