# Structure abstraction

In [None]:
from rdkit.Chem.Draw import IPythonConsole

IPythonConsole.drawOptions.addAtomIndices = True

To enable using systems and linked structure files in training deep learning models, we've implemented a number of useful functions to align, mask, and featurize proteins and ligands.

For this, we convert our `PlinderSystem` to a `Structure` object.

In [None]:
from plinder.core import PlinderSystem

plinder_system = PlinderSystem(system_id="4agi__1__1.C__1.W")

system_structure = plinder_system.holo_structure

In [None]:
system_structure

- We list all fields and their `FieldInfo` to show which ones are required. `id`, `protein_path` and `protein_sequence` are required. Everything else is optionally. Particularly worth mentioning is the decision to make `list_ligand_sdf_and_input_smiles` optional; this is because ligand will not be availbale in apo and predicted structures.
- Out of these field `ligand_mols` and `protein_atom_array` is computed within the object if set to default. 
- `ligand_mols` returns a chain-mapped dictionary of of the form:
    ```
    {
        "<instance_id>.<chain_id>": (
            RDKit 2D mol from template SMILES of type `Chem.Mol`,
            RDKit mol from template SMILES with random 3D conformer of type `Chem.Mol`,
            RDKit mol of solved (holo) ligand structure of type `Chem.Mol`,
            paired stacked arrays (template vs holo) mapping atom order by index of type `tuple[NDArray.int_, NDArray.int_]`
        )

    }
    ```
- While `protein_atom_array` returns [biotite AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) of the receptor protein structure.
- `add_ligand_hydrogens` specifies whether to adds hydrogens to ligand
- `structure_type`: could be `"holo"`, `"apo"` or `"pred"`

In [None]:
system_structure.model_fields

## Ligand

In [None]:
for name in system_structure.get_properties():
    if "ligand" in name:
        print(name)

The ligands are provided using dictionaries.

These dictionaries contain information for each ligand:
- `input_ligand_templates`: 2D RDKit mols generated from the RDKit canonical SMILES
- `input_ligand_conformers`: 3D (random) conformers generated for each input mol
- `input_ligand_conformers_coords`: positional coordintates for 3D conformers
- `resolved_ligand_mols`: RDKit mols of solved (holo) ligand structures
- `resolved_ligand_mols_coords`: positional coordintates for holo ligand structures
- `ligand_template2resolved_atom_order_stacks`: paired stacked arrays (template vs holo) mapping atom order by index
- `ligand_chain_ordered`: ordered list of all ligands by their keys

### Ligand atom id mapping mapping

Unlike the protein sequence - there is no canonical order to ligand atoms in the molecule.
It can be further complicated by automorphisms present in the structure due to symmetry, i.e. there is more than one match that is possible between the structures.

This is important when calculating ligand structure loss, as the most optimal atom order can change between the different inference results. Typically, it is accepted to take the atom ordering resulting in the best objective score and use that for the loss calculation.

Occasionally futher ambiguity arises to to part of the ligand structure being unresolved in the holo structure - this can lead to multiple available matches. We use RascalMCES algorithm from RDKit to provide all the possible matches between the atom order in the input structure (from SMILES) to the resolved holo structure.

This is provided as stacks of atom order arrays that reorder the template and holo indices to provide matches. Each stack is a unique order transformation and should be iterated.

In [None]:
system_structure.input_ligand_templates[system_structure.ligand_chain_ordered[0]]

In [None]:
system_structure.input_ligand_conformers[system_structure.ligand_chain_ordered[0]]

In [None]:
system_structure.resolved_ligand_mols[system_structure.ligand_chain_ordered[0]]

### Ligand conformer coordinates

As you can tell, the input 2D and 3D conformer indices match, but the resolved ligand is different.
Thus to perform a correct comparison for their coordinates one should use atom order stacks.


In [None]:
(
    input_atom_order_stack,
    holo_atom_order_stack,
) = system_structure.ligand_template2resolved_atom_order_stacks[
    system_structure.ligand_chain_ordered[0]
]
input_atom_order_stack, holo_atom_order_stack

In [None]:
system_structure.input_ligand_conformers_coords[
    system_structure.ligand_chain_ordered[0]
][input_atom_order_stack]

In [None]:
system_structure.resolved_ligand_mols_coords[system_structure.ligand_chain_ordered[0]][
    holo_atom_order_stack
]

### More complicated examples

Symmetry in ligand (automorphism) - two ways of pairwise mapping the atom order

In [None]:
structure_with_symmetry_in_ligand = PlinderSystem(
    system_id="4v2y__1__1.A__1.E"
).holo_structure
structure_with_symmetry_in_ligand.input_ligand_templates[
    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]
]

In [None]:
structure_with_symmetry_in_ligand.resolved_ligand_mols[
    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]
]

In [None]:
structure_with_symmetry_in_ligand.ligand_template2resolved_atom_order_stacks[
    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]
]

Symmetry arises due to ligand being partially resolved - there are three template pieces that can be mapped to the resolved ground truth.

In [None]:
structure_with_partly_resolved_ligand = PlinderSystem(
    system_id="1ngx__1__1.A_1.B__1.E"
).holo_structure
structure_with_partly_resolved_ligand.input_ligand_templates[
    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]
]

In [None]:
structure_with_partly_resolved_ligand.resolved_ligand_mols[
    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]
]

In [None]:
structure_with_partly_resolved_ligand.ligand_template2resolved_atom_order_stacks[
    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]
]

## Protein

In [None]:
for name in system_structure.get_properties():
    if "protein" in name:
        print(name)

### Others properties
This includes:
- `num_ligands`: Number of ligand chains
- `smiles`: Ligand smiles dictionary
- `num_proteins`: Number of protein chains


### Masking
The properties `protein_backbone_mask` and `protein_calpha_mask` are boolean masks that can be used to select backbone or calpha atoms from biotite `AtomArray`. The indices of `True` corresponds to backbone or calpha indices.

In [None]:
print(
    "Total number of atoms:",
    len(system_structure.protein_atom_array),
)
print("Number of backbone atoms:", system_structure.protein_backbone_mask.sum())
print(
    "Number of calpha atoms:",
    system_structure.protein_calpha_mask.sum(),
)

In [None]:
calpha_atom_array = system_structure.protein_atom_array[
    system_structure.protein_calpha_mask
]
calpha_atom_array.coord.shape

You can also filter by arbitrary properties of the `AtomArray` using the `filter` method. This returns a new `Structure` object.

In [None]:
calpha_structure = system_structure.filter(
    property="atom_name",
    mask="CA",
)

In [None]:
calpha_structure.protein_atom_array.coord.shape

### Get protein chain ordered
This gives a list of protein chains ordered by how they are in the structure

In [None]:
system_structure.protein_chain_ordered

### Get protein chains for all atoms
The list of chain IDs in the structure. Order of how they appear not kept.

In [None]:
system_structure.protein_chains

### Get protein coordinates
This property gets the 3D positions of each of the atoms in protein molecules

In [None]:
system_structure.protein_coords

### Get number of atoms of protein molecule

In [None]:
system_structure.protein_n_atoms

### Get protein structure atom names
Returns all atoms names the same way they appear in the structure

In [None]:
system_structure.protein_unique_atom_names

### Get protein residue names

In [None]:
system_structure.protein_unique_residue_names

### Get protein residues number
Residue number as they appear in structure

In [None]:
system_structure.protein_unique_residue_ids

### Get sequence from protein structure


In [None]:
system_structure.protein_sequence_from_structure

Note that this is different from the canonical SEQRES sequence due to unresolved terminal residues:

In [None]:
system_structure.protein_sequence

### Get tokenized sequence
Get tensor of sequence converted to integer-based amino acid tokens

In [None]:
system_structure.protein_structure_tokenized_sequence

## Linked protein input structures


For realistic inference scenarios we need to initialize our protein structures using a linked structure (introduced above). In most cases, these will not be a perfect match to the _holo_ structure - the number of residues, residue numbering, and sometime the sequnce can be different. It's important to be able to match these structures to ensure that we can map between them. 

In the example below, we will take a _holo_ structure and it's linked predicted (_pred_) form with different number of residues and  match and crop the resulting structures to figure out the correspondence between their residues. For this we use the `align_common_sequence` function of the holo `Structure` object, which aligns two structures based on their shared sequences. It has the following parameters:

```
other: Structure
    The other structure to align to
copy: bool
    Whether to make a copy or edit in-place
remove_differing_atoms: bool
    Whether to remove differing atoms between the two structure
renumber_residues: bool [False]
    If True, renumber residues in the two structures to match and starting from 1.
    If False, sets the resulting residue indices to the one from the aligned sequence
remove_differing_annotations: bool [False]
    Whether to remove differing annotations, like b-factor, etc
```
In this example, we will match, make copies and crop the structures.

:::{note} To use this function the proteins to be aligned must have the same chain ids. So, we first set the chain id of the predicted structure to that of the holo structure. :::

```python
plinder_system = PlinderSystem(system_id="4cj6__1__1.A__1.B")
holo = plinder_system.holo_structure
predicted = plinder_system.alternate_structures["P12271_A"]
predicted.set_chain(holo.protein_chain_ordered[0])
holo_cropped, predicted_cropped = holo.align_common_sequence(predicted)
predicted_cropped_superposed, raw_rmsd, refined_rmsd = predicted_cropped.superimpose(
    holo_cropped
)
```