# OpenEye Structural Featurizer

This notebook introduces structural modeling featurizers using the [OpenEye toolkits](https://docs.eyesopen.com/toolkits/python/index.html) to prepare protein structures and to dock small molecules into their binding sites.

**Note:** All structural featurizers fetch data and/or do expensive computations. Hence, fetched data (e.g. PDB structures) and intermediate results (e.g. a prepared protein structure) are stored in a cache directory to speed up calculations when featurizing the same or similar systems multiple times. The cache directory can be specified via the `cache_dir` parameter, but also has a default (`user_cache_dir` from `appdirs`). In case you update your KinoML version you should consider deleting the cache directory. Otherwise, you may get results from the former KinoML version, since the intermediate results will be taken from cache.

In [1]:
%%capture --no-display
from importlib import resources
import inspect
from pathlib import Path

from appdirs import user_cache_dir

from kinoml.core.ligands import Ligand
from kinoml.core.proteins import Protein, KLIFSKinase
from kinoml.core.systems import ProteinSystem, ProteinLigandComplex
from kinoml.features.core import Pipeline
from kinoml.features.protein import OEProteinStructureFeaturizer
from kinoml.features.complexes import ( 
    OEComplexFeaturizer, 
    OEDockingFeaturizer,
    MostSimilarPDBLigandFeaturizer,
    KLIFSConformationTemplatesFeaturizer,
)

## OEProteinStructureFeaturizer

All OpenEye Featurizers come with an extensive doc string explaining the capabilities and requirements.

In [2]:
print(inspect.getdoc(OEProteinStructureFeaturizer))

Given systems with exactly one protein, prepare the protein structure by:

 - modeling missing loops
 - building missing side chains
 - mutations, if `uniprot_id` or `sequence` attribute is provided for
   the protein component (see below)
 - removing everything but protein and water
 - protonation at pH 7.4

The protein component of each system must be a `core.proteins.Protein`
or a subclass thereof, must be initialized with toolkit='OpenEye' and
give access to a molecular structure, e.g. via a pdb_id. Additionally,
the protein component can have the following optional attributes to
customize the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
    generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
    should be used.
 - `expo_id`: A string specifying a ligand bound to the protein of
   interest. This is especially useful if mult

In general these featurizers will work with a minimal amount of information, e.g. just a PDB ID. However, it is recommended to be explicit as possible when defining the systems to featurize. For example, if a given PDB entry has multiple chains and ligands, the featurizer will have to guess which chain and ligand is of interest if not explicitly stated.

In [3]:
# collect systems to featurize, i.e. prepare the protein structure
systems = []

In [4]:
# unspecifc definition of the system, only via PDB ID
# modeling will be performed according to the sequence stored in the PDB Header
protein = Protein(pdb_id="4f8o", name="PsaA")
system = ProteinSystem(components=[protein])
systems.append(system)

In [5]:
# more specific definition of the system, protein of chain A co-crystallized with ligand AES and
# alternate location B, modeling will be performed according to the sequence of the given 
# UniProt ID
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
system = ProteinSystem(components=[protein])
systems.append(system)

In [6]:
# use a protein structure form file
with resources.path("kinoml.data.proteins", "4f8o_edit.pdb") as structure_path:
    pass
protein = Protein.from_file(file_path=structure_path, name="PsaA")
protein.uniprot_id = "P31522"
system = ProteinSystem(components=[protein])
systems.append(system)

In [7]:
with resources.path("kinoml.data.proteins", "kinoml_tests_4f8o_spruce.loop_db") as loop_db:
    pass
featurizer = OEProteinStructureFeaturizer(
    loop_db=loop_db,
    output_dir=user_cache_dir() + "/protein",
)

In [8]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinSystem with 1 components (<Protein name=PsaA>)>,
 <ProteinSystem with 1 components (<Protein name=PsaA>)>,
 <ProteinSystem with 1 components (<Protein name=PsaA>)>]

The featurizers will return the featurized systems as an [MDAnalysis universe](https://www.mdanalysis.org/). Systems that failed will be filtered out. In case one is interested in failures, one can enable logging messages via:
```
import logging  
logging.basicConfig(level=logging.DEBUG)
```

In [9]:
systems[0]

<ProteinSystem with 1 components (<Protein name=PsaA>)>

In [10]:
systems[0].featurizations["last"]

<Universe with 2381 atoms>

If an `output_dir` was provided, the prepared structure is saved in PDB and OEB format.

In [11]:
for path in sorted(Path(user_cache_dir() + "/protein").glob("*")):
    print(path.name)

kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_chainA_altlocB_protein.oeb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_chainA_altlocB_protein.pdb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_edit_protein.oeb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_edit_protein.pdb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_protein.oeb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_protein.pdb


In [12]:
# these are pytest nbval checks
# check number of residues
assert len(systems[0].featurizations["last"].residues) == 239
assert len(systems[1].featurizations["last"].residues) == 216
assert len(systems[2].featurizations["last"].residues) == 109

# check numbering of first residue
assert systems[0].featurizations["last"].residues[0].resid == 1
assert systems[1].featurizations["last"].residues[0].resid == 44
assert systems[2].featurizations["last"].residues[0].resid == 47

## OEComplexFeaturizer

In [13]:
print(inspect.getdoc(OEComplexFeaturizer))

Given systems with exactly one protein and one ligand, prepare the complex
structure by:

 - modeling missing loops
 - building missing side chains
 - mutations, if `uniprot_id` or `sequence` attribute is provided for the
   protein component (see below)
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='OpenEye' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
   should be used.
 - `expo_id`: A string specifying the ligand of interest. This is
   especiall

In [14]:
systems = []

In [15]:
protein = Protein(pdb_id="4f8o", name="PsaA")
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [16]:
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [17]:
featurizer = OEComplexFeaturizer(
    output_dir=user_cache_dir() + "/complex",
)

In [18]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>,
 <ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>]

If an `output_dir` was provided, the prepared structure is saved in PDB and OEB format, the prepared ligand is additionally saved in SDF format.

In [19]:
for path in sorted(Path(user_cache_dir() + "/complex").glob("*")):
    print(path.name)

kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_complex.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_complex.pdb
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_ligand.sdf
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_protein.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_protein.pdb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.pdb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_ligand.sdf
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_protein.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_protein.pdb


In [20]:
# these are pytest nbval checks
# check LIG exists
assert len(systems[0].featurizations["last"].select_atoms("resname LIG").residues) == 1
assert len(systems[1].featurizations["last"].select_atoms("resname LIG").residues) == 1

# check caps
assert len(systems[0].featurizations["last"].select_atoms("resname ACE or resname NME").residues) == 2
assert len(systems[1].featurizations["last"].select_atoms("resname ACE or resname NME").residues) == 1

# check number of residues
assert len(systems[0].featurizations["last"].residues) == 240
assert len(systems[1].featurizations["last"].residues) == 217

# check numbering of first residue
assert systems[0].featurizations["last"].residues[0].resid == 1
assert systems[1].featurizations["last"].residues[0].resid == 44

## OEDockingFeaturizer

The `OEDockingFeaturizer` supports [3 docking methods](https://docs.eyesopen.com/toolkits/python/dockingtk/index.html), i.e.:
 - **Fred** - standard docking protocol
 - **Hybrid** - biased by co-crystallized ligand
 - **Posit** - bias depends on the similarity to the co-crystallized ligand

In [21]:
print(inspect.getdoc(OEDockingFeaturizer))

Given systems with exactly one protein and one ligand, prepare the
structure and dock the ligand into the prepared protein structure with
one of OpenEye's docking algorithms:

 - modeling missing loops
 - building missing side chains
 - mutations, if `uniprot_id` or `sequence` attribute is provided for the
   protein component (see below)
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4
 - perform docking

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='OpenEye' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate l

### Fred

In [22]:
systems = []

In [23]:
# define the binding site for docking via co-crystallized ligand
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [24]:
# define the binding site for docking via residue IDs
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.pocket_resids = [
    516, 517, 521, 524, 542, 544, 573, 589, 590, 591, 592, 595, 596, 654, 655, 656, 657, 667, 668
]
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib_2")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [25]:
featurizer = OEDockingFeaturizer(
    output_dir=user_cache_dir() + "/Fred",
    method="Fred"
)

In [26]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>,
 <ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib_2>)>]

Docking scores are stored in the returned MDAnalysis universe.

In [27]:
[system.featurizations["last"]._topology.docking_score for system in systems]

[-17.801493, -3.960361]

In [28]:
# these are pytest nbval checks
# check LIG exists
assert len(systems[0].featurizations["last"].select_atoms("resname LIG").residues) == 1
assert len(systems[1].featurizations["last"].select_atoms("resname LIG").residues) == 1

# check caps
assert len(systems[0].featurizations["last"].select_atoms("resname ACE or resname NME").residues) == 10
assert len(systems[1].featurizations["last"].select_atoms("resname ACE or resname NME").residues) == 10

# check numbering of first residue
assert systems[0].featurizations["last"].residues[0].resid == 501
assert systems[1].featurizations["last"].residues[0].resid == 501

### Hybrid

In [29]:
systems = []

In [30]:
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [31]:
featurizer = OEDockingFeaturizer(
    output_dir=user_cache_dir() + "/Hybrid",
    method="Hybrid"
)

In [32]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

In [33]:
# these are pytest nbval checks
# check LIG exists
assert len(systems[0].featurizations["last"].select_atoms("resname LIG").residues) == 1

# check caps
assert len(systems[0].featurizations["last"].select_atoms("resname ACE or resname NME").residues) == 10

# check numbering of first residue
assert systems[0].featurizations["last"].residues[0].resid == 501

### Posit

In [34]:
systems = []

In [35]:
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [36]:
featurizer = OEDockingFeaturizer(
    output_dir=user_cache_dir() + "/Posit",
    method="Posit"
)

In [37]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

Beside the docking score, the Posit probability is also stored in the returned MDAnaylsis universe.

In [38]:
systems[0].featurizations["last"]._topology.posit_probability

0.5

In [39]:
# these are pytest nbval checks
# check LIG exists
assert len(systems[0].featurizations["last"].select_atoms("resname LIG").residues) == 1

# check caps
assert len(systems[0].featurizations["last"].select_atoms("resname ACE or resname NME").residues) == 10

# check numbering of first residue
assert systems[0].featurizations["last"].residues[0].resid == 501

## MostSimilarPDBLigandFeaturizer

Manually specifying the most suitable PDB structure to dock into is not practical for a larger set of ligands. Hence, the `MostSimilarPDBLigandFeaturizer` was implemented, wich can find the most suitable structure for docking in the PDB based on ligand similarity. The user can choose from one the following similarity metrics:

- Fingerprint
- Most common substructure
- OpenEye's shape
- Schrodinger's shape

In [40]:
print(inspect.getdoc(MostSimilarPDBLigandFeaturizer))

Find the most similar co-crystallized ligand in the PDB according to a
given SMILES and UniProt ID.

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, and must be initialized with a `uniprot_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structure with the most
    similar ligand ["fingerprint", "mcs", "openeye_shape",
    "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    n

### Most common substructure

In [41]:
systems = []

In [42]:
protein = Protein(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [43]:
featurizer = MostSimilarPDBLigandFeaturizer(
    similarity_metric="mcs"
)

In [44]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id

('4YNE', 'A', '4EK')

4.39 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Fingerprint

In [45]:
featurizer = MostSimilarPDBLigandFeaturizer(
    similarity_metric="fingerprint"
)

In [46]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id

('4YNE', 'A', '4EK')

3.88 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### OpenEye's shape

In [47]:
featurizer = MostSimilarPDBLigandFeaturizer(
    similarity_metric="openeye_shape"
)

In [48]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id

('4YNE', 'A', '4EK')

3min 37s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Using shape is clearly the slowest option, but in many cases the most accurate one.

### Pipeline of MostSimilarPDBLigandFeaturizer and OEDockingFeaturizer

The `MostSimilarPDBLigandFeaturizer` can be joined with the `OEDockingFeaturizer` into a `Pipeline` featurizer.

In [49]:
systems = []

In [50]:
protein = Protein(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [51]:
featurizer = Pipeline([
    MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint"),
    OEDockingFeaturizer(output_dir=user_cache_dir() + "/docking_pipeline", method="Posit"),
])

In [52]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

In [53]:
systems[0].featurizations

{'last': <Universe with 4783 atoms>,
 'Pipeline([MostSimilarPDBLigandFeaturizer, OEDockingFeaturizer])': <Universe with 4783 atoms>}

## KLIFSConformationTemplatesFeaturizer

The `KLIFSConformationTemplatesFeaturizer` searches for suitable templates to model a kinase:ligand complex in different conformations. The templates are selected based on ligand and sequence similarity.

In [54]:
print(inspect.getdoc(KLIFSConformationTemplatesFeaturizer))

Find suitable kinase templates for modeling a kinase:inhibitor complex in
different KLIFS conformations.

The protein component of each system must be a `core.proteins.KLIFSKinase`,
and must be initialized with a `uniprot_id` or `kinase_klifs_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structures with similar
    ligands ["fingerprint", "mcs", "openeye_shape", "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    numb

In [55]:
systems = []

In [56]:
protein = KLIFSKinase(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [57]:
featurizer = KLIFSConformationTemplatesFeaturizer(
    similarity_metric="fingerprint"
)

In [58]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<KLIFSKinase name=NTRK1>, <Ligand name=larotrectinib>)>]

In [59]:
systems[0].featurizations["last"]

Unnamed: 0,dfg,ac_helix,pdb_id,chain_id,expo_id,ligand_similarity,pocket_similarity
0,in,in,4yne,A,4EK,0.568047,443.0
1,in,out,6tfp,A,N6Z,0.534031,215.0
2,out,in,4pmp,A,31W,0.482759,443.0
3,out-like,in,6brj,A,VX6,0.521739,279.0
4,out-like,out,3aqv,A,TAK,0.435754,171.0
5,out,out,5jfv,A,6K1,0.49162,422.0
