# Schrodinger Structural Featurizer

This notebook introduces structural modeling featurizers using molecular modeling capabilities from the [Schrodinger Suite](https://www.schrodinger.com/) to prepare protein structures and to dock small molecules into their binding sites.

**Note:** All structural featurizers fetch data and/or do expensive computations. Hence, fetched data (e.g. PDB structures) and intermediate results (e.g. a prepared protein structure) are stored in a cache directory to speed up calculations when featurizing the same or similar systems multiple times. The cache directory can be specified via the `cache_dir` parameter, but also has a default (`user_cache_dir` from `appdirs`). In case you update your KinoML version you should consider deleting the cache directory. Otherwise, you may get results from the former KinoML version, since the intermediate results will be taken from cache.

In [1]:
%%capture --no-display
from importlib import resources
import inspect
from pathlib import Path

from appdirs import user_cache_dir

from kinoml.core.ligands import Ligand
from kinoml.core.proteins import Protein, KLIFSKinase
from kinoml.core.systems import ProteinLigandComplex
from kinoml.features.core import Pipeline
from kinoml.features.complexes import ( 
    SCHRODINGERComplexFeaturizer, 
    SCHRODINGERDockingFeaturizer,
    MostSimilarPDBLigandFeaturizer,
    KLIFSConformationTemplatesFeaturizer,
)

## SCHRODINGERComplexFeaturizer

All Schrodinger Featurizers come with an extensive doc string explaining the capabilities and requirements.

In [2]:
print(inspect.getdoc(SCHRODINGERComplexFeaturizer))

Given systems with exactly one protein and one ligand, prepare the complex
structure by:

 - modeling missing loops
 - building missing side chains
 - mutations, if `uniprot_id` or `sequence` attribute is provided for the
   protein component
   (see below)
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='MDAnalysis' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
   should be used.
 - `expo_id`: A string specifying the ligand of interest. This is
   esp

In general these featurizers will work with a minimal amount of information, e.g. just a PDB ID. However, it is recommended to be explicit as possible when defining the systems to featurize. For example, if a given PDB entry has multiple chains and ligands, the featurizer will have to guess which chain and ligand is of interest if not explicitly stated.

In [3]:
# collect systems to featurize, i.e. prepare the protein structure
systems = []

In [4]:
# unspecifc definition of the system, only via PDB ID
# modeling will be performed according to the sequence stored in the PDB Header
protein = Protein(pdb_id="4f8o", name="PsaA", toolkit="MDAnalysis")
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [5]:
# more specific definition of the system, protein of chain A co-crystallized with ligand AES and
# alternate location B, modeling will be performed according to the sequence of the given 
# UniProt ID
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA", toolkit="MDAnalysis")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)



In [6]:
featurizer = SCHRODINGERComplexFeaturizer(output_dir="output/complex")

The featurizers will return the featurized systems as an [MDAnalysis universe](https://www.mdanalysis.org/). Systems that failed will be filtered out. In case one is interested in failures, one can enable logging messages via:
```
import logging  
logging.basicConfig(level=logging.DEBUG)
```

In [7]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

JobId: lt10-0-62455dbf
JobId: lt10-1-62455dbf




[<ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>,
 <ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>]

In [8]:
systems[0].featurizations["last"]

<Universe with 2506 atoms>

If an `output_dir` was provided, the prepared structure is saved in PDB format.

In [9]:
for path in sorted(Path("output/complex").glob("*")):
    print(path.name)

kinoml_SCHRODINGERComplexFeaturizer_PsaA_4f8o_AEBSF_complex.pdb
kinoml_SCHRODINGERComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.pdb


## SCHRODINGERDockingFeaturizer

Docking can be performed with and without shape restrain to the co-crystallized ligand. Moreover, the protein structure for docking must contain a co-crystallized ligand, which is required for the pocket definition.

In [10]:
print(inspect.getdoc(SCHRODINGERDockingFeaturizer))

Given systems with exactly one protein and one ligand, prepare the
structure dock the ligand into its binding site identified by a
co-crystallized ligand. The following steps will be performed:

 - modeling missing loops
 - building missing side chains
 - mutations, if `uniprot_id` or `sequence` attribute is provided for the
   protein component (see below)
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4
 - docking a ligand

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='MDAnalysis' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string speci

### Without shape restrain

In [11]:
systems = []

In [12]:
protein = Protein(pdb_id="4yne", uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [13]:
featurizer = SCHRODINGERDockingFeaturizer(
    output_dir="output/docking_without_shape_restrain",
    shape_restrain=False
)

In [14]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

JobId: lt10-0-62455edf


Converted file: /scratch/lsftmp/6152749.tmpdir/tmpovx75qrg.mae


Removing previous job files...


JobId: lt10-0-62455fde
ExitStatus: finished


Removing previous job files...


JobId: lt10-0-6245601f
ExitStatus: finished


Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_without_shape_restrain/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae


[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

Docking scores are stored in the returned MDAnalysis universe.

In [15]:
systems[0].featurizations["last"]._topology.docking_score

-10.1766

If an `output_dir` was provided, the prepared structure is saved in PDB and MAE format, the prepared ligand is additionally saved in SDF format.

In [16]:
for path in sorted(Path("output/docking_without_shape_restrain").glob("*")):
    print(path.name)

kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.pdb
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_ligand.sdf


### With shape restrain

In [17]:
systems = []

In [18]:
protein = Protein(pdb_id="4yne", uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [19]:
featurizer = SCHRODINGERDockingFeaturizer(
    output_dir="output/docking_with_shape_restrain",
    shape_restrain=True,
)

In [20]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

Converted file: /scratch/lsftmp/6152749.tmpdir/tmp_n1sbprn.mae


Removing previous job files...


JobId: lt10-0-624560bd
ExitStatus: finished


Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_with_shape_restrain/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae


[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

## MostSimilarPDBLigandFeaturizer

Manually specifying the most suitable PDB structure to dock into is not practical for a larger set of ligands. Hence, the `MostSimilarPDBLigandFeaturizer` was implemented, wich can find the most suitable structure for docking in the PDB based on ligand similarity. The user can choose from one the following similarity metrics:

- Fingerprint
- Most common substructure
- OpenEye's shape
- Schrodinger's shape

In [21]:
print(inspect.getdoc(MostSimilarPDBLigandFeaturizer))

Find the most similar co-crystallized ligand in the PDB according to a
given SMILES and UniProt ID.

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, and must be initialized with a `uniprot_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structure with the most
    similar ligand ["fingerprint", "mcs", "openeye_shape",
    "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    n

### Most common substructure

In [22]:
systems = []

In [23]:
protein = Protein(uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [24]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="mcs")

In [25]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id

('4YNE', 'A', '4EK')

7.17 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Fingerprint

In [26]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint")

In [27]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id

('4YNE', 'A', '4EK')

2.54 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Schrodinger's shape

In [28]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="schrodinger_shape")

In [29]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id

	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_NUMERIC = "C",
	LC_TIME = "C",
	LANG = "C.UTF-8"
    are supported and installed on your system.


JobId: lt10-0-62456159


('4YPS', 'A', '4F6')

27.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Using shape is the slowest option, but in many cases the most accurate one.

### Pipeline of MostSimilarPDBLigandFeaturizer and SCHRODINGERDockingFeaturizer

The `MostSimilarPDBLigandFeaturizer` can be joined with the `SCHRODINGERDockingFeaturizer` into a `Pipeline` featurizer.

In [30]:
systems = []

In [31]:
protein = Protein(uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [32]:
featurizer = Pipeline([
    MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint"),
    SCHRODINGERDockingFeaturizer(output_dir="output/docking_pipeline"),
])

In [33]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

JobId: lt10-0-62456191


Converted file: /scratch/lsftmp/6152749.tmpdir/tmpg0s9ghna.mae


Removing previous job files...


JobId: lt10-0-6245628f
ExitStatus: finished


Removing previous job files...


JobId: lt10-0-624562d1
ExitStatus: finished


Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_pipeline/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4YNE_chainA_larotrectinib_complex.mae


[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

In [34]:
systems[0].featurizations

{'last': <Universe with 4964 atoms>,
 'Pipeline([MostSimilarPDBLigandFeaturizer, SCHRODINGERDockingFeaturizer])': <Universe with 4964 atoms>}

## KLIFSConformationTemplatesFeaturizer

The `KLIFSConformationTemplatesFeaturizer` searches for suitable templates to model a kinase:ligand complex in different conformations. The templates are selected based on ligand and sequence similarity.

In [35]:
print(inspect.getdoc(KLIFSConformationTemplatesFeaturizer))

Find suitable kinase templates for modeling a kinase:inhibitor complex in
different KLIFS conformations.

The protein component of each system must be a `core.proteins.KLIFSKinase`,
and must be initialized with a `uniprot_id` or `kinase_klifs_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structures with similar
    ligands ["fingerprint", "mcs", "openeye_shape", "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    numb

In [36]:
systems = []

In [37]:
protein = KLIFSKinase(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)

In [38]:
featurizer = KLIFSConformationTemplatesFeaturizer(
    similarity_metric="fingerprint"
)

In [39]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems

[<ProteinLigandComplex with 2 components (<KLIFSKinase name=NTRK1>, <Ligand name=larotrectinib>)>]

In [40]:
systems[0].featurizations["last"]

Unnamed: 0,dfg,ac_helix,pdb_id,chain_id,expo_id,ligand_similarity,pocket_similarity
0,in,in,4yne,A,4EK,0.568047,443.0
1,in,out,6tfp,A,N6Z,0.534031,215.0
2,out,in,4pmp,A,31W,0.482759,443.0
3,out-like,in,6brj,A,VX6,0.521739,279.0
4,out-like,out,3aqv,A,TAK,0.435754,171.0
5,out,out,5jfv,A,6K1,0.49162,422.0
