# Homology Modelling

### Author: William Glass

This notebook shows examples of how to perform homology modelling in KinoML. 

In [1]:
# get relevant imports
from kinoml.modeling.homology import HomologyModel
from kinoml.modeling.alignment import Alignment
from kinoml.core.proteins import ProteinStructure

### Basic usage

To start, we need a template from which to base our homology model on.

In [2]:
hm = HomologyModel()

# If we already have our model (e.g. if prepared using Spruce TK etc), so we can load it easily
structure = ProteinStructure.from_file("./4yne_protein.pdb")

# If we just want to grab a structure from the PDB we could use the `from_name` attribute in `ProteinStructure`, e.g:
structure_from_pdb = ProteinStructure.from_name('4yne')

# Once we have our structure, we need to extract it's sequence. Note, this will often not be the canonical sequence.
sequence = structure_from_pdb.sequence

We now have our template structure and its sequence. If, for some reason, we didn't have access to the structure but did have access to the sequence we could run a BLAST search to find a PDB structure for our template:

In [3]:
model_templates = hm.get_pdb_template(sequence)

@> Blast searching NCBI PDB database for "GSGIR..."
@> Blast search completed in 47.5s.


In [4]:
model_templates.keys()

dict_keys(['4yne', '4aoj', '5h3q', '6d1y', '6iqn', '6nss', '5kml', '6nsp', '6npt', '5kmi', '4gt5', '6pl1', '5jfs', '6d22', '5wr7', '4pmm', '5i8a', '5kvt', '4f0i', '4ymj', '3v5q', '6kzc', '4asz', '1luf', '3zos', '5fdp', '6few', '5bvk', '6y23', '6brj', '6tu9', '6fer', '3zzw', '4gt4', '3eta', '1irk', '4ibm', '5hhw', '5e1s', '1p14', '3ekk', '6pxv', '5xff', '5xfj', '5jkg', '4uxq', '1i44', '4tye', '5nud', '4qqj', '6jpe', '4qqt', '2z8c', '1gag', '1rqq', '4qq5', '6iuo', '4qqc', '4xlv', '4k33', '4fnw', '2yjr', '3lco', '3lw0', '1p4o', '6pnx', '4ux0', '1agw', '3rhx', '4zsa', '3kxx', '5a4c', '4rwi', '4f63', '4wun', '3js2', '5zv2', '3c4f', '3gql', '5aa8', '5aa9', '4fnx', '2xb7', '6mx8', '4fnz', '6e0r', '4tt7', '4dce', '5a9u', '2xp2', '3lcs', '2yfx', '3aox', '2yjs', '4z55', '4fob', '2yhv', '3l9p', '4hw7', '6t2w', '4xcu', '6lvm', '3tt0', '1jqh', '3o23', '3qqu', '2oj9', '3i81', '5fxq', '5fxr', '4d2r', '1m7n', '3lvp', '3d94', '6jk8', '6mzw', '6nvl', '5flf', '5vnd', '4anl', '4ans', '2ogv', '5i9u', '1mqb

In [5]:
model_template = list(model_templates.keys())[0]

In this toy example our BLAST search using the `4YNE` sequence has, as expected, returned the `4YNE` structure in the PDB as the "best" model for us to use as our template.

Typically, we will want to search the PDB with a query sequence and find the most relevant PDB structure to use as our template. We can use the `get_sequence` attribute to obtain the full canonical sequence from a database based on the a unique ID (either using the default `backend="uniprot"` or `backend="ncbi"`). Defaults: `backend=True` and `kinase=True`, the latter ensures we refine our search to sequences that contain a kinase domain.

In [7]:
uniprot_seq = hm.get_sequence('P04629', kinase=True)  # Get sequence based on the UNIPROT ID = P04629. Default: `kinase=True` is shown here for clarity.

We can then run a blast search using this canonical sequence to search the PDB for the best template model to use.

In [8]:
model_templates = hm.get_pdb_template(uniprot_seq) # Need to fix timeouts

@> Blast searching NCBI PDB database for "IVLKW..."
@> Blast search completed in 68.7s.


In [9]:
model_templates.keys()

dict_keys(['4aoj', '5h3q', '5kml', '6d1y', '6iqn', '5kmi', '4gt5', '4pmm', '5i8a', '4yne', '6nsp', '5kvt', '6nss', '4f0i', '5jfs', '6d22', '5wr7', '6npt', '6pl1', '3v5q', '6kzc', '4ymj', '4asz', '1luf', '3zos', '5fdp', '5bvk', '6few', '6y23', '6brj', '6fer', '6tu9', '3eta', '1irk', '4ibm', '5hhw', '5e1s', '1p14', '3ekk', '1i44', '6pxv', '3zzw', '4gt4', '4fnw', '2yjr', '2z8c', '1gag', '1rqq', '4xlv', '5aa9', '5aa8', '6mx8', '2xb7', '4fnz', '4dce', '2xp2', '6e0r', '4tt7', '5a9u', '3aox', '2yfx', '2yjs', '4fob', '2yhv', '3lcs', '4z55', '3l9p', '3lw0', '1p4o', '4fnx', '4ans', '4anl', '6pyh', '3lco', '2oj9', '5fxq', '1jqh', '3o23', '3i81', '5fxr', '3qqu', '4d2r', '3lvp', '1m7n', '3d94', '6jk8', '5xfj', '5xff', '4uxq', '5jkg', '6t2w', '3lcd', '3zbf', '4tye', '5nud', '4qqj', '6jpe', '4qqt', '3gql', '4hw7', '2ogv', '5fxs', '4qq5', '4qqc', '6iuo', '3tt0', '4k33', '2zm3', '1k3a', '5a46', '4zsa', '3rhx', '5a4c', '1agw', '4ux0', '4f63', '4rwi', '4wun', '3kxx', '5zv2', '3js2', '6pnx', '3c4f', '3gqi

Here, `model_templates.keys()` has returned a number of PDB structures we could use as templates for our homology model. 

### Generating the alignment

A structure (either from a user made one or one downloaded from the PDB with `ProteinStructure.from_name()`) can be used as template on to which we can build our homology model. 

We also need a target sequence, this is downloaded from the UniProt server using `HomologyModel().get_sequence()` as shown above.

This information can be used in `Alignment.get_alignment()` to produce an alignment of the two sequences. 

The user can also use `Alignment.make_ali_file()` that generates an alignment file based on MODELLER formatting.

#### Generating an alignment and homology model: no ligand

In this example we shall use the `4yne` PDB structure as a template and use the Uniprot sequence for NTRK1 to build a homolgy model.

In [10]:
structure_from_pdb = ProteinStructure.from_name('4yne') 
uniprot_seq = hm.get_sequence('P04629', kinase=True)

In [16]:
# Generate the sequence alignment between the template and target sequence
seq_alignment = Alignment.get_alignment(structure_from_pdb.sequence.sequence, up_seq)
# To access the alignment information you can use the metadata keys:
seq_alignment.metadata.keys()

dict_keys(['score', 'sequence_identity', 'symbols', 'codes'])

In [17]:
# Now, we can generate an alignment file (based on Modeller)
seq_alignment.make_ali_file(
    seq_alignment.metadata['symbols'][0], # the aligned template sequence
    seq_alignment.metadata['symbols'][1], # the aligned target sequence
    structure_from_pdb,
    uniprot_seq
    )

The alignment file can now be used to generate a homolgy model using Modeller. In a real use case the user will need to generate a large number of models and score them (e.g. using DOPE, QMEAN etc). Here, we shall only generate one. 

In [18]:
hm.make_model(structure_from_pdb, uniprot_seq, seq_alignment, num_models=1)

.0000       1.000
 9 Distance restraints 1 (CA-CA)      :       0       0      0   0.000   0.000      0.0000       1.000
10 Distance restraints 2 (N-O)        :       0       0      0   0.000   0.000      0.0000       1.000
11 Mainchain Phi dihedral restraints  :       0       0      0   0.000   0.000      0.0000       1.000
12 Mainchain Psi dihedral restraints  :       0       0      0   0.000   0.000      0.0000       1.000
13 Mainchain Omega dihedral restraints:      36       4      6   0.261   0.261      95.357       1.000
14 Sidechain Chi_1 dihedral restraints:      28       0      2   1.616   1.616      20.504       1.000
15 Sidechain Chi_2 dihedral restraints:      21       0      3   1.358   1.358      19.340       1.000
16 Sidechain Chi_3 dihedral restraints:      14       0      0   1.143   1.143      9.9072       1.000
17 Sidechain Chi_4 dihedral restraints:       6       0      0   1.778   1.778      5.2542       1.000
18 Disulfide distance restraints      :       0       0

#### Generating an alignment and homology model: with ligand

Often, structures will contain the coordinates of bound ligands. If these are of interest and need to be retained this is easy to do:

In [19]:
# We generate the same alignment file (based on Modeller), but now retain the bound ligand
seq_alignment.make_ali_file(
    seq_alignment.metadata['symbols'][0], # the aligned template sequence
    seq_alignment.metadata['symbols'][1], # the aligned target sequence
    structure_from_pdb,
    uniprot_seq,
    ligand=True
    )

In [20]:
hm.make_model(structure_from_pdb, uniprot_seq, seq_alignment, num_models=1, ligand=True)

  24.805       1.000
14 Sidechain Chi_1 dihedral restraints:      29       0      1   1.336   1.336      10.207       1.000
15 Sidechain Chi_2 dihedral restraints:      22       0      0   1.424   1.424      14.724       1.000
16 Sidechain Chi_3 dihedral restraints:      14       0      0   1.901   1.901      11.528       1.000
17 Sidechain Chi_4 dihedral restraints:       6       0      0   1.474   1.474      2.8384       1.000
18 Disulfide distance restraints      :       0       0      0   0.000   0.000      0.0000       1.000
19 Disulfide angle restraints         :       0       0      0   0.000   0.000      0.0000       1.000
20 Disulfide dihedral angle restraints:       0       0      0   0.000   0.000      0.0000       1.000
21 Lower bound distance restraints    :       0       0      0   0.000   0.000      0.0000       1.000
22 Upper bound distance restraints    :       0       0      0   0.000   0.000      0.0000       1.000
23 Distance restraints 3 (SDCH-MNCH)  :       0     