## Dunbrack cluster assignment

This notebook assigns structures (e.g. a pdb file or a simulation trajectory) into Dunbrack clusters based on [Modi and Dunbrack, 2019](https://pubmed.ncbi.nlm.nih.gov/30867294/).

Maintainer: [@jiayeguo](https://github.com/jiayeguo), [@schallerdavid](https://github.com/schallerdavid)

In [1]:
from MDAnalysis.core.universe import Universe
import numpy as np
from kinoml.features import kinase, protein_struct_features, dunbrack_cluster
from kinoml.features.klifs import query_klifs_database
import pandas as pd
from subprocess import Popen, PIPE

#### set pdbid and chain id for the structure of interest

In [2]:
#set up pdbid and chainid
pdbid = '2JIU'
chain = 'B'

#### load the structure using [MDAnalysis](https://www.mdanalysis.org).

Use pdb structures as an [MDAnalysis.core.universe.Universe object](https://www.mdanalysis.org/docs/documentation_pages/core/universe.html#MDAnalysis.core.universe.Universe) with one frame.

In [3]:
u = Universe(f'data/{pdbid}_chain{chain}.pdb')

#### get key residue indices to calculate structural features for cluster assignment 

In [4]:
klifs = query_klifs_database(pdbid, chain)
key_res = protein_struct_features.key_klifs_residues(klifs['numbering'])

#### calculate the dihedrals and distances for Dunbrack cluster assignment

In [5]:
dihedrals, distances = protein_struct_features.compute_simple_protein_features(u, key_res)

Choosing frames to analyze
Starting preparation
Finishing up


There is no missing coordinates.  All dihedrals and distances will be computed.


#### assign the conformation into a Dunbrack cluster

In [6]:
assignment = dunbrack_cluster.assign(dihedrals, distances)[0]
print(f"The input structure is assigned to Dunbrack cluster(s): {assignment}")

The input structure is assigned to Dunbrack cluster(s): 7


## Assign all KLIFS entries to respective Dunbrack clusters

In [7]:
lib = dunbrack_cluster.PDBDunbrack()

Enabling RDKit 2020.03.4 jupyter extensions
100%|██████████| 11407/11407 [00:10<00:00, 1106.67it/s]


Assigning 11k structures is time consuming, since it requires downloading all files from the PDB. Thus, information is stored locally in the user cache. If considerable changes were made to the cluster assignment functions, the local library can be overwritten via `lib.update(reinitialize=True)`.

In [8]:
lib

<PDBDunbrack library located at /home/david/.cache/pdb_dunbrack_library.csv contains 11407 structures.>

How many structures are found for each Dunbrack cluster?

In [9]:
for cluster_id in [0, 1, 2, 3, 4, 5, 6, 7, None]:
    print(f'Cluster ID {cluster_id} has {lib.structures_by_cluster(cluster_id).shape[0]} structures.')

Cluster ID 0 has 5319 structures.
Cluster ID 1 has 370 structures.
Cluster ID 2 has 964 structures.
Cluster ID 3 has 552 structures.
Cluster ID 4 has 936 structures.
Cluster ID 5 has 476 structures.
Cluster ID 6 has 868 structures.
Cluster ID 7 has 193 structures.
Cluster ID None has 1729 structures.


Since KLIFS has information about multiple structures per PDB, i.e. chains and alternate locations, it would be interesting to see in how many cases different Dunbrack clusters can be observed for the same PDB entry.  
***!Alternate locations not yet implemented!***

In [10]:
# 5 different Dunbrack clusters in a single PDB entry
lib.pdb_dunbrack_library.groupby(['pdb', 'kinase_ID']).filter(lambda x: len(x['dunbrack_cluster'].unique()) == 5)

Unnamed: 0,structure_ID,kinase,species,kinase_ID,pdb,alt,chain,rmsd1,rmsd2,pocket,...,bp_I_B,bp_II_in,bp_II_A_in,bp_II_B_in,bp_II_out,bp_II_B,bp_III,bp_IV,bp_V,dunbrack_cluster
1957,4901,VRK1,Human,193,2kul,10,A,0.903,2.268,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
1961,4921,VRK1,Human,193,2kul,2,A,0.841,2.289,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,2.0
1971,4896,VRK1,Human,193,2kul,18,A,0.901,2.311,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
1972,4949,VRK1,Human,193,2kul,6,A,0.979,2.195,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,1.0
1975,4924,VRK1,Human,193,2kul,16,A,0.961,2.313,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,1.0
1981,4948,VRK1,Human,193,2kul,4,A,0.838,2.21,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
1982,4919,VRK1,Human,193,2kul,1,A,0.865,2.236,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
1993,4915,VRK1,Human,193,2kul,20,A,0.863,2.243,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
2006,4883,VRK1,Human,193,2kul,12,A,0.879,2.214,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
2017,4887,VRK1,Human,193,2kul,8,A,0.889,2.279,LPIGQGGFGCIYLCVVKVEPLFTELKFYQRAALGVPKYWGSFMIMD...,...,False,False,False,False,False,False,False,False,False,0.0
