# aC helix descriptors

This notebook aims to identify descriptors of aC helix conformations that can be used to bias SAMS simulations. The [KLIFS database](https://klifs.net/index.php) will be queried for available kinase structures and analyzed for 5 distances connecting C alpha atoms of the aC helix and the hinge region. These distances were picked, since the hinge region is enclosed in the ATP binding pocket and may be stable enough to pull the aC helix closer to the ATP pocket without altering the overall structure of the kinase (aC helix out -> in). Finally, mean and standard deviations of these distances are calculated that are the basis for adding bias to SAMS simulations.

In [1]:
import pathlib

from appdirs import user_cache_dir
import MDAnalysis as mda
from opencadd.databases.klifs import setup_remote
from openeye import oechem
from tqdm import tqdm

INFO:opencadd.databases.klifs.api:If you want to see an non-truncated version of the DataFrames in this module, use `pd.set_option('display.max_columns', 50)` in your notebook.


In [2]:
# Set up remote session
remote = setup_remote()

INFO:opencadd.databases.klifs.api:Set up remote session...
INFO:opencadd.databases.klifs.api:Remote session is ready!


In [3]:
# retrieve kinase structures
kinase_df = remote.structures.all_structures()
# remove NMR structures
kinase_df = kinase_df[kinase_df["structure.resolution"].notna()]
print("Number of PDB entries:", len(set(kinase_df["structure.pdb_id"])))
print("Number of KLIFS entries:", len(kinase_df))

Number of PDB entries: 5327
Number of KLIFS entries: 11485


In [4]:
def klifs_to_pdb_resids(klifs_resids, klifs_structure_id):
    """
    Convert klifs pocket resids into the corresponding pdb resids of the 
    specified klifs structure.
    
    Parameters
    ----------
    klifs_resids: list of int
        KLIFS pocket residue ids.
    klifs_structure_id: int
        KLIFS structure ID of the structure to get the corresponding pdb resids for.
        
    Returns
    -------
    convert_dict: dict
        The dictionary with klifs resids as keys and pdb resids as values.
    """    
    pocket = remote.pockets.by_structure_klifs_id(klifs_structure_id)
    pdb_resids = pocket[pocket["residue.klifs_id"].isin(klifs_resids)]["residue.id"].to_list()
    convert_dict = {klifs_resid: int(pdb_resid) for klifs_resid, pdb_resid in zip(klifs_resids, pdb_resids) if pdb_resid != "_"}
    return convert_dict

In [5]:
def distance(x, y):
    """ This function returns the euclidean distance between two point in three dimensional space. """
    return ((x[0] - y[0]) ** 2 + (x[1] - y[1]) ** 2 + (x[2] - y[2]) ** 2) ** 0.5

Residues of the aC helix (20, 22, 24, 26, 28) and the hinge region (45, 46, 47, 48, 49) are conserved among all protein kinases and hence, should be applicable for SAMS simulations of most protein kinases of interest ([KLIFS residues](https://klifs.net/faq.php)).

In [6]:
critical_residues = [20, 22, 24, 26, 28, 45, 46, 47, 48, 49]
critical_distances = ["20_49", "22_48", "24_47", "26_46", "28_45"]
conformations = ["out", "out-like", "in"]

In [7]:
directory = pathlib.Path(user_cache_dir()) / "klifs_structures"
directory.mkdir(parents=True, exist_ok=True)
distance_dict = {conformation: {critical_distance: [] 
                                for critical_distance in critical_distances} 
                 for conformation in conformations}
complete_df = kinase_df[(kinase_df["structure.dfg"] != "na") & 
                        (kinase_df["structure.ac_helix"] != "na")]
complete_df = complete_df[~complete_df["structure.pdb_id"].isin(["6pjx"])] # remove problematic structures
for index, structure in tqdm(complete_df.iterrows(), total=complete_df.shape[0]):
    path = directory / f"{structure['structure.klifs_id']}.pdb"
    if not path.is_file():
        pdb_text = remote.coordinates.to_text(structure["structure.klifs_id"], 
                                              extension="pdb")
        with open(path, "w") as wf:
            wf.write(pdb_text)
    pdb_structure = mda.Universe(path, guess_masses=False)
    klifs_to_pdb_dict = klifs_to_pdb_resids(critical_residues, 
                                            structure["structure.klifs_id"])
    if len(klifs_to_pdb_dict) == len(critical_residues):
        not_unique_atoms = False
        for critical_distance in critical_distances:
            residue_pair = critical_distance.split("_")
            coords1 = pdb_structure.select_atoms(
                f"resid {klifs_to_pdb_dict[int(residue_pair[0])]} and name CA").positions
            if len(coords1) == 0:
                print(f"No atom matches for structure {structure['structure.klifs_id']} with selection for klifs resid {residue_pair[0]} and name CA ...")
                break
            elif len(coords1) == 1:
                coords1 = coords1[0]
            else:
                not_unique_atoms = True
                print(f"Multiple atoms match selection for klifs resid {residue_pair[0]} and name CA ...")
                break
            coords2 = pdb_structure.select_atoms(
                f"resid {klifs_to_pdb_dict[int(residue_pair[1])]} and name CA").positions
            if len(coords2) == 0:
                print(f"No atom matches for structure {structure['structure.klifs_id']} with selection for klifs resid {residue_pair[1]} and name CA ...")
                break
            elif len(coords2) == 1:
                coords2 = coords2[0]
            else:
                not_unique_atoms = True
                print(f"Multiple atoms match selection for klifs resid {residue_pair[1]} and name CA ...")
                break
            distance_dict[structure["structure.ac_helix"]][critical_distance].append(distance(coords1, coords2))
        if not_unique_atoms:
            break

 14%|█▍        | 1601/11136 [09:24<34:09,  4.65it/s]  

No atom matches for structure 9072 with selection for klifs resid 49 and name CA ...


 20%|██        | 2262/11136 [12:43<30:50,  4.80it/s]  

No atom matches for structure 3929 with selection for klifs resid 20 and name CA ...


 20%|██        | 2273/11136 [12:45<35:14,  4.19it/s]

No atom matches for structure 3910 with selection for klifs resid 20 and name CA ...


 25%|██▌       | 2788/11136 [15:04<27:11,  5.12it/s]  

No atom matches for structure 4468 with selection for klifs resid 20 and name CA ...


 29%|██▊       | 3186/11136 [17:06<35:09,  3.77it/s]  

No atom matches for structure 5529 with selection for klifs resid 22 and name CA ...


 33%|███▎      | 3640/11136 [19:18<34:33,  3.61it/s]  

No atom matches for structure 9826 with selection for klifs resid 22 and name CA ...


 40%|████      | 4455/11136 [23:29<27:38,  4.03it/s]

No atom matches for structure 4648 with selection for klifs resid 20 and name CA ...


 41%|████      | 4530/11136 [24:03<36:32,  3.01it/s]  

No atom matches for structure 3446 with selection for klifs resid 28 and name CA ...


 47%|████▋     | 5236/11136 [27:15<21:25,  4.59it/s]  

No atom matches for structure 12320 with selection for klifs resid 26 and name CA ...


 61%|██████    | 6782/11136 [34:10<27:10,  2.67it/s]  

No atom matches for structure 9533 with selection for klifs resid 45 and name CA ...


 77%|███████▋  | 8611/11136 [42:40<09:11,  4.58it/s]  

No atom matches for structure 3368 with selection for klifs resid 45 and name CA ...


100%|██████████| 11136/11136 [54:16<00:00,  3.42it/s]


In [8]:
def mean(data):
    """ This function returns the arithmetic mean. """
    return sum(data) / len(data)
def squared_deviations_from_mean(data):
    """ This function returns the squared deviations from mean. """
    c = mean(data)
    return sum((x - c) ** 2 for x in data)


def standard_deviation(data):
    """ This functions returns the population standard deviation. """
    return (squared_deviations_from_mean(data) / len(data)) ** 0.5

In [9]:
statistics_dict = {conformation: {critical_distance: {} for critical_distance in critical_distances} for conformation in ["out", "out-like", "in"]}

In [10]:
for conformation in conformations:
    for critical_distance in critical_distances:
        statistics_dict[conformation][critical_distance]["mean"] = mean(distance_dict[conformation][critical_distance])
        statistics_dict[conformation][critical_distance]["standard_deviation"] = standard_deviation(distance_dict[conformation][critical_distance])

In [11]:
statistics_dict

{'out': {'20_49': {'mean': 29.388096876141358,
   'standard_deviation': 2.3639302147587578},
  '22_48': {'mean': 23.618174779749292,
   'standard_deviation': 1.571313107813831},
  '24_47': {'mean': 20.18941644703819,
   'standard_deviation': 2.126781989616121},
  '26_46': {'mean': 16.37652998536413,
   'standard_deviation': 1.6903113654026147},
  '28_45': {'mean': 10.825879549548887,
   'standard_deviation': 1.7518496321605705}},
 'out-like': {'20_49': {'mean': 26.505665670276603,
   'standard_deviation': 1.5053202767038916},
  '22_48': {'mean': 22.85381628348522,
   'standard_deviation': 1.3242462025428903},
  '24_47': {'mean': 17.876768974216663,
   'standard_deviation': 1.0356373490369941},
  '26_46': {'mean': 16.604565303773455,
   'standard_deviation': 1.0961141455411763},
  '28_45': {'mean': 9.804295389753715,
   'standard_deviation': 0.6940670233499328}},
 'in': {'20_49': {'mean': 26.076600918421665,
   'standard_deviation': 2.1457908171117595},
  '22_48': {'mean': 22.4019541925

According to these statistics, aC helix in conformations are clearly distinguishable from aC helix out conformations although the distances between the hinge region and aC helix were not used in the KLIFS classification scheme. Interestingly, out-like aC helix conformations are not distinguishable from aC helix out conformations. Hence, these differences may result from movements in the DFG motif rather than from the aC helix.