# Structure preparation

This notebook will guide you through the preparation of kinase structures ready for MD simulation and the generation of additional data files important for simulation and Markov state modeling.

## Content

- Fetching data from OSF
- Structure preparation
- Cropping to same length
- Mapping residue IDs to KLIFS pocket residue IDs
- Assigning quality scores for each prepared structure

## Fetching data from OSF

All data generated by the scripts of this repository are deposited in an associated [OSF project](https://osf.io/6b3jr/). The [fetch_files_from_osf.py](https://github.com/openkinome/kinase-conformational-modeling/tree/main/scripts/fetch_files_from_osf.py) script can be used to download all files stored in the OSF project.

## Structure preparation

The [KLIFS database](https://klifs.net/) annotates human protein kinases available in the [PDB](https://www.rcsb.org/). The [generate_apo_kinases.py](https://github.com/openkinome/kinase-conformational-modeling/tree/main/scripts/generate_apo_kinases.py) script uses functionalities from the [KinoML framework](https://github.com/openkinome/kinoml/) to automatically prepare protein kinases found in KLIFS for a given KLIFS kinase ID (human ABL1 - 392, human EGFR - 406) including different chains and alternate locations. This resulted in the preparation of 84 structures for human ABL1 and 400 structures for human EGFR.

## Cropping to same length

To get a standardized kinase domain, the prepared structures were processed using the [crop_to_same_length.py](https://github.com/openkinome/kinase-conformational-modeling/tree/main/scripts/generate_apo_kinases.py) script. This script analyzes all structures for a given kinase and identifies a suitable N and C terminus, excludes structures with missing residues and adds caps to the termini:
- human ABL1
  - 83 structures with residues 246-496
  - exluded structures with gaps:
    - kinoml_OEKLIFSKinaseApoFeaturizer_ABL1_4wa9_chainA_altlocNone_protein
- human EGFR
  - 400 structures with residues 704-981

## Mapping residue IDs to KLIFS pocket residue IDs

The KLIFS database annotates 85 pocket residues for each kinase. Using this information in Markov state modeling would allow featurization across different kinases. The [generate_klifs_residues_dictionary.py](https://github.com/openkinome/kinase-conformational-modeling/tree/main/scripts/generate_klifs_residues_dictionary.py) script generates a dictionary in JSON format that maps the kinase specific residue IDs to the KLIFS pocket residue IDs.

## Assigning quality scores for each prepared structure

Since a lot of structures were prepared for MD simulation, it would be great to prioritize structures for simulation at e.g. Folding@Home based on their quality. Luckily, the KLIFS database annotates quality scores for each structure. The [generate_structure_quality_dictionary.py](https://github.com/openkinome/kinase-conformational-modeling/tree/main/scripts/generate_structure_quality_dictionary.py) script generates a dictionary in JSON format assigning each prepared structure the quality score from KLIFS.