# Characterization of loop dynamics in kinases

We present a workflow to discover protein conformational features associated with kinase loop rearrangments. The purpose of this notebook is to describe the necessary steps adopted in our study. Implementations of the described steps are included as `.py` files within the folder `workflow`.

## 0. Introduction

Our modelling pipeline is subdivided in the following sections: 
1. [Dataset creation](#1)   
    1.1. [The data](#11)   
    1.2. [Data download](#12)   
    1.3. [Kinase taxonomy](#13)   
2. [Dataset curation](#2)  
    2.1. [Extracting protein chains](#21)   
    2.2. [Filtering for activation loop](#22)   
    2.3. [Conformational classification](#23)   
    2.4. [Structural conservation](#24)   
    2.5. [Reconstructing small loop segments](#25)   
    2.6. [Coarse-graining activation loops](#26)   
3. [Dimensionality reduction](#3)   
    3.1. [Low-dimensional representation](#31)   
    3.2. [Clustering](#32)   
    3.3. [Analysis](#33)   
4. [Feature definition](#4)   
    4.1. [Feature matrix](#41)  
5. [Feature selection](#5)
6. [Feature classification](#6)   

The overall pipeline, implemented in the sections hereafter, is represented according to the following schematic.

![State of the workflow](images/fullPipelineSchematic.png)

To get started, let's load some packages!

In [None]:
# File and system operations
import os
import sys
import subprocess
from glob import glob
import pickle
import shutil

# Data processing
import pandas as pd
import numpy as np
import mdtraj as md

# Network and parallel processing
import requests
import time
import multiprocessing
import concurrent.futures

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# custom utility functions and class
from workflow.utilities import count_pdb_files, braf_res, clear_and_make, make_seg, copy_filtered_pdbs
from workflow.utilities import PDBDownloader

# 1. Dataset creation <a id="1"></a>

In this section we will parse the Protein Data Bank (PDB) for kinase structures and download them into a dataset.

## 1.1 The data <a id="11"></a>
Our approach involves searching for protein homologs to the reference sequence: a BRAF kinase (PDB code: 6UAN).

BRAF is a key part of the MAPK/ERK pathway. This pathway relays signals from outside the cell, like growth factors binding to receptor tyrosine kinases, to control cell growth, division, survival, and differentiation.

The active site sits in a cleft between the small N‚Äëterminal lobe and larger C‚Äëterminal lobe of the kinase domain. 
ATP binds in a pocket on the N‚Äëlobe side of the kinase domain, at the P-binding loop. 
The protein substrate, mainly MEK kinase, contacts a broad surface on the C‚Äëlobe of BRAF. 
Activation loop and Œ±C helix interact with residues in the binding site, regulating activation. 

![State of the workflow](images/BRAFSlide1.png)

When creating our dataset, we take as structural reference the BRAF structure since we are familiar with its typical regulatory role during phosphorilation. We query the InterPro database online at https://www.ebi.ac.uk/interpro/ to find structures in the PDB that match the protein kinase-like domain family. InterPro is a database that classifies protein sequences into families and predicts the presence of domains and important sites.

Our query input is the BRAF sequence and we filter for structures that are part of the "Protein kinase-like domain superfamily" (IPR011009) and that are included in the PDB.

## 1.2 Data download <a id="12"></a>
Here we download the structures output from the InterPro query.

Let's start by writing all PDB codes to a list.

In [None]:
structure_path = 'structure-matching-IPR011009.tsv'
pdb_data = pd.read_csv(structure_path, sep = "\t", header=0, engine='python')
pdb_data['Accession'] = pdb_data['Accession'].str.upper()
pdb_ids = pdb_data['Accession'].tolist()

The class `PDBDownloader` enables carrying out multi-threaded PDB download. It uses up to 2 CPU cores. We now download the PDB structures listed above.

In [None]:
downloader = PDBDownloader()
downloader.parallel_download(pdb_ids, "Results/InterProPDBs") 

We can now check how many structures from the InterPro query were actually downloaded.

In [None]:
folder_path = "Results/InterProPDBs"
file_names = [os.path.splitext(f)[0] for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
pdb_raw = pd.DataFrame({"PDBs": file_names})

pdb_data['Downloaded'] = pdb_data['Accession'].str.upper().isin(pdb_raw['PDBs']).map({True: True, False: False})

counts = pdb_data['Downloaded'].value_counts().to_dict()
print(f"Downloaded: {counts[True]}, Failed: {counts[False]}")

We can save the names of failed PDB downloads for future reference.

In [None]:
fail_list = pdb_data[pdb_data['Downloaded']==False]
fail_list.to_csv('fail_list.csv')

## 1.3 Kinase taxonomy <a id="13"></a>
We seek to annotate our dataset with kinase family, species, and class information.


The class `KinaseGroupLabeller` enables extracting metadata from UniProt to investigate what kinase families and which species are represented in our dataset.

In [None]:
from workflow.kinaseGroupLabelling import KinaseGroupLabeller

lab = KinaseGroupLabeller()
annot = lab.run()  # Auto-discovers PDBs from Results/InterProPDBs
display(annot.head())

We now use the plotting method `plot_distribution_bars()` to visualise the parsed metadata as histograms.

In [None]:
# Create distribution plots for kinase families, species, and classes
# Using horizontal bar charts for readable labels
fig_family = lab.plot_distribution_bars(annot, 'family', top_n=10)
fig_species = lab.plot_distribution_bars(annot, 'species', top_n=15)
fig_class = lab.plot_distribution_bars(annot, 'kinase_class')


Our dataset includes only Kinase-like structures as expected.

# 2. Dataset curation <a id="2"></a>
In this section we will curate our kinase dataset for input into conformational analysis.

## 2.1 Extracting protein chains <a id="21"></a>
Here we extract only the protein chains containing a kinase domain from our database of downloaded PDB structures.

We utilise the class `PDBChainExtractor()` to write to PDB files the coordinates of chains indicated by InterPro query output. 

In [None]:
from workflow.pdb_chain_extractor import PDBChainExtractor

# Create an instance of the class
chain_extractor = PDBChainExtractor()

# Now call the method on the instance
chain_extractor.extract_chains_parallel(pdb_data, 'Results/activation_segments/unaligned/', None)

Let's make sure that the number of chains corresponds to at least the same amount of files downloaded.

In [None]:
pdb_directory = 'Results/activation_segments/unaligned/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

## 2.2 Filtering for activation loop  <a id="22"></a>
Here we exclude all kinase domains that do not have the characteristic conserved residue motifs DFG and APE that delimit the activation loop.

We utilise the method `copy_filtered_pdbs()` to extract amino acid sequences from the structures in our dataset and exclude those not containing DFG and APE.

In [None]:
# Copy filtered PDB files to a new directory
source_dir = 'Results/activation_segments/unaligned/'
target_dir = 'Results/activation_segments/motif_filtered/'
valid_pdbs, invalid_pdbs = copy_filtered_pdbs(source_dir, target_dir)

Let's check how many kinase domains we are left with.

In [None]:
pdb_directory = 'Results/activation_segments/motif_filtered/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

## 2.3 Conformational classification  <a id="23"></a>
Here we investigate the conformational diversity of our kinase domains by applying the classification developed by the Dunbrack's group.

The class `DunbrackWorkflow` enables performing the conformational classification using the `KinCore` software.

**When KinCore fails**, structures are marked with `'failed'` status. This happens when:
- KinCore cannot find the DFG or C-helix motifs
- Structure has missing residues in critical regions
- Non-standard kinase fold
- Structure quality issues

In [None]:
from workflow.DunbrackAssignment import DunbrackWorkflow

# Initialize the workflow with your KinCore installation location
workflow = DunbrackWorkflow(
    input_dir='Results/activation_segments/motif_filtered/',
    output_dir='Results/dunbrack_assignments',
    kincore_dir='/home/marmatt/Documents/Kincore-standalone'  # Your actual KinCore installation
)

# Run the complete conformation assignment workflow
results = workflow.run(
    output_csv='kinase_conformation_assignments.csv'
)

Let's now print some information about the KinCore analysis.

In [None]:
# Display results
print(f"\n{'='*80}")
print("CONFORMATION ASSIGNMENT RESULTS")
print(f"{'='*80}\n")
display(results.head(10))

# Show conformation distribution
print("\nConformation Distribution:")
print(results['overall_conformation'].value_counts())

# Show DFG motif distribution
print("\nDFG Motif Distribution:")
print(results['dfg_conformation'].value_counts())

# Show C-helix distribution
print("\nC-helix Distribution:")
print(results['chelix_conformation'].value_counts())

# Show ligand information
print("\n" + "="*80)
print("LIGAND INFORMATION")
print("="*80)
print("\nLigand Distribution:")
ligand_counts = results['ligand'].value_counts()
print(ligand_counts)

print(f"\nTotal unique ligands: {len(ligand_counts)}")
print(f"Structures with ligand: {(results['ligand'] != 'No_ligand').sum()}")
print(f"Structures without ligand: {(results['ligand'] == 'No_ligand').sum()}")

# Show top 10 most common ligands (excluding No_ligand)
print("\nTop 10 most common ligands (excluding apo structures):")
top_ligands = results[results['ligand'] != 'No_ligand']['ligand'].value_counts().head(10)
for ligand, count in top_ligands.items():
    print(f"  {ligand:.<20} {count:>4}")

Let's now visualise the metadata extracted from `KinCore`.

In [None]:
# Generate the multi-panel Dunbrack distribution plot (logic lives in the class now)
_ = DunbrackWorkflow.plot_conformation_distribution(
    assignments_csv="Results/dunbrack_assignments/kinase_conformation_assignments.csv",
    output_png="Results/dunbrack_assignments/conformation_distribution.png",
    show=True,
    print_dunbrack_summary=True,
)

## 2.4 Structural conservation  <a id="24"></a>
Here we will be investigating which residues of our reference BRAF kinase are structurally conserved across the collected dataset.

We choose to assess structure conservation using a novel multiple structure alignment algorithm: FoldMason. It uses the structural alphabet from Foldseek to represent 3D structures as sequences, enabling fast comparison between large structure sets. The class `AlignmentFoldMason` is implemented for this purpose.

In [None]:
from workflow.align_FoldMason import AlignmentFoldMason

# Initialize
aligner = AlignmentFoldMason(log_file="multiple_alignment_foldmason.log")

# Single multi-structure FoldMason run (optionally anchors with the template first)
aligner.process_foldmason_alignment_multi(
    pdb_path="Results/activation_segments/motif_filtered/",
    target_dir="Results/activation_segments/multi_aligned_foldmason/",
    template_pdb="6UAN_chainD.pdb",  # omit if you don't want to include a template
    out_name="msa",                  # output prefix
    report_mode=1                    # 0: no report, 1: HTML report
)

Let's check that the number of structurally aligned files corresponds to the same number of files filtered for activation segment in the previous subsection.

In [None]:
pdb_directory = 'Results/activation_segments/multi_aligned_foldmason/'
pdb_count = count_pdb_files(pdb_directory, recursive=True)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.") 

We will be showing structure conservation with respect to the BRAF reference sequence. We have written the `analyse_alignment()` class to load the multi-structure alignment, calculate conservation at each BRAF residue position and select residues that fall within a certain conservation threshold (70%). We visualise this analysis as a histogram.

**This class creates the `conservation` variable** that is used later in the feature selection workflow.

In [None]:
from workflow.analyse_alignment_foldmason import analyse_alignment

# Load multi-structure alignment
analyser = analyse_alignment()
multi_data = analyser.load_multi_alignment(
    "Results/activation_segments/multi_aligned_foldmason/msa_3di.fa",
    reference_name="6UAN_chainD"  # Specify the template structure as reference
)

# Get residue names from reference structure (format: "ALA-123")
reference_residues = braf_res()

# Visualize conservation with residue labels on x-axis
# Labels show: "ALA449 (0)" format (3-letter code + PDB number + 0-based index)
conservation, highly_conserved = analyser.visualize_residue_conservation(
    filtered_alignments=multi_data['structures'],
    reference_residues=reference_residues,  # Pass residue names for x-axis labels
    output_file="Results/multi_alignment_foldMason_conservation.png",
    show_plot=True
)

print(f"Analyzed {len(multi_data['structures'])} structures (total: {multi_data['n_structures']})")

# Identify residues with conservation > 70%
conservation_threshold = 0.70
conserved_70_indices = np.where(conservation >= conservation_threshold)[0]

print(f"\n=== Residues with ‚â•{conservation_threshold*100:.0f}% Conservation ===")
print(f"Total conserved residues: {len(conserved_70_indices)}")
print(f"\nPositions and residues:")
for idx in conserved_70_indices:
    if idx < len(reference_residues):
        print(f"  Position {idx:3d}: {reference_residues[idx]:>10s} - {conservation[idx]*100:.1f}% conserved")
    else:
        print(f"  Position {idx:3d}: (no reference) - {conservation[idx]*100:.1f}% conserved")

# Save to file
conserved_df = pd.DataFrame({
    'position': conserved_70_indices,
    'residue': [reference_residues[i] if i < len(reference_residues) else 'N/A' for i in conserved_70_indices],
    'conservation': [conservation[i] for i in conserved_70_indices]
})
conserved_df.to_csv('Results/conserved_residues_70percent.csv', index=False)
print(f"\nSaved conserved residues to: Results/conserved_residues_70percent.csv")

In order to assess the validity of our approach we can visualise how FoldMason aligns sequences by running the method `visualise_sequence_alignment()`.

In [None]:
from workflow.analyse_alignment_foldmason import visualise_sequence_alignment

# Create visualizer instance
visualizer = visualise_sequence_alignment()

# Generate HTML for 3Di alignment
di_stats = visualizer.generate_multi_alignment_html(
    alignment_file="Results/activation_segments/multi_aligned_foldmason/msa_3di.fa",
    output_file="Results/multi_alignment_3di.html",
    reference_name="6UAN_chainD"
)

print(f"3Di alignment: {di_stats}")

## 2.5 Reconstructing small loop segments  <a id="25"></a>
Here we will be using homology modelling to reconstruct small missing residue regions within the activation loop.

Many crystal structures exhibit missing residues in the activation loop since X-ray crystallography is not a useful technique to resolve disordered regions. We have written the class `ProteinReconstructor()` that extracts the full sequence from the original PDB file of each kinase domain, checks which structures require a reconstruction of less than 4 consecutive missing residues in the activation loop and utilises MODELLER to fill in missing residues with reasonable conformations.

In [None]:
from workflow.reconstruct import ProteinReconstructor

# Configuration
input_dir = "Results/activation_segments/aligned_mda"
full_pdb_dir = "Results/InterProPDBs"
output_dir = "Results/activation_segments/reconstructedModeller"
max_gap_length = 4
    
# Create and run the reconstructor
reconstructor = ProteinReconstructor(
    input_dir=input_dir,
    full_pdb_dir=full_pdb_dir,
    output_dir=output_dir,
    max_gap_length=max_gap_length
)
    
reconstructor.run_modeller_pipeline()

We should now check how many reconstructed structures we are left with.

In [None]:
pdb_directory = 'Results/activation_segments/reconstructedModeller/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

## 2.6 Coarse-graining activation loops <a id="26"></a>
Here we present a method to coarse-grain activation loops and represent them with an equal number of coordinates independently from the length of the loop. 

The class `CAStripper` provides means to save to a new folder only the coordinates of the CŒ± atoms of each activation loop.

In [None]:
from workflow.ca_stripper import CAStripper

stripper = CAStripper(motifs=['DFG', 'APE'])

# Process a directory
output_dir = stripper.strip_to_ca(
    input_dir="Results/activation_segments/reconstructedModeller/", 
    output_dir="Results/activation_segments/CA_segments/"
)

Let's make sure that the number of structures being processed has not decreased.

In [None]:
pdb_directory = 'Results/activation_segments/CA_segments/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

In the next two filtering steps we will be using the `OutlierStripper` class to retain a dataset of well-aligned and similarly-sized loop structures.

We are first going to apply Tukey's method to exclude activation loop structures characterised by a number of CŒ± atoms that lies outside the interquartile range of the distribution.

At this point it would be useful to filter out loops whose extremities are not structurally aligned to the extremities of the reference BRAF structure. This is to minimise the impact of the lack of roto-translational invariance on the dimensionality reduction performed later. In order to accomplish this, we will be looking at the RMSD between the extremities of each structure and the ones of the reference.

In [None]:
from workflow.ca_stripper import OutlierStripper

# Load your reference structure
ref_traj = md.load("6UAN_chainD.pdb")
ca_indices = ref_traj.topology.select("name CA")

# Initialize OutlierStripper with reference PDB for distance filtering
distance_outlier_detector = OutlierStripper(
    k_factor=1.5,
    reference_pdb="6UAN_chainD.pdb",
    ref_first_resid=144,   # First residue ID
    ref_last_resid=173     # Last residue ID
)

# Analyze with both CA count AND distance filtering
final_results = distance_outlier_detector.analyze(
    ca_segments_dir="Results/activation_segments/CA_segments/",
    create_plots=True,  # This creates histograms AND violin plots
    distance_cutoff=5.0,
    apply_distance_filter=True,
    clean_dir_name="CA_segments_final_cleaned"
)

print(f"Final cleaned structures saved to: {final_results['clean_dir']}")

Let's again make sure the number of files retained after this filtering step is right.

In [None]:
pdb_directory = 'Results/activation_segments/CA_segments/CA_segments_final_cleaned'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

### 2.5.2 CŒ± interpolation

We now focus on preparing the input to the dimensionality reduction algorithms chosen. The issue we have at present is that we are dealing with heterogeneity in the number of atoms of each input. 

We utilise the class `Fitting()` to obtain a uniform representation of our dataset by fitting cubic splines to the carbon alphas of our structures and then sampling the path obtained an equal amount of times corresponding to the median of the histogram shown above.

In [None]:
from workflow.fitting_class import Fitting

# Process all PDB files in a directory
input_directory = "Results/activation_segments/CA_segments/CA_segments_final_cleaned/" # PROBLEM it should be Results/activation_segments/CA_segments/CA_segments_final_cleaned/ 
output_directory = 'Results/activation_segments/fitted'

# Initialise
fitter = Fitting()

# This will fit all structures and create comparison plots
fitter.process_directory(
    input_dir=input_directory,
    output_dir=output_directory,
    create_plots=True,  # Set to False if you don't want plots
    plot_dir='Results/activation_segments/plots'  # Optional: specify plot directory
)

# 3. Dimensionality reduction  <a id="3"></a>
In this section we will apply dimensionality reduction to our coarse-grained representation of selected activation loops.

## 3.1 Principal component analysis (PCA)  <a id="31"></a>
We are now going to perform Principal Component Analysis (PCA) in order to reduce the dimensionality of our coarse-grained dataset.

We have written the class `PCAWorkflow()` in order to apply PCA to our activation loop dataset.

In [None]:
from workflow.pca_analysis import PCAWorkflow

workflow = PCAWorkflow(n_components=8, n_clusters=2)
results = workflow.run_full_analysis(
    structures_path="Results/activation_segments/fitted/",
    output_prefix="my_analysis"
)

# Access results
print(f"Structures: {len(results['structure_names'])}")
print(f"PC1: {results['explained_variance'][0]:.1f}% variance")

We now use the class `ClusterAnalyzer()` in order to visualise how our dataset projects along the first two principal components. We cluster in this reduced space and obtain two labels for two clusters of projected conformations. We also visualise how active and inactive labels from KinCore project in PC space.

In [None]:
from workflow.pca_analysis import ClusterAnalyzer

cluster_analyzer = ClusterAnalyzer(n_clusters=2)

_ = cluster_analyzer.plot_pca_cluster_and_activation(
    results,
    kincore_file="Results/dunbrack_assignments/kinase_conformation_assignments.csv",
    cluster_plot_path="pca_clustering_labels.png",
    activation_plot_path="pca_activation_states.png",
    show=True
)


## 3.2 Convolutional autoencoder  <a id="32"></a>
We are now going to train a small Convolutional Autoencoder (CNN) in order to reduce the dimensionality of our coarse-grained dataset.

We will be exploiting the framework named `molearn` for training a CNN on activation loop coarse-grained conformations. To facilitate running all the steps required by this package for training, we have written the class `AutoencoderWorkflow()`.

In [None]:
from workflow.autoencoder_workflow import AutoencoderWorkflow

# Create workflow
myworkflow = AutoencoderWorkflow(
    folder_name='Results/activation_segments/fitted',
    output_base_dir='Results/run_trial_BRAFActivationLoop_postalign_checkpoint0',
    manual_seed=25,
    batch_size=8
)

Let's prepare the data for input into training and let's train our model for 32 consecutive epochs with early stopping.

In [None]:
myworkflow.prepare_data(atom_selection=['CA'])
myworkflow.train(max_epochs=32, patience=32)

After training our Autoencoder we are going to load the trained model so that we can perform some analysis.

In [None]:
# Load checkpoint
myworkflow.load_checkpoint()

Let's now initialise the analysis class and decode structures in order to be able to quantify the performance of the model.

In [None]:
# Setup and run analysis
myworkflow.setup_analysis()
myworkflow.extract_dataset()
myworkflow.decode_structures()

Since the data gets shuffled before input into our model we are going to keep track of the shuffled indices to facilitate our analysis.

In [None]:
myworkflow.rename_files()

We can now save the learnt latent representation for future visualisation.

In [None]:
myworkflow.calculate_errors()
myworkflow.scan_error_landscape()
myworkflow.extract_encoded_coordinates()

We can also load the PCA labels obtained in the previous section in order to visualise how they project in latent space.

In [None]:
# Load PCA cluster labels 
pca_labels_file = 'cluster_labels_my_analysis_hierarchical.txt'
myworkflow.load_external_labels(pca_labels_file)

Finally let's visualise how both training and validation data project in latent space and what is the distribution of PCA and activation labels over the latent space.

In [None]:
# --- Plot 1: Latent space colored by PCA cluster labels ---
myworkflow.plot_latent_space(
    title="Latent Space Projection (PCA Cluster Labels)",
    output_file='latent_space_pca_labels.png'
)

# --- Plot 2: Latent space colored by KinCore activation state labels ---
# Load activation state labels from KinCore classification
myworkflow.load_kincore_labels(
    kincore_file='Results/dunbrack_assignments/kinase_conformation_assignments.csv'
)

# Plot with activation labels (pass them explicitly)
# Use red for inactive (0) and green for active (1)
myworkflow.plot_latent_space(
    labels=(myworkflow.activation_labels_train, myworkflow.activation_labels_valid),
    title="Latent Space Projection (Activation States)",
    output_file='latent_space_activation_states.png',
    activation_colors=['red', 'green']  # [inactive, active]
)

# Organize structures by PCA cluster labels
myworkflow.organize_by_clusters()

Let's now investigate whether there is a correlation between the labels assigned by Dunbrack and the ones obtained through clustering in PC space.

In [None]:
from workflow.pca_analysis import ClusterAnalyzer

cluster_analyzer = ClusterAnalyzer(n_clusters=2)

out = cluster_analyzer.integrate_dunbrack_with_pca_clusters(
    pca_labels_file="cluster_labels_my_analysis_hierarchical.txt",
    dunbrack_assignments_csv="Results/dunbrack_assignments/kinase_conformation_assignments.csv",
    prefix_len=6,
    merged_output_csv="Results/dunbrack_assignments/pca_dunbrack_merged.csv",
    print_tables=True,
    print_percentages=True,
)

# keep the merged dataframe available for downstream cells
merged = out["merged"]


In [None]:
cluster_analyzer = ClusterAnalyzer(n_clusters=2)

out = cluster_analyzer.analyze_cluster_vs_activity_status(
    merged_csv="Results/dunbrack_assignments/pca_dunbrack_merged.csv",
    print_tables=True,
    print_percentages=True,
    print_enrichment=True,
    enrichment_threshold_pct=10.0,
)

# keep commonly used objects in the notebook namespace
merged = out["merged"]
merged_clean = out["merged_clean"]
activity_crosstab = out["activity_crosstab"]


It can be useful to visualise if there is a correlation with a confusion matrix.

In [None]:
cluster_analyzer = ClusterAnalyzer(n_clusters=2)

out = cluster_analyzer.plot_cluster_vs_activation_state_heatmap(
    merged_csv="Results/dunbrack_assignments/pca_dunbrack_merged.csv",
    output_png="Results/dunbrack_assignments/pca_activation_correlation_plot.png",
    show=True,
)

# keep commonly used objects in the notebook namespace
merged = out["merged"]
merged_clean = out["merged_clean"]
activation_crosstab = out["activation_crosstab"]


# 4. Feature definition  <a id="4"></a>
In this section we will featurise the kinase domains in our dataset and select only features of statistical relevance.

## 4.1 Feature matrix  <a id="41"></a>
Here we define our featurisation approach. We construct a feature matrix for each structure in our dataset.

Let's first import and initialise the class `FeatureSelection()` that we use to perform all the necessary steps to featurise kinase domains outside the activation loop region.

In [None]:
from workflow.feature_selection import FeatureSelection

# Initialize the feature selection object
fs = FeatureSelection(
    dfg_index=145,  # Position of DFG motif
    ape_index=174,  # Position of APE motif
    conservation_threshold=0.70  # 70% conservation threshold
)

print("FeatureSelection initialized")
print(f"Activation loop region: {fs.dfg_index} to {fs.ape_index}")
print(f"Conservation threshold: {fs.conservation_threshold * 100}%")

The first step uses the conservation data to identify residues that are:
- Conserved in ‚â•70% of structures
- Located outside the activation loop region (DFG to APE motif)

<div class="alert alert-warning">
<b>PRE-REQUISITE:</b> You must first run the conservation analysis cell which calculates the <code>conservation</code> variable by analyzing the multi-structure alignment.
</div>

In [None]:
# Check if conservation variable exists
try:
    conservation
except NameError:
    raise NameError(
        "The 'conservation' variable is not defined. "
        "Please run Cell 48 first to calculate conservation from the multi-structure alignment.\n"
        "Cell 48 runs: analyser.visualize_residue_conservation()"
    )

# Load reference residue names
reference_residues = braf_res()

# Identify conserved residues (uses 70% threshold set in FeatureSelection initialization)
# This will find residues that are:
# - Conserved in ‚â•70% of structures
# - Located OUTSIDE the activation loop (not between DFG at 145 and APE at 174)
fs.identify_conserved_residues(
    conservation=conservation,
    reference_residues=reference_residues
)

print(f"\nFound {len(fs.fully_conserved)} conserved residues (‚â•70% conservation)")

Before calculating distance features, we need to organize structures into cluster-specific folders based on the PCA clustering results in order to be able to easily retain the label of each structure in our dataset.

In [None]:
# Load PCA cluster labels (use hierarchical clustering results)
pca_labels_file = 'cluster_labels_my_analysis_hierarchical.txt'
df_labels = pd.read_csv(pca_labels_file, skiprows=1, header=None, 
                        names=['ClusterLabel', 'PDBCode', 'FullName'])

# Create output directories for each cluster (clearing old structures)
cluster0_dir = "Results/activation_segments/structuresToFeaturiseCluster0/"
cluster1_dir = "Results/activation_segments/structuresToFeaturiseCluster1/"
clear_and_make(cluster0_dir)
clear_and_make(cluster1_dir)

# Source directory with aligned structures
source_dir = "Results/activation_segments/unaligned/"

# Organize structures by cluster
copied_cluster0 = []
copied_cluster1 = []
missing_files = []

for _, row in df_labels.iterrows():
    cluster_label = int(row['ClusterLabel'])
    structure_name = row['FullName']
    
    # Ensure .pdb extension
    if not structure_name.endswith('.pdb'):
        structure_name = structure_name + '.pdb'
    
    # Extract first 6 characters for matching (PDB code)
    pdb_prefix = structure_name[:6]
    
    # Find matching file in source directory using prefix
    # Look for any file that starts with the 6-character prefix
    matching_files = [f for f in os.listdir(source_dir) 
                     if f.startswith(pdb_prefix) and f.endswith('.pdb')]
    
    if not matching_files:
        missing_files.append(structure_name)
        continue
    
    # Use the first matching file
    source_file = os.path.join(source_dir, matching_files[0])
    
    # Copy to appropriate cluster folder (keep original name from PCA labels)
    if cluster_label == 0:
        dest_file = os.path.join(cluster0_dir, matching_files[0])
        shutil.copy2(source_file, dest_file)
        copied_cluster0.append(matching_files[0])
    elif cluster_label == 1:
        dest_file = os.path.join(cluster1_dir, matching_files[0])
        shutil.copy2(source_file, dest_file)
        copied_cluster1.append(matching_files[0])

print(f"=== PCA Cluster Organization ===")
print(f"Matching files using first 6 characters of structure name")
print(f"Source directory: {source_dir}")
print(f"\nCluster 0: {len(copied_cluster0)} structures ‚Üí {cluster0_dir}")
print(f"Cluster 1: {len(copied_cluster1)} structures ‚Üí {cluster1_dir}")
print(f"Total structures organized: {len(copied_cluster0) + len(copied_cluster1)}")

if missing_files:
    print(f"\n‚ö†Ô∏è  Warning: {len(missing_files)} files not found in {source_dir}")
    print("First 5 missing files (showing first 6 chars used for matching):")
    for f in missing_files[:5]:
        print(f"  - {f[:6]}* (from {f})")


We can now calculate pairwise distances between conserved residues for structures in each cluster which will constitute our feature dataset.


We calculate distances first for the structures in cluster 0.

In [None]:
# Your aligned structures and alignment function
aligned_structures = multi_data['structures']  # Your list of alignment objects

# Choose which cluster to analyze
# Options: cluster0_dir or cluster1_dir (defined in previous cell)
# Cluster 0 is typically the inactive state, Cluster 1 is active state (or vice versa)
pdb_directory = cluster0_dir  # Change to cluster1_dir to analyze the other cluster

print(f"Analyzing structures from: {pdb_directory}")
print(f"Number of structures: {len(os.listdir(pdb_directory))}")

# Calculate distances between conserved residues
distance_df_cluster0 = fs.calculate_intra_structure_distances(
    aligned_structures=aligned_structures,
    pdb_directory=pdb_directory,
    alignment_function=make_seg  # Your function to create alignment segment
)

print(f"\n=== Distance Calculation Results (Cluster 0) ===")
print(f"Calculated {len(distance_df_cluster0)} distance measurements")
print(f"Across {len(fs.structures)} structures")
print(f"\nFirst few rows:")
distance_df_cluster0.head()

Then we calculate distances for structures in cluster 1.

In [None]:
# Calculate distances for Cluster 1
pdb_directory_cluster1 = cluster1_dir

print(f"Analyzing structures from: {pdb_directory_cluster1}")
print(f"Number of structures: {len(os.listdir(pdb_directory_cluster1))}")

# Calculate distances between conserved residues for Cluster 1
distance_df_cluster1 = fs.calculate_intra_structure_distances(
    aligned_structures=aligned_structures,
    pdb_directory=pdb_directory_cluster1,
    alignment_function=make_seg
)

print(f"\n=== Distance Calculation Results (Cluster 1) ===")
print(f"Calculated {len(distance_df_cluster1)} distance measurements")
print(f"Across {len(fs.structures)} structures")
print(f"\nFirst few rows:")
distance_df_cluster1.head()


Now we can create a dataset of all the features extracted from our kinase domains which will be later used in classification.

In [None]:
# Combine distance measurements from both clusters
combined_distance_df = pd.concat([distance_df_cluster0, distance_df_cluster1], 
                                  ignore_index=True)

# Update fs.intra_structure_df with combined data
fs.intra_structure_df = combined_distance_df

print(f"=== Combined Distance Data ===")
print(f"Cluster 0: {len(distance_df_cluster0)} measurements from {distance_df_cluster0['structure'].nunique()} structures")
print(f"Cluster 1: {len(distance_df_cluster1)} measurements from {distance_df_cluster1['structure'].nunique()} structures")
print(f"Combined: {len(combined_distance_df)} total measurements from {combined_distance_df['structure'].nunique()} structures")
print(f"\nSample of combined data:")
print(combined_distance_df.head())


Let's add a column to our feature dataset in order to be able to track what labels are associated with what structures.


In [None]:
# Check if cluster directories are defined (from Cell 105)
try:
    cluster0_dir
    cluster1_dir
except NameError:
    # If not defined, set them to the expected paths
    print("‚ö†Ô∏è  WARNING: cluster0_dir and cluster1_dir not found in environment.")
    print("Please run Cell 105 first to organize structures by PCA clusters.")
    print("Using default paths as fallback...\n")
    
    cluster0_dir = "Results/activation_segments/structuresToFeaturiseCluster0/"
    cluster1_dir = "Results/activation_segments/structuresToFeaturiseCluster1/"

# Use the cluster directories created from PCA analysis
cluster_dirs = {
    0: cluster0_dir,  # "Results/activation_segments/structuresToFeaturiseCluster0/"
    1: cluster1_dir   # "Results/activation_segments/structuresToFeaturiseCluster1/"
}

print("Using PCA-based cluster directories:")
print(f"  Cluster 0: {cluster_dirs[0]}")
print(f"  Cluster 1: {cluster_dirs[1]}")

# Assign labels based on cluster membership
fs.assign_labels_from_clusters(cluster_dirs)

# Check label distribution
print("\n=== Label Distribution in Dataset ===")
if fs.intra_structure_df is not None and 'label' in fs.intra_structure_df.columns:
    # IMPORTANT: Each structure has MANY rows (one per residue pair distance)
    # So we need to count both rows AND unique structures
    
    print("üìä Unique structures per label:")
    for label in sorted(fs.intra_structure_df['label'].unique()):
        if label == -1:
            continue
        n_structures = fs.intra_structure_df[fs.intra_structure_df['label'] == label]['structure'].nunique()
        n_measurements = len(fs.intra_structure_df[fs.intra_structure_df['label'] == label])
        print(f"  Label {label}: {n_structures} structures ({n_measurements} distance measurements)")
    
    total_unique = fs.intra_structure_df[fs.intra_structure_df['label'] != -1]['structure'].nunique()
    total_measurements = len(fs.intra_structure_df[fs.intra_structure_df['label'] != -1])
    print(f"\n‚úÖ Total: {total_unique} labeled structures, {total_measurements} total distance measurements")
    
    if (fs.intra_structure_df['label'] == -1).any():
        unlabeled = fs.intra_structure_df[fs.intra_structure_df['label'] == -1]['structure'].nunique()
        print(f"‚ö†Ô∏è  Warning: {unlabeled} structures without labels")
else:
    print("No labels assigned yet. Run distance calculation and combination first.")

We can now organise all distance measurements into a structured feature matrix for each structure. Since not all residues between which we compute distances are conserved we use median imputation to ensure all feature matrices include the same number of features.

In [None]:
# Build feature matrix with median imputation
feature_matrix, imputation_mask = fs.build_feature_matrix(
    use_median_imputation=True
)

print(f"\nFeature matrix shape: {feature_matrix.shape}")
print(f"Number of structures: {len(fs.structure_names)}")
print(f"Number of features: {len(fs.unique_pairs)}")
print(f"\nSample feature names: {[f'{p[0]}-{p[1]}' for p in fs.unique_pairs[:5]]}")

# Save all results for later reloading
print("\n" + "="*60)
print("Saving feature matrix and related data...")
print("="*60)
fs.save_results(output_prefix="")
print("‚úÖ All data saved! Can be reloaded without recomputing.")

Let's visualise some feature matrices as heat maps.

In [None]:
# Plot example heatmaps
fs.plot_distance_heatmaps(
    n_examples=4,
    save_dir=None  # Set to a directory path to save all heatmaps
)

# 5. Feature selection  <a id="5"></a>

In this section we will study how each feature is distributed across our kinase dataset and we will filter out features that are not statistically relevant in order to facilitate the classification step.

If you've already computed feature matrices and saved them, you can skip the previous cells and reload the data here using the class `FeatureSelection`.

In [None]:
from workflow.feature_selection import FeatureSelection

# Create new FeatureSelection object
fs = FeatureSelection(dfg_index=145, ape_index=174, conservation_threshold=0.97)

# Load reference data
fs.load_results('reference_data.pkl')

# Load feature matrix
feature_df = pd.read_csv('feature_matrix.csv', index_col=0)
fs.feature_matrix = feature_df.values
fs.structure_names = list(feature_df.index)

# Load labels
labels_df = pd.read_csv('labels.csv')
fs.labels = labels_df['label'].values

# Load distance dataframe
fs.intra_structure_df = pd.read_csv('intra_structure_distances.csv')

# Calculate statistics
final_shape = fs.feature_matrix.shape
final_total = fs.feature_matrix.size
final_valid = np.sum(~np.isnan(fs.feature_matrix))

print(f"\n‚úÖ Reloaded successfully!")
print(f"\nüìä Feature matrix:")
print(f"   Matrix shape: {final_shape[0]:,} structures √ó {final_shape[1]:,} residue pairs")
print(f"   Total entries: {final_total:,}")
print(f"   Valid measurements: {final_valid:,}")
print(f"   NaN values: {final_total - final_valid:,}")


We first look for abnormally large features that might indicate structural issues and exclude them from the feature matrices. 

In [None]:
# Check prerequisites
try:
    fs
    if fs.feature_matrix is None:
        raise ValueError("Feature matrix not built yet. Run Cell 113 first.")
except NameError:
    raise NameError("FeatureSelection object 'fs' not defined. Run Cell 102 first.")

# One-liner replacement for the long outlier/NaN-cleaning snippet.
# - sets >50√Ö distances to NaN
# - drops all-NaN features/structures
# - saves outlier table to CSV
results = fs.filter_outlier_distances_and_drop_nan(
    threshold=50.0,
    set_to_nan=True,
    max_nan_fraction=1.0,
    outliers_csv_path="outlier_distances.csv",
    print_top_n=10,
    verbose=True,
)


We now filter out all features that are defined between consecutive residues, these are most likely highly-correlated features.

In [None]:
# Filter out features with consecutive residue indices
# These features (e.g., 130-131) are not informative since consecutive residues
# are always close together in the protein structure

print("="*60)
print("FILTERING CONSECUTIVE RESIDUE FEATURES")
print("="*60)

# Count features before filtering
features_before = len(fs.unique_pairs)
print(f"\nFeatures before filtering: {features_before}")

# Find and remove consecutive residue pairs
consecutive_features = fs.filter_consecutive_residues(remove=True)

# Show final count
features_after = len(fs.unique_pairs)
print(f"\nüìä Final feature count: {features_after}")
print(f"   Features removed: {features_before - features_after}")


We then specifically look for highly-correlated features, construct feature classess of correlated features and pick a representative feature from each based on the largest variance.

In [None]:
# Correlation-based feature selection
# Identify groups of highly correlated features and keep only the feature
# with the highest standard deviation from each group

print("\n" + "="*60)
print("CORRELATION-BASED FEATURE SELECTION")
print("="*60)

# Speed tips:
# - If you have no NaN values, correlation will be much faster (uses numpy's corrcoef)
# - use_parallel=True enables parallel processing (2-8x faster with NaN values)
# - Set plot_histogram=False to skip plotting
# - Increase correlation_threshold (e.g., 0.95) to find fewer groups

print(f"Current feature matrix shape: {fs.feature_matrix.shape}")
print(f"Has NaN values: {np.any(np.isnan(fs.feature_matrix))}")

# Perform correlation analysis
# If parallel processing has issues, set use_parallel=False to use the safe sequential method
selected_features, analysis_info = fs.filter_correlated_features(
    correlation_threshold=0.90,  # Higher threshold = fewer correlated groups = faster
    plot_histogram=True,          # Set to False to skip plotting
    plot_network=False,           # Set to True to see the correlation network (slow for many features)
    use_parallel=True,            # Set to False if you encounter issues with parallel processing
    n_jobs=-1                     # Use all CPU cores (-1), or specify number (e.g., 4)
)

# Apply the selection to the feature matrix
fs.apply_feature_selection(selected_features)

print("\n‚úÖ Correlation-based feature selection complete!")

# Save intermediate results (after correlation selection)
print("\n" + "="*60)
print("Saving correlation-filtered feature matrix...")
print("="*60)
fs.save_results(output_prefix="corr_filtered_")
print("\n‚úÖ Saved correlation-filtered results!")


If you've already run correlation-based feature selection and want to skip directly to same-mean and variance filtering, you can run the following cell.

In [None]:
fs = FeatureSelection(dfg_index=145, ape_index=174, conservation_threshold=0.97)
fs.load_results('corr_filtered_reference_data.pkl')

feature_df = pd.read_csv('corr_filtered_feature_matrix.csv', index_col=0)
fs.feature_matrix = feature_df.values
fs.structure_names = list(feature_df.index)

labels_df = pd.read_csv('corr_filtered_labels.csv')
fs.labels = labels_df['label'].values

fs.intra_structure_df = pd.read_csv('corr_filtered_intra_structure_distances.csv')

print(f"‚úÖ Reloaded correlation-filtered data!")
print(f"   Matrix shape: {fs.feature_matrix.shape}")
print(f"   Number of features: {len(fs.unique_pairs)}")


We now further reduce the feature space by filtering out features that have the same mean (within 0.01 √Ö) and pick the feature with highest variance from each group.

In [None]:
# Filter features with the same mean
# Keeps only the feature with highest standard deviation from each mean group

same_mean_info = fs.filter_same_mean_features(
    mean_tolerance=0.01,  # Features within 0.01 √Ö mean are considered "same"
    remove=True
)

print(f"\nüìä Same-mean filtering complete!")
print(f"   Mean groups found: {same_mean_info['n_groups']}")
print(f"   Features removed: {same_mean_info['n_removed']}")


We further reduce the feature space by filtering out features that have low variance (< 0.1 √Ö).

In [None]:
# Filter features with low variance
# Removes features that don't vary much across structures

selected_variance_indices = fs.filter_low_variance_features(
    variance_threshold=0.1,  # Remove features with variance < 0.1
    remove=True
)

print(f"\n‚úÖ Low variance filtering complete!")
print(f"   Final feature count: {len(fs.unique_pairs)}")


Finally, we exploit the ANOVA SUM method to select a sub-set of statistically-relevant features.

In [None]:
# Filter features using ANOVA F-value
# Keeps top N features that best distinguish between classes (active vs inactive)

selected_anova_indices = fs.filter_anova_features(
    n_features=300,      # Keep top 300 features
    plot_scores=False,   # Set to True to see F-value distribution
    remove=True
)

print(f"\n‚úÖ ANOVA F-value filtering complete!")
print(f"   Final feature count: {len(fs.unique_pairs)}")


In [None]:
# Impute any remaining NaN values before saving and classification
fs.impute_remaining_nan(strategy='median')


In [None]:
# Save the fully filtered feature matrix and related data
print("\n" + "="*60)
print("SAVING FILTERED FEATURE MATRIX")
print("="*60)

print(f"\nFinal feature matrix shape: {fs.feature_matrix.shape}")
print(f"  Structures: {fs.feature_matrix.shape[0]}")
print(f"  Features: {fs.feature_matrix.shape[1]}")

fs.save_results(output_prefix="filtered_")
print("\n‚úÖ Saved filtered results! Can be reloaded for downstream analysis.")

# Print filtering summary
print("\n" + "="*60)
print("FILTERING SUMMARY")
print("="*60)
print("Applied filters in order:")
print("  1. ‚úì Outlier distances (>50√Ö)")
print("  2. ‚úì All-NaN features/structures")
print("  3. ‚úì Consecutive residue pairs")
print("  4. ‚úì Correlation-based selection (r > 0.90)")
print("  5. ‚úì Same-mean features (tolerance 0.01√Ö)")
print("  6. ‚úì Low variance features (threshold 0.1)")
print("  7. ‚úì ANOVA F-value selection (top 300)")
print(f"\nFinal: {fs.feature_matrix.shape[0]} structures √ó {fs.feature_matrix.shape[1]} features")


# 6. Feature classification  <a id="6"></a>

In this section we will train and analyse Random Forest (RF) classifier to investigate what features are most significant in predicting predominant activation loop conformational changes.

The class `FeatureClassification` facilitates running all the steps required to train a RF classifier.

In [None]:
# Classification Analysis using FeatureClassification class
from workflow.feature_classification import FeatureClassification

# Use the filtered feature matrix and labels from feature selection (full dataset, no balancing)
classifier = FeatureClassification(
    feature_matrix=fs.feature_matrix,
    labels=fs.labels,
    unique_pairs=fs.unique_pairs,
    fully_conserved=fs.fully_conserved,
    structure_names=fs.structure_names
)

print(f"‚úÖ FeatureClassification initialized")
print(f"   Features: {len(classifier.unique_pairs)}")
print(f"   Structures: {len(classifier.labels)}")
print(f"   Classes: {np.unique(classifier.labels)}")
print(f"   Class distribution: {dict(zip(*np.unique(classifier.labels, return_counts=True)))}")


Let's first split the data into training and validation.

In [None]:
# Step 1: Split data into train/test sets
classifier.split_data(train_size=0.9, random_state=42)

We can now train the model with our input features.

In [None]:
# Step 2: Train Random Forest model
classifier.train_model(n_estimators=100, random_state=42)

Let's evaluate model performance and visualise it with a confusion matrix to make sure our classifier is able to deal with the input.

In [None]:
# Step 3: Evaluate model performance
metrics = classifier.evaluate_model()

# Step 4: Plot confusion matrix
cm = classifier.plot_confusion_matrix()

Let's now visualise and investigate what are the most significant features both looking at Mean Decrease in Impurity (MDI) and SHAP values.

In [None]:
# Step 5: Compute feature importances (MDI)
importances, importances_std, importances_sem = classifier.compute_feature_importances()

# Step 6: Print top features
top_indices = classifier.print_top_features(n_top=20)

# Step 7: Plot feature ranking
classifier.plot_feature_ranking(n_top=20)

# Step 8: Compute permutation importances
perm_result = classifier.compute_permutation_importances(n_repeats=10, n_jobs=4)

# Step 9: Compute SHAP values (can be slow)
shap_values = classifier.compute_shap_values()
classifier.plot_shap_summary(class_idx=0, max_display=20)  # Class 0
classifier.plot_shap_summary(class_idx=1, max_display=20)  # Class 1
classifier.plot_feature_distributions(n_top=20, class_idx=1)

print("\n" + "="*60)
print("‚úÖ CLASSIFICATION ANALYSIS COMPLETE")
print("="*60)

---