## Introduction

We present a workflow to discover protein conformational features associated with loop rearrangments. The purpose of this notebook is to describe the necessary steps adopted in our study. Implementations of the described steps are included in the folder Classes.

This notebook is divided in the following sections: 
1. Dataset creation
    1. Downloading sequence-related chains
    2. Reconstructing small regions of each chain
    3. Checking structure conservation
2. PCA
    1. Low-dimensional representation
    2. Clustering
    3. Analysis
3. Autoencoder
    1. Low-dimensional representation
    2. Clustering
    3. Analysis
4. Analysis

To get started, let's load some packages!

In [1]:
# File and system operations
import os
import sys
import subprocess
from glob import glob

# Data processing
import pandas as pd

# Network and parallel processing
import requests
import time
import multiprocessing
import concurrent.futures

# 1. Dataset creation

## 1.1 the data

## 1.2 data download

The first step of the workflow is to select structures that have similar sequence to the B-RAF sequence of reference. We run InterPro on an online server to do this.

BLAST the PDB to get structures with similar sequence to the BRAF with InterPro.

The following code downloads structures from text file obtained from interpro.

In [2]:
structure_path = 'structure-matching-IPR011009.tsv'
pdb_data = pd.read_csv(structure_path, sep = "\t", header=0, engine='python')
pdb_data['Accession'] = pdb_data['Accession'].str.upper()
pdb_ids = pdb_data['Accession'].tolist()

Multi-threaded PDB download system. 
- Method download2 downloads individual files from RCSB database. Handles 404 errors and empty files. Uses streaming download for large files.
- Method download_pdbs. Processes lists of PDB files to be downloaded. Checks for already downloaded files to avoid duplicates. Creates the output directory if it doesn't exist.
- Method parallel_download implements parallel downloading. Splits the PDB list into chunks for different threads. Uses up to 20 threads or 2x CPU cores (whichever is smaller).




In [None]:
from pdb_downloader import PDBDownloader

downloader = PDBDownloader()
downloader.parallel_download(pdb_ids, "Results/InterProPDBs") 

Counting the number of pdb files in the directory PDBs

In [None]:
folder_path = "Results/InterProPDBs"
file_names = [os.path.splitext(f)[0] for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
pdb_raw = pd.DataFrame({"PDBs": file_names})

pdb_data['Downloaded'] = pdb_data['Accession'].str.upper().isin(pdb_raw['PDBs']).map({True: True, False: False})

counts = pdb_data['Downloaded'].value_counts().to_dict()
print(f"Downloaded: {counts[True]}, Failed: {counts[False]}")

Saving the names of failed pdb downloads.

In [4]:
fail_list = pdb_data[pdb_data['Downloaded']==False]
fail_list.to_csv('fail_list.csv')

In [None]:
from utilities import count_pdb_files

pdb_directory = 'Results/InterProPDBs'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

## 1.3 Extracting protein chains

We can now make sure we are only saving the protein chain of interest as indicated by InterPro.

In [None]:
from pdb_chain_extractor import PDBChainExtractor

# Create an instance of the class
chain_extractor = PDBChainExtractor()

# Now call the method on the instance
chain_extractor.extract_chains_parallel(pdb_data, 'Results/activation_segments/unaligned/', None)

In [None]:
from utilities import count_pdb_files

pdb_directory = 'Results/activation_segments/unaligned/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

### 1.3.2 Cleaning the dataset

Here we will be getting rid of all structures that do not contain DFG and APE in sequence.

## 1.4 Reconstructing small protein segments

In [None]:
from reconstruct import ProteinReconstructor
#FOR THE MODELLER OUTPUT YOU SHOULD HAVE A FOLDER WITH ALL TARGET FILES OUTPUT FROM MODELLER
#YOU SHOULD ALSO MINIMISE THE MODELLER OUTPUT PRINTS
# Configuration
input_dir = "Results/activation_segments/unaligned"
full_pdb_dir = "Results/InterProPDBs"
output_dir = "Results/activation_segments/reconstructedModeller"
max_gap_length = 4
    
# Create and run the reconstructor
reconstructor = ProteinReconstructor(
    input_dir=input_dir,
    full_pdb_dir=full_pdb_dir,
    output_dir=output_dir,
    max_gap_length=max_gap_length
)
    
reconstructor.run_modeller_pipeline()

In [None]:
from utilities import count_pdb_files

pdb_directory = 'Results/activation_segments/reconstructedModeller/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

## 1.5 Structural alignment

Now let's run the MUSTANG algorithm to align all structures.

In [None]:
from align import Alignment

# Initialize the alignment class
aligner = Alignment()

# Example 1: MUSTANG alignment
aligner.process_mustang_alignment(
    pdb_path="Results/activation_segments/reconstructedModeller",
    target_dir="Results/activation_segments/reconstructed_mustang",
    template_pdb="6UAN_chainD.pdb"
)

In [None]:
from utilities import count_pdb_files

pdb_directory = 'Results/activation_segments/reconstructed_mustang/'
pdb_count = count_pdb_files(pdb_directory)

print(f"There are {pdb_count} PDB files in the directory '{pdb_directory}'.")

### 1.5.1 Analysis of structural conservation

Here we will be checking structural conservation. We will also be showing the proposed sequence alignment (both for MUSTANG and actual sequence alignment.)

We will be working with unreconstructed structures in order to address the issue of residue conservation. Let's create a dataset of unreconstructed structures selected for reconstruction.

In [None]:
from utilities import create_nonreconstructed_folder

create_nonreconstructed_folder(
    unaligned_dir="Results/activation_segments/unaligned",
    reconstructed_dir="Results/activation_segments/reconstructedModeller", 
    target_dir="Results/activation_segments/nonReconstructed4Mustang"
)


Let's run the structural alignment again for structures that are not reconstructed.

In [None]:
# Simplified workflow with imports
import os
from utilities import find_pdbs, fname
from MUSTANG import run_mustang

# Define the paths
pdb_path = "Results/activation_segments/nonReconstructed4Mustang"
target_dir = "Results/activation_segments/nonReconstructed_mustang"
template_pdb = "6UAN_chainD.pdb"

# Ensure the target directory exists
os.makedirs(target_dir, exist_ok=True)

print(f"Input directory: {pdb_path}")
print(f"Output directory: {target_dir}")
print(f"Template PDB: {template_pdb}")

# Get list of PDB files
pdbs = find_pdbs(pdb_path)
print(f"Found {len(pdbs)} PDB files to process")

# Process each file
for pdb in pdbs:
    name = fname(pdb)
    new_fp = run_mustang(template_pdb, pdb, target_dir, name=name)

print("Processing complete")

Now let's analyse B-RAF protein structural alignment and which regions of B-RAF structure are mostly conserved in our dataset. Let's load the MUSTANG alignments. 

In [None]:
from analyse_alignment import analyse_alignment

# Create a handler for loading alignments
analyzer = analyse_alignment("Results/activation_segments/nonReconstructed_mustang")

# Step 1: Load alignments
aligned = analyzer.load_alignments()

### 1.5.2 Cleaning the dataset

Here we will be getting rid of all structures that do not contain DFG and APE in sequence.

FIrst let's make sure our alignment file is correct by keeping only structures with both DFG and APE.

In [4]:
from analyse_alignment import analyse_alignment

# Initialize analyzer
analyzer = analyse_alignment("Results/activation_segments/nonReconstructed_mustang")

In [5]:
# STEP 1: Filter structures with both DFG and APE motifs
print("=== STEP 1: Filtering structures with DFG and APE motifs ===")
valid_alignments, invalid_alignments = analyzer.filter_and_save_structures_with_motifs(
    target_dir="Results/activation_segments/reconstructed_mustang_filtered/"
)

=== STEP 1: Filtering structures with DFG and APE motifs ===
Loading alignments...
Filtering alignments based on DFG and APE motifs...
Found 1107 structures with both DFG and APE motifs
Excluded 49 structures without required motifs
Copying valid structures to Results/activation_segments/reconstructed_mustang_filtered/...


Copying structures with DFG and APE motifs: 100%|██████████| 1107/1107 [04:50<00:00,  3.81it/s]

Successfully copied 2214 files
Saved 1107 filtered alignments to Results/activation_segments/reconstructed_mustang_filtered/filtered_alignments_with_motifs.pkl





In [None]:
# STEP 2: Check alignment quality (DFG and APE not aligned to gaps)
print("\n=== STEP 2: Validating DFG and APE alignment quality ===")
validated_alignments, aligning_segs, counts, lengths = analyzer.filter_alignments_by_gaps(valid_alignments)

Here we will be getting rid of all poorly MUSTANG aligned structures, by checking the ones for which DFG or APE are aligned to gaps.

In [None]:
# STEP 3: Copy final validated structures to ultimate destination
print("\n=== STEP 3: Copying final validated structures ===")
analyzer.copy_validated_structures(
    validated_alignments,
    source_dir="Results/activation_segments/reconstructed_mustang_filtered/",
    target_dir="Results/activation_segments/reconstructed_mustang_final/"
)

print(f"\nFinal Results:")
print(f"- Initial structures: {len(valid_alignments) + len(invalid_alignments)}")
print(f"- With DFG/APE motifs: {len(valid_alignments)}")
print(f"- With proper alignment: {len(validated_alignments)}")

Let's now visualise a histogram indicating which regions of the B-RAF structure are conserved throughout the dataset.

In [None]:
# Call visualize_residue_conservation
conservation, highly_conserved = analyzer.visualize_residue_conservation(
    filtered_alignments=validated_alignments,
    braf_reference_seq=None,  # Optional: provide your BRAF reference sequence
    output_file="conservation_plot.png",
    figsize=(15, 6),
    show_plot=True
)

## 1.6 Aligning activation loops

Now, let's align the extremities of the loops so to minimise the impact of the lack of roto-translational invariance on ML performance.

In [3]:
# Import the class
from align import Alignment

# Initialize with your MUSTANG path if needed
aligner = Alignment(mustang_path="/home/marmatt/Downloads/MUSTANG_v3.2.4/bin/mustang-3.2.4")

# Example: PyMOL alignment (as shown in your notebook)
aligner.process_pymol_alignment(
    pdb_dir="Results/activation_segments/reconstructed_mustang/",
    reference_pdb="6UAN_chainD.pdb",  # Your reference structure
    output_dir="Results/activation_segments/reconstructed_mustang_ends/",
    ref_name="6UAN_chainD"
)

PyMOL initialized in headless mode
[449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641,

KeyboardInterrupt: 

## 1.7 Strip to CA atoms


## 1.8 Fit stripped loops