# AF2_MiniPAE

AF2_MiniPAE is a lightweight tool for analyzing multiple AlphaFold2 multimer predictions (from **ColabFold**) to identify high-confidence motif-like regions between an **intrinsically disordered protein (chain B)** and a **receptor (chain A)**. It computes the minimal **Predicted Aligned Error (PAE)** between chains and highlights the most confident interaction region.

> Author: Martin Veinstein

You can generate your AlphaFold2 predictions using ColabFold: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

Or the batch version for multiple predictions: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/batch/AlphaFold2_batch.ipynb#scrollTo=kOblAo-xetgx

## 🔍 **What AF2_MiniPAE does**

Given AlphaFold2 prediction outputs (`.result.zip` archives from ColabFold), this tool:

- Parses the `.a3m` alignment file to determine sequence lengths and structure type (monomer, homodimer, or heterodimer)
- Loads the PAE matrix from `rank_001_scores.json`
- Extracts the **minimal PAE region** between IDP and receptor chains
- Highlights residues with PAE below a confidence threshold (default: 4)
- Outputs:
  - Formatted sequence (uppercase = high-confidence contact)
  - Motif region (first-to-last confident contact)
  - PAE metrics and region indices

## ⚠️ **Requirements & Notes**

This script only works for `DIMERIC predictions`.

The SLiM-containing protein `MUST be the second chain` in your AlphaFold prediction (i.e., `chain B`).

This script **reads .result.zip archives**. We therfor strongly recommend to tick the "zip_results" if you are running colabfold batch.

This script expects:

>a .a3m alignment file

>a rank_001_scores.json file
Both files should be zipped together in a .result.zip archive, stored in your Google Drive.


**References**

This script is adapted from the work of:

Omidi et al.

A. Omidi, M.H. Møller, N. Malhis, J.M. Bui, & J. Gsponer,
AlphaFold-Multimer accurately captures interactions and dynamics of intrinsically disordered protein regions,
Proc. Natl. Acad. Sci. U.S.A. 121 (44) e2406407121,
https://doi.org/10.1073/pnas.2406407121 (2024)

The method and code were used and described in our own study:

[Full citation coming soon]

# **Run MiniPAE**

Users can run all cells from the above menu:

Runtime > Run all

📘 Step 1: Mount your Google Drive

In [1]:
# @title
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


📌 Select input folder location in your drive

Example: Your /input_alphafold/ folder should contain .result.zip files like:
myprotein.result.zip
Each zip must include:

>myprotein**.a3m**

>myprotein_scores_rank_001**.json**

Your **.result.zip** archive must be stored in a folder in your Drive, like:

In [2]:
# 📁 User input parameters
dir = '/content/drive/MyDrive/input_alphafold'  #@param {type:"string"}



Define PAE threshold for high-score motifs (default = 4)

In [3]:
pae_threshold = 4  #@param {type:"number"}

📘 Step 2: Install required libraries (if needed)

In [None]:
# @title
!pip install numpy


📘 Step 3: Import Python libraries

In [5]:
# @title
# 🔧 Required imports
import numpy as np
import json
import os
import re
import zipfile
import pandas as pd



📘 Step 4: Define functions to process your files

In [6]:
# @title
def parse_a3m_header(a3mhead):
    lengths_str, cardinalities = a3mhead.strip().lstrip("#").split()
    lengths = list(map(int, lengths_str.split(",")))
    return lengths, cardinalities

def read_scores(zip_file, score_filename, a3m_filename, threshold):
    with zip_file.open(score_filename) as f:
        scores = json.load(f)

    with zip_file.open(a3m_filename) as f:
        lines = f.read().decode("utf-8").splitlines()
        a3mhead = lines[0].strip()
        lengths, card = parse_a3m_header(a3mhead)

        # Convert cardinality string to list of integers (if needed)
        card_list = list(map(int, card.split(",")))

        # Error handling for non-dimeric cases
        if len(lengths) == 2 and card == "1,1":
            idp_len, receptor_len = lengths
            marker = ">102"  # Heterodimer
        elif len(lengths) == 1 and card == "2":
            idp_len = receptor_len = lengths[0]
            marker = ">101"  # Homodimer
        elif len(lengths) == 1 and card == "1":
            raise ValueError("Monomeric structure detected — only dimeric models (homodimer or heterodimer) are supported.")
        else:
            raise ValueError(f"Unsupported structure configuration: lengths={lengths}, cardinalities={card}. Only dimers are supported.")

        # Extract sequence from correct marker
        protein_sequence = None
        for i, line in enumerate(lines):
            if line.strip() == marker:
                if i + 1 < len(lines):
                    protein_sequence = lines[i + 1].strip().replace("-", "")
                break

    pae = np.array(scores["pae"])
    assert pae.shape == (idp_len + receptor_len, idp_len + receptor_len), f"PAE shape mismatch: {pae.shape}"

    pae_idp_receptor_min = pae[:idp_len, idp_len:].min(axis=0).tolist()

    if protein_sequence and len(protein_sequence) == len(pae_idp_receptor_min):
        formatted_sequence = "".join(
            res.upper() if pae_val < threshold else res.lower()
            for res, pae_val in zip(protein_sequence, pae_idp_receptor_min)
        )
    else:
        formatted_sequence = "N/A"

    match = re.search(r"[A-Z].*[A-Z]", formatted_sequence)
    motif_sequence = match.group(0) if match else "N/A"
    motif_start = match.start() + 1 if match else "N/A"
    motif_stop = match.end() if match else "N/A"

    return {
        "pae_idp_receptor_min": pae_idp_receptor_min,
        "formatted_sequence": formatted_sequence,
        "motif_sequence": motif_sequence,
        "motif_start": motif_start,
        "motif_stop": motif_stop
    }


📘 Step 5: Set your folder and run the analysis

In [None]:
# @title
output_file = f"/content/metrics_miniPAE-thresh{pae_threshold}.csv"
results_summary = []

def read_scores_from_zip(zip_path, threshold):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        # Find relevant JSON and A3M files inside the archive
        json_candidates = [f for f in zip_ref.namelist() if f.endswith(".json") and "rank_001" in f and "scores" in f]

        basename = os.path.basename(zip_path).replace(".result.zip", "")
        expected_a3m_name = f"{basename}.a3m"
        a3m_candidates = [f for f in zip_ref.namelist() if f.endswith(expected_a3m_name)]

        # Safety checks
        if not json_candidates:
            raise FileNotFoundError("No score JSON file found inside the archive.")
        if not a3m_candidates:
            raise FileNotFoundError("No A3M file found inside the archive.")

        json_name = json_candidates[0]
        a3m_name = a3m_candidates[0]

        return read_scores(zip_ref, json_name, a3m_name, threshold)

with open(output_file, "w") as csvfile:
    csvfile.write("experiment_name,min_pae_val,pae_idp_receptor_min,formatted_sequence,motif_sequence,motif_start,motif_stop\n")

    for fname in os.listdir(dir):
        if fname.endswith(".result.zip"):
            exp = fname.replace(".result.zip", "")
            zip_path = os.path.join(dir, fname)

            try:
                result = read_scores_from_zip(zip_path, pae_threshold)
                pae_vals = result["pae_idp_receptor_min"]
                min_val = min(pae_vals) if pae_vals else "N/A"
                pae_str = ";".join(map(str, pae_vals))

                csvfile.write(f"{exp},{min_val},{pae_str},{result['formatted_sequence']},{result['motif_sequence']},{result['motif_start']},{result['motif_stop']}\n")

                results_summary.append({
                    "experiment": exp,
                    "min_pae_val": min_val,
                    "motif_sequence": result["motif_sequence"]
                })

            except FileNotFoundError as fnf_err:
                print(f"⚠️ Skipping {fname}: {fnf_err}")
            except zipfile.BadZipFile:
                print(f"❌ ERROR: {fname} is not a valid .zip file. Skipping.")
            except Exception as e:
                print(f"❌ ERROR while processing {fname}: {e}")

print(f"✅ Done! Results saved to: {output_file}")
pd.DataFrame(results_summary)


📘 Step 6: Download results

In [8]:
# @title
from google.colab import files
files.download(output_file)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Results**
Results will be printed at the end of this colab notebook and downloaded as a .csv file

This CSV includes:

- experiment_name = the name of the prediction

- miniPAE = the lowest PAE value with the disordered prediction (chain B)

- pae_list = all PAE values with the disordered prediction (chain B)

- formatted_sequence = sequence of the disordered protein (chain B) with UPPER case for PAE<threshold

- motif_sequence	= sequence of the disordered protein (chain B) region above threshold

- motif_start & motif_stop = start & stop positions of the region above threshold