## Explanation of the Protein Folding Prediction Script

This Python script is designed to classify protein sequences as either "Folded" (typically from PDB) or "Disordered" (typically from DisProt). It does this by:
1.  Defining a set of biophysical and compositional features for amino acid sequences.
2.  Implementing and evaluating two main rule-based classification approaches.
3.  Using a proper training/testing split to ensure fair evaluation of the classifiers.

The script's structure and logic can be mapped to the **Concept Model** framework (M1: Properties, M2: Constraints, M3: Transformations, M4: Goal State).

### Key Components of the Script:

1.  **Amino Acid Property Definitions (`aa_properties_base`):**
    * The script begins by defining various fundamental physicochemical properties for each of the 20 canonical amino acids (e.g., hydrophobicity, charge, flexibility, propensity for helix/sheet).
    * These base properties are normalized and stored. They are the building blocks for the features used by the classifiers.

2.  **Data Loading (`load_fasta_with_labels`):**
    * Protein sequences are loaded from FASTA files (`pdb_chains.fasta` for folded, `disprot_13000.fasta` for disordered).
    * Each sequence is stored along with its true label (1 for PDB/Folded, 0 for DisProt/Disordered) and its raw sequence string.

3.  **New 7 Feature Definitions (`compute_new_seven_features`):**
    * A function `compute_new_seven_features` is defined to calculate a specific set of 7 features for any given sequence string (which could be a whole protein or a shorter window/segment). These features are:
        1.  `hydro_norm_avg`: Average normalized hydrophobicity.
        2.  `flex_norm_avg`: Average normalized flexibility.
        3.  `h_bond_potential_avg`: Average H-bonding potential (sum of donors/acceptors).
        4.  `abs_net_charge_prop`: Absolute proportion of net charge.
        5.  `shannon_entropy`: A measure of sequence complexity.
        6.  `freq_proline`: Frequency of Proline.
        7.  `freq_bulky_hydrophobics`: Combined frequency of W, C, F, Y, I, V, L.
    * **Concept Model M1 (Property Vectors / Tensor Snapshots):** This set of 7 features calculated for a protein (or segment) constitutes its M1 representation – a vector of its key properties.

4.  **Main Feature Computation (`compute_features_for_dataset`):**
    * This function processes a list of raw sequences.
    * It can either calculate the 7 new features globally for each entire protein (if `WINDOW_SIZE_BASELINE` is `None`) or calculate them for sliding windows and then average these window features to get 7 global values for the protein. For the "New Global Features Classifier" part, it's set to compute direct global features.

5.  **Train/Test Split:**
    * The full dataset (with globally computed new features and labels) is split into a training set (80%) and a testing set (20%). This is crucial for an unbiased evaluation of how well the classifiers generalize to unseen data.
    * Raw sequences corresponding to the test set are kept aside for the sliding window classifier.

6.  **Midpoint Calculation (from Training Data's Global Features):**
    * From the **training set's global features**, the script calculates the average value of each of the 7 new features for PDB proteins and for DisProt proteins.
    * The `midpoints` are then calculated as the halfway point between these PDB and DisProt averages for each feature.
    * **Concept Model M2 (Constraints):** These empirically derived `midpoints` define the thresholds for the classification rules. A condition like `feature_value >= midpoint` (or `<= midpoint`, depending on the feature) acts as a constraint. A protein/segment feature vector is tested against these constraints.

7.  **Defining "Conditions Met" (`count_conditions_for_new_feature_vector`):**
    * This helper function takes a 7-feature vector (for a protein or a segment) and the `midpoints`.
    * It checks, for each of the 7 features, whether it falls on the "PDB-like" side of its respective midpoint (e.g., higher hydrophobicity, lower proline frequency). The direction of comparison (`>=` or `<=`) is determined by observing the means of PDB vs. DisProt proteins in the training data.
    * It returns the total number of conditions (out of 7) that were met.

### Classifier 1: Baseline Threshold-Based Classifier (New Global Features)

* **Logic:**
    1.  The 7 new global features are calculated for each protein in the test set.
    2.  For each test protein, `count_conditions_for_new_feature_vector` determines how many of its 7 global features satisfy the midpoint-derived conditions. This result is stored as `conditions_met`.
    3.  The script then evaluates performance by trying different thresholds `k` (from 1 to 7). A protein is predicted as "Folded" if its `conditions_met >= k`.
* **Relation to Concept Model:**
    * **M1:** The 7 new global features of an entire protein.
    * **M2:** The set of 7 conditions derived from `midpoints`.
    * **M3 (Transformation/Rule):** The process of (a) counting how many conditions are met by M1, and (b) comparing this count to a threshold `k`.
    * **M4 (Goal State):** The true labels (Folded/Disordered). The script finds the `k` that yields the best F1-score for PDB proteins, effectively optimizing this simple M3 rule against M4.

### Classifier 2: Sliding Window (Larger - 9 AA) Classifier with Failure Cancellation

* **Logic:** This classifier processes each raw protein sequence in the test set with a more complex, stateful rule:
    1.  **Parameters:**
        * `SLIDING_WINDOW_SIZE = 9` (each local window to analyze).
        * `SLIDING_WINDOW_SLIDE_STEP = 9` (non-overlapping windows).
        * `SLIDING_WINDOW_PASS_K = 4` (a 9-AA window "passes" if its 7 *local* features meet at least 4 conditions, judged by the *globally-derived `midpoints`*).
        * `MAX_UNFORGIVEN_FAILED_WINDOWS_SLIDING = 3` (the protein is "Folded" if it has 3 or fewer uncancelled failed windows).
    2.  **Serial Processing:** It slides a 9-AA window across the sequence.
    3.  **Window Evaluation:** For each 9-AA window, its 7 new features are calculated. `count_conditions_for_new_feature_vector` determines if this window "passes" or "fails" based on `SLIDING_WINDOW_PASS_K` and the global `midpoints`.
    4.  **Failure Cancellation:** A running count of `current_consecutive_failures_streak` is maintained. If a window "passes," this streak is reset to 0 (any failures in that streak are "cancelled"). If a window "fails," the streak count increases.
    5.  **Protein Classification:** After all windows are processed, the `total_unforgiven_failures` is simply the value of `current_consecutive_failures_streak` at the end of the sequence (as any streaks terminated by a pass were reset). The protein is predicted "Folded" if `total_unforgiven_failures <= MAX_UNFORGIVEN_FAILED_WINDOWS_SLIDING`.
* **Relation to Concept Model:**
    * **M1 (local):** The 7 new features calculated for each 9-AA window.
    * **M2 (local):** The conditions a window must meet (based on global `midpoints` and `SLIDING_WINDOW_PASS_K`) to "pass."
    * **M3 (Transformation/Rule):** This is a more complex M3. It involves:
        * The serial processing of windows.
        * The evaluation of each window against local M2.
        * The stateful tracking of `current_consecutive_failures_streak`.
        * The "failure cancellation" logic.
        * The final decision rule based on `total_unforgiven_failures` and `MAX_UNFORGIVEN_FAILED_WINDOWS_SLIDING`.
    * **M4 (Goal State):** The true labels (Folded/Disordered) that this entire M3 rule system is trying to predict.

### Summary
The script first establishes a baseline performance using a threshold classifier on 7 new global features. It then tests a more intricate, serial window-based classifier with a failure cancellation mechanism, using the same underlying feature definitions (calculated locally) and the same globally-derived midpoints for local window evaluation. The results then show how these different approaches (different M1 aggregations and different M3 rules) perform at predicting the M4 goal state.

In [1]:
# 1.) Download the PDB chain sequences (FASTA format from RCSB) via the HTTPS mirror,
#      then keep only the first 15 000 entries.

import requests

# Use the “files.wwpdb.org” HTTPS mirror instead of FTP
pdb_url = "https://files.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt"
try:
    resp = requests.get(pdb_url, timeout=60)
    resp.raise_for_status()
    text = resp.text.strip()
    if not text.startswith(">"):
        raise RuntimeError("Downloaded content does not look like FASTA.")
except Exception as e:
    raise RuntimeError(f"Failed to download PDB chain sequences: {e}")

# Write the complete dump to a temporary file
with open("pdb_chains.fasta", "w", encoding="utf-8") as f:
    f.write(text + "\n")

# ─── Split the full FASTA into individual (header, sequence) tuples ─────────────
def split_fasta(filepath):
    sequences = []
    with open(filepath, "r") as f:
        header = None
        seq_lines = []
        for line in f:
            line = line.rstrip()
            if line.startswith(">"):
                if header is not None:
                    sequences.append((header, "".join(seq_lines)))
                header = line
                seq_lines = []
            else:
                seq_lines.append(line)
        # Add the final sequence
        if header is not None:
            sequences.append((header, "".join(seq_lines)))
    return sequences

all_chains = split_fasta("pdb_chains.fasta")

# ─── Keep exactly the first 15 000 chains ─────────────────────────────────────────
subset = all_chains[:15000]

# ─── Write those 15 000 chains back to “pdb_chains.fasta” ───────────────────────
with open("pdb_chains.fasta", "w", encoding="utf-8") as f:
    for header, seq in subset:
        f.write(f"{header}\n")
        f.write(f"{seq}\n")

print(f"✔ Extracted {len(subset)} chains → 'pdb_chains.fasta'")


✔ Extracted 15000 chains → 'pdb_chains.fasta'


In [2]:
# 2.) Use DisProt’s search endpoint with format=fasta
import requests
import os

url = "https://disprot.org/api/search?format=fasta&limit=10000"
try:
    resp = requests.get(url, timeout=15)
    resp.raise_for_status()
except Exception as e:
    raise RuntimeError(f"Failed to GET DisProt FASTA via API: {e}")

text = resp.text.strip()

# 2.2) Quick sanity check: FASTA must start with '>', not '<'
if not text.startswith(">"):
    raise RuntimeError(
        "Downloaded content does not look like FASTA. "
        "If it begins with '<', you're still hitting an HTML page instead of raw FASTA."
    )

# 2.3) Write the 100 DisProt entries to a file
with open("disprot_13000.fasta", "w") as f:
    f.write(text + "\n")

print("✔ Successfully fetched 100 DisProt sequences in FASTA format → 'disprot_1000.fasta'")


✔ Successfully fetched 100 DisProt sequences in FASTA format → 'disprot_1000.fasta'


In [3]:
# 2.1) Collect more data

import requests
import time

# ─── PARAMETERS ─────────────────────────────────────────────────────────────
TOTAL_DESIRED = 25_000   # how many DisProt sequences we want total
PER_PAGE      = 100      # DisProt’s hard cap per request
OUTPUT_FILE   = "disprot_13000.fasta"

accum_seqs = []
offset     = 0

while len(accum_seqs) < TOTAL_DESIRED:
    url = f"https://disprot.org/api/search?format=fasta&limit={PER_PAGE}&offset={offset}"
    try:
        resp = requests.get(url, timeout=30)
        resp.raise_for_status()
    except Exception as e:
        raise RuntimeError(f"Failed to GET DisProt FASTA (offset={offset}): {e}")

    block = resp.text.strip()
    if not block.startswith(">"):
        raise RuntimeError(
            "Downloaded content does not look like FASTA. "
            "If it begins with '<', you're still hitting an HTML page."
        )

    # Parse out this page’s FASTA sequences (collecting only the raw sequences, not full headers):
    raw_lines = block.splitlines()
    header = None
    seq_buf = ""
    this_page_seqs = []
    for line in raw_lines:
        if line.startswith(">"):
            if header is not None and seq_buf:
                this_page_seqs.append(seq_buf)
            header = line
            seq_buf = ""
        else:
            seq_buf += line.strip()
    if header is not None and seq_buf:
        this_page_seqs.append(seq_buf)

    if not this_page_seqs:
        # No more sequences returned → break out early
        break

    accum_seqs.extend(this_page_seqs)
    offset += PER_PAGE

    # Sleep briefly (so we don’t hammer the server)
    time.sleep(0.4)

# Trim in case we overshot
accum_seqs = accum_seqs[:TOTAL_DESIRED]

# Write out ~25k sequences in FASTA format (with minimal headers)
with open(OUTPUT_FILE, "w") as f:
    for i, seq in enumerate(accum_seqs):
        f.write(f">disprot_sequence_{i+1}\n")
        f.write(seq + "\n")

print(f"✔ Fetched {len(accum_seqs)} DisProt sequences → '{OUTPUT_FILE}'")


✔ Fetched 25000 DisProt sequences → 'disprot_13000.fasta'


In [4]:
# 2.2) Verify Downloaded Sequences
with open("disprot_13000.fasta") as f:
    for _ in range(5):
        print(f.readline().rstrip())


>disprot_sequence_1
EHVIEMDVTSENGQRALKEQSSKAKIVKNRWGRNVVQISNT
>disprot_sequence_2
VYRNSRAQGGG
>disprot_sequence_3


# Constraint Based - (Concept Model)

In [19]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split # Make sure to import this

# ─── (A) Build aa_properties ───────────────────────────────────────────────────
# (Amino acid properties dictionaries remain the same as in your original code)
kd_hydro = {
    'A':  1.8, 'R': -4.5, 'N': -3.5, 'D': -3.5, 'C':  2.5,
    'Q': -3.5, 'E': -3.5, 'G': -0.4, 'H': -3.2, 'I':  4.5,
    'L':  3.8, 'K': -3.9, 'M':  1.9, 'F':  2.8, 'P': -1.6,
    'S': -0.8, 'T': -0.7, 'W': -0.9, 'Y': -1.3, 'V':  4.2
}
charge = {
    'A':  0, 'R':  1, 'N':  0, 'D': -1, 'C':  0,
    'Q':  0, 'E': -1, 'G':  0, 'H':  0, 'I':  0,
    'L':  0, 'K':  1, 'M':  0, 'F':  0, 'P':  0,
    'S':  0, 'T':  0, 'W':  0, 'Y':  0, 'V':  0
}
h_donors = {'A':0,'R':2,'N':2,'D':0,'C':0,'Q':2,'E':0,'G':0,'H':1,'I':0,
            'L':0,'K':1,'M':0,'F':0,'P':0,'S':1,'T':1,'W':1,'Y':1,'V':0}
h_acceptors = {'A':0,'R':0,'N':2,'D':2,'C':1,'Q':2,'E':2,'G':0,'H':1,'I':0,
               'L':0,'K':0,'M':0,'F':0,'P':0,'S':1,'T':1,'W':0,'Y':1,'V':0}
flexibility = {
    'A': 0.357, 'R': 0.529, 'N': 0.463, 'D': 0.511, 'C': 0.346,
    'Q': 0.493, 'E': 0.497, 'G': 0.544, 'H': 0.323, 'I': 0.462,
    'L': 0.365, 'K': 0.466, 'M': 0.295, 'F': 0.314, 'P': 0.509,
    'S': 0.507, 'T': 0.444, 'W': 0.305, 'Y': 0.420, 'V': 0.386
}
sidechain_volume = {
    'A':  88.6, 'R': 173.4, 'N': 114.1, 'D': 111.1, 'C': 108.5,
    'Q': 143.8, 'E': 138.4, 'G':  60.1, 'H': 153.2, 'I': 166.7,
    'L': 166.7, 'K': 168.6, 'M': 162.9, 'F': 189.9, 'P': 112.7,
    'S':  89.0, 'T': 116.1, 'W': 227.8, 'Y': 193.6, 'V': 140.0
}
polarity = {
    'A':  8.1, 'R': 10.5, 'N': 11.6, 'D': 13.0, 'C':  5.5,
    'Q': 10.5, 'E': 12.3, 'G':  9.0, 'H': 10.4, 'I':  5.2,
    'L':  4.9, 'K': 11.3, 'M':  5.7, 'F':  5.2, 'P':  8.0,
    'S':  9.2, 'T':  8.6, 'W':  5.4, 'Y':  6.2, 'V':  5.9
}
choufa_helix = {
    'A': 1.45, 'R': 0.79, 'N': 0.73, 'D': 1.01, 'C': 0.77,
    'Q': 1.17, 'E': 1.51, 'G': 0.53, 'H': 1.00, 'I': 1.08,
    'L': 1.34, 'K': 1.07, 'M': 1.20, 'F': 1.12, 'P': 0.59,
    'S': 0.79, 'T': 0.82, 'W': 1.14, 'Y': 0.61, 'V': 1.06
}
choufa_sheet = {
    'A': 0.97, 'R': 0.90, 'N': 0.65, 'D': 0.54, 'C': 1.30,
    'Q': 1.23, 'E': 0.37, 'G': 0.75, 'H': 0.87, 'I': 1.60,
    'L': 1.22, 'K': 0.74, 'M': 1.67, 'F': 1.28, 'P': 0.62,
    'S': 0.72, 'T': 1.20, 'W': 1.19, 'Y': 1.29, 'V': 1.70
}
rel_ASA = {
    'A': 0.74, 'R': 1.48, 'N': 1.14, 'D': 1.23, 'C': 0.86,
    'Q': 1.36, 'E': 1.26, 'G': 1.00, 'H': 0.91, 'I': 0.59,
    'L': 0.61, 'K': 1.29, 'M': 0.64, 'F': 0.65, 'P': 0.71,
    'S': 1.42, 'T': 1.20, 'W': 0.55, 'Y': 0.63, 'V': 0.54
}
beta_branched = {aa: (1 if aa in ('V','I','T') else 0) for aa in kd_hydro.keys()}

aa_properties = {}
canonical_set = set(kd_hydro.keys())
for aa in canonical_set:
    hydro_norm  = (kd_hydro[aa] + 4.5) / 9.0
    volume_norm = sidechain_volume[aa] / 227.8
    pol_norm    = (polarity[aa] - 4.9) / (13.0 - 4.9)
    helix_norm  = choufa_helix[aa] / 1.51
    sheet_norm  = choufa_sheet[aa] / 1.70
    asa_norm    = (rel_ASA[aa] - 0.54) / (1.48 - 0.54)
    aromatic    = 1 if aa in ('F','Y','W') else 0

    # CORRECTED: Ensure 12 distinct properties in the correct order
    aa_properties[aa] = [
        hydro_norm,          # [0]
        charge[aa],          # [1]
        h_donors[aa],        # [2]
        h_acceptors[aa],     # [3]
        flexibility[aa],     # [4]
        volume_norm,         # [5]
        pol_norm,            # [6]
        aromatic,            # [7]
        helix_norm,          # [8]
        sheet_norm,          # [9]
        asa_norm,            # [10]
        beta_branched[aa]    # [11]
    ]

# ─── (B) Load FASTA sequences ─────────────────────────────────────────────────
def load_fasta(filepath, filter_non_canonical=False):
    seqs = []
    try:
        with open(filepath) as f:
            header = None
            seq_content = "" # Renamed to avoid conflict with outer scope 'seq'
            for line in f:
                line = line.strip()
                if line.startswith(">"):
                    if header is not None and seq_content:
                        if (not filter_non_canonical) or (set(seq_content) <= canonical_set):
                            seqs.append(seq_content)
                    header = line
                    seq_content = ""
                else:
                    seq_content += line
            if header is not None and seq_content: # Add the last sequence
                if (not filter_non_canonical) or (set(seq_content) <= canonical_set):
                    seqs.append(seq_content)
    except FileNotFoundError:
        print(f"Warning: File not found {filepath}. Returning empty list.")
    return seqs

pdb_seqs    = load_fasta("pdb_chains.fasta",   filter_non_canonical=False)
disprot_seqs = load_fasta("disprot_13000.fasta", filter_non_canonical=False)

print(f"Loaded {len(pdb_seqs)} PDB sequences.")
print(f"Loaded {len(disprot_seqs)} DisProt sequences.")

if not pdb_seqs and not disprot_seqs:
    print("Error: No sequences loaded. Exiting.")
    exit()


# ─── (C) Compute each chain’s 7 global features ────────────────────────────────
# This function should now work correctly as aa_properties is fixed.
def compute_global_features(sequence_str): # Renamed parameter to avoid conflict
    props = []
    valid_aas_in_sequence = 0
    for aa in sequence_str:
        if aa in aa_properties: # Checks if 'aa' is a key in our properties dictionary
            v = aa_properties[aa] # v is the list of 12 properties for this amino acid
            # Ensure 'v' has enough elements before indexing, though with the fix it should always have 12
            if len(v) == 12:
                props.append([
                    v[0],               # hydrophobicity_norm
                    v[1],               # charge
                    v[2] + v[3],        # h_dh_a (h_donors[aa] + h_acceptors[aa])
                    v[4] / 0.544,       # norm_flex (flexibility[aa] / 0.544)
                    v[6],               # pol_norm (pol_norm from calculated values)
                    v[7] + v[8],        # arom_plus_helix (aromatic + helix_norm)
                    v[10]               # asa_norm (asa_norm from calculated values)
                ])
                valid_aas_in_sequence +=1
            else:
                # This case should ideally not be reached if aa_properties is built correctly
                print(f"Warning: Amino acid {aa} has an unexpected number of properties: {len(v)}. Skipping.")
    
    if not props or valid_aas_in_sequence == 0:
        return np.zeros(7) # Return a vector of zeros if no properties could be computed
    return np.mean(np.vstack(props), axis=0)

all_features_list = []
all_labels_list   = []

for s in pdb_seqs: # Renamed loop variable
    if s: 
        all_features_list.append(compute_global_features(s))
        all_labels_list.append(1)

for s in disprot_seqs: # Renamed loop variable
    if s:
        all_features_list.append(compute_global_features(s))
        all_labels_list.append(0)

if not all_features_list:
    print("Error: No features could be computed. Exiting.")
    exit()

df_all_data = pd.DataFrame(
    all_features_list,
    columns=[
        "hydro_norm", "charge", "h_dh_a", "norm_flex",
        "pol_norm", "arom_plus_helix", "asa_norm"
    ])
df_all_data["label"] = all_labels_list

# --- FIX: Perform Train/Test Split BEFORE calculating midpoints ---
if df_all_data.empty or df_all_data['label'].nunique() < 2:
    print("Error: Not enough data or classes to perform a meaningful split and train. Exiting.")
    exit()

min_class_count = df_all_data['label'].value_counts().min()
if min_class_count < 2 : 
    print(f"Warning: The smallest class has only {min_class_count} member(s). Stratification might fail or be unreliable.")
    if min_class_count < 1: 
        print("Error: Smallest class has 0 members. Cannot proceed.")
        exit()

try:
    X_train, X_test, y_train, y_test = train_test_split(
        df_all_data.drop(columns=["label"]),
        df_all_data["label"],
        test_size=0.2, 
        random_state=42, 
        stratify=df_all_data["label"] 
    )
except ValueError as e:
    print(f"Error during train_test_split (likely due to stratification issues with small class sizes): {e}")
    print("Consider using a non-stratified split if appropriate, or ensure more samples per class.")
    exit()


# Create DataFrames for training and testing sets
df_train = X_train.copy()
df_train["label"] = y_train

df_test = X_test.copy()
df_test["label"] = y_test

print(f"\nTraining set size: {len(df_train)}")
print(f"Testing set size: {len(df_test)}")
print(f"Training set PDB (1) count: {y_train.sum()}, DisProt (0) count: {len(y_train) - y_train.sum()}")
print(f"Testing set PDB (1) count: {y_test.sum()}, DisProt (0) count: {len(y_test) - y_test.sum()}")


# ─── (D) Compute midpoint thresholds USING ONLY TRAINING DATA ───────────────────
if df_train["label"].nunique() < 2:
    print("\nError: Training set does not contain both classes. Cannot compute midpoints robustly.")
    midpoints = {col: 0.5 for col in X_train.columns} 
    print("Warning: Using default midpoints (0.5).")
else:
    train_means = df_train.groupby("label").mean().rename(index={0:"DisProt", 1:"PDB"})
    if "PDB" not in train_means.index or "DisProt" not in train_means.index:
        print("\nError: Could not find means for both PDB and DisProt in the training set.")
        midpoints = {col: 0.5 for col in X_train.columns} 
        print("Warning: Using default midpoints (0.5).")
    else:
        midpoints = {col: (train_means.loc["PDB", col] + train_means.loc["DisProt", col]) / 2
                     for col in X_train.columns}

        print("\nGlobal Feature Means (DisProt vs. PDB) from TRAINING DATA:\n")
        print(train_means, "\n")
        print("Chosen Midpoint Thresholds (from TRAINING DATA):\n")
        for feat, t in midpoints.items():
            print(f"  {feat:18s} = {t:.3f}")
        print()

# ─── (E) Count how many of the 7 conditions each chain IN THE TEST SET satisfies ───
def count_conditions_on_test_set(row, midpoints_dict):
    c1 = row["hydro_norm"]          >= midpoints_dict.get("hydro_norm", 0) 
    c2 = abs(row["charge"])         <= abs(midpoints_dict.get("charge", 0))
    c3 = row["h_dh_a"]              <= midpoints_dict.get("h_dh_a", float('inf'))
    c4 = row["norm_flex"]           <= midpoints_dict.get("norm_flex", float('inf'))
    c5 = row["pol_norm"]            <= midpoints_dict.get("pol_norm", float('inf'))
    c6 = row["arom_plus_helix"]     >= midpoints_dict.get("arom_plus_helix", 0)
    c7 = row["asa_norm"]            <= midpoints_dict.get("asa_norm", float('inf'))
    return sum([c1, c2, c3, c4, c5, c6, c7])

if df_test.empty:
    print("Test set is empty. Skipping evaluation.")
else:
    df_test["conditions_met"] = df_test.apply(lambda r: count_conditions_on_test_set(r, midpoints), axis=1)

    dist_test = df_test.groupby("label")["conditions_met"] \
                  .value_counts() \
                  .unstack(fill_value=0) \
                  .rename(index={0:"DisProt", 1:"PDB"})
    pd.set_option("display.max_columns", None)
    print("Distribution of ‘conditions_met’ by Label (ON TEST SET):\n")
    print(dist_test, "\n")

    # ─── (F) For each k=1…7, classify “folded if conditions_met ≥ k” ON TEST SET ─────────────
    results = []
    true_test_labels = df_test["label"]
    for k_threshold in range(1, 8):
        preds_test = (df_test["conditions_met"] >= k_threshold).astype(int)
        
        tp = ((preds_test == 1) & (true_test_labels == 1)).sum()
        fn = ((preds_test == 0) & (true_test_labels == 1)).sum()
        tn = ((preds_test == 0) & (true_test_labels == 0)).sum()
        fp = ((preds_test == 1) & (true_test_labels == 0)).sum()
        
        acc = (tp + tn) / len(df_test) if len(df_test) > 0 else 0
        
        precision_pdb = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall_pdb = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1_pdb = 2 * (precision_pdb * recall_pdb) / (precision_pdb + recall_pdb) if (precision_pdb + recall_pdb) > 0 else 0
        
        results.append({
            "k (min # of features)": k_threshold,
            "TP": tp, "FN": fn, "TN": tn, "FP": fp,
            "Accuracy": f"{acc:.2%}",
            "Precision (PDB)": f"{precision_pdb:.2%}",
            "Recall (PDB)": f"{recall_pdb:.2%}",
            "F1-score (PDB)": f"{f1_pdb:.2%}"
        })

    df_results = pd.DataFrame(results)
    pd.set_option("display.max_rows", None)
    print("Performance on TEST SET as we vary k = minimum # of satisfied conditions:\n")
    print(df_results.to_string(index=False))

    if not df_results.empty:
        try:
            df_results['F1_PDB_float'] = df_results['F1-score (PDB)'].str.rstrip('%').astype('float') / 100.0
            best_k_row = df_results.loc[df_results['F1_PDB_float'].idxmax()]
            best_k = int(best_k_row["k (min # of features)"])
            print(f"\n--- Detailed Classification Report for best k = {best_k} (based on F1 PDB) ---")
            
            best_preds_test = (df_test["conditions_met"] >= best_k).astype(int)
            # Use zero_division=0 or 1 in classification_report to handle UndefinedMetricWarning
            print(classification_report(true_test_labels, best_preds_test, target_names=["DisProt (0)", "PDB (1)"], zero_division=0))

            cm = confusion_matrix(true_test_labels, best_preds_test)
            cm_df = pd.DataFrame(cm, index=["Actual DisProt","Actual PDB"], columns=["Pred DisProt","Pred PDB"])
            print("Confusion Matrix for best k:\n", cm_df)
        except Exception as e:
            print(f"Could not generate detailed report for best k: {e}")


Loaded 15000 PDB sequences.
Loaded 25000 DisProt sequences.

Training set size: 32000
Testing set size: 8000
Training set PDB (1) count: 12000, DisProt (0) count: 20000
Testing set PDB (1) count: 3000, DisProt (0) count: 5000

Global Feature Means (DisProt vs. PDB) from TRAINING DATA:

         hydro_norm    charge    h_dh_a  norm_flex  pol_norm  arom_plus_helix  \
label                                                                           
DisProt    0.401099 -0.022805  1.256680   0.837249  0.512060         0.718566   
PDB        0.475068 -0.008898  1.073287   0.806553  0.437201         0.735020   

         asa_norm  
label              
DisProt  0.519782  
PDB      0.445255   

Chosen Midpoint Thresholds (from TRAINING DATA):

  hydro_norm         = 0.438
  charge             = -0.016
  h_dh_a             = 1.165
  norm_flex          = 0.822
  pol_norm           = 0.475
  arom_plus_helix    = 0.727
  asa_norm           = 0.483

Distribution of ‘conditions_met’ by Label (ON TEST 

# 5.) Sliding Window vs Global Learned Constraints

In [20]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import math

# --- Parameters for Classifiers ---
# For Global Feature Classifier (WINDOW_SIZE_GLOBAL_FEATURES = None means direct global calculation)
WINDOW_SIZE_GLOBAL_FEATURES = None 

# For Sliding Window Classifier
SLIDING_WINDOW_SIZE = 9  # Size of the sliding window
SLIDING_WINDOW_SLIDE_STEP = 9 # Step for sliding (equal to WINDOW_SIZE for non-overlapping)
SLIDING_WINDOW_PASS_K = 4     # Min conditions a window must meet to "pass"
MAX_UNFORGIVEN_FAILED_WINDOWS_SLIDING = 3 # Max uncancelled failed windows for protein to pass

# ─── (A) Build aa_properties (underlying single AA properties) ────────────────
kd_hydro = {
    'A':  1.8, 'R': -4.5, 'N': -3.5, 'D': -3.5, 'C':  2.5,
    'Q': -3.5, 'E': -3.5, 'G': -0.4, 'H': -3.2, 'I':  4.5,
    'L':  3.8, 'K': -3.9, 'M':  1.9, 'F':  2.8, 'P': -1.6,
    'S': -0.8, 'T': -0.7, 'W': -0.9, 'Y': -1.3, 'V':  4.2
}
charge = { # Simplified charge for H, assuming neutral pH for general calculation
    'A':  0, 'R':  1, 'N':  0, 'D': -1, 'C':  0,
    'Q':  0, 'E': -1, 'G':  0, 'H':  0, 'I':  0, 
    'L':  0, 'K':  1, 'M':  0, 'F':  0, 'P':  0,
    'S':  0, 'T':  0, 'W':  0, 'Y':  0, 'V':  0
}
h_donors = {'A':0,'R':2,'N':2,'D':0,'C':0,'Q':2,'E':0,'G':0,'H':1,'I':0,
            'L':0,'K':1,'M':0,'F':0,'P':0,'S':1,'T':1,'W':1,'Y':1,'V':0}
h_acceptors = {'A':0,'R':0,'N':2,'D':2,'C':1,'Q':2,'E':2,'G':0,'H':1,'I':0,
               'L':0,'K':0,'M':0,'F':0,'P':0,'S':1,'T':1,'W':0,'Y':1,'V':0}
flexibility = {
    'A': 0.357, 'R': 0.529, 'N': 0.463, 'D': 0.511, 'C': 0.346,
    'Q': 0.493, 'E': 0.497, 'G': 0.544, 'H': 0.323, 'I': 0.462,
    'L': 0.365, 'K': 0.466, 'M': 0.295, 'F': 0.314, 'P': 0.509,
    'S': 0.507, 'T': 0.444, 'W': 0.305, 'Y': 0.420, 'V': 0.386
}
sidechain_volume = {
    'A':  88.6, 'R': 173.4, 'N': 114.1, 'D': 111.1, 'C': 108.5, 'Q': 143.8, 
    'E': 138.4, 'G':  60.1, 'H': 153.2, 'I': 166.7, 'L': 166.7, 'K': 168.6, 
    'M': 162.9, 'F': 189.9, 'P': 112.7, 'S':  89.0, 'T': 116.1, 'W': 227.8, 
    'Y': 193.6, 'V': 140.0
}
polarity = {
    'A':  8.1, 'R': 10.5, 'N': 11.6, 'D': 13.0, 'C':  5.5, 'Q': 10.5, 
    'E': 12.3, 'G':  9.0, 'H': 10.4, 'I':  5.2, 'L':  4.9, 'K': 11.3, 
    'M':  5.7, 'F':  5.2, 'P':  8.0, 'S':  9.2, 'T':  8.6, 'W':  5.4, 
    'Y':  6.2, 'V':  5.9
}
choufa_helix = {
    'A': 1.45, 'R': 0.79, 'N': 0.73, 'D': 1.01, 'C': 0.77, 'Q': 1.17, 
    'E': 1.51, 'G': 0.53, 'H': 1.00, 'I': 1.08, 'L': 1.34, 'K': 1.07, 
    'M': 1.20, 'F': 1.12, 'P': 0.59, 'S': 0.79, 'T': 0.82, 'W': 1.14, 
    'Y': 0.61, 'V': 1.06
}
choufa_sheet = {
    'A': 0.97, 'R': 0.90, 'N': 0.65, 'D': 0.54, 'C': 1.30, 'Q': 1.23, 
    'E': 0.37, 'G': 0.75, 'H': 0.87, 'I': 1.60, 'L': 1.22, 'K': 0.74, 
    'M': 1.67, 'F': 1.28, 'P': 0.62, 'S': 0.72, 'T': 1.20, 'W': 1.19, 
    'Y': 1.29, 'V': 1.70
}
rel_ASA = {
    'A': 0.74, 'R': 1.48, 'N': 1.14, 'D': 1.23, 'C': 0.86, 'Q': 1.36, 
    'E': 1.26, 'G': 1.00, 'H': 0.91, 'I': 0.59, 'L': 0.61, 'K': 1.29, 
    'M': 0.64, 'F': 0.65, 'P': 0.71, 'S': 1.42, 'T': 1.20, 'W': 0.55, 
    'Y': 0.63, 'V': 0.54
}
beta_branched = {aa: (1 if aa in ('V','I','T') else 0) for aa in kd_hydro.keys()}

aa_properties_base = {} 
canonical_set = set(kd_hydro.keys())
for aa in canonical_set:
    aa_properties_base[aa] = {
        'hydro_norm': (kd_hydro[aa] + 4.5) / 9.0,
        'charge_val': charge[aa], 
        'h_donors': h_donors[aa],
        'h_acceptors': h_acceptors[aa],
        'flexibility': flexibility[aa],
        'volume_norm': sidechain_volume[aa] / 227.8,
        'pol_norm': (polarity[aa] - 4.9) / (13.0 - 4.9),
        'is_aromatic': 1 if aa in ('F','Y','W') else 0,
        'helix_prop': choufa_helix[aa] / 1.51,
        'sheet_prop': choufa_sheet[aa] / 1.70,
        'asa_norm': (rel_ASA[aa] - 0.54) / (1.48 - 0.54),
        'is_beta_branched': beta_branched[aa]
    }

# ─── (B) Load FASTA sequences & Store Raw Sequences with Labels ────────────────
def load_fasta_with_labels(filepath, label, filter_non_canonical=False):
    sequences_with_labels = []
    try:
        with open(filepath) as f:
            header = None; seq_content = ""
            for line in f:
                line = line.strip()
                if line.startswith(">"):
                    if header is not None and seq_content:
                        if (not filter_non_canonical) or (set(seq_content) <= canonical_set):
                            sequences_with_labels.append({'sequence': seq_content, 'label': label, 'header': header})
                    header = line; seq_content = ""
                else: seq_content += line
            if header is not None and seq_content: # Last sequence
                if (not filter_non_canonical) or (set(seq_content) <= canonical_set):
                    sequences_with_labels.append({'sequence': seq_content, 'label': label, 'header': header})
    except FileNotFoundError: print(f"Warning: File not found {filepath}.")
    return sequences_with_labels

all_sequences_data = []
all_sequences_data.extend(load_fasta_with_labels("pdb_chains.fasta", 1))
all_sequences_data.extend(load_fasta_with_labels("disprot_13000.fasta", 0))

print(f"Loaded {len([item for item in all_sequences_data if item['label'] == 1])} PDB sequences.")
print(f"Loaded {len([item for item in all_sequences_data if item['label'] == 0])} DisProt sequences.")

if not all_sequences_data:
    print("Error: No sequences loaded. Exiting."); exit()

# ─── (C) NEW Feature Computation Functions ───────────────────────────────────
def get_aa_composition(sequence_str):
    composition = {aa: 0 for aa in canonical_set}
    valid_len = 0
    for aa in sequence_str:
        if aa in canonical_set:
            composition[aa] += 1
            valid_len += 1
    if valid_len == 0: return {aa: 0.0 for aa in canonical_set}, 0
    for aa in composition: composition[aa] /= valid_len
    return composition, valid_len

def calculate_shannon_entropy(aa_composition):
    entropy = 0.0
    for aa_freq in aa_composition.values(): # Iterate over frequencies directly
        if aa_freq > 0:
            entropy -= aa_freq * math.log2(aa_freq)
    return entropy

def compute_new_seven_features(sequence_str):
    if not sequence_str: return np.zeros(7)
    composition, valid_seq_len = get_aa_composition(sequence_str)
    if valid_seq_len == 0: return np.zeros(7)

    hydro_norm_sum, flex_norm_sum, h_bond_potential_sum = 0, 0, 0
    for aa in sequence_str:
        if aa in aa_properties_base:
            props = aa_properties_base[aa]
            hydro_norm_sum += props['hydro_norm']
            flex_norm_sum += props['flexibility'] / 0.544
            h_bond_potential_sum += props['h_donors'] + props['h_acceptors']

    net_charge_prop = (composition.get('R',0) + composition.get('K',0)) - \
                      (composition.get('D',0) + composition.get('E',0))
    bulky_hydrophobics_list = ['W', 'C', 'F', 'Y', 'I', 'V', 'L']
    
    return np.array([
        hydro_norm_sum / valid_seq_len,
        flex_norm_sum / valid_seq_len,
        h_bond_potential_sum / valid_seq_len,
        abs(net_charge_prop),
        calculate_shannon_entropy(composition),
        composition.get('P', 0),
        sum(composition.get(aa, 0) for aa in bulky_hydrophobics_list)
    ])

def compute_features_for_dataset(sequence_list, window_size_param=None):
    """
    Computes the new 7 features for a list of sequence strings.
    If window_size_param is None, computes global features.
    If window_size_param is an int, computes features for each window and averages them.
    """
    all_feature_vectors = []
    for seq_str in sequence_list:
        if not seq_str: 
            all_feature_vectors.append(np.zeros(7))
            continue
        
        canonical_sequence = "".join([aa for aa in seq_str if aa in canonical_set])
        if not canonical_sequence: 
            all_feature_vectors.append(np.zeros(7))
            continue

        if window_size_param is None or len(canonical_sequence) < window_size_param:
            all_feature_vectors.append(compute_new_seven_features(canonical_sequence))
        else:
            window_derived_feature_sets = [] 
            for i in range(len(canonical_sequence) - window_size_param + 1):
                window_segment_str = canonical_sequence[i : i + window_size_param]
                window_features = compute_new_seven_features(window_segment_str)
                window_derived_feature_sets.append(window_features)
            if not window_derived_feature_sets:
                all_feature_vectors.append(compute_new_seven_features(canonical_sequence)) # Fallback
            else:
                all_feature_vectors.append(np.mean(np.vstack(window_derived_feature_sets), axis=0))
    return all_feature_vectors

# --- Prepare data for Global New Features Classifier ---
new_feature_names = [
    "hydro_norm_avg", "flex_norm_avg", "h_bond_potential_avg",
    "abs_net_charge_prop", "shannon_entropy", "freq_proline", "freq_bulky_hydrophobics"
]
print(f"\nComputing NEW GLOBAL features (WINDOW_SIZE_GLOBAL_FEATURES = {WINDOW_SIZE_GLOBAL_FEATURES})...")
raw_sequences_list = [item['sequence'] for item in all_sequences_data]
labels_list = [item['label'] for item in all_sequences_data]

global_features_calculated = compute_features_for_dataset(raw_sequences_list, window_size_param=WINDOW_SIZE_GLOBAL_FEATURES)

df_global_features = pd.DataFrame(global_features_calculated, columns=new_feature_names)
df_global_features["label"] = labels_list
print("NEW GLOBAL feature computation complete.")

if df_global_features.empty or df_global_features['label'].nunique() < 2:
    print("Error: Not enough data for global features. Exiting."); exit()
    
X_global_train, X_global_test, y_global_train, y_global_test, train_indices_global, test_indices_global = train_test_split(
    df_global_features.drop(columns=["label"]),
    df_global_features["label"],
    np.arange(len(raw_sequences_list)), 
    test_size=0.2, random_state=42, stratify=df_global_features["label"] 
)

df_train_global_features = X_global_train.copy()
df_train_global_features["label"] = y_global_train

# Raw sequences for the test set (will be used by the sliding window classifier)
test_raw_sequences_for_sliding_window = [raw_sequences_list[i] for i in test_indices_global]
y_test_for_sliding_window = y_global_test # True labels for the test set

print(f"\nGlobal Features Training set size: {len(df_train_global_features)}")
print(f"Global Features Testing set size: {len(X_global_test)}")

# --- Calculate Midpoints from Global New Features Training Data ---
if df_train_global_features["label"].nunique() < 2:
    print("\nError: Global features training set lacks class diversity for midpoints."); exit()
else:
    train_means_global = df_train_global_features.groupby("label").mean().rename(index={0:"DisProt", 1:"PDB"})
    if "PDB" not in train_means_global.index or "DisProt" not in train_means_global.index:
        print("\nError: Could not find means for both PDB and DisProt in global training data."); exit()
    else:
        midpoints_global_new_features = {col: (train_means_global.loc["PDB", col] + train_means_global.loc["DisProt", col]) / 2
                                         for col in X_global_train.columns}
        print("\nGlobal Feature Means (DisProt vs. PDB) from NEW GLOBAL FEATURES TRAINING DATA:\n")
        print(train_means_global, "\n")
        print("Chosen Midpoint Thresholds (from NEW GLOBAL FEATURES TRAINING DATA):\n")
        for feat, t in midpoints_global_new_features.items(): print(f"  {feat:18s} = {t:.3f}")
        print()

# --- Helper to count conditions met for the NEW 7 features ---
def count_conditions_for_new_feature_vector(new_feature_vector_values, midpoints_dict, train_means_for_direction):
    row = pd.Series(new_feature_vector_values, index=new_feature_names)
    conditions_met_count = 0
    
    # hydro_norm_avg: PDB typically higher
    if row["hydro_norm_avg"] >= midpoints_dict.get("hydro_norm_avg", 0.0): conditions_met_count +=1
    # flex_norm_avg: PDB typically lower
    if row["flex_norm_avg"] <= midpoints_dict.get("flex_norm_avg", float('inf')): conditions_met_count +=1
    # h_bond_potential_avg: PDB typically lower
    if row["h_bond_potential_avg"] <= midpoints_dict.get("h_bond_potential_avg", float('inf')): conditions_met_count +=1
    # abs_net_charge_prop: PDB typically lower
    if row["abs_net_charge_prop"] <= midpoints_dict.get("abs_net_charge_prop", float('inf')): conditions_met_count +=1
    # shannon_entropy: PDB typically higher (inspect means to confirm this assumption)
    if train_means_for_direction.loc["PDB", "shannon_entropy"] > train_means_for_direction.loc["DisProt", "shannon_entropy"]:
        if row["shannon_entropy"] >= midpoints_dict.get("shannon_entropy", 0.0): conditions_met_count +=1
    else: # PDB shannon_entropy is lower or equal
        if row["shannon_entropy"] <= midpoints_dict.get("shannon_entropy", float('inf')): conditions_met_count +=1
    # freq_proline: PDB typically lower
    if row["freq_proline"] <= midpoints_dict.get("freq_proline", float('inf')): conditions_met_count +=1
    # freq_bulky_hydrophobics: PDB typically higher
    if row["freq_bulky_hydrophobics"] >= midpoints_dict.get("freq_bulky_hydrophobics", 0.0): conditions_met_count +=1
    
    return conditions_met_count

# ----------------------------------------------------------------------------------
# --- 1. New Global Features Threshold-Based Classifier Evaluation ---
# ----------------------------------------------------------------------------------
print("\n\n--- Evaluating New Global Features Threshold-Based Classifier ---")
df_test_global_features_eval = X_global_test.copy()
df_test_global_features_eval["label"] = y_global_test

if df_test_global_features_eval.empty:
    print("Global features test set is empty. Skipping evaluation.")
else:
    df_test_global_features_eval["conditions_met"] = df_test_global_features_eval.apply(
        lambda r: count_conditions_for_new_feature_vector(r[new_feature_names].values, midpoints_global_new_features, train_means_global), axis=1
    )
    
    dist_test_global = df_test_global_features_eval.groupby("label")["conditions_met"].value_counts().unstack(fill_value=0).rename(index={0:"DisProt", 1:"PDB"})
    print("Distribution of ‘conditions_met’ (NEW GLOBAL features) by Label (ON TEST SET):\n")
    print(dist_test_global, "\n")

    results_global = []
    for k_thresh in range(1, 8):
        preds_test_global = (df_test_global_features_eval["conditions_met"] >= k_thresh).astype(int)
        tp = ((preds_test_global == 1) & (y_global_test == 1)).sum()
        fn = ((preds_test_global == 0) & (y_global_test == 1)).sum()
        tn = ((preds_test_global == 0) & (y_global_test == 0)).sum()
        fp = ((preds_test_global == 1) & (y_global_test == 0)).sum()
        acc = (tp + tn) / len(y_global_test) if len(y_global_test) > 0 else 0
        prec_pdb = tp / (tp + fp) if (tp + fp) > 0 else 0
        rec_pdb = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1_pdb = 2*(prec_pdb*rec_pdb)/(prec_pdb+rec_pdb) if (prec_pdb+rec_pdb)>0 else 0
        results_global.append({
            "k": k_thresh, "TP": tp, "FN": fn, "TN": tn, "FP": fp, "Accuracy": f"{acc:.2%}",
            "Precision (PDB)": f"{prec_pdb:.2%}", "Recall (PDB)": f"{rec_pdb:.2%}", "F1-score (PDB)": f"{f1_pdb:.2%}"
        })
    df_results_global = pd.DataFrame(results_global)
    print("Performance of New Global Features Classifier on TEST SET (varying k):\n")
    print(df_results_global.to_string(index=False))

    if not df_results_global.empty:
        try:
            df_results_global['F1_PDB_float'] = df_results_global['F1-score (PDB)'].str.rstrip('%').astype('float') / 100.0
            best_k_row_global = df_results_global.loc[df_results_global['F1_PDB_float'].idxmax()]
            best_k_global = int(best_k_row_global["k"])
            print(f"\n--- Detailed New Global Features Classification Report for best k = {best_k_global} (based on F1 PDB) ---")
            best_preds_global = (df_test_global_features_eval["conditions_met"] >= best_k_global).astype(int)
            print(classification_report(y_global_test, best_preds_global, target_names=["DisProt (0)", "PDB (1)"], zero_division=0))
            cm_global = confusion_matrix(y_global_test, best_preds_global)
            print("Confusion Matrix for best k (New Global Features):\n", pd.DataFrame(cm_global, index=["Actual DisProt","Actual PDB"], columns=["Pred DisProt","Pred PDB"]))
        except Exception as e: print(f"Error in detailed report for global features: {e}")

# ----------------------------------------------------------------------------------
# --- 2. Sliding Window (Larger) Classifier with Failure Cancellation ---
# ----------------------------------------------------------------------------------
print("\n\n--- Testing Sliding Window (Larger) Classifier with Failure Cancellation ---")

def classify_protein_sliding_window_cancellation_new_features(
    sequence_str, window_size, slide_step,
    midpoints_for_eval, window_k_pass_thresh, 
    max_allowed_total_failures, train_means_for_direction_check): # Added train_means

    if not sequence_str: return 1 
    canonical_sequence = "".join([aa for aa in sequence_str if aa in canonical_set])
    if not canonical_sequence or len(canonical_sequence) < window_size: return 1 

    current_consecutive_failures_streak = 0
    num_windows_processed = 0

    for i in range(0, len(canonical_sequence) - window_size + 1, slide_step):
        window_str = canonical_sequence[i : i + window_size]
        num_windows_processed += 1
        
        seven_new_features_for_current_window = compute_new_seven_features(window_str)
        num_conditions_this_window_met = count_conditions_for_new_feature_vector(
            seven_new_features_for_current_window, 
            midpoints_for_eval,
            train_means_for_direction_check # Pass train_means here
        )
        
        window_passes = (num_conditions_this_window_met >= window_k_pass_thresh)
        
        if window_passes: current_consecutive_failures_streak = 0 
        else: current_consecutive_failures_streak += 1
            
    if num_windows_processed == 0: return 0 
    total_unforgiven_failures = current_consecutive_failures_streak
    
    return 1 if total_unforgiven_failures <= max_allowed_total_failures else 0

print(f"\nApplying Sliding Window (Larger) Classifier (NEW features) with Failure Cancellation: window_size={SLIDING_WINDOW_SIZE}, slide_step={SLIDING_WINDOW_SLIDE_STEP}, window_k_pass_thresh={SLIDING_WINDOW_PASS_K}, max_total_unforgiven_failures={MAX_UNFORGIVEN_FAILED_WINDOWS_SLIDING}...")
predictions_sliding_window_test = []
if not test_raw_sequences_for_sliding_window:
    print("No raw sequences in test set for sliding window classifier.")
else:
    for raw_seq in test_raw_sequences_for_sliding_window:
        pred = classify_protein_sliding_window_cancellation_new_features(
            raw_seq, SLIDING_WINDOW_SIZE, SLIDING_WINDOW_SLIDE_STEP,
            midpoints_global_new_features, # Use midpoints from global new features training
            SLIDING_WINDOW_PASS_K,
            MAX_UNFORGIVEN_FAILED_WINDOWS_SLIDING,
            train_means_global # Pass train_means for direction check
        )
        predictions_sliding_window_test.append(pred)
    print("\nSliding Window (Larger) with Failure Cancellation classification complete.")

    if predictions_sliding_window_test:
        print("\nPerformance of Sliding Window (Larger) Classifier with Failure Cancellation (ON TEST SET):\n")
        print(classification_report(y_test_for_sliding_window, predictions_sliding_window_test, target_names=["DisProt (0)", "PDB (1)"], zero_division=0))
        cm_sliding = confusion_matrix(y_test_for_sliding_window, predictions_sliding_window_test)
        print("Confusion Matrix:\n", pd.DataFrame(cm_sliding, index=["Actual DisProt","Actual PDB"], columns=["Pred DisProt","Pred PDB"]))
        acc_sliding = (cm_sliding[0,0] + cm_sliding[1,1]) / np.sum(cm_sliding) if np.sum(cm_sliding) > 0 else 0
        print(f"Accuracy: {acc_sliding:.2%}")
    else:
        print("No predictions made by Sliding Window (Larger) classifier.")


Loaded 15000 PDB sequences.
Loaded 25000 DisProt sequences.

Computing NEW GLOBAL features (WINDOW_SIZE_GLOBAL_FEATURES = None)...
NEW GLOBAL feature computation complete.

Global Features Training set size: 32000
Global Features Testing set size: 8000

Global Feature Means (DisProt vs. PDB) from NEW GLOBAL FEATURES TRAINING DATA:

         hydro_norm_avg  flex_norm_avg  h_bond_potential_avg  \
label                                                          
DisProt        0.401099       0.837249              1.256680   
PDB            0.475068       0.806553              1.073287   

         abs_net_charge_prop  shannon_entropy  freq_proline  \
label                                                         
DisProt             0.115245         3.359850      0.069069   
PDB                 0.036236         3.700737      0.043093   

         freq_bulky_hydrophobics  
label                             
DisProt                 0.210505  
PDB                     0.312975   

Chosen Midpoin