<a href="https://colab.research.google.com/github/neetushibu/IontheFold-Team6/blob/main/IontheFold_SelectionCriteria.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Charged Protein - Selection Criteria**

🥇 Gold (Top 10%)

Chains: ≥ 2 (must be multichain, since we are studying PPIs)

Charged residues per 100 aa: ≥ 18.3%

Total binding sites: ≥ 1716

Represents the “hottest” PPIs — extremely electrostatically rich, ideal for focused fine-tuning.

🥈 Silver (Top 25%)

Chains: ≥ 2

Charged residues per 100 aa: ≥ 10.2%

Total binding sites: ≥ 1078

Strong electrostatic candidates, balanced size and quality — good for building a robust mid-sized training dataset.

🥉 Bronze (Top 40%)

Chains: ≥ 2

Charged residues per 100 aa: ≥ 7.4%

Total binding sites: ≥ 800 (approx, 60th percentile)

Broader dataset, useful if you need more coverage and diversity, even if some noise is introduced.

🚫 Excluded

Ineligible: Single-chain proteins or missing charge data.

Below Threshold: Multichain but fails to meet even Bronze thresholds.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
from pathlib import Path

# Load the analyzer summary provided by user
input_file = Path("/content/drive/MyDrive/IontheFold/EDA/19-Aug-25/FullSequence_Analysis.csv")
df = pd.read_csv(input_file)

# Standardize column names
df.columns = [c.strip().replace(" ", "_").replace("-", "_").lower() for c in df.columns]

# Key columns
tot_res_col = "total_residues"
chg_res_col = "total_charged_residues"
sites_col   = "total_binding_sites"
chains_col  = "total_chains"

# Derived metric: charged residues per 100 aa
df["charged_fraction"] = pd.to_numeric(df[chg_res_col], errors="coerce") / pd.to_numeric(df[tot_res_col], errors="coerce") * 100

# Compute percentiles for tier cutoffs
charged_frac_p90 = df["charged_fraction"].quantile(0.90)
charged_frac_p75 = df["charged_fraction"].quantile(0.75)
charged_frac_p60 = df["charged_fraction"].quantile(0.60)

binding_sites_p90 = pd.to_numeric(df[sites_col], errors="coerce").quantile(0.90)
binding_sites_p75 = pd.to_numeric(df[sites_col], errors="coerce").quantile(0.75)
binding_sites_p60 = pd.to_numeric(df[sites_col], errors="coerce").quantile(0.60)

# Assign tiers
def assign_tier(row):
    chains = pd.to_numeric(row[chains_col], errors="coerce")
    frac   = row["charged_fraction"]
    sites  = pd.to_numeric(row[sites_col], errors="coerce")
    if chains < 2 or pd.isna(frac) or pd.isna(sites):
        return "Ineligible"
    if frac >= charged_frac_p90 and sites >= binding_sites_p90:
        return "Gold (Top 10%)"
    elif frac >= charged_frac_p75 and sites >= binding_sites_p75:
        return "Silver (Top 25%)"
    elif frac >= charged_frac_p60 and sites >= binding_sites_p60:
        return "Bronze (Top 40%)"
    else:
        return "Below Threshold"

df["tier"] = df.apply(assign_tier, axis=1)

# Save updated file
out_file = Path("/content/drive/MyDrive/IontheFold/EDA/19-Aug-25/FullSequence_Analysis_with_Tiers.csv")
df.to_csv(out_file, index=False)

# Summary counts
tier_counts = df["tier"].value_counts().to_dict()
(
    charged_frac_p90, charged_frac_p75, charged_frac_p60,
    binding_sites_p90, binding_sites_p75, binding_sites_p60,
    tier_counts, str(out_file)
)


(np.float64(1830.4347826086957),
 np.float64(984.0),
 np.float64(654.5454545454546),
 np.float64(1770.0),
 np.float64(1045.0),
 np.float64(743.0),
 {'Ineligible': 3063,
  'Below Threshold': 2803,
  'Silver (Top 25%)': 1278,
  'Bronze (Top 40%)': 1263,
  'Gold (Top 10%)': 718},
 '/content/drive/MyDrive/IontheFold/EDA/19-Aug-25/FullSequence_Analysis_with_Tiers.csv')