# HybridStackPPI Experiments (Unified Pipeline)

Notebook n√†y d√πng c√°c h√†m ƒë√£ chu·∫©n ho√° trong `pipelines/` v√† `experiments/run.py`.
- Split tr√°nh leak (protein-level/cluster-level) v√† chu·∫©n ho√° c·∫∑p PPI (sort P1/P2).
- Motif local embedding d√πng **max pooling**.
- class_weight ƒë√£ ƒë·ªìng b·ªô.


In [2]:
from importlib import reload

from experiments import run as exp_run
from pipelines.feature_engine import EmbeddingComputer, FeatureEngine
from pipelines.builders import define_stacking_columns, create_stacking_pipeline

# Reload ƒë·ªÉ ph·∫£n √°nh thay ƒë·ªïi code n·∫øu ch·ªânh s·ª≠a
reload(exp_run)


<module 'experiments.run' from '/media/SAS/Van/ppis/experiments/run.py'>

## C·∫•u h√¨nh ƒë∆∞·ªùng d·∫´n & tham s·ªë chung

In [3]:
H5_CACHE_FILE = 'cache/esm2_embeddings.h5'
ESM_MODEL = 'facebook/esm2_t33_650M_UR50D'
CACHE_VERSION = 'v3'

# BioGrid datasets packaged with the repo
BIOGRID_HUMAN_FASTA = 'data/BioGrid/Human/human_dict.fasta'
BIOGRID_HUMAN_PAIR = 'data/BioGrid/Human/human_pairs.tsv'

BIOGRID_YEAST_FASTA = 'data/BioGrid/Yeast/yeast_dict.fasta'
BIOGRID_YEAST_PAIR = 'data/BioGrid/Yeast/yeast_pairs.tsv'

N_JOBS = -1  # d√πng t·∫•t c·∫£ CPU, ƒë·ªïi n·∫øu c·∫ßn


## Kh·ªüi t·∫°o FeatureEngine v√† c·ªôt stacking

In [4]:
embedding_computer = EmbeddingComputer(model_name=ESM_MODEL)
feature_engine = FeatureEngine(h5_cache_path=H5_CACHE_FILE, embedding_computer=embedding_computer)
interp_cols, embed_cols = define_stacking_columns(feature_engine, pairing_strategy='concat')

model_factory = lambda n_jobs: create_stacking_pipeline(
    interp_cols=interp_cols,
    embed_cols=embed_cols,
    n_jobs=n_jobs,
    use_selector=True,
)


Loading protein language model: facebook/esm2_t33_650M_UR50D...


2025-11-30 11:03:45.994848: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-30 11:03:46.436309: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-30 11:03:47.783914: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t3

ESM2 model loaded successfully on CPU.
Initializing Sequence-Only Hybrid Feature Engine...
Fetching and compiling ELM motifs from API: http://elm.eu.org/elms/elms_index.tsv...
‚úÖ Successfully loaded and compiled 353 motifs from ELM database.
Feature Engine ready.


## 1) 5-Fold Cross-Validation (protein-level)


In [6]:
cv_results = exp_run.run_experiment(
    fasta_path=BIOGRID_HUMAN_FASTA,
    pairs_path=BIOGRID_HUMAN_PAIR,
    h5_cache_path=H5_CACHE_FILE,
    model_factory=model_factory,
    pairing_strategy='concat',
    n_splits=5,
    esm_model_name=ESM_MODEL,
    n_jobs=N_JOBS,
    cache_version=CACHE_VERSION,
)
cv_results


[33m[1müì¶ PHASE: LOADING FEATURES FROM CACHE[0m
------------------------------------------------------------
  [Cache] Loading feature matrix from cache/human_human_pairs_facebook_esm2_t33_650m_ur50d_concat_v2_features.h5...
  [Cache] Load complete. X=(62328, 7140), y=(62328,)

üöÄ [36m[1mEXPERIMENT: 5-FOLD CV (PROTEIN-LEVEL SPLIT - NO LEAKAGE)[0m
Generating 5-fold splits based on 38869 unique proteins...
  Fold 1: Train Pairs=39636, Val Pairs=2713 (Leakage Check: 0 overlap)
  Fold 2: Train Pairs=40205, Val Pairs=2534 (Leakage Check: 0 overlap)
  Fold 3: Train Pairs=40749, Val Pairs=2424 (Leakage Check: 0 overlap)
  Fold 4: Train Pairs=39724, Val Pairs=2760 (Leakage Check: 0 overlap)
  Fold 5: Train Pairs=39834, Val Pairs=2733 (Leakage Check: 0 overlap)
  [37m‚ÑπÔ∏è  --- Fold 1/5 ---
‚úÖ Stacking (Selector=True) pipeline created (using *permissive* thresholds).

üî• Starting Cumulative Feature Selection (Initial: 2020)
   [Config] use_variance=True, use_importance=True, use

{'Accuracy': 0.9945238969973869,
 'Precision': 0.9995533510325678,
 'Recall (Sensitivity)': 0.9901341375282133,
 'F1 Score': 0.994817741243371,
 'Specificity': 0.9995022967276579,
 'MCC': 0.9890695645208197,
 'ROC-AUC': 0.9983302993894503,
 'PR-AUC': 0.9989671146365471}

## 2) Independent Test (train=Human, test=Yeast)

In [None]:
indep_results = exp_run.run_experiment(
    fasta_path=BIOGRID_HUMAN_FASTA,
    pairs_path=BIOGRID_HUMAN_PAIR,
    test_fasta_path=BIOGRID_YEAST_FASTA,
    test_pairs_path=BIOGRID_YEAST_PAIR,
    h5_cache_path=H5_CACHE_FILE,
    model_factory=model_factory,
    pairing_strategy='concat',
    n_splits=1,
    esm_model_name=ESM_MODEL,
    n_jobs=N_JOBS,
    cache_version=CACHE_VERSION,
)
indep_results


## 3) Full Ablation Study (5 m√¥ h√¨nh)

In [None]:
ablation_df = exp_run.run_ablation_study(
    fasta_path=BIOGRID_HUMAN_FASTA,
    pairs_path=BIOGRID_HUMAN_PAIR,
    h5_cache_path=H5_CACHE_FILE,
    esm_model_name=ESM_MODEL,
    n_splits=5,
    n_jobs=N_JOBS,
)
ablation_df