# Model Evaluation and First Interpretability Pass

This notebook evaluates whether the modeling pipeline captures meaningful biological signal in drug response prediction.
The focus is on honest performance assessment, sanity checks, and a controlled first-pass interpretability analysis.

At this stage, the goal is **not** to optimize models or claim biological mechanisms,
but to determine whether the current feature space explains variability in drug response
beyond trivial baselines.


## Imports, paths, and minimal I/O checks

In [1]:
# Import necessary libraries

from pathlib import Path
import pandas as pd

In [2]:
# Project paths (assumes this notebook lives in /notebooks)
PROJECT_ROOT = Path.cwd().resolve().parents[0]
DATA_PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"

In [3]:
# Input parquet files
EXPR_PATH = DATA_PROCESSED_DIR / "depmap_expression_matched.parquet"
PRISM_PATH = DATA_PROCESSED_DIR / "prism_auc_filtered.parquet"
DRUG_INDEX_PATH = DATA_PROCESSED_DIR / "drug_index.parquet"
CELL_META_PATH = DATA_PROCESSED_DIR / "cell_line_metadata.parquet"
SELECTED_DRUGS_PATH = DATA_PROCESSED_DIR / "selected_drugs.parquet"

paths = {
    "expression": EXPR_PATH,
    "prism": PRISM_PATH,
    "drug_index": DRUG_INDEX_PATH,
    "cell_meta": CELL_META_PATH,
    "selected_drugs": SELECTED_DRUGS_PATH,
}

In [4]:
# Check that all required parquet files exist
missing_files = [name for name, p in paths.items() if not p.exists()]
if missing_files:
    raise FileNotFoundError(
        "Missing required parquet files:\n"
        + "\n".join(f"- {name}" for name in missing_files)
    )

In [5]:
# Load parquets
expr_df = pd.read_parquet(EXPR_PATH)
prism_df = pd.read_parquet(PRISM_PATH)
drug_index_df = pd.read_parquet(DRUG_INDEX_PATH)
cell_meta_df = pd.read_parquet(CELL_META_PATH)
selected_drugs_df = pd.read_parquet(SELECTED_DRUGS_PATH)

In [7]:
# Minimal sanity prints
print("Expression shape:", expr_df.shape)
print("PRISM shape:", prism_df.shape)
print("Drug index shape:", drug_index_df.shape)
print("Cell metadata shape:", cell_meta_df.shape)
print("Selected drugs shape:", selected_drugs_df.shape)

display(selected_drugs_df.head())

print("\n✅ all parquet inputs loaded successfully.")

Expression shape: (751, 19220)
PRISM shape: (732066, 5)
Drug index shape: (1528, 7)
Cell metadata shape: (751, 2)
Selected drugs shape: (100, 7)


Unnamed: 0,broad_id,name,n_cell_lines,auc_mean,auc_std,auc_min,auc_max
0,BRD-K95142244-001-01-5,talazoparib,711,0.68425,0.195109,0.11725,1.807046
1,BRD-K50168500-001-07-9,canertinib,707,0.915852,0.153452,0.305531,1.792206
2,BRD-K33610132-001-02-9,rociletinib,704,0.959678,0.178074,0.535787,2.882057
3,BRD-A70858459-001-01-7,estramustine,699,1.004251,0.181686,0.492309,1.954327
4,BRD-K77625799-001-07-7,vandetanib,690,1.006869,0.196001,0.625323,2.260526



✅ all parquet inputs loaded successfully.


## Dataset Assembly for Evaluation

In this section, we reconstruct the modeling dataset in memory by combining
DepMap expression features with PRISM drug response measurements.

The dataset is restricted to the predefined MVP drug subset and to cell lines
with matched molecular and pharmacological data.
