# 01 — Sanity Checks: DepMap Expression + PRISM Drug Response

## Purpose
This notebook performs foundational sanity checks for the MVP pipeline by validating:
- The presence and integrity of raw input files (fingerprinting).
- The basic structure and expected columns of DepMap expression and PRISM drug response datasets.
- The feasibility of joining datasets via shared cell line identifiers.
- Minimal descriptive statistics to detect obvious anomalies early (without heavy preprocessing).

This notebook is intentionally conservative: it does **not** train models, define final resistance labels, or perform feature engineering. Its role is to confirm that the project "stands on solid ground" before downstream analysis.

---

## Inputs (raw)
Expected raw file locations (not tracked by Git):

- DepMap expression (RNA-seq):  
  `data/raw/depmap_expression/OmicsExpressionTPMLogp1HumanProteinCodingGenes.csv`

- PRISM drug response (secondary screen dose–response parameters):  
  `data/raw/prism_response/secondary_dose_response_curve_parameters.csv`

> Raw datasets are intentionally stored under `data/raw/` and excluded from version control to prevent accidental publication of large files and to keep the repository lightweight. Provenance and file fingerprints are documented instead.

---

## Reproducibility: file fingerprinting (SHA256)
To support strong reproducibility, we compute a SHA256 fingerprint for each raw file used by this notebook, along with file size.  
These fingerprints allow future users to confirm they are using the same source files (bitwise identity), even if filenames or download mechanisms change.

We **do not** hard-fail execution if fingerprints differ, because reproducibility should remain possible across environments. Instead, we record fingerprints and encourage users to report them when sharing results.

---

## Step 0 — Environment & configuration
- Import core packages
- Define file paths
- Basic runtime checks (paths exist, readable)

In [1]:
# %% [code]
from __future__ import annotations

import hashlib
from pathlib import Path
import pandas as pd
import numpy as np


# --- Paths (Windows-friendly). Use forward slashes; Path() handles it on Windows too.
DEP_MAP_EXPR_PATH = Path("../data/raw/OmicsExpressionTPMLogp1HumanProteinCodingGenes.csv")
PRISM_DOSE_RESP_PATH = Path(
    "../data/raw/prism-repurposing-20q2-secondary-screen-dose-response-curve-parameters.csv"
)

# --- Basic runtime checks
for p in [DEP_MAP_EXPR_PATH, PRISM_DOSE_RESP_PATH]:
    if not p.exists():
        raise FileNotFoundError(f"Missing file: {p.resolve()}")

print("✅ Input files found:")
print(f"- DepMap expression: {DEP_MAP_EXPR_PATH.resolve()}")
print(f"- PRISM dose-response: {PRISM_DOSE_RESP_PATH.resolve()}")


✅ Input files found:
- DepMap expression: C:\Users\paula\OneDrive\Documentos\Proyectos\epigenetic-drug-resistance\data\raw\OmicsExpressionTPMLogp1HumanProteinCodingGenes.csv
- PRISM dose-response: C:\Users\paula\OneDrive\Documentos\Proyectos\epigenetic-drug-resistance\data\raw\prism-repurposing-20q2-secondary-screen-dose-response-curve-parameters.csv


---

## Step 1 — Fingerprint raw files
- Report file name, size (MB), and SHA256 for:
  - DepMap expression file
  - PRISM response file

**Expected outcome**
- Both files exist locally
- Fingerprints are computed successfully

In [2]:
def file_sha256(path: Path, chunk_size: int = 8_192) -> str:
    sha256 = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(chunk_size), b""):
            sha256.update(chunk)
    return sha256.hexdigest()

def describe_file(path: Path) -> dict:
    size_bytes = path.stat().st_size
    return {
        "file": path.name,
        "size_mb": round(size_bytes / (1024**2), 2),
        "sha256": file_sha256(path),
    }

fingerprints = pd.DataFrame(
    [describe_file(DEP_MAP_EXPR_PATH), describe_file(PRISM_DOSE_RESP_PATH)]
)

fingerprints

Unnamed: 0,file,size_mb,sha256
0,OmicsExpressionTPMLogp1HumanProteinCodingGenes...,517.87,9d7e64ebbcb2811fa5e0ecf56d952ee96ca9f1f2b6ead0...
1,prism-repurposing-20q2-secondary-screen-dose-r...,276.73,2ac69a21f1d681fe7447689262b82ca6e3dc90bfef0bd9...


---

## Step 2 — Load DepMap expression (minimal read)
### Goals
- Confirm the table loads correctly
- Validate expected index/identifier columns
- Verify dimensionality is plausible for this release (thousands of genes, hundreds of models)

### Checks
- Shape
- Column presence and basic types
- Missingness overview (high-level)

**Notes**
DepMap expression file is expected to contain gene-level log2(TPM+1) expression values for protein-coding genes. The dataset should include a cell line identifier column (commonly `ModelID` and/or `DepMap_ID`) depending on the file format.

In [3]:
# Minimal load to inspect structure without stressing memory

# Read only the header first
depmap_cols = pd.read_csv(DEP_MAP_EXPR_PATH, nrows=0).columns.tolist()

print(f"Number of columns: {len(depmap_cols)}")
print("First 10 columns:")
depmap_cols[:10]


Number of columns: 19221
First 10 columns:


['Unnamed: 0',
 'SequencingID',
 'ModelID',
 'IsDefaultEntryForModel',
 'ModelConditionID',
 'IsDefaultEntryForMC',
 'TSPAN6 (7105)',
 'TNMD (64102)',
 'DPM1 (8813)',
 'SCYL3 (57147)']

In [4]:
# Heuristic check for identifier columns
id_like_cols = [c for c in depmap_cols if any(
    key.lower() in c.lower() 
    for key in ["depmap", "model", "ccle", "id"]
)]

id_like_cols

['SequencingID',
 'ModelID',
 'IsDefaultEntryForModel',
 'ModelConditionID',
 'JARID2 (3720)',
 'IDS (3423)',
 'BID (637)',
 'ARID4A (5926)',
 'ARID1B (57492)',
 'ARID4B (51742)',
 'IDI1 (3422)',
 'IDH3G (3421)',
 'SIDT1 (54847)',
 'MID2 (11043)',
 'NID2 (22795)',
 'PRELID3B (51012)',
 'DIDO1 (11083)',
 'GID8 (54994)',
 'IDH3B (3420)',
 'MID1 (4281)',
 'ID2 (3398)',
 'ARID3A (1820)',
 'NID1 (4811)',
 'ID3 (3399)',
 'ARID1A (8289)',
 'IDE (3416)',
 'ID1 (3397)',
 'PCID2 (55795)',
 'IDUA (3425)',
 'IDO1 (3620)',
 'RIDA (10247)',
 'KIDINS220 (57498)',
 'CIDEB (27141)',
 'ATRAID (51374)',
 'IDH1 (3417)',
 'ITPRID2 (6744)',
 'GID4 (79018)',
 'PRELID3A (10650)',
 'IDNK (414328)',
 'IDI2 (91734)',
 'SIDT2 (51092)',
 'ARID5B (84159)',
 'GRID2 (2895)',
 'PID1 (55022)',
 'MIDEAS (91748)',
 'SPIDR (23514)',
 'MID1IP1 (58526)',
 'IDH3A (3419)',
 'MIDN (90007)',
 'NFKBID (84807)',
 'HID1 (283987)',
 'PRELID1 (27166)',
 'PPID (5481)',
 'ID4 (3400)',
 'CIDEA (1149)',
 'EID2 (163126)',
 'EID2B (126272

In [5]:
depmap_preview = pd.read_csv(
    DEP_MAP_EXPR_PATH,
    nrows=100
)

depmap_preview.head()

Unnamed: 0.1,Unnamed: 0,SequencingID,ModelID,IsDefaultEntryForModel,ModelConditionID,IsDefaultEntryForMC,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147),...,ATXN8 (724066),SMIM42 (117981789),NPBWR1 (2831),ACTL10 (170487),RNF228 (122319436),PANO1 (101927423),HRURF (120766137),PRRC2B (84726),F8A2 (474383),F8A1 (8263)
0,0,CDS-010xbm,ACH-001113,Yes,MC-001113-k2lR,Yes,4.956577,0.0,7.577648,3.179411,...,0.0,0.0,0.414727,0.077634,1.113094,0.411901,0.0,5.134808,1.214541,4.315653
1,1,CDS-02TzJp,ACH-001289,Yes,MC-001289-BpdI,Yes,4.954992,0.617243,7.334747,2.783576,...,0.0,0.0,0.02984,0.0,1.262156,1.026871,0.0,5.951231,0.007227,4.25106
2,2,CDS-0693hw,ACH-001339,Yes,MC-001339-5nRN,Yes,3.421952,0.0,7.546069,2.61588,...,0.0,0.0,0.127768,0.105594,0.413829,0.540575,0.0,4.205971,0.030894,2.783448
3,3,CDS-07Plat,ACH-001619,No,MC-001619-IR6I,No,5.196729,0.0,6.362268,2.144996,...,0.0,0.0,0.075396,0.932918,0.0,0.511978,0.0,4.510747,3.223597,5.106595
4,4,CDS-08FOcu,ACH-001979,Yes,MC-001979-E3qW,Yes,4.651643,0.0,5.946408,2.454515,...,0.0,0.0,0.012012,0.0,1.00647,0.729221,0.0,4.922916,0.150926,4.661556


In [6]:
depmap_preview.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 19221 entries, Unnamed: 0 to F8A1 (8263)
dtypes: float64(19215), int64(1), object(5)
memory usage: 14.7+ MB


In [7]:
depmap_preview.describe().iloc[:, :5]  # first few columns only

Unnamed: 0.1,Unnamed: 0,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147)
count,100.0,100.0,100.0,100.0,100.0
mean,49.5,3.858736,0.110574,6.809955,2.585842
std,29.011492,1.675112,0.789898,0.633608,0.610053
min,0.0,0.0,0.0,5.519271,1.063075
25%,24.75,3.46932,0.0,6.394414,2.124794
50%,49.5,4.2882,0.0,6.769404,2.600558
75%,74.25,4.930172,0.0,7.207496,2.936453
max,99.0,7.10594,7.858733,8.576556,4.170262


---

## Step 3 — Load PRISM drug response (minimal read)
### Goals
- Confirm the table loads correctly
- Validate expected response metrics exist (AUC, IC50, EC50) and identifiers (drug name/broad_id, depmap_id, etc.)
- Ensure there are multiple compounds and multiple cell lines

### Checks
- Shape
- Column presence (especially: `depmap_id`, `name`/drug identifier, `auc`)
- Missingness overview for key variables

**Notes**
This project will use **AUC** as the primary continuous drug response measure for the MVP, with optional binarization defined later.

In [8]:
# Load PRISM
prism_cols = pd.read_csv(PRISM_DOSE_RESP_PATH, nrows=0).columns.tolist()

print(f"Number of columns: {len(prism_cols)}")
print("First 30 columns:")
prism_cols[:30]

Number of columns: 20
First 30 columns:


['broad_id',
 'depmap_id',
 'ccle_name',
 'screen_id',
 'upper_limit',
 'lower_limit',
 'slope',
 'r2',
 'auc',
 'ec50',
 'ic50',
 'passed_str_profiling',
 'row_name',
 'name',
 'moa',
 'target',
 'disease.area',
 'indication',
 'smiles',
 'phase']

In [9]:
# Identify key columns (depmap_id, drug id, AUC)
key_like = [c for c in prism_cols if any(k in c.lower() for k in ["depmap", "auc", "ic50", "ec50", "name", "broad", "compound", "dose"])]
key_like

['broad_id',
 'depmap_id',
 'ccle_name',
 'auc',
 'ec50',
 'ic50',
 'row_name',
 'name']

In [10]:
# Load a small preview
prism_preview = pd.read_csv(PRISM_DOSE_RESP_PATH, nrows=200)
prism_preview.head()

Unnamed: 0,broad_id,depmap_id,ccle_name,screen_id,upper_limit,lower_limit,slope,r2,auc,ec50,ic50,passed_str_profiling,row_name,name,moa,target,disease.area,indication,smiles,phase
0,BRD-K36949735-001-01-1,ACH-000948,2313287_STOMACH,MTS010,1.088523,-1.079841,8.284451,0.913068,0.989373,10.375348,9.209687,True,PR500_ACH-000948,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
1,BRD-K36949735-001-01-1,ACH-000011,253J_URINARY_TRACT,MTS010,1.004565,-2.248875,7.883544,0.934277,0.988011,11.475826,9.255366,True,PR500_ACH-000011,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
2,BRD-K36949735-001-01-1,ACH-000026,253JBV_URINARY_TRACT,MTS010,1.068269,-21.457752,1.216485,0.877853,0.958743,167.921713,8.327173,True,PR500_ACH-000026,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
3,BRD-K36949735-001-01-1,ACH-000323,42MGBA_CENTRAL_NERVOUS_SYSTEM,MTS010,0.985761,0.029825,0.929279,0.929666,0.814224,2.221636,2.300992,True,PR500_ACH-000323,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
4,BRD-K36949735-001-01-1,ACH-000905,5637_URINARY_TRACT,MTS010,1.129441,0.017405,1.158793,0.929414,0.830589,1.459074,1.835021,True,PR500_ACH-000905,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3


In [11]:
# Basic sanity info
prism_preview.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   broad_id              200 non-null    object 
 1   depmap_id             200 non-null    object 
 2   ccle_name             196 non-null    object 
 3   screen_id             200 non-null    object 
 4   upper_limit           200 non-null    float64
 5   lower_limit           200 non-null    float64
 6   slope                 200 non-null    float64
 7   r2                    200 non-null    float64
 8   auc                   200 non-null    float64
 9   ec50                  200 non-null    float64
 10  ic50                  171 non-null    float64
 11  passed_str_profiling  200 non-null    bool   
 12  row_name              200 non-null    object 
 13  name                  200 non-null    object 
 14  moa                   200 non-null    object 
 15  target                2

---

## Step 4 — Identifier alignment feasibility
### Goals
- Identify the join key(s) that connect DepMap expression to PRISM response
- Quantify overlap between datasets

### Checks
- Candidate keys (e.g., `depmap_id`, `ModelID`, `ccle_name`)
- Overlap counts (how many cell lines are shared)
- Proportion of PRISM entries mappable to expression features

**Deliverable**
A clear choice of join key for downstream steps (documented, but not yet “finalized forever”).

In [12]:
# --- Load minimal identifiers from expression
expr_ids = pd.read_csv(
    DEP_MAP_EXPR_PATH,
    usecols=["ModelID", "SequencingID", "ModelConditionID"],
)

# --- Load minimal PRISM core
prism_core = pd.read_csv(
    PRISM_DOSE_RESP_PATH,
    usecols=["depmap_id", "broad_id", "name", "auc"],
)

# --- Defensive normalization
expr_ids["ModelID"] = expr_ids["ModelID"].astype(str).str.strip()
prism_core["depmap_id"] = prism_core["depmap_id"].astype(str).str.strip()
prism_core["broad_id"] = prism_core["broad_id"].astype(str).str.strip()

print("Expression ModelID example:", expr_ids["ModelID"].dropna().iloc[0])
print("PRISM depmap_id example:", prism_core["depmap_id"].dropna().iloc[0])

# --- Overlap (cell lines)
expr_set = set(expr_ids["ModelID"].dropna().unique())
prism_set = set(prism_core["depmap_id"].dropna().unique())
shared = expr_set.intersection(prism_set)

overlap_stats = pd.DataFrame([{
    "n_expression_models": len(expr_set),
    "n_prism_cell_lines": len(prism_set),
    "n_shared_cell_lines": len(shared),
    "shared_vs_expression_pct": round(100 * len(shared) / max(len(expr_set), 1), 2),
    "shared_vs_prism_pct": round(100 * len(shared) / max(len(prism_set), 1), 2),
}])

overlap_stats

Expression ModelID example: ACH-001113
PRISM depmap_id example: ACH-000948


Unnamed: 0,n_expression_models,n_prism_cell_lines,n_shared_cell_lines,shared_vs_expression_pct,shared_vs_prism_pct
0,1699,738,727,42.79,98.51


In [13]:
prism_core["mappable_to_expression"] = prism_core["depmap_id"].isin(expr_set)

pd.DataFrame([{
    "total_prism_rows": len(prism_core),
    "mappable_rows": int(prism_core["mappable_to_expression"].sum()),
    "mappable_rows_pct": round(100 * prism_core["mappable_to_expression"].mean(), 2),
    "unmappable_rows": int((~prism_core["mappable_to_expression"]).sum()),
}])

Unnamed: 0,total_prism_rows,mappable_rows,mappable_rows_pct,unmappable_rows
0,753778,738001,97.91,15777


In [14]:
(
    prism_core.loc[~prism_core["mappable_to_expression"], "depmap_id"]
    .value_counts()
    .head(20)
)

depmap_id
nan           10137
ACH-001024     1557
ACH-000047     1494
ACH-000309     1484
ACH-001212      169
ACH-000658      163
ACH-001205      159
ACH-000534      158
ACH-000992      157
ACH-001078      152
ACH-000010      147
Name: count, dtype: int64

In [15]:
# %% [code]
dup_mask = prism_core.duplicated(subset=["depmap_id", "broad_id"], keep=False)

pd.DataFrame([{
    "total_rows": len(prism_core),
    "duplicate_rows": int(dup_mask.sum()),
    "duplicate_rows_pct": round(100 * dup_mask.mean(), 6),
    "unique_pairs_cell_drug": prism_core[["depmap_id", "broad_id"]].drop_duplicates().shape[0],
}])


Unnamed: 0,total_rows,duplicate_rows,duplicate_rows_pct,unique_pairs_cell_drug
0,753778,99402,13.187172,699583


### Duplicate cell line × drug entries (PRISM)

Multiple rows per (depmap_id, broad_id) combination are present in the PRISM dose–response dataset (~13% of rows).
This is expected and reflects:
- multiple experimental screens (e.g. MTS010),
- multiple fitted dose–response curves per compound–cell line pair.

No resolution strategy is applied at this stage.
The choice of how to handle duplicate entries (e.g. screen selection, aggregation, or quality-based filtering)
is deferred to the next notebook, where response targets are formally defined.

---

## Step 5 — Minimal descriptive sanity checks
### DepMap expression
- Distribution snapshots for expression values (no heavy plotting required)
- Spot-check a few genes and a few models

### PRISM response
- Distribution of AUC (overall and per-screen if applicable)
- Spot-check a few compounds and a few cell lines

**Purpose**
Catch glaring issues early (e.g., all zeros, swapped columns, unexpected ranges).

In [16]:
# ---- Load DepMap expression matrix ----
depmap_expr_df = pd.read_csv(DEP_MAP_EXPR_PATH, low_memory=False)

print("Loaded DepMap expression -> depmap_expr_df")
print(f"Shape: {depmap_expr_df.shape}")
display(depmap_expr_df.head())


# ---- Load PRISM dose–response data ----
prism_df = pd.read_csv(PRISM_DOSE_RESP_PATH, low_memory=False)

print("\nLoaded PRISM dose–response -> prism_df")
print(f"Shape: {prism_df.shape}")
display(prism_df.head())


# ---- Standardized working copies for downstream steps ----
dep_expr = depmap_expr_df.copy()
prism = prism_df.copy()

print("\nStandardized working dataframes created:")
print(" - dep_expr  (DepMap expression)")
print(" - prism     (PRISM dose–response)")


Loaded DepMap expression -> depmap_expr_df
Shape: (1754, 19221)


Unnamed: 0.1,Unnamed: 0,SequencingID,ModelID,IsDefaultEntryForModel,ModelConditionID,IsDefaultEntryForMC,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147),...,ATXN8 (724066),SMIM42 (117981789),NPBWR1 (2831),ACTL10 (170487),RNF228 (122319436),PANO1 (101927423),HRURF (120766137),PRRC2B (84726),F8A2 (474383),F8A1 (8263)
0,0,CDS-010xbm,ACH-001113,Yes,MC-001113-k2lR,Yes,4.956577,0.0,7.577648,3.179411,...,0.0,0.0,0.414727,0.077634,1.113094,0.411901,0.0,5.134808,1.214541,4.315653
1,1,CDS-02TzJp,ACH-001289,Yes,MC-001289-BpdI,Yes,4.954992,0.617243,7.334747,2.783576,...,0.0,0.0,0.02984,0.0,1.262156,1.026871,0.0,5.951231,0.007227,4.25106
2,2,CDS-0693hw,ACH-001339,Yes,MC-001339-5nRN,Yes,3.421952,0.0,7.546069,2.61588,...,0.0,0.0,0.127768,0.105594,0.413829,0.540575,0.0,4.205971,0.030894,2.783448
3,3,CDS-07Plat,ACH-001619,No,MC-001619-IR6I,No,5.196729,0.0,6.362268,2.144996,...,0.0,0.0,0.075396,0.932918,0.0,0.511978,0.0,4.510747,3.223597,5.106595
4,4,CDS-08FOcu,ACH-001979,Yes,MC-001979-E3qW,Yes,4.651643,0.0,5.946408,2.454515,...,0.0,0.0,0.012012,0.0,1.00647,0.729221,0.0,4.922916,0.150926,4.661556



Loaded PRISM dose–response -> prism_df
Shape: (753778, 20)


Unnamed: 0,broad_id,depmap_id,ccle_name,screen_id,upper_limit,lower_limit,slope,r2,auc,ec50,ic50,passed_str_profiling,row_name,name,moa,target,disease.area,indication,smiles,phase
0,BRD-K36949735-001-01-1,ACH-000948,2313287_STOMACH,MTS010,1.088523,-1.079841,8.284451,0.913068,0.989373,10.375348,9.209687,True,PR500_ACH-000948,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
1,BRD-K36949735-001-01-1,ACH-000011,253J_URINARY_TRACT,MTS010,1.004565,-2.248875,7.883544,0.934277,0.988011,11.475826,9.255366,True,PR500_ACH-000011,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
2,BRD-K36949735-001-01-1,ACH-000026,253JBV_URINARY_TRACT,MTS010,1.068269,-21.457752,1.216485,0.877853,0.958743,167.921713,8.327173,True,PR500_ACH-000026,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
3,BRD-K36949735-001-01-1,ACH-000323,42MGBA_CENTRAL_NERVOUS_SYSTEM,MTS010,0.985761,0.029825,0.929279,0.929666,0.814224,2.221636,2.300992,True,PR500_ACH-000323,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
4,BRD-K36949735-001-01-1,ACH-000905,5637_URINARY_TRACT,MTS010,1.129441,0.017405,1.158793,0.929414,0.830589,1.459074,1.835021,True,PR500_ACH-000905,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3



Standardized working dataframes created:
 - dep_expr  (DepMap expression)
 - prism     (PRISM dose–response)


In [17]:
# -----------------------------
# Helpers
# -----------------------------
def _to_numeric_series(x: pd.Series) -> pd.Series:
    """Coerce to numeric; returns numeric series with NaNs for non-parsable."""
    return pd.to_numeric(x, errors="coerce")

def quick_numeric_snapshot(s: pd.Series, name: str) -> pd.Series:
    s = _to_numeric_series(s).dropna()
    if s.empty:
        return pd.Series({"name": name, "n": 0})
    qs = s.quantile([0.0, 0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99, 1.0])
    out = {
        "name": name,
        "n": int(s.shape[0]),
        "mean": float(s.mean()),
        "std": float(s.std(ddof=1)) if s.shape[0] > 1 else 0.0,
        "min": float(qs.loc[0.0]),
        "p01": float(qs.loc[0.01]),
        "p05": float(qs.loc[0.05]),
        "p25": float(qs.loc[0.25]),
        "p50": float(qs.loc[0.50]),
        "p75": float(qs.loc[0.75]),
        "p95": float(qs.loc[0.95]),
        "p99": float(qs.loc[0.99]),
        "max": float(qs.loc[1.0]),
    }
    return pd.Series(out)

def print_block(title: str):
    print("\n" + "=" * 80)
    print(title)
    print("=" * 80)



In [18]:
# -----------------------------
# Step 5 — Minimal descriptive sanity checks
# -----------------------------

# ========= DepMap expression =========
print_block("Step 5A — DepMap expression: minimal descriptive sanity checks (memory-safe)")

# Identify ID-like cols (your DepMap has 'ModelID') :contentReference[oaicite:2]{index=2}
id_like_cols = [c for c in dep_expr.columns if str(c).lower() in {"modelid", "depmap_id", "ccle_name", "stripped_cell_line_name"}]

# numeric candidate columns
numeric_cols = [c for c in dep_expr.columns if c not in id_like_cols and pd.api.types.is_numeric_dtype(dep_expr[c])]

print(f"DepMap rows: {dep_expr.shape[0]:,} | cols: {dep_expr.shape[1]:,}")
print(f"ID-like cols detected: {id_like_cols if id_like_cols else 'None'}")
print(f"Numeric columns (candidate genes): {len(numeric_cols):,}")

if len(numeric_cols) == 0:
    print("No numeric columns detected. Check parsing.")
else:
    rng = np.random.default_rng(42)

    # --- sample a small subset of genes (columns), then flatten values (safe) ---
    n_genes_sample = min(200, len(numeric_cols))
    genes_sample = rng.choice(numeric_cols, size=n_genes_sample, replace=False).tolist()

    # flatten values from a SMALL matrix: (n_rows x 200) -> manageable
    vals = dep_expr[genes_sample].to_numpy().ravel()
    snap = quick_numeric_snapshot(pd.Series(vals), name=f"depmap_expr_values_{n_genes_sample}genes")
    display(snap.to_frame().T)

    # --- spot-check a few genes and a few rows ---
    genes_to_check = rng.choice(genes_sample, size=min(6, len(genes_sample)), replace=False).tolist()
    print(f"\nSpot-check genes: {genes_to_check}")

    if "ModelID" in dep_expr.columns:
        model_ids = dep_expr["ModelID"].dropna().astype(str).unique()
        pick_models = rng.choice(model_ids, size=min(6, len(model_ids)), replace=False).tolist()
        print(f"Spot-check ModelID: {pick_models}")
        display(dep_expr.loc[dep_expr["ModelID"].astype(str).isin(pick_models), ["ModelID"] + genes_to_check])
    else:
        idx_to_check = rng.choice(dep_expr.index.to_numpy(), size=min(6, len(dep_expr)), replace=False)
        print(f"Spot-check row indices: {idx_to_check.tolist()}")
        display(dep_expr.loc[idx_to_check, genes_to_check])

    # --- missingness on sampled genes (fast) ---
    miss_rate = dep_expr[genes_sample].isna().mean().mean()
    print(f"\nMean missing rate across sampled numeric matrix: {miss_rate:.6f}")



Step 5A — DepMap expression: minimal descriptive sanity checks (memory-safe)
DepMap rows: 1,754 | cols: 19,221
ID-like cols detected: ['ModelID']
Numeric columns (candidate genes): 19,216


Unnamed: 0,name,n,mean,std,min,p01,p05,p25,p50,p75,p95,p99,max
0,depmap_expr_values_200genes,350800,2.747282,2.479266,0.0,0.0,0.0,0.138171,2.589681,4.638407,6.965323,8.712341,14.435996



Spot-check genes: ['OR56A5 (390084)', 'SLC25A29 (123096)', 'C2orf66 (401027)', 'ALOX12B (242)', 'ANAPC4 (29945)', 'QRFP (347148)']
Spot-check ModelID: ['ACH-000828', 'ACH-000022', 'ACH-001068', 'ACH-001680', 'ACH-002668', 'ACH-000767']


Unnamed: 0,ModelID,OR56A5 (390084),SLC25A29 (123096),C2orf66 (401027),ALOX12B (242),ANAPC4 (29945),QRFP (347148)
30,ACH-000828,0.0,6.05548,0.138877,0.0,4.67114,0.109374
225,ACH-001068,0.0,4.682792,1.860752,2.149549,4.087319,0.0
298,ACH-002668,0.0,3.155077,0.172539,0.0,3.491802,0.306409
393,ACH-001680,0.0,1.893288,0.17215,0.019111,4.333281,0.108731
478,ACH-000022,0.0,4.173847,0.246839,1.494072,4.486682,0.162964
1179,ACH-000767,0.0,5.498356,0.496501,1.683611,4.669322,0.183395



Mean missing rate across sampled numeric matrix: 0.000000


In [19]:
# ========= PRISM response =========
print_block("Step 5B — PRISM response: minimal descriptive sanity checks")

print(f"PRISM rows: {prism.shape[0]:,} | cols: {prism.shape[1]:,}")

# AUC column detection (robust)
auc_candidates = [c for c in prism.columns if str(c).lower() in {"auc", "auc_mean", "auc_prism"} or "auc" in str(c).lower()]
if len(auc_candidates) == 0:
    raise ValueError("Could not find an AUC-like column in PRISM dataframe. Check column names.")
auc_col = auc_candidates[0]
print(f"Using AUC column: {auc_col}")

# 1) Overall AUC distribution
auc_snap = quick_numeric_snapshot(prism[auc_col], name=f"prism_{auc_col}_overall")
display(auc_snap.to_frame().T)

# 2) Per-screen distribution if screen column exists
screen_candidates = [c for c in prism.columns if str(c).lower() in {"screen_id", "screen", "screen_name"}]
screen_col = screen_candidates[0] if screen_candidates else None

if screen_col is not None:
    print(f"\nScreen column detected: {screen_col}")
    # show top screens by count
    top_screens = prism[screen_col].value_counts(dropna=False).head(10)
    display(top_screens.to_frame("n_rows"))

    # per-screen quantiles (top 5 screens only to keep it light)
    top5 = top_screens.index[:5]
    rows = []
    for sc in top5:
        rows.append(quick_numeric_snapshot(prism.loc[prism[screen_col] == sc, auc_col], name=f"{screen_col}={sc}"))
    display(pd.DataFrame(rows))
else:
    print("\nNo screen column detected (OK). Skipping per-screen snapshot.")

# 3) Spot-check a few compounds and a few cell lines
compound_candidates = [c for c in prism.columns if str(c).lower() in {"broad_id", "compound", "drug", "pert_id"}]
cell_candidates = [c for c in prism.columns if str(c).lower() in {"depmap_id", "ccle_name", "cell_line", "cell"}]

compound_col = compound_candidates[0] if compound_candidates else None
cell_col = cell_candidates[0] if cell_candidates else None

print(f"\nCompound col: {compound_col} | Cell col: {cell_col}")

rng = np.random.default_rng(42)

cols_to_show = [c for c in [cell_col, compound_col, screen_col, auc_col] if c is not None]
cols_to_show = list(dict.fromkeys(cols_to_show))  # dedupe while preserving order

if compound_col is not None:
    compounds = prism[compound_col].dropna().astype(str).unique()
    pick_compounds = rng.choice(compounds, size=min(5, len(compounds)), replace=False).tolist()
    print(f"Spot-check compounds: {pick_compounds}")
    display(prism.loc[prism[compound_col].astype(str).isin(pick_compounds), cols_to_show].head(20))

if cell_col is not None:
    cells = prism[cell_col].dropna().astype(str).unique()
    pick_cells = rng.choice(cells, size=min(5, len(cells)), replace=False).tolist()
    print(f"Spot-check cells: {pick_cells}")
    display(prism.loc[prism[cell_col].astype(str).isin(pick_cells), cols_to_show].head(20))

# 4) Quick range sanity: AUC should not be constant / all missing
auc_num = _to_numeric_series(prism[auc_col])
print(f"\nAUC non-null: {auc_num.notna().sum():,} | unique (approx): {auc_num.dropna().round(6).nunique():,}")
print(f"AUC min/max: {auc_num.min()} / {auc_num.max()}")


Step 5B — PRISM response: minimal descriptive sanity checks
PRISM rows: 753,778 | cols: 20
Using AUC column: auc


Unnamed: 0,name,n,mean,std,min,p01,p05,p25,p50,p75,p95,p99,max
0,prism_auc_overall,753778,0.942378,0.281293,0.0,0.28958,0.503727,0.793111,0.907027,1.097597,1.442942,1.703991,4.889162



Screen column detected: screen_id


Unnamed: 0_level_0,n_rows
screen_id,Unnamed: 1_level_1
HTS002,602495
MTS010,116977
MTS006,33472
MTS005,834


Unnamed: 0,name,n,mean,std,min,p01,p05,p25,p50,p75,p95,p99,max
0,screen_id=HTS002,602495,0.971169,0.292115,0.004174,0.278742,0.535919,0.81048,0.916698,1.165028,1.479544,1.741179,4.889162
1,screen_id=MTS010,116977,0.812559,0.172539,0.0,0.339612,0.454659,0.712655,0.866787,0.950888,1.0,1.0,1.0
2,screen_id=MTS006,33472,0.87416,0.249803,0.06698,0.25041,0.400528,0.72993,0.897199,1.051512,1.238663,1.369608,2.339337
3,screen_id=MTS005,834,1.089643,0.138931,0.632434,0.81868,0.881082,0.981462,1.090548,1.173533,1.326709,1.450987,1.622269



Compound col: broad_id | Cell col: depmap_id
Spot-check compounds: ['BRD-A60594020-034-01-1', 'BRD-K73319509-001-08-0', 'BRD-K49371609-003-03-8', 'BRD-K42828737-001-03-3', 'BRD-K34073885-001-09-3']


Unnamed: 0,depmap_id,broad_id,screen_id,auc
83882,ACH-001001,BRD-K42828737-001-03-3,MTS010,0.92213
83883,ACH-000956,BRD-K42828737-001-03-3,MTS010,0.814937
83884,ACH-000948,BRD-K42828737-001-03-3,MTS010,0.985109
83885,ACH-000011,BRD-K42828737-001-03-3,MTS010,0.884877
83886,ACH-000026,BRD-K42828737-001-03-3,MTS010,0.903605
83887,ACH-000323,BRD-K42828737-001-03-3,MTS010,0.73652
83888,ACH-000905,BRD-K42828737-001-03-3,MTS010,0.839548
83889,ACH-000973,BRD-K42828737-001-03-3,MTS010,0.934278
83890,ACH-000896,BRD-K42828737-001-03-3,MTS010,0.885797
83891,ACH-000070,BRD-K42828737-001-03-3,MTS010,0.897736


Spot-check cells: ['ACH-000569', 'ACH-000107', 'ACH-000766', 'ACH-000715', 'ACH-000366']


Unnamed: 0,depmap_id,broad_id,screen_id,auc
71,ACH-000107,BRD-K36949735-001-01-1,MTS010,0.928761
397,ACH-000766,BRD-K36949735-001-01-1,MTS010,0.889187
559,ACH-000366,BRD-K36949735-001-01-1,MTS010,0.885681
579,ACH-000715,BRD-K36949735-001-01-1,MTS010,0.991025
757,ACH-000107,BRD-K13662825-001-07-5,MTS010,0.542889
1017,ACH-000766,BRD-K13662825-001-07-5,MTS010,0.463565
1138,ACH-000366,BRD-K13662825-001-07-5,MTS010,0.340145
1330,ACH-000107,BRD-K95142244-001-01-5,MTS010,0.867419
1503,ACH-000569,BRD-K95142244-001-01-5,MTS010,0.69426
1652,ACH-000766,BRD-K95142244-001-01-5,MTS010,0.575484



AUC non-null: 753,778 | unique (approx): 498,745
AUC min/max: 0.0 / 4.88916233189892


---

## Step 6 — Output summary & go/no-go
### Summary table
- File fingerprints
- Shapes
- Key columns detected
- Join key selected (provisional)
- Overlap statistics

### Go/no-go criteria
Proceed if:
- Both datasets load without errors
- Required identifiers exist
- There is a meaningful intersection of cell lines (non-trivial overlap)
- Response metric (AUC) is present and non-degenerate

If any criterion fails, stop here and resolve data access or mapping issues before continuing.

In [None]:
# Preconditions
assert "dep_expr" in globals(), "dep_expr not found."
assert "prism" in globals(), "prism not found."
assert "join_id" in dep_expr.columns, "dep_expr must have 'join_id' (create it in Step 6-0)."
assert "join_id" in prism.columns, "prism must have 'join_id' (create it in Step 6-0)."

# AUC column (we know from your minimal load it's 'auc', but keep it safe)
auc_col = "auc" if "auc" in prism.columns else next((c for c in prism.columns if "auc" in str(c).lower()), None)

# Overlap
dep_set = set(dep_expr["join_id"].dropna().astype(str).str.strip().unique())
prism_set = set(prism["join_id"].dropna().astype(str).str.strip().unique())
shared = dep_set.intersection(prism_set)

dep_unique = len(dep_set)
prism_unique = len(prism_set)
shared_n = len(shared)
shared_vs_dep = round(100 * shared_n / max(1, dep_unique), 2)
shared_vs_prism = round(100 * shared_n / max(1, prism_unique), 2)

# AUC sanity
auc_present = auc_col is not None
auc_ok = False
auc_nonnull = 0
auc_unique = 0
if auc_present:
    auc_num = pd.to_numeric(prism[auc_col], errors="coerce")
    auc_nonnull = int(auc_num.notna().sum())
    auc_unique = int(auc_num.dropna().round(8).nunique())
    auc_ok = (auc_nonnull > 0) and (auc_unique >= 10)

# Summary table
fingerprints_present = "fingerprints" in globals()

summary = pd.DataFrame(
    [
        {"item": "Fingerprints computed", "value": bool(fingerprints_present)},
        {"item": "DepMap shape", "value": f"{dep_expr.shape[0]:,} × {dep_expr.shape[1]:,}"},
        {"item": "PRISM shape", "value": f"{prism.shape[0]:,} × {prism.shape[1]:,}"},
        {"item": "Join key selected", "value": "join_id (DepMap: ModelID, PRISM: depmap_id)"},
        {"item": "DepMap unique join_id", "value": dep_unique},
        {"item": "PRISM unique join_id", "value": prism_unique},
        {"item": "Shared join_id (intersection)", "value": shared_n},
        {"item": "Shared vs DepMap (%)", "value": shared_vs_dep},
        {"item": "Shared vs PRISM (%)", "value": shared_vs_prism},
        {"item": "AUC column detected", "value": auc_col},
        {"item": "AUC non-null count", "value": auc_nonnull},
        {"item": "AUC unique (rounded) values", "value": auc_unique},
    ]
)
display(summary)

# Go / No-Go
criteria = {
    "Both datasets load without errors": True,
    "Required identifiers exist (join_id)": (dep_unique > 0) and (prism_unique > 0),
    "Meaningful intersection (non-trivial overlap)": shared_n >= 25,
    "Response metric (AUC) is present and non-degenerate": auc_present and auc_ok,
}
criteria_df = pd.DataFrame([{"criterion": k, "pass": v} for k, v in criteria.items()])
display(criteria_df)

all_pass = bool(all(criteria.values()))
print("\nGO / NO-GO:", "✅ GO (proceed)" if all_pass else "⛔ NO-GO (stop & fix before continuing)")

if not all_pass:
    failed = [k for k, v in criteria.items() if not v]
    raise RuntimeError("Step 6 NO-GO. Failed criteria:\n- " + "\n- ".join(failed))


Unnamed: 0,item,value
0,Fingerprints computed,True
1,DepMap shape,"1,754 × 19,222"
2,PRISM shape,"753,778 × 21"
3,Join key selected,"join_id (DepMap: ModelID, PRISM: depmap_id)"
4,DepMap unique join_id,1699
5,PRISM unique join_id,738
6,Shared join_id (intersection),727
7,Shared vs DepMap (%),42.79
8,Shared vs PRISM (%),98.51
9,AUC column detected,auc


Unnamed: 0,criterion,pass
0,Both datasets load without errors,True
1,Required identifiers exist (join_id),True
2,Meaningful intersection (non-trivial overlap),True
3,Response metric (AUC) is present and non-degen...,True



GO / NO-GO: ✅ GO (proceed)


---