# 01 — Sanity Checks: DepMap Expression + PRISM Drug Response

## Purpose
This notebook performs foundational sanity checks for the MVP pipeline by validating:
- The presence and integrity of raw input files (fingerprinting).
- The basic structure and expected columns of DepMap expression and PRISM drug response datasets.
- The feasibility of joining datasets via shared cell line identifiers.
- Minimal descriptive statistics to detect obvious anomalies early (without heavy preprocessing).

This notebook is intentionally conservative: it does **not** train models, define final resistance labels, or perform feature engineering. Its role is to confirm that the project "stands on solid ground" before downstream analysis.

---

## Inputs (raw)
Expected raw file locations (not tracked by Git):

- DepMap expression (RNA-seq):  
  `data/raw/depmap_expression/OmicsExpressionTPMLogp1HumanProteinCodingGenes.csv`

- PRISM drug response (secondary screen dose–response parameters):  
  `data/raw/prism_response/secondary_dose_response_curve_parameters.csv`

> Raw datasets are intentionally stored under `data/raw/` and excluded from version control to prevent accidental publication of large files and to keep the repository lightweight. Provenance and file fingerprints are documented instead.

---

## Reproducibility: file fingerprinting (SHA256)
To support strong reproducibility, we compute a SHA256 fingerprint for each raw file used by this notebook, along with file size.  
These fingerprints allow future users to confirm they are using the same source files (bitwise identity), even if filenames or download mechanisms change.

We **do not** hard-fail execution if fingerprints differ, because reproducibility should remain possible across environments. Instead, we record fingerprints and encourage users to report them when sharing results.

---

## Step 0 — Environment & configuration
- Import core packages
- Define file paths
- Basic runtime checks (paths exist, readable)

In [None]:
# %% [code]
from __future__ import annotations

import hashlib
from pathlib import Path
import pandas as pd

# --- Paths (Windows-friendly). Use forward slashes; Path() handles it on Windows too.
DEP_MAP_EXPR_PATH = Path("../data/raw/OmicsExpressionTPMLogp1HumanProteinCodingGenes.csv")
PRISM_DOSE_RESP_PATH = Path(
    "../data/raw/prism-repurposing-20q2-secondary-screen-dose-response-curve-parameters.csv"
)

# --- Basic runtime checks
for p in [DEP_MAP_EXPR_PATH, PRISM_DOSE_RESP_PATH]:
    if not p.exists():
        raise FileNotFoundError(f"Missing file: {p.resolve()}")

print("✅ Input files found:")
print(f"- DepMap expression: {DEP_MAP_EXPR_PATH.resolve()}")
print(f"- PRISM dose-response: {PRISM_DOSE_RESP_PATH.resolve()}")


✅ Input files found:
- DepMap expression: C:\Users\paula\OneDrive\Documentos\Proyectos\epigenetic-drug-resistance\data\raw\OmicsExpressionTPMLogp1HumanProteinCodingGenes.csv
- PRISM dose-response: C:\Users\paula\OneDrive\Documentos\Proyectos\epigenetic-drug-resistance\data\raw\prism-repurposing-20q2-secondary-screen-dose-response-curve-parameters.csv


---

## Step 1 — Fingerprint raw files
- Report file name, size (MB), and SHA256 for:
  - DepMap expression file
  - PRISM response file

**Expected outcome**
- Both files exist locally
- Fingerprints are computed successfully

In [2]:
def file_sha256(path: Path, chunk_size: int = 8_192) -> str:
    sha256 = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(chunk_size), b""):
            sha256.update(chunk)
    return sha256.hexdigest()

def describe_file(path: Path) -> dict:
    size_bytes = path.stat().st_size
    return {
        "file": path.name,
        "size_mb": round(size_bytes / (1024**2), 2),
        "sha256": file_sha256(path),
    }

fingerprints = pd.DataFrame(
    [describe_file(DEP_MAP_EXPR_PATH), describe_file(PRISM_DOSE_RESP_PATH)]
)

fingerprints

Unnamed: 0,file,size_mb,sha256
0,OmicsExpressionTPMLogp1HumanProteinCodingGenes...,517.87,9d7e64ebbcb2811fa5e0ecf56d952ee96ca9f1f2b6ead0...
1,prism-repurposing-20q2-secondary-screen-dose-r...,276.73,2ac69a21f1d681fe7447689262b82ca6e3dc90bfef0bd9...


---

## Step 2 — Load DepMap expression (minimal read)
### Goals
- Confirm the table loads correctly
- Validate expected index/identifier columns
- Verify dimensionality is plausible for this release (thousands of genes, hundreds of models)

### Checks
- Shape
- Column presence and basic types
- Missingness overview (high-level)

**Notes**
DepMap expression file is expected to contain gene-level log2(TPM+1) expression values for protein-coding genes. The dataset should include a cell line identifier column (commonly `ModelID` and/or `DepMap_ID`) depending on the file format.

In [3]:
# Minimal load to inspect structure without stressing memory

# Read only the header first
depmap_cols = pd.read_csv(DEP_MAP_EXPR_PATH, nrows=0).columns.tolist()

print(f"Number of columns: {len(depmap_cols)}")
print("First 10 columns:")
depmap_cols[:10]


Number of columns: 19221
First 10 columns:


['Unnamed: 0',
 'SequencingID',
 'ModelID',
 'IsDefaultEntryForModel',
 'ModelConditionID',
 'IsDefaultEntryForMC',
 'TSPAN6 (7105)',
 'TNMD (64102)',
 'DPM1 (8813)',
 'SCYL3 (57147)']

In [4]:
# Heuristic check for identifier columns
id_like_cols = [c for c in depmap_cols if any(
    key.lower() in c.lower() 
    for key in ["depmap", "model", "ccle", "id"]
)]

id_like_cols

['SequencingID',
 'ModelID',
 'IsDefaultEntryForModel',
 'ModelConditionID',
 'JARID2 (3720)',
 'IDS (3423)',
 'BID (637)',
 'ARID4A (5926)',
 'ARID1B (57492)',
 'ARID4B (51742)',
 'IDI1 (3422)',
 'IDH3G (3421)',
 'SIDT1 (54847)',
 'MID2 (11043)',
 'NID2 (22795)',
 'PRELID3B (51012)',
 'DIDO1 (11083)',
 'GID8 (54994)',
 'IDH3B (3420)',
 'MID1 (4281)',
 'ID2 (3398)',
 'ARID3A (1820)',
 'NID1 (4811)',
 'ID3 (3399)',
 'ARID1A (8289)',
 'IDE (3416)',
 'ID1 (3397)',
 'PCID2 (55795)',
 'IDUA (3425)',
 'IDO1 (3620)',
 'RIDA (10247)',
 'KIDINS220 (57498)',
 'CIDEB (27141)',
 'ATRAID (51374)',
 'IDH1 (3417)',
 'ITPRID2 (6744)',
 'GID4 (79018)',
 'PRELID3A (10650)',
 'IDNK (414328)',
 'IDI2 (91734)',
 'SIDT2 (51092)',
 'ARID5B (84159)',
 'GRID2 (2895)',
 'PID1 (55022)',
 'MIDEAS (91748)',
 'SPIDR (23514)',
 'MID1IP1 (58526)',
 'IDH3A (3419)',
 'MIDN (90007)',
 'NFKBID (84807)',
 'HID1 (283987)',
 'PRELID1 (27166)',
 'PPID (5481)',
 'ID4 (3400)',
 'CIDEA (1149)',
 'EID2 (163126)',
 'EID2B (126272

In [5]:
depmap_preview = pd.read_csv(
    DEP_MAP_EXPR_PATH,
    nrows=100
)

depmap_preview.head()

Unnamed: 0.1,Unnamed: 0,SequencingID,ModelID,IsDefaultEntryForModel,ModelConditionID,IsDefaultEntryForMC,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147),...,ATXN8 (724066),SMIM42 (117981789),NPBWR1 (2831),ACTL10 (170487),RNF228 (122319436),PANO1 (101927423),HRURF (120766137),PRRC2B (84726),F8A2 (474383),F8A1 (8263)
0,0,CDS-010xbm,ACH-001113,Yes,MC-001113-k2lR,Yes,4.956577,0.0,7.577648,3.179411,...,0.0,0.0,0.414727,0.077634,1.113094,0.411901,0.0,5.134808,1.214541,4.315653
1,1,CDS-02TzJp,ACH-001289,Yes,MC-001289-BpdI,Yes,4.954992,0.617243,7.334747,2.783576,...,0.0,0.0,0.02984,0.0,1.262156,1.026871,0.0,5.951231,0.007227,4.25106
2,2,CDS-0693hw,ACH-001339,Yes,MC-001339-5nRN,Yes,3.421952,0.0,7.546069,2.61588,...,0.0,0.0,0.127768,0.105594,0.413829,0.540575,0.0,4.205971,0.030894,2.783448
3,3,CDS-07Plat,ACH-001619,No,MC-001619-IR6I,No,5.196729,0.0,6.362268,2.144996,...,0.0,0.0,0.075396,0.932918,0.0,0.511978,0.0,4.510747,3.223597,5.106595
4,4,CDS-08FOcu,ACH-001979,Yes,MC-001979-E3qW,Yes,4.651643,0.0,5.946408,2.454515,...,0.0,0.0,0.012012,0.0,1.00647,0.729221,0.0,4.922916,0.150926,4.661556


In [6]:
depmap_preview.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 19221 entries, Unnamed: 0 to F8A1 (8263)
dtypes: float64(19215), int64(1), object(5)
memory usage: 14.7+ MB


In [7]:
depmap_preview.describe().iloc[:, :5]  # first few columns only

Unnamed: 0.1,Unnamed: 0,TSPAN6 (7105),TNMD (64102),DPM1 (8813),SCYL3 (57147)
count,100.0,100.0,100.0,100.0,100.0
mean,49.5,3.858736,0.110574,6.809955,2.585842
std,29.011492,1.675112,0.789898,0.633608,0.610053
min,0.0,0.0,0.0,5.519271,1.063075
25%,24.75,3.46932,0.0,6.394414,2.124794
50%,49.5,4.2882,0.0,6.769404,2.600558
75%,74.25,4.930172,0.0,7.207496,2.936453
max,99.0,7.10594,7.858733,8.576556,4.170262


---

## Step 3 — Load PRISM drug response (minimal read)
### Goals
- Confirm the table loads correctly
- Validate expected response metrics exist (AUC, IC50, EC50) and identifiers (drug name/broad_id, depmap_id, etc.)
- Ensure there are multiple compounds and multiple cell lines

### Checks
- Shape
- Column presence (especially: `depmap_id`, `name`/drug identifier, `auc`)
- Missingness overview for key variables

**Notes**
This project will use **AUC** as the primary continuous drug response measure for the MVP, with optional binarization defined later.

In [8]:
# Load PRISM
prism_cols = pd.read_csv(PRISM_DOSE_RESP_PATH, nrows=0).columns.tolist()

print(f"Number of columns: {len(prism_cols)}")
print("First 30 columns:")
prism_cols[:30]

Number of columns: 20
First 30 columns:


['broad_id',
 'depmap_id',
 'ccle_name',
 'screen_id',
 'upper_limit',
 'lower_limit',
 'slope',
 'r2',
 'auc',
 'ec50',
 'ic50',
 'passed_str_profiling',
 'row_name',
 'name',
 'moa',
 'target',
 'disease.area',
 'indication',
 'smiles',
 'phase']

In [9]:
# Identify key columns (depmap_id, drug id, AUC)
key_like = [c for c in prism_cols if any(k in c.lower() for k in ["depmap", "auc", "ic50", "ec50", "name", "broad", "compound", "dose"])]
key_like

['broad_id',
 'depmap_id',
 'ccle_name',
 'auc',
 'ec50',
 'ic50',
 'row_name',
 'name']

In [10]:
# Load a small preview
prism_preview = pd.read_csv(PRISM_DOSE_RESP_PATH, nrows=200)
prism_preview.head()

Unnamed: 0,broad_id,depmap_id,ccle_name,screen_id,upper_limit,lower_limit,slope,r2,auc,ec50,ic50,passed_str_profiling,row_name,name,moa,target,disease.area,indication,smiles,phase
0,BRD-K36949735-001-01-1,ACH-000948,2313287_STOMACH,MTS010,1.088523,-1.079841,8.284451,0.913068,0.989373,10.375348,9.209687,True,PR500_ACH-000948,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
1,BRD-K36949735-001-01-1,ACH-000011,253J_URINARY_TRACT,MTS010,1.004565,-2.248875,7.883544,0.934277,0.988011,11.475826,9.255366,True,PR500_ACH-000011,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
2,BRD-K36949735-001-01-1,ACH-000026,253JBV_URINARY_TRACT,MTS010,1.068269,-21.457752,1.216485,0.877853,0.958743,167.921713,8.327173,True,PR500_ACH-000026,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
3,BRD-K36949735-001-01-1,ACH-000323,42MGBA_CENTRAL_NERVOUS_SYSTEM,MTS010,0.985761,0.029825,0.929279,0.929666,0.814224,2.221636,2.300992,True,PR500_ACH-000323,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3
4,BRD-K36949735-001-01-1,ACH-000905,5637_URINARY_TRACT,MTS010,1.129441,0.017405,1.158793,0.929414,0.830589,1.459074,1.835021,True,PR500_ACH-000905,anlotinib,"VEGFR inhibitor, PDGFR tyrosine kinase recepto...","KDR, PDGFRB",,,COc1cc2c(Oc3ccc4[nH]c(C)cc4c3F)ccnc2cc1OCC1(N)CC1,Phase 2/Phase 3


In [11]:
# Basic sanity info
prism_preview.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   broad_id              200 non-null    object 
 1   depmap_id             200 non-null    object 
 2   ccle_name             196 non-null    object 
 3   screen_id             200 non-null    object 
 4   upper_limit           200 non-null    float64
 5   lower_limit           200 non-null    float64
 6   slope                 200 non-null    float64
 7   r2                    200 non-null    float64
 8   auc                   200 non-null    float64
 9   ec50                  200 non-null    float64
 10  ic50                  171 non-null    float64
 11  passed_str_profiling  200 non-null    bool   
 12  row_name              200 non-null    object 
 13  name                  200 non-null    object 
 14  moa                   200 non-null    object 
 15  target                2

---

## Step 4 — Identifier alignment feasibility
### Goals
- Identify the join key(s) that connect DepMap expression to PRISM response
- Quantify overlap between datasets

### Checks
- Candidate keys (e.g., `depmap_id`, `ModelID`, `ccle_name`)
- Overlap counts (how many cell lines are shared)
- Proportion of PRISM entries mappable to expression features

**Deliverable**
A clear choice of join key for downstream steps (documented, but not yet “finalized forever”).

In [None]:
# --- Load minimal identifiers from expression
expr_ids = pd.read_csv(
    DEP_MAP_EXPR_PATH,
    usecols=["ModelID", "SequencingID", "ModelConditionID"],
)

# --- Load minimal PRISM core
prism_core = pd.read_csv(
    PRISM_DOSE_RESP_PATH,
    usecols=["depmap_id", "broad_id", "name", "auc"],
)

# --- Defensive normalization
expr_ids["ModelID"] = expr_ids["ModelID"].astype(str).str.strip()
prism_core["depmap_id"] = prism_core["depmap_id"].astype(str).str.strip()
prism_core["broad_id"] = prism_core["broad_id"].astype(str).str.strip()

print("Expression ModelID example:", expr_ids["ModelID"].dropna().iloc[0])
print("PRISM depmap_id example:", prism_core["depmap_id"].dropna().iloc[0])

# --- Overlap (cell lines)
expr_set = set(expr_ids["ModelID"].dropna().unique())
prism_set = set(prism_core["depmap_id"].dropna().unique())
shared = expr_set.intersection(prism_set)

overlap_stats = pd.DataFrame([{
    "n_expression_models": len(expr_set),
    "n_prism_cell_lines": len(prism_set),
    "n_shared_cell_lines": len(shared),
    "shared_vs_expression_pct": round(100 * len(shared) / max(len(expr_set), 1), 2),
    "shared_vs_prism_pct": round(100 * len(shared) / max(len(prism_set), 1), 2),
}])

overlap_stats

Expression ModelID example: ACH-001113
PRISM depmap_id example: ACH-000948


Unnamed: 0,n_expression_models,n_prism_cell_lines,n_shared_cell_lines,shared_vs_expression_pct,shared_vs_prism_pct
0,1699,738,727,42.79,98.51


In [None]:
prism_core["mappable_to_expression"] = prism_core["depmap_id"].isin(expr_set)

pd.DataFrame([{
    "total_prism_rows": len(prism_core),
    "mappable_rows": int(prism_core["mappable_to_expression"].sum()),
    "mappable_rows_pct": round(100 * prism_core["mappable_to_expression"].mean(), 2),
    "unmappable_rows": int((~prism_core["mappable_to_expression"]).sum()),
}])

Unnamed: 0,total_prism_rows,mappable_rows,mappable_rows_pct,unmappable_rows
0,753778,738001,97.91,15777


In [None]:
(
    prism_core.loc[~prism_core["mappable_to_expression"], "depmap_id"]
    .value_counts()
    .head(20)
)

depmap_id
nan           10137
ACH-001024     1557
ACH-000047     1494
ACH-000309     1484
ACH-001212      169
ACH-000658      163
ACH-001205      159
ACH-000534      158
ACH-000992      157
ACH-001078      152
ACH-000010      147
Name: count, dtype: int64

In [15]:
# %% [code]
dup_mask = prism_core.duplicated(subset=["depmap_id", "broad_id"], keep=False)

pd.DataFrame([{
    "total_rows": len(prism_core),
    "duplicate_rows": int(dup_mask.sum()),
    "duplicate_rows_pct": round(100 * dup_mask.mean(), 6),
    "unique_pairs_cell_drug": prism_core[["depmap_id", "broad_id"]].drop_duplicates().shape[0],
}])


Unnamed: 0,total_rows,duplicate_rows,duplicate_rows_pct,unique_pairs_cell_drug
0,753778,99402,13.187172,699583


### Duplicate cell line × drug entries (PRISM)

Multiple rows per (depmap_id, broad_id) combination are present in the PRISM dose–response dataset (~13% of rows).
This is expected and reflects:
- multiple experimental screens (e.g. MTS010),
- multiple fitted dose–response curves per compound–cell line pair.

No resolution strategy is applied at this stage.
The choice of how to handle duplicate entries (e.g. screen selection, aggregation, or quality-based filtering)
is deferred to the next notebook, where response targets are formally defined.

---

## Step 5 — Minimal descriptive sanity checks
### DepMap expression
- Distribution snapshots for expression values (no heavy plotting required)
- Spot-check a few genes and a few models

### PRISM response
- Distribution of AUC (overall and per-screen if applicable)
- Spot-check a few compounds and a few cell lines

**Purpose**
Catch glaring issues early (e.g., all zeros, swapped columns, unexpected ranges).

---

## Step 6 — Output summary & go/no-go
### Summary table
- File fingerprints
- Shapes
- Key columns detected
- Join key selected (provisional)
- Overlap statistics

### Go/no-go criteria
Proceed if:
- Both datasets load without errors
- Required identifiers exist
- There is a meaningful intersection of cell lines (non-trivial overlap)
- Response metric (AUC) is present and non-degenerate

If any criterion fails, stop here and resolve data access or mapping issues before continuing.

---


## Next notebook (planned)
- `02_define_response_targets.ipynb`  
  Define drug-level targets for modeling (AUC), choose modeling strategy (per-drug vs multi-task), and formalize any binarization rules for resistance labels (if used).