### 🚫 Remove Cell Lines Not Present in Expression Matrix

Some cell lines in the GDSC drug response data do not appear in the single-cell expression matrix. To ensure consistent alignment and avoid missing PCA values later, we remove these unmatched rows before merging.


In [1]:
import polars as pl

# Load GDSC drug response dataset
gdsc_bulk = pl.read_parquet("../../data/gdsc/gdsc_final_cleaned.parquet")

# Load single-cell gene expression (transposed version)
# We'll just extract the first row to get cell line names (i.e., column names from original matrix)
cell_expr_df = pl.read_parquet("../../data/sc_data/rnaseq_fpkm.parquet")
expr_cell_lines = cell_expr_df.columns[5:]  # Skip first 5 non-gene columns (SANGER_MODEL_ID, etc.)

print(f"🧬 Cell lines with gene expression data: {len(expr_cell_lines)}")

# Keep only GDSC rows where cell line has expression
gdsc_filtered = gdsc_bulk.filter(pl.col("SANGER_MODEL_ID").is_in(expr_cell_lines))

print(f"✅ GDSC rows before filtering: {gdsc_bulk.shape[0]}")
print(f"✅ GDSC rows after filtering:  {gdsc_filtered.shape[0]}")
print(f"🧹 Removed {gdsc_bulk.shape[0] - gdsc_filtered.shape[0]} unmatched cell line rows.")

# Find missing cell lines (in GDSC but not in expression matrix)
missing_cell_lines = gdsc_bulk.select("SANGER_MODEL_ID").unique().filter(
    ~pl.col("SANGER_MODEL_ID").is_in(expr_cell_lines)
)

# Print summary
print("❗ Cell lines in GDSC but NOT in gene expression matrix:")
print(missing_cell_lines)

# Save the filtered GDSC dataset
gdsc_filtered.write_parquet("../../data/gdsc/gdsc_final_cleaned.parquet")
print("💾 Saved filtered GDSC dataset to 'gdsc_final_cleaned.parquet'")



🧬 Cell lines with gene expression data: 1427
✅ GDSC rows before filtering: 575197
✅ GDSC rows after filtering:  571985
🧹 Removed 3212 unmatched cell line rows.
❗ Cell lines in GDSC but NOT in gene expression matrix:
shape: (7, 1)
┌─────────────────┐
│ SANGER_MODEL_ID │
│ ---             │
│ str             │
╞═════════════════╡
│ SIDM00361       │
│ SIDM01201       │
│ SIDM01261       │
│ SIDM00205       │
│ SIDM01219       │
│ SIDM00003       │
│ SIDM01021       │
└─────────────────┘
💾 Saved filtered GDSC dataset to 'gdsc_final_cleaned.parquet'
