# Extract Top Highly Variable Genes (HVGs)

This notebook identifies and selects the top **N most variable genes** across all pseudo-bulk cell line expression profiles.

The result will be a reduced matrix of expression features that retains only the genes with the most variation — typically the most biologically informative ones.


In [6]:
import polars as pl
import numpy as np
import pandas as pd
import os


## 1. Load and Preprocess Pseudo-Bulk Expression Matrix

We load the raw single-cell pseudo-bulk expression matrix and prepare it for variance calculation. This includes:
- Transposing
- Dropping metadata columns
- Removing gene identifier header rows


In [7]:
# Load original expression matrix
aligned = pd.read_parquet("../../data/sc_data/rnaseq_fpkm.parquet")
print("✅ Raw data loaded:", aligned.shape)

# Transpose
transposed_df = aligned.set_index(aligned.columns[0]).transpose()
transposed_df = transposed_df.apply(pd.to_numeric, errors='coerce').fillna(0.0)

# Reset index to make SANGER_MODEL_ID a column
transposed_df.index.name = "SANGER_MODEL_ID"
transposed_df.reset_index(inplace=True)
print("✅ Transposition complete. Shape:", transposed_df.shape)

# Convert to Polars
cell_gene_matrix = pl.from_pandas(transposed_df)

# Drop metadata columns
cell_gene_matrix = cell_gene_matrix.drop(["model_name", "dataset_name", "data_source", "gene_id"])

# Drop the first row (transposed gene IDs)
cell_gene_matrix = cell_gene_matrix.slice(1)

print("✅ Cleaned gene matrix shape:", cell_gene_matrix.shape)


✅ Raw data loaded: (37606, 1432)
✅ Transposition complete. Shape: (1431, 37607)
✅ Cleaned gene matrix shape: (1430, 37603)


## 2. Compute Variance and Select Top Genes

We compute variance for each gene across all cell lines and keep the top N most variable genes.


In [8]:
non_gene_cols = ["SANGER_MODEL_ID"]
gene_cols = [col for col in cell_gene_matrix.columns if col not in non_gene_cols]

# Store IDs and gene matrix
sanger_ids = cell_gene_matrix.select("SANGER_MODEL_ID").to_pandas()
X = cell_gene_matrix.select(gene_cols).to_pandas().to_numpy()

print("✅ Gene matrix extracted. Shape:", X.shape)

# Compute variance
TOP_N = 2000
variances = np.var(X, axis=0)
top_indices = np.argsort(variances)[-TOP_N:]
top_genes = [gene_cols[i] for i in top_indices]

print(f"✅ Selected top {TOP_N} HVGs.")
print("Top HVGs preview:", top_genes[:5])


✅ Gene matrix extracted. Shape: (1430, 37602)
✅ Selected top 2000 HVGs.
Top HVGs preview: ['SIDG38622', 'SIDG26269', 'SIDG07496', 'SIDG39680', 'SIDG05293']


## 3. Merge HVG Matrix with Drug Response Data

We combine the top HVG expression features with the drug sensitivity matrix (LN_IC50), yielding one row per (cell line, drug) pair.


In [9]:
# Filter expression data
cols_to_keep = non_gene_cols + top_genes
filtered_expr = cell_gene_matrix.select(cols_to_keep)
print("✅ Filtered expression matrix shape:", filtered_expr.shape)

# Load drug response data
gdsc_bulk = pl.read_parquet("../../data/gdsc/gdsc_final_cleaned.parquet").select([
    pl.col("SANGER_MODEL_ID").cast(pl.Utf8),
    pl.col("DRUG_ID").cast(pl.Int32),
    pl.col("LN_IC50").cast(pl.Float32)
])
print("✅ Loaded GDSC drug response. Shape:", gdsc_bulk.shape)

# Merge with GDSC
merged = gdsc_bulk.join(filtered_expr, on="SANGER_MODEL_ID", how="left")
print("✅ Merged shape:", merged.shape)
print(merged.head())

# Save
merged.write_parquet("../../data/pseudo_bulk/gdsc_single_cell_top_hvgs.parquet")
print("📁 Saved to: gdsc_single_cell_top_hvgs.parquet")


✅ Filtered expression matrix shape: (1430, 2001)
✅ Loaded GDSC drug response. Shape: (571985, 3)
✅ Merged shape: (571985, 2003)
shape: (5, 2_003)
┌────────────┬─────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ SANGER_MOD ┆ DRUG_ID ┆ LN_IC50   ┆ SIDG38622 ┆ … ┆ SIDG20383 ┆ SIDG20381 ┆ SIDG20382 ┆ SIDG19416 │
│ EL_ID      ┆ ---     ┆ ---       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ ---        ┆ i32     ┆ f32       ┆ f64       ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
│ str        ┆         ┆           ┆           ┆   ┆           ┆           ┆           ┆           │
╞════════════╪═════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ SIDM00374  ┆ 1009    ┆ 4.13448   ┆ 38.79     ┆ … ┆ 15125.9   ┆ 12108.8   ┆ 12398.1   ┆ 3645.29   │
│ SIDM00255  ┆ 268     ┆ -2.236015 ┆ 24.37     ┆ … ┆ 8855.12   ┆ 8250.84   ┆ 7095.92   ┆ 2427.56   │
│ SIDM01182  ┆ 1012    ┆ 1.321538  ┆ 54.09    