# 🔗 Merge PCA Embeddings with Drug Response Dataset

In this notebook, we merge the PCA-transformed scFoundation embeddings with our drug response bulk dataset using the shared `SANGER_MODEL_ID`. This prepares the data for downstream modeling.


In [9]:
import pandas as pd


## 📥 Load Datasets

We load both:
- `bulk_with_pca.parquet`: the original drug response dataset with PCA features.
- `scfoundation_bulk_pca_top{N}.parquet`: the selected PCA-transformed scFoundation embeddings.


In [10]:
# Load both DataFrames
gene_pca_df = pd.read_parquet("../../data/processed/bulk_with_pca.parquet")
emb_pca_df = pd.read_parquet("../../data/processed/bulk_with_pca_embeddings.parquet")

# Rename PCA columns in gene_pca_df
gene_pca_cols = [col for col in gene_pca_df.columns if col.startswith("SCF_PC")]
voom_renaming = {col: col.replace("SCF_PC", "VOOM_PC") for col in gene_pca_cols}
gene_pca_df = gene_pca_df.rename(columns=voom_renaming)

# Confirm
print("✅ Renamed columns in gene_pca_df:", list(voom_renaming.values())[:5], "...")


✅ Renamed columns in gene_pca_df: ['VOOM_PC1', 'VOOM_PC2', 'VOOM_PC3', 'VOOM_PC4', 'VOOM_PC5'] ...


## 🔗 Merge on SANGER_MODEL_ID

We merge the two datasets using their shared cell line identifier.


In [11]:
# Identify merge keys
merge_keys = ["SANGER_MODEL_ID", "DRUG_ID"]

# Merge DataFrames
merged_df = gene_pca_df.merge(
    emb_pca_df,
    on=merge_keys,
    how="inner"
)

print("✅ Merged shape:", merged_df.shape)
print("✅ Columns:", merged_df.columns[:10].tolist(), "...")


✅ Merged shape: (571985, 64)
✅ Columns: ['SANGER_MODEL_ID', 'DRUG_ID', 'LN_IC50_x', 'VOOM_PC1', 'VOOM_PC2', 'VOOM_PC3', 'VOOM_PC4', 'VOOM_PC5', 'VOOM_PC6', 'VOOM_PC7'] ...


## 💾 Save Merged Dataset

We'll save the merged dataset to a new Parquet file for downstream training and evaluation.


In [None]:
# Keep only merge keys + PCA columns in embeddings DataFrame
emb_pca_keep = emb_pca_df[merge_keys + voom_renaming].copy()

print("✅ Prepared embeddings DataFrame for merging:", emb_pca_keep.shape)


✅ Prepared embeddings DataFrame for merging: (571985, 32)


In [13]:
# Merge on SANGER_MODEL_ID and DRUG_ID
merged_df = gene_pca_df.merge(
    emb_pca_keep,
    on=merge_keys,
    how="inner"
)

# Confirm shape and columns
print("✅ Merged shape:", merged_df.shape)
print("✅ Columns preview:", merged_df.columns[:10].tolist(), "...")
print(merged_df.head())


✅ Merged shape: (571985, 63)
✅ Columns preview: ['SANGER_MODEL_ID', 'DRUG_ID', 'LN_IC50', 'VOOM_PC1', 'VOOM_PC2', 'VOOM_PC3', 'VOOM_PC4', 'VOOM_PC5', 'VOOM_PC6', 'VOOM_PC7'] ...
  SANGER_MODEL_ID  DRUG_ID   LN_IC50   VOOM_PC1    VOOM_PC2   VOOM_PC3  \
0       SIDM00263        1  3.966813 -28.846116  197.069926 -19.870734   
1       SIDM00269        1  2.692090 -32.939312  178.038200 -36.558014   
2       SIDM00203        1  2.477990 -50.438404  224.057089 -11.252632   
3       SIDM01111        1  2.033564  29.766660   21.063254  40.867959   
4       SIDM00909        1  2.966007  82.632950  -44.422875  -0.794799   

     VOOM_PC4    VOOM_PC5   VOOM_PC6   VOOM_PC7  ...  SCF_PC21  SCF_PC22  \
0   44.943251  120.252984 -11.736488  68.092467  ... -0.622784  0.211736   
1   76.119477   61.616892 -41.889930  48.596424  ...  0.163480  0.171078   
2   68.074396  118.967199 -35.606544  15.422671  ... -0.252287  0.144818   
3  121.357190  114.116486   0.979702  -9.181698  ... -0.379239  0.281085 

In [15]:
# Save to parquet
merged_df.to_parquet("../../data/processed/bulk_conc_pca_gene_embeddings.parquet", index=False)
print("✅ Saved concatenated PCA feature table to '../../data/processed/bulk_with_pca_concat.parquet'.")


✅ Saved concatenated PCA feature table to '../../data/processed/bulk_with_pca_concat.parquet'.
