# Step 10: Align Pan-Cancer Cell Lines with GDSC

In this notebook, we map cell line names from the Pan-Cancer single-cell dataset to SANGER_MODEL_IDs from GDSC using a provided lookup table, and filter out unmapped cells.


## 10.1 Load Data and Cell Line Mapping


In [1]:
import pandas as pd
import scanpy as sc

# Load processed AnnData object
adata = sc.read("../../data/pancancer_dimred.h5ad")

# Load cell line mapping file
mapping_df = pd.read_csv("../../data/cell_sanger_map.csv").drop_duplicates()
mapping_df.columns = ['SANGER_MODEL_ID', 'CELL_LINE_NAME']

print("✅ Loaded mapping_df and adata")
print("Mapping sample:")
print(mapping_df.head())
print("AnnData shape:", adata.shape)


✅ Loaded mapping_df and adata
Mapping sample:
  SANGER_MODEL_ID CELL_LINE_NAME
0       SIDM00853            GCT
1       SIDM00567         ONS-76
2       SIDM00042            PL4
3       SIDM00455     PA-TU-8902
4       SIDM00881        HCC1428
AnnData shape: (56970, 30314)


## 10.2 Extract and Normalize Cell Line Names


In [2]:
# Extract prefix (cell line) from barcode
adata.obs['cell_line'] = adata.obs.index.str.split('_').str[0]

# Normalize: remove dashes and uppercase
adata.obs['cell_line_norm'] = adata.obs['cell_line'].str.replace('-', '', regex=False).str.upper()

print("✅ Extracted and normalized cell line names from adata.obs")
print("Unique normalized cell lines (first 10):", adata.obs['cell_line_norm'].unique()[:10])


✅ Extracted and normalized cell line names from adata.obs
Unique normalized cell lines (first 10): ['NCIH2126' 'SW579' 'C32' 'NCIH446' 'HEC251' 'MFE319' 'SKNAS' 'NCIH2452'
 'COLO741' 'WM88']


In [3]:
## 10.3 Normalize Mapping File and Apply Mapping


In [4]:
# Normalize mapping file
mapping_df['CELL_LINE_NAME_NORM'] = mapping_df['CELL_LINE_NAME'].str.replace('-', '', regex=False).str.upper()

# Map normalized names to SIDM
name_to_sidm = dict(zip(mapping_df['CELL_LINE_NAME_NORM'], mapping_df['SANGER_MODEL_ID']))
adata.obs['SIDM'] = adata.obs['cell_line_norm'].map(name_to_sidm)

# Preview mapping results
print("✅ Mapped cell lines to SIDM codes")
print(adata.obs[['cell_line', 'cell_line_norm', 'SIDM']].drop_duplicates().head(10))


✅ Mapped cell lines to SIDM codes
                        cell_line cell_line_norm       SIDM
NCIH2126_LUNG            NCIH2126       NCIH2126        NaN
SW579_THYROID               SW579          SW579        NaN
C32_SKIN                      C32            C32  SIDM00890
NCIH446_LUNG              NCIH446        NCIH446  SIDM00965
HEC251_ENDOMETRIUM         HEC251         HEC251        NaN
MFE319_ENDOMETRIUM         MFE319         MFE319  SIDM00333
SKNAS_AUTONOMIC_GANGLIA     SKNAS          SKNAS  SIDM01101
NCIH2452_PLEURA          NCIH2452       NCIH2452  SIDM00722
COLO741_SKIN              COLO741        COLO741        NaN
WM88_SKIN                    WM88           WM88        NaN


## 10.4 Filter Unmapped Cells


In [5]:
# Identify and report unmapped lines
unmapped = adata.obs[adata.obs['SIDM'].isna()]['cell_line'].unique()
print("❗ Unmapped cell lines:", unmapped)
print("Total unmapped:", len(unmapped))

# Filter only mapped cells
adata = adata[~adata.obs['SIDM'].isna()].copy()
print("✅ Filtered AnnData object to mapped cell lines:", adata.shape)


❗ Unmapped cell lines: ['NCIH2126' 'SW579' 'HEC251' 'COLO741' 'WM88' 'SNU899' 'HEC108' 'SNU308'
 'TM31' 'KPNSI9S' 'BICR18' 'SQ1' 'BICR6' 'SH10TC' 'UMUC1' 'CCFSTTG1' 'TEN'
 'RERFLCAD1' 'COV434' 'SNU1079' 'YD38' 'JHOC5' 'PANC1' 'VMCUB1' 'SNU1077'
 'LI7' 'ACCMESO1' 'HMC18' 'SNU1076' 'EFE184' 'PECAPJ49' 'BICR56' 'PK59'
 'HUH6' 'HS852T' 'LMSU' 'SNUC4' 'OVSAHO' 'GOS3' 'SNU738' 'PATU8988S'
 'HEC59' 'HS729' 'KPL1' 'NCIH2077' 'KMRC3' 'CL34' 'ZR751' 'PK45H'
 'RERFLCAI' 'SNU1196' 'OUMS23' 'HEC151' 'NCIH1373' 'HCC56' 'BICR16' 'HEC6'
 'SNU46' 'SNU1214' 'NCIH2073' 'ONCODG1' 'HUH28' 'TE14' 'CAKI2' 'SCC47'
 '93VU' 'JHU006']
Total unmapped: 67
✅ Filtered AnnData object to mapped cell lines: (39715, 30314)


## 10.5 Save Updated Dataset


In [6]:
# Save with updated metadata
adata.write("../../data/pancancer_dimred.h5ad")
print("✅ Overwritten the original file with SIDM-enhanced metadata")


✅ Overwritten the original file with SIDM-enhanced metadata
