# 🧬 Map Ensembl IDs to Gene Symbols (Breast Pseudo-Bulk)

This notebook maps Ensembl gene IDs in the breast cancer pseudo-bulk dataset to HGNC gene symbols using the `gene_info.csv` reference file.


In [9]:
import pandas as pd
import scanpy as sc


## 📥 Load Breast Dataset and Gene Mapping File


In [10]:
# Load h5ad file
adata = sc.read_h5ad("data/breast_cancer_dimred.h5ad")  # replace with actual file

# Load mapping file
gene_info = pd.read_csv("gene_info.csv")  # must contain columns 'feature_id' and 'feature_name'
ensg_to_symbol = dict(zip(gene_info["feature_id"], gene_info["feature_name"]))

print("Original shape:", adata.shape)



Original shape: (30523, 47096)


## 🔁 Rename Columns Using Mapping


In [11]:
# Map using .var_names or .var.index
adata.var["original_id"] = adata.var_names
adata.var["mapped_symbol"] = adata.var["original_id"].map(ensg_to_symbol)

# Drop genes that failed to map (optional but recommended)
mapped_mask = adata.var["mapped_symbol"].notnull()
print(f"✅ Mapped {mapped_mask.sum()} / {adata.shape[1]} genes")

adata = adata[:, mapped_mask].copy()
adata.var_names = adata.var["mapped_symbol"]
adata.var_names_make_unique()


✅ Mapped 47022 / 47096 genes


## 🧪 Check for Duplicated Gene Symbols (Optional)


In [12]:
adata.write("breast_cancer_mapped.h5ad")
print("📁 Saved mapped file to 'your_data_mapped.h5ad'")


📁 Saved mapped file to 'your_data_mapped.h5ad'
