# Applied Data Science Final Project - HypoMap Label Projection 

Problem posed from OpenProblems: https://openproblems.bio/datasets/cellxgene_census/hypomap

Paper: https://www.nature.com/articles/s42255-022-00657-y

In [1]:
import scanpy as sc 
import pandas as pd
import numpy as np 

In [2]:
path = "/Users/gretavanzetten/Desktop/ADS_FinalProject/dataset.h5ad"
adata = sc.read_h5ad(path, backed='r')
adata

AnnData object with n_obs × n_vars = 384925 × 41642 backed at '/Users/gretavanzetten/Desktop/ADS_FinalProject/dataset.h5ad'
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'n_genes', 'batch', 'size_factors'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length', 'n_cells', 'hvg', 'hvg_score'
    uns: 'dataset_description', 'dataset_id', 'dataset_name', 'dataset_organism', 'dataset_reference', 'dataset_summary', 'dataset_url', 'knn', 'normalization_id', 'pca_variance'
    obsm: 'X_pca'
    varm: 'pca_loadings'
    layers: 'counts', 'normalized'
    obsp: 'knn_connectivit

In [3]:
print("Shape (cells × genes):", adata.shape)
print("Available layers:", list(adata.layers.keys()) if adata.isbacked else "Use adata.layers in non-backed mode")
print("Available embeddings (obsm):", list(adata.obsm.keys()))
print("Available pairwise matrices (obsp):", list(adata.obsp.keys()))
print("Available uns keys:", list(adata.uns.keys()))

Shape (cells × genes): (384925, 41642)
Available layers: ['counts', 'normalized']
Available embeddings (obsm): ['X_pca']
Available pairwise matrices (obsp): ['knn_connectivities', 'knn_distances']
Available uns keys: ['dataset_description', 'dataset_id', 'dataset_name', 'dataset_organism', 'dataset_reference', 'dataset_summary', 'dataset_url', 'knn', 'normalization_id', 'pca_variance']


In [4]:
# first several rows of cell-level metadata
obs_df = adata.obs.head(5)
display(obs_df)

# first several rows of gene-level metadata
var_df = adata.var.head(5)
display(var_df)

# check for number of unique tissues and cell types
print("Unique tissues:", adata.obs['tissue'].unique()[:10])
print("Unique cell types:", adata.obs['cell_type'].unique()[:10])

Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,sex,sex_ontology_term_id,suspension_type,tissue,tissue_ontology_term_id,tissue_general,tissue_general_ontology_term_id,n_genes,batch,size_factors
0,1867178,dbb4e1ed-d820-4e83-981f-88ef7eb55a35,10x 3' v3,EFO:0009922,neuron,CL:0000540,2 month-old stage,MmusDv:0000062,normal,PATO:0000461,...,female,PATO:0000383,nucleus,hypothalamus,UBERON:0001898,brain,UBERON:0000955,1760,pooled,3310.0
1,1867179,dbb4e1ed-d820-4e83-981f-88ef7eb55a35,10x 3' v3,EFO:0009922,neuron,CL:0000540,2 month-old stage,MmusDv:0000062,normal,PATO:0000461,...,female,PATO:0000383,nucleus,hypothalamus,UBERON:0001898,brain,UBERON:0000955,963,pooled,1209.0
2,1867180,dbb4e1ed-d820-4e83-981f-88ef7eb55a35,10x 3' v3,EFO:0009922,neuron,CL:0000540,2 month-old stage,MmusDv:0000062,normal,PATO:0000461,...,female,PATO:0000383,nucleus,hypothalamus,UBERON:0001898,brain,UBERON:0000955,2165,pooled,4049.0
3,1867181,dbb4e1ed-d820-4e83-981f-88ef7eb55a35,10x 3' v3,EFO:0009922,neuron,CL:0000540,2 month-old stage,MmusDv:0000062,normal,PATO:0000461,...,female,PATO:0000383,nucleus,hypothalamus,UBERON:0001898,brain,UBERON:0000955,1977,pooled,3156.0
4,1867182,dbb4e1ed-d820-4e83-981f-88ef7eb55a35,10x 3' v3,EFO:0009922,neuron,CL:0000540,2 month-old stage,MmusDv:0000062,normal,PATO:0000461,...,female,PATO:0000383,nucleus,hypothalamus,UBERON:0001898,brain,UBERON:0000955,1584,pooled,2591.0


Unnamed: 0_level_0,soma_joinid,feature_id,feature_name,feature_length,n_cells,hvg,hvg_score
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENSMUSG00000051951,0,ENSMUSG00000051951,Xkr4,6094,71381,True,8.289147
ENSMUSG00000089699,1,ENSMUSG00000089699,Gm1992,250,9743,False,-0.862607
ENSMUSG00000102343,2,ENSMUSG00000102343,Gm37381,1364,38,False,-0.205761
ENSMUSG00000025900,3,ENSMUSG00000025900,Rp1,12311,637,False,1.351251
ENSMUSG00000025902,4,ENSMUSG00000025902,Sox17,4772,5937,True,7.640341


Unique tissues: ['hypothalamus']
Categories (1, object): ['hypothalamus']
Unique cell types: ['neuron', 'tanycyte', 'astrocyte', 'oligodendrocyte', 'oligodendrocyte precursor cell', 'microglial cell', 'pituitary gland cell', 'mural cell', 'fibroblast', 'endothelial cell']
Categories (13, object): ['astrocyte', 'endothelial cell', 'ependymal cell', 'erythrocyte', ..., 'oligodendrocyte', 'oligodendrocyte precursor cell', 'pituitary gland cell', 'tanycyte']


In [6]:
if adata.isbacked:
    # Backed mode: .layers is a manager object, can still see stored layer names
    print("Available layers:", list(adata.file["layers"].keys()))
else:
    print("Available layers:", list(adata.layers.keys()))
print("PCA embedding shape:", adata.obsm['X_pca'].shape if 'X_pca' in adata.obsm else "Not found")

Available layers: ['counts', 'normalized']
PCA embedding shape: (384925, 50)


In [7]:
for key in adata.uns_keys():
    print(f"{key}: {adata.uns[key] if key in adata.uns else '—'}")

dataset_description: The hypothalamus plays a key role in coordinating fundamental body functions. Despite recent progress in single-cell technologies, a unified catalogue and molecular characterization of the heterogeneous cell types and, specifically, neuronal subtypes in this brain region are still lacking. Here we present an integrated reference atlas “HypoMap” of the murine hypothalamus consisting of 384,925 cells, with the ability to incorporate new additional experiments. We validate HypoMap by comparing data collected from SmartSeq2 and bulk RNA sequencing of selected neuronal cell types with different degrees of cellular heterogeneity.
dataset_id: cellxgene_census/hypomap
dataset_name: HypoMap
dataset_organism: mus_musculus
dataset_reference: steuernagel2022hypomap
dataset_summary: A unified single cell gene expression atlas of the murine hypothalamus
dataset_url: https://cellxgene.cziscience.com/collections/d86517f0-fa7e-4266-b82e-a521350d6d36
knn: {'connectivities_key': 'knn