# Human HypoMap: Cell2Location Spatial Mapping

This notebook demonstrates how to load human hypothalamus transcriptomic data and 
reconstruct cell2location maps similar to Figure 3e from Tadross et al. (2025) 
"A comprehensive spatio-cellular map of the human hypothalamus" (Nature).

Figure 3e shows the spatial mapping of five C3 VMH (ventromedial hypothalamus) 
GLU-2 clusters to subregions of the VMH using cell2location deconvolution.

## Setup and Imports

In [None]:
import sys
sys.path.insert(0, '..')

import scanpy as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Import human dataset module
from src.datasets.human_hypomap import (
    load_spatial_positions,
    extract_human_metadata,
    get_human_hypomap_config,
    detect_cell_type_columns,
    H5AD_FILE,
    SPATIAL_POSITIONS_TAR,
    HUMAN_REGIONS,
    HUMAN_REGION_COLORS,
)

# Set plotting defaults
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=100, frameon=False, figsize=(6, 6))

## 1. Load the Human Hypothalamus snRNA-seq Data

The HYPOMAP dataset contains 433,369 nuclei from human hypothalamus, 
organized in a hierarchical clustering structure (C0-C4 levels).

In [2]:
# Load the h5ad file
print(f"Loading h5ad file from: {H5AD_FILE}")
adata = sc.read_h5ad(H5AD_FILE, backed='r')

print(f"\nDataset shape: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"\nAvailable obs columns: {list(adata.obs.columns)}")

Loading h5ad file from: /Users/patrickmineault/LocalDocuments/hypomap/notebooks/../data/raw/human_hypomap/480e89e7-84ad-4fa8-adc3-f7c562a77a78.h5ad

Dataset shape: 433369 cells x 36924 genes

Available obs columns: ['assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'disease_ontology_term_id', 'is_primary_data', 'self_reported_ethnicity_ontology_term_id', 'sex_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'nCount_RNA', 'nFeature_RNA', 'percent_mt', 'Sample_ID', 'Dataset', 'C0_named', 'C1_named', 'C2_named', 'C3_named', 'C4_named', 'region', 'cell_type', 'assay', 'disease', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid']


In [3]:
# Detect cell type hierarchy columns
cell_type_cols = detect_cell_type_columns(adata.obs.columns)
print(f"Cell type hierarchy columns: {cell_type_cols}")

# Show the hierarchical structure
for col in cell_type_cols:
    n_clusters = adata.obs[col].nunique()
    print(f"  {col}: {n_clusters} clusters")

Cell type hierarchy columns: ['C0_named', 'C1_named', 'C2_named', 'C3_named', 'C4_named']
  C0_named: 4 clusters
  C1_named: 13 clusters
  C2_named: 52 clusters
  C3_named: 156 clusters
  C4_named: 452 clusters


## 2. Explore VMH GLU-2 Clusters (Figure 3e clusters)

Figure 3e shows five C3 clusters from the C2-35 Mid-2 GLU-2 branch:
- C3-118: CALB1
- C3-119: MEGF10  
- C3-120: ARHGAP42
- C3-121: SLC22A10
- C3-122: PGR

These are VMH (ventromedial hypothalamus) glutamatergic neurons.

In [4]:
# Define the VMH GLU-2 clusters of interest (from Figure 3e)
vmh_glu2_clusters = {
    'C3-118': {'name': 'Mid-2 GLU-2 CALB1', 'marker': 'CALB1'},
    'C3-119': {'name': 'Mid-2 GLU-2 MEGF10', 'marker': 'MEGF10'},
    'C3-120': {'name': 'Mid-2 GLU-2 ARHGAP42', 'marker': 'ARHGAP42'},
    'C3-121': {'name': 'Mid-2 GLU-2 SLC22A10', 'marker': 'SLC22A10'},
    'C3-122': {'name': 'Mid-2 GLU-2 PGR', 'marker': 'PGR'},
}

# VMH marker genes (from Figure 3d)
vmh_markers = ['FEZF1', 'NR5A1', 'BDNF', 'ESR1', 'NOS1', 'ADCYAP1']

print("VMH GLU-2 clusters for Figure 3e reconstruction:")
for cluster_id, info in vmh_glu2_clusters.items():
    print(f"  {cluster_id}: {info['name']} (marker: {info['marker']})")

VMH GLU-2 clusters for Figure 3e reconstruction:
  C3-118: Mid-2 GLU-2 CALB1 (marker: CALB1)
  C3-119: Mid-2 GLU-2 MEGF10 (marker: MEGF10)
  C3-120: Mid-2 GLU-2 ARHGAP42 (marker: ARHGAP42)
  C3-121: Mid-2 GLU-2 SLC22A10 (marker: SLC22A10)
  C3-122: Mid-2 GLU-2 PGR (marker: PGR)


In [5]:
# Load full data into memory for analysis (if memory allows)
# For large datasets, consider subsetting first
print("Loading data into memory...")
adata_memory = adata

# Check if C3_named column exists and filter for VMH clusters
if 'C3_named' in adata_memory.obs.columns:
    # Find clusters matching VMH GLU-2 pattern
    c3_values = adata_memory.obs['C3_named'].unique()
    vmh_clusters = [c for c in c3_values if 'GLU-2' in str(c) and 'Mid-2' in str(c)]
    print(f"\nFound {len(vmh_clusters)} Mid-2 GLU-2 clusters:")
    for c in sorted(vmh_clusters):
        n_cells = (adata_memory.obs['C3_named'] == c).sum()
        print(f"  {c}: {n_cells} cells")

Loading data into memory...

Found 7 Mid-2 GLU-2 clusters:
  C3-118 Mid-2 GLU-2 CALB1: 2557 cells
  C3-119 Mid-2 GLU-2 MEGF10: 1394 cells
  C3-120 Mid-2 GLU-2 ARHGAP42: 2141 cells
  C3-121 Mid-2 GLU-2 SLC22A10: 1868 cells
  C3-122 Mid-2 GLU-2 PGR: 1049 cells
  C3-125 Mid-2 GABA-GLU-2 NTS: 470 cells
  C3-126 Mid-2 GABA-GLU-2 PPP1R17: 2189 cells


## 3. Load Spatial Transcriptomics Data

The spatial data comes from 10x Visium CytAssist experiments on 9 hypothalamic sections 
from 7 donors, covering preoptic/anterior, middle, and posterior hypothalamus.

In [None]:
# Load spatial positions from tissue_positions_list files
spatial_positions = load_spatial_positions()

print(f"\nLoaded spatial data for {len(spatial_positions)} samples:")
for sample_id, df in spatial_positions.items():
    n_spots = len(df)
    n_in_tissue = df['in_tissue'].sum()
    print(f"  {sample_id}: {n_spots} spots ({n_in_tissue} in tissue)")

In [None]:
# Example: visualize spatial positions for one sample
if spatial_positions:
    sample_id = list(spatial_positions.keys())[0]
    pos_df = spatial_positions[sample_id]
    
    # Filter to spots in tissue
    in_tissue = pos_df[pos_df['in_tissue'] == 1]
    
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.scatter(in_tissue['pixel_x'], in_tissue['pixel_y'], 
               s=10, alpha=0.6, c='steelblue')
    ax.set_xlabel('Pixel X')
    ax.set_ylabel('Pixel Y')
    ax.set_title(f'Spatial positions: {sample_id}')
    ax.invert_yaxis()  # Invert to match tissue orientation
    ax.set_aspect('equal')
    plt.tight_layout()
    plt.show()

## 4. Cell2Location Workflow for Spatial Mapping

Cell2location is a Bayesian model that deconvolves spatial transcriptomics spots 
to estimate the abundance of snRNA-seq cell types at each location.

The workflow consists of:
1. Estimate reference cell-type signatures from snRNA-seq data
2. Train cell2location model to map signatures to spatial spots
3. Visualize abundance scores for each cluster

In [None]:
# Cell2location requires additional installation
# pip install cell2location

try:
    import cell2location
    from cell2location.utils.filtering import filter_genes
    from cell2location.models import RegressionModel
    CELL2LOC_AVAILABLE = True
    print("cell2location is available")
except ImportError:
    CELL2LOC_AVAILABLE = False
    print("cell2location not installed. Install with: pip install cell2location")
    print("\nThe following cells show the workflow structure.")

### 4.1 Prepare Reference Signatures from snRNA-seq

First, we estimate cell-type specific gene expression signatures from the snRNA-seq data.

In [None]:
def prepare_reference_signatures(adata, cluster_column='C3_named', 
                                  min_cells_per_gene=0.08,
                                  min_nonzero_mean=1.4):
    """
    Prepare reference cell-type signatures for cell2location.
    
    Parameters from the paper:
    - Include genes expressed in at least 8% of cells
    - Include genes expressed in at least 0.05% of cells if non-zero mean > 1.4
    """
    print(f"Preparing reference signatures using {cluster_column}")
    
    # Filter genes based on expression criteria
    if CELL2LOC_AVAILABLE:
        selected_genes = filter_genes(
            adata,
            cell_count_cutoff=min_cells_per_gene,
            cell_percentage_cutoff2=0.0005,
            nonz_mean_cutoff=min_nonzero_mean
        )
        adata_filtered = adata[:, selected_genes].copy()
    else:
        # Simplified filtering without cell2location
        gene_expressed = (adata.X > 0).mean(axis=0)
        if hasattr(gene_expressed, 'A1'):
            gene_expressed = gene_expressed.A1
        selected = gene_expressed >= min_cells_per_gene
        adata_filtered = adata[:, selected].copy()
    
    print(f"Selected {adata_filtered.n_vars} genes for signature estimation")
    return adata_filtered

# Example usage (commented out to avoid long computation)
# adata_ref = prepare_reference_signatures(adata_memory, cluster_column='C3_named')

### 4.2 Train Reference Model

The reference model estimates cell-type signatures accounting for batch effects.

In [None]:
def train_reference_model(adata_ref, cluster_column='C3_named',
                          batch_key='Sample_ID', n_epochs=250):
    """
    Train the negative binomial regression model to estimate 
    reference cell-type signatures.
    
    From the paper: "We estimated reference signatures using the negative 
    binomial regression model, accounting for the effects of donor, sex, 
    batch and dataset."
    """
    if not CELL2LOC_AVAILABLE:
        print("Skipping - cell2location not available")
        return None
    
    # Setup reference model
    cell2location.models.RegressionModel.setup_anndata(
        adata_ref,
        batch_key=batch_key,
        labels_key=cluster_column
    )
    
    # Create and train model
    model = cell2location.models.RegressionModel(adata_ref)
    model.train(max_epochs=n_epochs, use_gpu=True)
    
    # Export posterior estimates
    adata_ref = model.export_posterior(
        adata_ref, 
        sample_kwargs={'num_samples': 1000, 'batch_size': 2500}
    )
    
    return model, adata_ref

# Example usage (commented out)
# ref_model, adata_ref = train_reference_model(adata_ref)

### 4.3 Load and Prepare Spatial Data

In [None]:
def load_visium_sample(h5_path, spatial_positions_df):
    """
    Load a Visium sample and attach spatial coordinates.
    """
    # Load the filtered feature matrix
    adata_vis = sc.read_10x_h5(h5_path)
    
    # Add spatial coordinates
    adata_vis.obs['in_tissue'] = 0
    adata_vis.obs['array_row'] = 0
    adata_vis.obs['array_col'] = 0
    
    # Match barcodes to spatial positions
    barcode_to_pos = spatial_positions_df.set_index('barcode')
    for bc in adata_vis.obs_names:
        if bc in barcode_to_pos.index:
            pos = barcode_to_pos.loc[bc]
            adata_vis.obs.loc[bc, 'in_tissue'] = pos['in_tissue']
            adata_vis.obs.loc[bc, 'array_row'] = pos['array_row']
            adata_vis.obs.loc[bc, 'array_col'] = pos['array_col']
    
    # Create spatial coordinates in obsm
    coords = spatial_positions_df[['pixel_x', 'pixel_y']].values
    if len(coords) == adata_vis.n_obs:
        adata_vis.obsm['spatial'] = coords
    
    return adata_vis

# List available Visium h5 files
visium_dir = Path('../data/raw/human_hypomap/visium_h5')
if visium_dir.exists():
    h5_files = list(visium_dir.glob('*.h5'))
    print(f"Found {len(h5_files)} Visium h5 files:")
    for f in h5_files:
        print(f"  {f.name}")

### 4.4 Train Cell2Location Model

In [None]:
def train_cell2location_model(adata_vis, inf_aver, n_cells_per_location=3,
                               detection_alpha=20, n_epochs=30000):
    """
    Train cell2location model to map cell types to spatial locations.
    
    Parameters from the paper:
    - detection_alpha=20
    - n_cells_per_location=3
    - trained for 30,000 epochs
    """
    if not CELL2LOC_AVAILABLE:
        print("Skipping - cell2location not available")
        return None
    
    # Setup and train model
    cell2location.models.Cell2location.setup_anndata(
        adata_vis,
        batch_key=None
    )
    
    model = cell2location.models.Cell2location(
        adata_vis,
        cell_state_df=inf_aver,
        N_cells_per_location=n_cells_per_location,
        detection_alpha=detection_alpha
    )
    
    model.train(
        max_epochs=n_epochs,
        batch_size=None,
        train_size=1,
        use_gpu=True
    )
    
    # Export results
    adata_vis = model.export_posterior(
        adata_vis,
        sample_kwargs={'num_samples': 1000, 'batch_size': 2500}
    )
    
    return model, adata_vis

print("Cell2location training function defined.")
print("\nKey hyperparameters from the paper:")
print("  - detection_alpha: 20")
print("  - N_cells_per_location: 3")
print("  - max_epochs: 30,000")

## 5. Visualize Cell2Location Results (Figure 3e Style)

After running cell2location, the abundance scores for each cluster are stored in 
`adata_vis.obsm['q05_cell_abundance_w_sf']`. We can visualize these to recreate Figure 3e.

In [None]:
def plot_cell2location_abundance(adata_vis, cluster_names, 
                                  abundance_key='q05_cell_abundance_w_sf',
                                  cmap='viridis', figsize=(15, 3)):
    """
    Plot cell2location abundance scores for specified clusters.
    Recreates the style of Figure 3e.
    
    Parameters:
    -----------
    adata_vis : AnnData
        Spatial data with cell2location results
    cluster_names : list
        List of cluster names to plot (e.g., ['C3-118', 'C3-119', ...])
    """
    n_clusters = len(cluster_names)
    fig, axes = plt.subplots(1, n_clusters, figsize=figsize)
    
    if n_clusters == 1:
        axes = [axes]
    
    for ax, cluster in zip(axes, cluster_names):
        if abundance_key in adata_vis.obsm:
            abundance = adata_vis.obsm[abundance_key]
            if cluster in abundance.columns:
                values = abundance[cluster].values
            else:
                values = np.zeros(adata_vis.n_obs)
                print(f"Warning: {cluster} not found in abundance matrix")
        else:
            # Demo with random values
            values = np.random.rand(adata_vis.n_obs)
        
        # Get spatial coordinates
        if 'spatial' in adata_vis.obsm:
            coords = adata_vis.obsm['spatial']
        else:
            # Use array coordinates
            coords = np.column_stack([
                adata_vis.obs['array_col'].values,
                adata_vis.obs['array_row'].values
            ])
        
        # Normalize values for color mapping
        vmin, vmax = np.percentile(values, [5, 95])
        
        scatter = ax.scatter(
            coords[:, 0], coords[:, 1],
            c=values, cmap=cmap,
            vmin=vmin, vmax=vmax,
            s=15, alpha=0.8
        )
        
        ax.set_title(cluster, fontsize=10)
        ax.set_aspect('equal')
        ax.invert_yaxis()
        ax.axis('off')
        
        # Add colorbar
        plt.colorbar(scatter, ax=ax, fraction=0.046, pad=0.04)
    
    plt.suptitle('Cell2Location Abundance Scores\n(Figure 3e style)', y=1.02)
    plt.tight_layout()
    return fig

print("Visualization function defined.")

In [None]:
# Example: Create a demonstration plot with spatial positions
if spatial_positions:
    # Use the first sample for demonstration
    sample_id = list(spatial_positions.keys())[0]
    pos_df = spatial_positions[sample_id]
    in_tissue = pos_df[pos_df['in_tissue'] == 1].copy()
    
    # Create mock abundance data for demonstration
    # In practice, this comes from cell2location output
    vmh_cluster_ids = ['C3-118', 'C3-119', 'C3-120', 'C3-121', 'C3-122']
    
    fig, axes = plt.subplots(1, 5, figsize=(18, 4))
    
    for ax, cluster_id in zip(axes, vmh_cluster_ids):
        # Generate mock abundance (replace with real cell2location output)
        # Real data would show distinct VMH subregion patterns
        np.random.seed(hash(cluster_id) % 2**32)
        
        # Simulate cluster-specific spatial patterns
        center_x = in_tissue['pixel_x'].mean() + np.random.randn() * 100
        center_y = in_tissue['pixel_y'].mean() + np.random.randn() * 100
        
        dist = np.sqrt((in_tissue['pixel_x'] - center_x)**2 + 
                       (in_tissue['pixel_y'] - center_y)**2)
        abundance = np.exp(-dist / dist.max() * 3) + np.random.rand(len(dist)) * 0.1
        
        scatter = ax.scatter(
            in_tissue['pixel_x'], 
            in_tissue['pixel_y'],
            c=abundance,
            cmap='YlOrRd',
            s=8,
            alpha=0.8
        )
        
        ax.set_title(f"{cluster_id}\n{vmh_glu2_clusters[cluster_id]['marker']}", 
                     fontsize=10)
        ax.invert_yaxis()
        ax.set_aspect('equal')
        ax.axis('off')
    
    # Add shared colorbar
    cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
    cbar = fig.colorbar(scatter, cax=cbar_ax)
    cbar.set_label('Cell Abundance', rotation=270, labelpad=15)
    
    plt.suptitle(f'VMH GLU-2 Cluster Mapping (Sample: {sample_id})\n'
                 'Simulated data - replace with cell2location output', 
                 fontsize=12, y=1.05)
    plt.tight_layout(rect=[0, 0, 0.9, 1])
    plt.show()
    
    print("\nNote: This is simulated data for demonstration.")
    print("Run the full cell2location pipeline to get real abundance scores.")

## 6. Full Pipeline Summary

To fully reconstruct Figure 3e, run the following steps:

```python
# 1. Prepare reference from snRNA-seq
adata_ref = prepare_reference_signatures(adata_memory, cluster_column='C3_named')

# 2. Train reference model
ref_model, adata_ref = train_reference_model(adata_ref)
inf_aver = adata_ref.varm['means_per_cluster_mu_fg'].T.copy()

# 3. Load spatial data
adata_vis = load_visium_sample(h5_path, spatial_positions_df)

# 4. Train cell2location
c2l_model, adata_vis = train_cell2location_model(adata_vis, inf_aver)

# 5. Visualize VMH clusters
vmh_clusters = ['C3-118', 'C3-119', 'C3-120', 'C3-121', 'C3-122']
fig = plot_cell2location_abundance(adata_vis, vmh_clusters)
```

Note: The full pipeline requires significant computational resources 
(GPU recommended) and takes several hours to complete.

## 7. Region Annotation

The paper assigns regional annotations to snRNA-seq clusters based on 
spatial abundance scoring in each hypothalamic region.

In [None]:
# Display available region annotations
print("Hypothalamic regions defined in the dataset:\n")
for abbr, (name, color) in HUMAN_REGIONS.items():
    print(f"  {abbr:12s} - {name:40s} {color}")

In [None]:
# Visualize region color scheme
fig, ax = plt.subplots(figsize=(12, 6))

regions = list(HUMAN_REGIONS.keys())
colors = [HUMAN_REGION_COLORS[r] for r in regions]
names = [HUMAN_REGIONS[r][0] for r in regions]

y_pos = range(len(regions))
bars = ax.barh(y_pos, [1]*len(regions), color=colors, edgecolor='black')

ax.set_yticks(y_pos)
ax.set_yticklabels([f"{r} - {n}" for r, n in zip(regions, names)], fontsize=9)
ax.set_xlim(0, 1.5)
ax.set_xlabel('')
ax.set_title('Human Hypothalamus Region Color Scheme')
ax.xaxis.set_visible(False)

plt.tight_layout()
plt.show()

## References

- Tadross, J.A., Steuernagel, L., Dowsett, G.K.C., et al. (2025). A comprehensive spatio-cellular map of the human hypothalamus. *Nature*, 639, 708-716.
- Kleshchevnikov, V., et al. (2022). Cell2location maps fine-grained cell types in spatial transcriptomics. *Nature Biotechnology*, 40, 661-671.

Data availability:
- snRNA-seq: https://cellxgene.cziscience.com/collections/d0941303-7ce3-4422-9249-cf31eb98c480
- Spatial transcriptomics: GEO accession GSE278848
- Code: https://github.com/lsteuernagel/HYPOMAP