# Spatial Transcriptomics Analysis with Scanpy & Squidpy

**Dataset**: 10x Genomics Visium - Human Breast Cancer  
**Goal**: Perform spatial analysis and identify spatially patterned genes  
**Key Learning**: Always use `sc.read_visium()` for proper coordinate loading

---

## Why This Notebook Exists

This notebook demonstrates the **correct way** to analyze 10x Visium spatial transcriptomics data. A critical lesson learned: **never manually load spatial coordinates** - use Scanpy's built-in `read_visium()` function.

### What We'll Cover:

1. **Data Loading** - Proper Visium data loading with `sc.read_visium()`
2. **Quality Control** - Filter low-quality spots and genes
3. **Preprocessing** - Normalization, HVG selection, scaling
4. **Clustering** - PCA, UMAP, Leiden clustering
5. **Spatial Analysis** - Spatial neighbor graphs and Moran's I autocorrelation
6. **Visualization** - Verify spots are ON tissue (not around it!)

---

## Setup & Configuration

In [None]:
import warnings
warnings.filterwarnings('ignore')

import scanpy as sc
import squidpy as sq
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

print(f"scanpy version: {sc.__version__}")
print(f"squidpy version: {sq.__version__}")

In [None]:
SEED = 42
np.random.seed(SEED)

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=100, facecolor='white', frameon=False)

In [None]:
PROJECT_ROOT = Path("/Users/sriharshameghadri/randomAIProjects/kaggle/medGemma")
DATA_DIR = PROJECT_ROOT / "data" / "sample"
OUTPUT_DIR = PROJECT_ROOT / "outputs"
OUTPUT_DIR.mkdir(exist_ok=True)

---

## 1. Load Visium Data - THE CORRECT WAY ✅

### Critical: Use `sc.read_visium()` NOT Manual Loading!

**Why this matters:**
- 10x Visium has a specific coordinate system requiring transformation
- Manual loading of `tissue_positions_list.csv` misses scalefactor application
- Wrong coordinates → spots form rectangle AROUND tissue instead of ON tissue
- Invalid spatial statistics (Moran's I, co-occurrence)

**What `sc.read_visium()` does:**
1. Loads gene expression matrix (h5 file)
2. Loads spatial coordinates with proper transformation
3. Loads tissue images (lowres, hires, fullres)
4. Sets up scalefactors for alignment
5. Creates complete AnnData object with all spatial metadata

In [None]:
# ✅ CORRECT APPROACH - Use Scanpy's built-in Visium reader
adata = sc.read_visium(
    DATA_DIR,
    count_file='Visium_Human_Breast_Cancer_filtered_feature_bc_matrix.h5',
    load_images=True
)
adata.var_names_make_unique()

print(f"Dataset loaded: {adata.shape}")
print(f"Spatial coordinates: {adata.obsm['spatial'].shape}")

library_id = list(adata.uns['spatial'].keys())[0]
print(f"Library ID: {library_id}")

### Validation: Check Coordinates Are Loaded Correctly

Spatial coordinates should show **scattered values** across the tissue, NOT sequential increments.

In [None]:
print("First 5 spatial coordinates:")
print(adata.obsm['spatial'][:5])
print("\nCoordinate ranges:")
print(f"X range: {adata.obsm['spatial'][:, 0].min():.0f} - {adata.obsm['spatial'][:, 0].max():.0f}")
print(f"Y range: {adata.obsm['spatial'][:, 1].min():.0f} - {adata.obsm['spatial'][:, 1].max():.0f}")

---

## 2. Quality Control

Calculate QC metrics for each spot:
- `n_genes_by_counts`: Number of genes with >0 counts
- `total_counts`: Total UMI counts per spot
- `pct_counts_mt`: Percentage of mitochondrial gene expression

In [None]:
sc.pp.calculate_qc_metrics(adata, inplace=True)

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)

print(f"Mean genes per spot: {int(adata.obs['n_genes_by_counts'].mean())}")
print(f"Mean counts per spot: {int(adata.obs['total_counts'].mean())}")
print(f"Mean MT%: {adata.obs['pct_counts_mt'].mean():.2f}%")

### Visualize QC Metrics

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sc.pl.violin(adata, 'n_genes_by_counts', ax=axes[0], show=False)
axes[0].set_title('Genes per Spot')

sc.pl.violin(adata, 'total_counts', ax=axes[1], show=False)
axes[1].set_title('Total Counts per Spot')

sc.pl.violin(adata, 'pct_counts_mt', ax=axes[2], show=False)
axes[2].set_title('Mitochondrial %')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'qc_metrics.png', dpi=150, bbox_inches='tight')
plt.show()

---

## 3. Filtering

Remove low-quality spots and uninformative genes:
- **Spots**: Keep only spots with ≥200 genes detected
- **Genes**: Keep only genes detected in ≥3 spots

In [None]:
print(f"Before filtering: {adata.n_obs} spots, {adata.n_vars} genes")

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

print(f"After filtering: {adata.n_obs} spots, {adata.n_vars} genes")

---

## 4. Normalization & Preprocessing

### Normalization
- Normalize each spot to 10,000 total counts
- Log-transform for variance stabilization

### Highly Variable Genes (HVGs)
- Select top 2,000 genes with high biological variation
- Use Seurat method (standard, no extra dependencies)

### Scaling
- Z-score normalization per gene
- Clip to max_value=10 to reduce outlier influence

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat')
print(f"Highly variable genes: {adata.var['highly_variable'].sum()}")

adata.raw = adata
adata = adata[:, adata.var['highly_variable']]

sc.pp.scale(adata, max_value=10)

---

## 5. Dimensionality Reduction & Clustering

### PCA
- Reduce 2,000 HVGs to 50 principal components
- Captures major axes of variation

### UMAP
- Nonlinear embedding for visualization
- Preserves local and some global structure

### Leiden Clustering
- Graph-based community detection
- Resolution=0.5 for moderate granularity

In [None]:
sc.tl.pca(adata, n_comps=50, random_state=SEED)
sc.pp.neighbors(adata, n_pcs=40, random_state=SEED)
sc.tl.umap(adata, random_state=SEED)
sc.tl.leiden(adata, resolution=0.5, random_state=SEED)

print(f"Clusters found: {adata.obs['leiden'].nunique()}")

### Visualize Clustering

In [None]:
sc.pl.umap(adata, color='leiden', title='Leiden Clustering', legend_loc='on data')

---

## 6. Spatial Analysis

### Spatial Neighbor Graph

Build a graph connecting neighboring spots:
- **n_neighs=6**: Hexagonal lattice (Visium array structure)
- **coord_type='generic'**: Use pixel coordinates from `adata.obsm['spatial']`

This graph captures which spots are physically adjacent on the tissue.

In [None]:
sq.gr.spatial_neighbors(adata, coord_type='generic', n_neighs=6)
print(f"Spatial graph built: {adata.obsp['spatial_connectivities'].shape}")

### Moran's I Spatial Autocorrelation

**What is Moran's I?**
- Statistical measure of spatial clustering
- Range: -1 (dispersed) to +1 (clustered)
- Tests if gene expression is spatially correlated

**Interpretation:**
- **I > 0**: Similar expression in neighboring spots (spatial clustering)
- **I ≈ 0**: Random spatial distribution
- **I < 0**: Dissimilar expression in neighbors (rare in biology)

**Significance:**
- p-value from permutation test (n_perms=100)
- p < 0.05 indicates significant spatial pattern

In [None]:
top_hvgs = adata.var_names[:100]
sq.gr.spatial_autocorr(adata, mode='moran', genes=top_hvgs, n_perms=100, n_jobs=1)

morans_df = adata.uns['moranI'].sort_values('I', ascending=False)
significant_genes = morans_df[morans_df['pval_norm'] < 0.05]

print(f"Significant spatial genes: {len(significant_genes)}")
print("\nTop 10 spatially autocorrelated genes:")
print(significant_genes.head(10))

### Biological Interpretation of Top Spatial Genes

Example: If ISG15 is top gene
- **ISG15**: Interferon-stimulated gene
- High Moran's I → clusters in specific tissue regions
- Likely marks immune-activated areas (tumor-immune interface)

Complement genes (C1QA/C1QB/C1QC):
- Part of innate immune system
- Expressed by macrophages
- Co-localization suggests immune infiltration zones

---

## 7. Spatial Visualization - CRITICAL VALIDATION

### What You Should See:

✅ **CORRECT**:
- Spots scattered ACROSS the tissue
- Clusters following tissue morphology
- Dense regions with high counts
- Tissue architecture visible

❌ **WRONG**:
- Spots forming rectangle AROUND tissue
- Fiducial markers visible as circles
- Geometric patterns instead of biological patterns
- Complete tissue-spot misalignment

If you see the WRONG pattern → coordinates are not loaded correctly!

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 16))

# Panel 1: Scanpy - Clusters on tissue
sc.pl.spatial(
    adata,
    library_id=library_id,
    color='leiden',
    ax=axes[0, 0],
    title='Leiden Clusters (Scanpy)',
    size=1.5,
    show=False
)

# Panel 2: Squidpy - Clusters on tissue
sq.pl.spatial_scatter(
    adata,
    library_id=library_id,
    color='leiden',
    ax=axes[0, 1],
    title='Leiden Clusters (Squidpy)',
    size=1.5,
    img=True,
    img_res_key='hires',
    img_alpha=0.5,
    alpha=0.8,
    frameon=False
)

# Panel 3: Total counts per spot
sc.pl.spatial(
    adata,
    library_id=library_id,
    color='total_counts',
    ax=axes[1, 0],
    title='Total Counts per Spot',
    size=1.5,
    cmap='viridis',
    show=False
)

# Panel 4: Top spatial gene expression
if len(significant_genes) > 0:
    top_gene = significant_genes.index[0]
    sc.pl.spatial(
        adata,
        library_id=library_id,
        color=top_gene,
        ax=axes[1, 1],
        title=f'Top Spatial Gene: {top_gene}',
        size=1.5,
        cmap='Reds',
        show=False
    )

plt.tight_layout()
output_path = OUTPUT_DIR / "spatial_tissue_overview.png"
plt.savefig(output_path, dpi=150, bbox_inches='tight')
print(f"\nSaved: {output_path}")
plt.show()

---

## 8. Spatial Co-occurrence Analysis

Which clusters are neighbors more often than expected by chance?

**Interpretation:**
- High co-occurrence → tissue niches or microenvironments
- Example: Tumor-immune interface (tumor + immune clusters adjacent)

In [None]:
sq.gr.co_occurrence(adata, cluster_key='leiden')

sq.pl.co_occurrence(
    adata,
    cluster_key='leiden',
    figsize=(8, 8)
)
plt.savefig(OUTPUT_DIR / 'spatial_cooccurrence.png', dpi=150, bbox_inches='tight')
plt.show()

---

## 9. Export Results for MedGemma

Create JSON file with spatial features for clinical report generation.

In [None]:
import json

# Prepare cluster statistics
cluster_counts = adata.obs['leiden'].value_counts().to_dict()
cluster_stats = {f"cluster_{k}": int(v) for k, v in cluster_counts.items()}

# Prepare spatial gene information
top_spatial_genes = significant_genes.head(20)[['I', 'pval_norm']].to_dict('index')
spatial_genes_clean = {
    gene: {"morans_i": float(data['I']), "pval": float(data['pval_norm'])}
    for gene, data in top_spatial_genes.items()
}

# Create output dictionary
output = {
    "dataset_info": {
        "total_spots": int(adata.n_obs),
        "total_genes": int(adata.n_vars),
        "n_clusters": int(adata.obs['leiden'].nunique())
    },
    "clusters": cluster_stats,
    "spatial_statistics": {
        "morans_i": {
            "n_significant_genes": len(significant_genes),
            "top_genes": spatial_genes_clean
        }
    },
    "qc_metrics": {
        "mean_genes_per_spot": float(adata.obs['n_genes_by_counts'].mean()),
        "mean_counts_per_spot": float(adata.obs['total_counts'].mean()),
        "mean_mt_percent": float(adata.obs['pct_counts_mt'].mean())
    }
}

# Save to JSON
json_path = OUTPUT_DIR / "spatial_features.json"
with open(json_path, 'w') as f:
    json.dump(output, f, indent=2)

print(f"\nSpatial features saved to: {json_path}")
print(f"File size: {json_path.stat().st_size / 1024:.1f} KB")

---

## 10. Save Processed Data

In [None]:
h5ad_path = OUTPUT_DIR / "processed_visium.h5ad"
adata.write(h5ad_path)
print(f"\nProcessed data saved to: {h5ad_path}")
print(f"File size: {h5ad_path.stat().st_size / (1024**2):.1f} MB")

---

## Summary & Key Takeaways

### What We Did:

1. ✅ Loaded Visium data **correctly** using `sc.read_visium()`
2. ✅ Performed QC and filtered low-quality data
3. ✅ Normalized, selected HVGs, and scaled data
4. ✅ Identified clusters using Leiden algorithm
5. ✅ Built spatial neighbor graph (6-neighbor hexagonal)
6. ✅ Calculated Moran's I for spatial autocorrelation
7. ✅ Visualized spatial patterns on tissue
8. ✅ Analyzed cluster co-occurrence
9. ✅ Exported features for downstream analysis

### Critical Lessons:

1. **ALWAYS use `sc.read_visium()`** - don't manually load coordinates
2. **Visually validate** - spots should be ON tissue, not around it
3. **Moran's I identifies spatial genes** - high I = spatially clustered expression
4. **Spatial neighbors matter** - 6-neighbor hexagonal for Visium
5. **Biological interpretation** - immune genes clustering = immune infiltration

### Next Steps:

1. **Cell Type Identification** - Annotate clusters with cell type labels
2. **MedGemma Integration** - Generate clinical reports from spatial features
3. **Deployment** - Create Streamlit app for end-to-end pipeline