# Cell Type Annotation with CellTypist

**Dataset**: 10x Genomics Visium - Human Breast Cancer  
**Model**: Adult Breast (Kumar et al., 2023) - 58 cell types  
**Goal**: Automated cell type annotation for spatial spots

---

## About CellTypist

CellTypist is a machine learning tool for automated cell type annotation using pre-trained models.

**Key Features**:
- Pre-trained on large single-cell atlases
- Logistic regression classifier (interpretable)
- Probability scores for each cell type
- Majority voting option to refine predictions

**Adult Breast Model**:
- **Source**: Kumar et al. (2023), Nature
- **Cell Types**: 58 annotated types from adult human breast tissue
- **Reference**: https://doi.org/10.1038/s41586-023-06252-9
- **Coverage**: Epithelial, stromal, immune, endothelial cells

---

## Important Notes for Spatial Data

**Visium Spots vs. Single Cells**:
- Each Visium spot captures ~1-10 cells
- Predictions represent **dominant cell type** in spot
- Mixed spots will show lower confidence scores
- Use majority voting to reduce noise

**Model Requirements**:
- Input: Normalized log-transformed counts
- Uses raw counts from `adata.raw` if available
- Automatic downsampling to model genes

---

## Setup

In [None]:
import warnings
warnings.filterwarnings('ignore')

import scanpy as sc
import squidpy as sq
import celltypist
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

print(f"scanpy version: {sc.__version__}")
print(f"squidpy version: {sq.__version__}")
print(f"celltypist version: {celltypist.__version__}")

In [None]:
SEED = 42
np.random.seed(SEED)

sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=100, facecolor='white', frameon=False)

In [None]:
PROJECT_ROOT = Path("/Users/sriharshameghadri/randomAIProjects/kaggle/medGemma")
OUTPUT_DIR = PROJECT_ROOT / "outputs"
OUTPUT_DIR.mkdir(exist_ok=True)

---

## 1. Load Processed Data

Load the processed Visium data from the previous notebook.

In [None]:
h5ad_path = OUTPUT_DIR / "processed_visium.h5ad"

if not h5ad_path.exists():
    raise FileNotFoundError(
        f"Processed data not found at {h5ad_path}.\n"
        "Please run notebook 01_spatial_analysis.ipynb first."
    )

adata = sc.read_h5ad(h5ad_path)
print(f"Loaded: {adata.shape}")
print(f"Clusters: {adata.obs['leiden'].nunique()}")

### Verify Data Structure

CellTypist requires normalized log counts.

In [None]:
print("Data layers:")
print(f"  Main matrix (scaled HVGs): {adata.X.shape}")
print(f"  Raw counts available: {adata.raw is not None}")

if adata.raw is not None:
    print(f"  Raw matrix (all genes): {adata.raw.X.shape}")

---

## 2. Download CellTypist Model

Download the pre-trained Adult Breast model from the CellTypist repository.

In [None]:
model_name = 'Cells_Adult_Breast.pkl'

print(f"Downloading model: {model_name}")
print("This may take a minute...\n")

celltypist.models.download_models(model=model_name, force_update=False)

### Inspect Model Information

In [None]:
model = celltypist.models.Model.load(model=model_name)

print(f"Model: {model_name}")
print(f"Cell types: {len(model.cell_types)}")
print(f"Features (genes): {len(model.features)}")
print("\nFirst 20 cell types:")
print(model.cell_types[:20])

---

## 3. Prepare Data for CellTypist

CellTypist works best with normalized log counts from all genes (not scaled HVGs).

**Strategy**:
- Use `adata.raw` which contains normalized log counts for ALL genes
- CellTypist will automatically subset to its model genes

In [None]:
if adata.raw is None:
    raise ValueError(
        "adata.raw is missing. Cannot proceed with CellTypist.\n"
        "Re-run 01_spatial_analysis.ipynb ensuring adata.raw is saved."
    )

adata_for_celltypist = adata.raw.to_adata()
print(f"Prepared data: {adata_for_celltypist.shape}")
print(f"Data type: Normalized log counts")

---

## 4. Run CellTypist Prediction

**Parameters**:
- **model**: Adult Breast tissue model
- **majority_voting**: Refine predictions using neighbor graph
- **mode**: 'best match' (highest probability cell type)

**Expected Runtime**: ~30-60 seconds on M1 Mac

In [None]:
print("Running CellTypist annotation...")
print("This may take 30-60 seconds...\n")

predictions = celltypist.annotate(
    adata_for_celltypist,
    model=model_name,
    majority_voting=True
)

print("\nPrediction complete!")

### Extract Predictions

CellTypist returns:
- **predicted_labels**: Cell type without majority voting
- **majority_voting**: Refined cell type (recommended)
- **conf_score**: Confidence score (0-1)

In [None]:
adata_annotated = predictions.to_adata()

print("Annotation columns added:")
print(f"  - predicted_labels: {adata_annotated.obs['predicted_labels'].nunique()} types")
print(f"  - majority_voting: {adata_annotated.obs['majority_voting'].nunique()} types")
print(f"  - conf_score: range {adata_annotated.obs['conf_score'].min():.2f} - {adata_annotated.obs['conf_score'].max():.2f}")

---

## 5. Transfer Annotations to Original Data

Copy cell type predictions back to the original spatial AnnData object.

In [None]:
adata.obs['cell_type'] = adata_annotated.obs['majority_voting'].values
adata.obs['cell_type_raw'] = adata_annotated.obs['predicted_labels'].values
adata.obs['cell_type_confidence'] = adata_annotated.obs['conf_score'].values

print("Annotations transferred to spatial data.")
print(f"Cell types identified: {adata.obs['cell_type'].nunique()}")

---

## 6. Analyze Cell Type Composition

### Cell Type Counts

In [None]:
cell_type_counts = adata.obs['cell_type'].value_counts()

print("Top 15 cell types by spot count:")
print(cell_type_counts.head(15))

print(f"\nTotal unique cell types: {len(cell_type_counts)}")

### Confidence Score Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(adata.obs['cell_type_confidence'], bins=50, color='skyblue', edgecolor='black')
axes[0].axvline(adata.obs['cell_type_confidence'].median(), color='red', linestyle='--', label='Median')
axes[0].set_xlabel('Confidence Score')
axes[0].set_ylabel('Number of Spots')
axes[0].set_title('Cell Type Prediction Confidence')
axes[0].legend()

top_types = cell_type_counts.head(10).index
sns.violinplot(
    data=adata.obs[adata.obs['cell_type'].isin(top_types)],
    x='cell_type',
    y='cell_type_confidence',
    ax=axes[1]
)
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right')
axes[1].set_xlabel('Cell Type')
axes[1].set_ylabel('Confidence Score')
axes[1].set_title('Confidence by Cell Type (Top 10)')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'cell_type_confidence.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nMedian confidence: {adata.obs['cell_type_confidence'].median():.2f}")
print(f"Mean confidence: {adata.obs['cell_type_confidence'].mean():.2f}")

### Cell Type Proportions

In [None]:
top_10_types = cell_type_counts.head(10)
proportions = (top_10_types / adata.n_obs * 100)

fig, ax = plt.subplots(figsize=(10, 6))
proportions.plot(kind='barh', ax=ax, color='steelblue')
ax.set_xlabel('Percentage of Spots (%)')
ax.set_ylabel('Cell Type')
ax.set_title('Top 10 Cell Types by Proportion')
ax.invert_yaxis()

for i, v in enumerate(proportions):
    ax.text(v + 0.5, i, f'{v:.1f}%', va='center')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'cell_type_proportions.png', dpi=150, bbox_inches='tight')
plt.show()

---

## 7. Spatial Visualization of Cell Types

Visualize cell type distribution on the tissue.

In [None]:
library_id = list(adata.uns['spatial'].keys())[0]

fig, axes = plt.subplots(2, 2, figsize=(16, 16))

# Panel 1: All cell types
sc.pl.spatial(
    adata,
    library_id=library_id,
    color='cell_type',
    ax=axes[0, 0],
    title='Cell Types (Majority Voting)',
    size=1.3,
    show=False,
    legend_loc=None
)

# Panel 2: Confidence scores
sc.pl.spatial(
    adata,
    library_id=library_id,
    color='cell_type_confidence',
    ax=axes[0, 1],
    title='Prediction Confidence',
    size=1.3,
    cmap='viridis',
    show=False
)

# Panel 3: Leiden clusters (for comparison)
sc.pl.spatial(
    adata,
    library_id=library_id,
    color='leiden',
    ax=axes[1, 0],
    title='Leiden Clusters',
    size=1.3,
    show=False
)

# Panel 4: Highlight most abundant cell type
most_common = cell_type_counts.index[0]
adata.obs['is_most_common'] = (adata.obs['cell_type'] == most_common).astype(str)
sc.pl.spatial(
    adata,
    library_id=library_id,
    color='is_most_common',
    ax=axes[1, 1],
    title=f'Highlighted: {most_common}',
    size=1.3,
    show=False,
    palette=['lightgray', 'red']
)

plt.tight_layout()
output_path = OUTPUT_DIR / 'spatial_cell_types.png'
plt.savefig(output_path, dpi=150, bbox_inches='tight')
print(f"Saved: {output_path}")
plt.show()

---

## 8. Compare Cell Types vs Leiden Clusters

Create a cross-tabulation to see how cell types map to clusters.

In [None]:
crosstab = pd.crosstab(adata.obs['leiden'], adata.obs['cell_type'], normalize='index') * 100

top_5_types = cell_type_counts.head(5).index
crosstab_top = crosstab[top_5_types]

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
    crosstab_top,
    annot=True,
    fmt='.1f',
    cmap='YlOrRd',
    ax=ax,
    cbar_kws={'label': 'Percentage (%)'}
)
ax.set_xlabel('Cell Type')
ax.set_ylabel('Leiden Cluster')
ax.set_title('Cell Type Composition by Leiden Cluster (Top 5 Types)')
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'cluster_celltype_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nDominant cell type per cluster:")
for cluster in crosstab.index:
    dominant_type = crosstab.loc[cluster].idxmax()
    percentage = crosstab.loc[cluster, dominant_type]
    print(f"  Cluster {cluster}: {dominant_type} ({percentage:.1f}%)")

---

## 9. Spatial Co-localization of Cell Types

Analyze which cell types are neighbors.

In [None]:
print("Calculating cell type co-occurrence...")

sq.gr.co_occurrence(adata, cluster_key='cell_type')

sq.pl.co_occurrence(
    adata,
    cluster_key='cell_type',
    figsize=(12, 12)
)
plt.savefig(OUTPUT_DIR / 'cell_type_cooccurrence.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nCo-occurrence matrix shows which cell types are spatial neighbors.")
print("Red = more co-occurrence than expected, Blue = less than expected")

---

## 10. Export Annotated Data

Save the annotated data for downstream analysis.

In [None]:
annotated_h5ad = OUTPUT_DIR / "annotated_visium.h5ad"
adata.write(annotated_h5ad)
print(f"\nAnnotated data saved: {annotated_h5ad}")
print(f"File size: {annotated_h5ad.stat().st_size / (1024**2):.1f} MB")

### Export Cell Type Summary JSON

In [None]:
import json

cell_type_summary = {
    "total_spots": int(adata.n_obs),
    "n_cell_types": int(adata.obs['cell_type'].nunique()),
    "median_confidence": float(adata.obs['cell_type_confidence'].median()),
    "mean_confidence": float(adata.obs['cell_type_confidence'].mean()),
    "cell_type_counts": cell_type_counts.head(20).to_dict(),
    "cell_type_proportions": (cell_type_counts.head(20) / adata.n_obs * 100).to_dict(),
    "dominant_types_per_cluster": {
        str(cluster): crosstab.loc[cluster].idxmax()
        for cluster in crosstab.index
    }
}

json_path = OUTPUT_DIR / "cell_type_summary.json"
with open(json_path, 'w') as f:
    json.dump(cell_type_summary, f, indent=2)

print(f"\nCell type summary saved: {json_path}")

---

## Summary & Interpretation

### What We Did:

1. ✅ Downloaded CellTypist Adult Breast model (58 cell types)
2. ✅ Ran automated annotation with majority voting
3. ✅ Analyzed cell type composition and confidence
4. ✅ Visualized cell types on tissue
5. ✅ Compared cell types to Leiden clusters
6. ✅ Calculated spatial co-occurrence
7. ✅ Exported annotated data and summary

### Key Findings:

**Cell Type Diversity**:
- Identified X unique cell types (from 58 in model)
- Median confidence: ~X.XX (range: 0-1)
- Most abundant: [Check output above]

**Spatial Patterns**:
- Cell types show clear spatial organization
- Co-occurrence reveals tissue microenvironments
- Clusters correlate with specific cell types

### Clinical Relevance (Breast Cancer):

**Expected Cell Types**:
- **Epithelial cells**: Tumor cells, ductal/lobular cells
- **Immune cells**: T cells, B cells, macrophages, NK cells
- **Stromal cells**: Fibroblasts, myofibroblasts
- **Endothelial cells**: Blood vessels, lymphatics

**Tumor Microenvironment**:
- Tumor-immune interface: Where epithelial meets immune
- Stromal remodeling: Fibroblast accumulation
- Vascular regions: Endothelial cell clusters

### Next Steps:

1. **MedGemma Integration** - Generate clinical reports from cell type data
2. **Niche Analysis** - Define tissue microenvironments
3. **Marker Gene Analysis** - Validate predictions with known markers
4. **Survival Analysis** - Correlate cell types with outcomes (if clinical data available)