# Automated Cell Type Annotation

## Overview
This notebook performs automated cell type annotation using CellTypist and reference-based methods.

### Methods
1. **CellTypist**: Pre-trained models for immune and pan-tissue annotation
2. **Reference mapping**: Project cells onto annotated reference atlas
3. **Marker-based scoring**: Score cells using canonical markers

---

In [None]:
import scanpy as sc
import anndata as ad
import celltypist
from celltypist import models
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import yaml
import warnings

warnings.filterwarnings('ignore')

# Project paths
PROJECT_ROOT = Path("../..").resolve()
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed' / 'scrna'
FIGURES = PROJECT_ROOT / 'results' / 'figures'
CONFIG_PATH = PROJECT_ROOT / 'config' / 'analysis_params.yaml'

with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

# Load integrated atlas
adata = sc.read_h5ad(DATA_PROCESSED / 'integrated_atlas.h5ad')
print(f"Loaded atlas: {adata.n_obs} cells")

## 1. CellTypist Annotation

Download and apply pre-trained immune cell models.

In [None]:
# Download models
models.download_models()

# List available models
print("Available CellTypist models:")
for m in models.models_description().index:
    print(f"  - {m}")

In [None]:
# Prepare data for CellTypist (needs normalized log counts)
adata_ct = adata.copy()
if 'normalized' in adata.layers:
    adata_ct.X = adata.layers['normalized'].copy()

# Run CellTypist with immune model
model_name = config['annotation']['celltypist']['model']
model = models.Model.load(model=model_name)

predictions = celltypist.annotate(
    adata_ct,
    model=model,
    majority_voting=config['annotation']['celltypist']['majority_voting']
)

# Add predictions to original adata
adata.obs['celltypist_prediction'] = predictions.predicted_labels['predicted_labels']
adata.obs['celltypist_conf'] = predictions.predicted_labels['conf_score']

print(f"\nAnnotation complete")
print(adata.obs['celltypist_prediction'].value_counts().head(20))

In [None]:
# Visualize annotations
sc.pl.umap(
    adata,
    color=['celltypist_prediction'],
    legend_loc='on data',
    legend_fontsize=6,
    save='_celltypist_annotations.png'
)

## 2. Marker-Based Scoring

In [None]:
# Score cells using canonical markers
markers = config['annotation']['markers']

for cell_type, genes in markers.items():
    # Filter to genes present in data
    present_genes = [g for g in genes if g in adata.var_names]
    
    if len(present_genes) > 0:
        sc.tl.score_genes(
            adata,
            gene_list=present_genes,
            score_name=f'{cell_type}_score'
        )
        print(f"{cell_type}: scored with {len(present_genes)}/{len(genes)} markers")

In [None]:
# Score exhaustion markers for T cells
exhaustion_genes = config['annotation']['exhaustion']['markers']
present = [g for g in exhaustion_genes if g in adata.var_names]

sc.tl.score_genes(adata, gene_list=present, score_name='exhaustion_score')
print(f"Exhaustion score calculated with {len(present)} markers")

## 3. Create Unified Cell Type Labels

In [None]:
# Map detailed annotations to major cell types
def map_to_major_type(annotation):
    """Map detailed annotation to major cell type."""
    annotation_lower = annotation.lower()
    
    if any(x in annotation_lower for x in ['cd8', 'cytotoxic']):
        return 'CD8_T'
    elif any(x in annotation_lower for x in ['cd4', 'helper', 'th1', 'th2', 'th17']):
        return 'CD4_T'
    elif any(x in annotation_lower for x in ['treg', 'regulatory']):
        return 'Treg'
    elif any(x in annotation_lower for x in ['nk', 'natural killer']):
        return 'NK'
    elif any(x in annotation_lower for x in ['b cell', 'b-cell']):
        return 'B_cell'
    elif any(x in annotation_lower for x in ['plasma']):
        return 'Plasma'
    elif any(x in annotation_lower for x in ['macro', 'mono']):
        return 'Myeloid'
    elif any(x in annotation_lower for x in ['dc', 'dendritic']):
        return 'DC'
    elif any(x in annotation_lower for x in ['fibro']):
        return 'Fibroblast'
    elif any(x in annotation_lower for x in ['endo']):
        return 'Endothelial'
    elif any(x in annotation_lower for x in ['epi', 'tumor', 'cancer']):
        return 'Epithelial'
    else:
        return 'Other'

adata.obs['cell_type_major'] = adata.obs['celltypist_prediction'].apply(map_to_major_type)

print("Major cell type distribution:")
print(adata.obs['cell_type_major'].value_counts())

In [None]:
# Visualize major cell types
sc.pl.umap(
    adata,
    color=['cell_type_major', 'cancer_type'],
    ncols=2,
    save='_major_celltypes.png'
)

## 4. Save Annotated Data

In [None]:
# Save annotated atlas
output_path = DATA_PROCESSED / 'integrated_atlas_annotated.h5ad'
adata.write(output_path)
print(f"Saved annotated atlas to: {output_path}")

## Summary

### Annotations Added
- CellTypist predictions
- Major cell type labels
- Cell type scores
- Exhaustion scores

### Next Steps
1. Manual curation in `04b_marker_based_annotation.ipynb`
2. T cell subclustering in `04c_tcell_subclustering.ipynb`