# 02 - Cell Type Annotation

This notebook demonstrates clustering and cell type annotation for single-cell data.

## Overview

Steps include:
1. Load preprocessed data
2. Neighbor graph construction
3. Clustering (Leiden algorithm)
4. UMAP visualization
5. Marker gene identification
6. Cell type annotation

## Setup

In [None]:
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yaml

# Configure scanpy
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, facecolor='white', figsize=(10, 10))

print(f"Scanpy version: {sc.__version__}")

## Load Configuration and Data

In [None]:
# Load configuration
with open('../config/analysis_config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Load preprocessed data
adata = sc.read_h5ad('../data/processed/preprocessed_data.h5ad')

print(f"Loaded data: {adata.shape[0]} cells x {adata.shape[1]} genes")
adata

## Neighborhood Graph

Construct k-nearest neighbor graph in PCA space.

In [None]:
# Compute neighborhood graph
sc.pp.neighbors(
    adata,
    n_neighbors=config['cell_annotation']['n_neighbors'],
    n_pcs=config['cell_annotation']['n_pcs']
)

print("Neighborhood graph computed")

## Clustering

Apply Leiden clustering algorithm.

In [None]:
# Leiden clustering
sc.tl.leiden(
    adata,
    resolution=config['cell_annotation']['resolution'],
    random_state=config['compute']['random_state']
)

print(f"Identified {adata.obs['leiden'].nunique()} clusters")
print(adata.obs['leiden'].value_counts().sort_index())

## UMAP Visualization

Compute UMAP embedding for visualization.

In [None]:
# Compute UMAP
sc.tl.umap(
    adata,
    min_dist=config['visualization']['umap_min_dist'],
    spread=config['visualization']['umap_spread'],
    random_state=config['compute']['random_state']
)

print("UMAP embedding computed")

In [None]:
# Visualize clusters
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='Leiden Clusters')

## Marker Gene Identification

Find differentially expressed genes for each cluster.

In [None]:
# Find marker genes for each cluster
sc.tl.rank_genes_groups(
    adata,
    groupby='leiden',
    method='wilcoxon',
    key_added='rank_genes'
)

print("Marker gene analysis complete")

In [None]:
# Visualize top marker genes
sc.pl.rank_genes_groups(adata, n_genes=10, key='rank_genes', sharey=False)

In [None]:
# Get top marker genes as dataframe
marker_df = sc.get.rank_genes_groups_df(adata, group=None, key='rank_genes')
marker_df.head(20)

## Cell Type Annotation

Annotate clusters based on known marker genes.

In [None]:
# Define marker genes for reproductive tissue cell types
marker_genes = config['cell_annotation']['marker_genes']

# Visualize marker gene expression
all_markers = [gene for genes in marker_genes.values() for gene in genes]
available_markers = [g for g in all_markers if g in adata.var_names]

if len(available_markers) > 0:
    sc.pl.dotplot(
        adata,
        var_names=available_markers[:15],  # Show first 15 available markers
        groupby='leiden',
        standard_scale='var'
    )
else:
    print("Note: Marker genes not found in dataset. This is expected for demo data.")

In [None]:
# Manual annotation based on marker gene expression
# This is a template - adjust based on your actual data

cluster_annotations = {
    '0': 'Cell Type A',
    '1': 'Cell Type B',
    '2': 'Cell Type C',
    # Add more annotations based on your analysis
}

# Map cluster IDs to cell type names
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_annotations)

# For unmapped clusters, keep cluster ID
adata.obs['cell_type'] = adata.obs['cell_type'].fillna('Cluster ' + adata.obs['leiden'])

print("Cell type annotations:")
print(adata.obs['cell_type'].value_counts())

In [None]:
# Visualize cell types
sc.pl.umap(adata, color='cell_type', title='Cell Type Annotations')

## Quality Checks

Verify cell type annotations with QC metrics.

In [None]:
# Compare QC metrics across cell types
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Number of genes
adata.obs.boxplot('n_genes_by_counts', by='cell_type', ax=axes[0])
axes[0].set_xlabel('Cell Type')
axes[0].set_ylabel('Number of Genes')
axes[0].set_title('Genes per Cell Type')
axes[0].tick_params(axis='x', rotation=45)

# Total counts
adata.obs.boxplot('total_counts', by='cell_type', ax=axes[1])
axes[1].set_xlabel('Cell Type')
axes[1].set_ylabel('Total Counts')
axes[1].set_title('UMI Counts per Cell Type')
axes[1].tick_params(axis='x', rotation=45)

# Mitochondrial percentage
adata.obs.boxplot('pct_counts_mt', by='cell_type', ax=axes[2])
axes[2].set_xlabel('Cell Type')
axes[2].set_ylabel('Mitochondrial %')
axes[2].set_title('Mitochondrial Content per Cell Type')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Save Annotated Data

In [None]:
# Save annotated data
output_file = '../data/processed/annotated_data.h5ad'
adata.write(output_file, compression='gzip')

print(f"Saved annotated data to {output_file}")
print(f"Dataset: {adata.shape[0]} cells x {adata.shape[1]} genes")
print(f"Cell types: {adata.obs['cell_type'].nunique()}")

## Next Steps

Proceed to notebook `03_flux_estimation.ipynb` for metabolic flux analysis.