# üß¨ Transcriptomics and Single-Cell Analysis: Hands-on Practice

## Table of Contents
1. [Introduction to RNA-seq Data](#practice-1-introduction-to-rna-seq-data)
2. [Data Normalization Methods](#practice-2-data-normalization-methods)
3. [Differential Expression Analysis](#practice-3-differential-expression-analysis)
4. [Single-Cell Data Processing](#practice-4-single-cell-data-processing)
5. [Dimensionality Reduction (PCA, UMAP)](#practice-5-dimensionality-reduction)
6. [Clustering and Cell Type Identification](#practice-6-clustering-and-cell-type-identification)
7. [Visualization and Interpretation](#practice-7-visualization-and-interpretation)

## Installing and Importing Essential Libraries

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Try importing scanpy for single-cell analysis (optional)
try:
    import scanpy as sc
    print("‚úÖ Scanpy available for advanced single-cell analysis")
except ImportError:
    print("‚ö†Ô∏è Scanpy not installed. Install with: pip install scanpy")
    sc = None

# Visualization settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')

print("‚úÖ All core libraries loaded successfully!")

---
## Practice 1: Introduction to RNA-seq Data

### üéØ Learning Objectives
- Understand the structure of RNA-seq count matrices
- Learn about genes √ó samples format
- Explore basic data properties

### üìñ Key Concepts
**Count Matrix:** Rows = genes, Columns = samples/cells  
**Raw Counts:** Integer values representing the number of reads mapped to each gene

In [None]:
# 1.1 Create simulated RNA-seq data
def create_sample_data():
    """Generate synthetic bulk RNA-seq data for practice"""
    
    np.random.seed(42)
    
    # Parameters
    n_genes = 100
    n_samples_control = 5
    n_samples_treatment = 5
    
    # Gene names
    gene_names = [f"Gene_{i:03d}" for i in range(1, n_genes + 1)]
    
    # Sample names
    control_names = [f"Control_{i}" for i in range(1, n_samples_control + 1)]
    treatment_names = [f"Treatment_{i}" for i in range(1, n_samples_treatment + 1)]
    sample_names = control_names + treatment_names
    
    # Generate count data (Negative Binomial distribution)
    # Control samples
    control_counts = np.random.negative_binomial(n=10, p=0.3, 
                                                   size=(n_genes, n_samples_control))
    
    # Treatment samples (some genes differentially expressed)
    treatment_counts = np.random.negative_binomial(n=10, p=0.3, 
                                                     size=(n_genes, n_samples_treatment))
    
    # Make first 10 genes upregulated in treatment
    treatment_counts[:10, :] = treatment_counts[:10, :] * 3
    
    # Make genes 10-20 downregulated in treatment
    treatment_counts[10:20, :] = treatment_counts[10:20, :] // 3
    
    # Combine into full count matrix
    counts = np.hstack([control_counts, treatment_counts])
    
    # Create DataFrame
    count_df = pd.DataFrame(counts, index=gene_names, columns=sample_names)
    
    # Create sample metadata
    metadata = pd.DataFrame({
        'sample': sample_names,
        'condition': ['Control'] * n_samples_control + ['Treatment'] * n_samples_treatment,
        'batch': [1, 1, 2, 2, 2, 1, 1, 2, 2, 2]
    })
    
    print("üìä RNA-seq Count Matrix Created")
    print("=" * 50)
    print(f"Dimensions: {count_df.shape[0]} genes √ó {count_df.shape[1]} samples")
    print(f"\nFirst 5 genes √ó 3 samples:")
    print(count_df.iloc[:5, :3])
    print(f"\nüî¨ Sample Metadata:")
    print(metadata.head())
    
    return count_df, metadata

counts, metadata = create_sample_data()

In [None]:
# 1.2 Explore data properties
def explore_count_data(counts):
    """Visualize basic properties of count data"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Distribution of counts
    axes[0].hist(counts.values.flatten(), bins=50, edgecolor='black', alpha=0.7)
    axes[0].set_xlabel('Count Value')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of Raw Counts')
    axes[0].set_yscale('log')
    
    # Library sizes (total counts per sample)
    library_sizes = counts.sum(axis=0)
    axes[1].bar(range(len(library_sizes)), library_sizes, color='steelblue', alpha=0.7)
    axes[1].set_xlabel('Sample Index')
    axes[1].set_ylabel('Total Counts')
    axes[1].set_title('Library Sizes')
    axes[1].tick_params(axis='x', rotation=45)
    
    # Genes detected per sample
    genes_detected = (counts > 0).sum(axis=0)
    axes[2].bar(range(len(genes_detected)), genes_detected, color='coral', alpha=0.7)
    axes[2].set_xlabel('Sample Index')
    axes[2].set_ylabel('Number of Genes')
    axes[2].set_title('Genes Detected per Sample')
    axes[2].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìà Data Summary Statistics:")
    print("=" * 50)
    print(f"Mean count per gene: {counts.mean().mean():.2f}")
    print(f"Median count per gene: {counts.median().median():.2f}")
    print(f"Zero counts: {(counts == 0).sum().sum()} ({(counts == 0).sum().sum() / counts.size * 100:.1f}%)")

explore_count_data(counts)

---
## Practice 2: Data Normalization Methods

### üéØ Learning Objectives
- Implement TPM (Transcripts Per Million) normalization
- Compare different normalization strategies
- Understand the importance of normalization

### üìñ Key Concepts
**TPM:** Accounts for gene length and sequencing depth  
**DESeq2 size factors:** Median-of-ratios normalization  
**Log transformation:** Stabilizes variance

In [None]:
# 2.1 Implement normalization methods
def normalize_counts(counts):
    """Apply different normalization methods"""
    
    # Method 1: CPM (Counts Per Million)
    library_sizes = counts.sum(axis=0)
    cpm = counts / library_sizes * 1e6
    
    # Method 2: Log-transformed CPM (log2(CPM + 1))
    log_cpm = np.log2(cpm + 1)
    
    # Method 3: Z-score normalization
    scaler = StandardScaler()
    z_score = pd.DataFrame(
        scaler.fit_transform(counts.T).T,
        index=counts.index,
        columns=counts.columns
    )
    
    print("üîß Normalization Methods Applied")
    print("=" * 50)
    print("\n1. CPM (Counts Per Million)")
    print(cpm.iloc[:3, :3])
    print("\n2. Log2(CPM + 1)")
    print(log_cpm.iloc[:3, :3])
    print("\n3. Z-score Normalization")
    print(z_score.iloc[:3, :3])
    
    return cpm, log_cpm, z_score

cpm, log_cpm, z_score = normalize_counts(counts)

In [None]:
# 2.2 Visualize normalization effects
def visualize_normalization(counts, log_cpm):
    """Compare raw and normalized data"""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Raw counts boxplot
    axes[0].boxplot([counts[col] for col in counts.columns], 
                     labels=range(1, len(counts.columns) + 1))
    axes[0].set_xlabel('Sample')
    axes[0].set_ylabel('Raw Count')
    axes[0].set_title('Raw Counts Distribution')
    axes[0].set_yscale('log')
    
    # Normalized counts boxplot
    axes[1].boxplot([log_cpm[col] for col in log_cpm.columns],
                     labels=range(1, len(log_cpm.columns) + 1))
    axes[1].set_xlabel('Sample')
    axes[1].set_ylabel('Log2(CPM + 1)')
    axes[1].set_title('Normalized Counts Distribution')
    
    plt.tight_layout()
    plt.show()
    
    print("\n‚úÖ Normalization reduces technical variation between samples")

visualize_normalization(counts, log_cpm)

---
## Practice 3: Differential Expression Analysis

### üéØ Learning Objectives
- Perform simple differential expression analysis
- Apply statistical tests (t-test)
- Correct for multiple testing
- Create volcano plots

### üìñ Key Concepts
**Fold Change:** log2(Treatment / Control)  
**P-value:** Statistical significance  
**FDR:** False Discovery Rate (adjusted p-value)

In [None]:
# 3.1 Differential expression testing
def perform_de_analysis(log_cpm, metadata):
    """Identify differentially expressed genes"""
    
    # Separate control and treatment
    control_samples = metadata[metadata['condition'] == 'Control']['sample']
    treatment_samples = metadata[metadata['condition'] == 'Treatment']['sample']
    
    control_data = log_cpm[control_samples]
    treatment_data = log_cpm[treatment_samples]
    
    # Calculate statistics for each gene
    results = []
    
    for gene in log_cpm.index:
        # Mean expression
        control_mean = control_data.loc[gene].mean()
        treatment_mean = treatment_data.loc[gene].mean()
        
        # Fold change (log2)
        log2_fc = treatment_mean - control_mean
        
        # T-test
        t_stat, p_value = stats.ttest_ind(control_data.loc[gene], 
                                           treatment_data.loc[gene])
        
        results.append({
            'gene': gene,
            'control_mean': control_mean,
            'treatment_mean': treatment_mean,
            'log2_fc': log2_fc,
            'p_value': p_value
        })
    
    # Create results DataFrame
    de_results = pd.DataFrame(results)
    
    # Multiple testing correction (Benjamini-Hochberg)
    from scipy.stats import false_discovery_control
    de_results['p_adj'] = false_discovery_control(de_results['p_value'])
    
    # Classify genes
    de_results['significant'] = (de_results['p_adj'] < 0.05) & (np.abs(de_results['log2_fc']) > 1)
    de_results['direction'] = 'Not Sig'
    de_results.loc[(de_results['significant']) & (de_results['log2_fc'] > 1), 'direction'] = 'Up'
    de_results.loc[(de_results['significant']) & (de_results['log2_fc'] < -1), 'direction'] = 'Down'
    
    # Sort by p-value
    de_results = de_results.sort_values('p_value')
    
    print("üî¨ Differential Expression Analysis Results")
    print("=" * 60)
    print(f"Total genes tested: {len(de_results)}")
    print(f"Upregulated (FDR < 0.05, log2FC > 1): {(de_results['direction'] == 'Up').sum()}")
    print(f"Downregulated (FDR < 0.05, log2FC < -1): {(de_results['direction'] == 'Down').sum()}")
    print(f"\nTop 5 upregulated genes:")
    print(de_results[de_results['direction'] == 'Up'][['gene', 'log2_fc', 'p_adj']].head())
    
    return de_results

de_results = perform_de_analysis(log_cpm, metadata)

In [None]:
# 3.2 Create volcano plot
def create_volcano_plot(de_results):
    """Visualize differential expression with volcano plot"""
    
    plt.figure(figsize=(10, 7))
    
    # Plot points by significance
    colors = {'Up': 'red', 'Down': 'blue', 'Not Sig': 'gray'}
    
    for direction, color in colors.items():
        subset = de_results[de_results['direction'] == direction]
        plt.scatter(subset['log2_fc'], 
                   -np.log10(subset['p_value']),
                   c=color, 
                   alpha=0.6, 
                   s=30,
                   label=direction)
    
    # Add threshold lines
    plt.axhline(y=-np.log10(0.05), color='black', linestyle='--', linewidth=1, alpha=0.5)
    plt.axvline(x=1, color='black', linestyle='--', linewidth=1, alpha=0.5)
    plt.axvline(x=-1, color='black', linestyle='--', linewidth=1, alpha=0.5)
    
    plt.xlabel('Log2 Fold Change', fontsize=12)
    plt.ylabel('-Log10 P-value', fontsize=12)
    plt.title('Volcano Plot: Treatment vs Control', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüåã Volcano plot shows:")
    print("   - X-axis: Log2 fold change (effect size)")
    print("   - Y-axis: -Log10 p-value (significance)")
    print("   - Red: Upregulated genes")
    print("   - Blue: Downregulated genes")

create_volcano_plot(de_results)

---
## Practice 4: Single-Cell Data Processing

### üéØ Learning Objectives
- Create simulated single-cell data
- Apply QC filtering
- Understand sparsity in scRNA-seq

### üìñ Key Concepts
**Sparsity:** Many zero values in single-cell data  
**QC Metrics:** nGenes, nUMI, %mitochondrial  
**Doublets:** Two cells captured in one droplet

In [None]:
# 4.1 Generate single-cell data
def create_single_cell_data():
    """Simulate single-cell RNA-seq data with multiple cell types"""
    
    np.random.seed(42)
    
    n_genes = 200
    n_cells = 300
    
    # Create three cell types with different expression patterns
    n_type1 = 100  # Cell type 1
    n_type2 = 100  # Cell type 2
    n_type3 = 100  # Cell type 3
    
    # Gene names
    gene_names = [f"Gene_{i:03d}" for i in range(1, n_genes + 1)]
    
    # Cell type 1: High expression of genes 1-50
    type1 = np.random.poisson(lam=3, size=(n_genes, n_type1))
    type1[:50, :] = np.random.poisson(lam=10, size=(50, n_type1))
    
    # Cell type 2: High expression of genes 51-100
    type2 = np.random.poisson(lam=3, size=(n_genes, n_type2))
    type2[50:100, :] = np.random.poisson(lam=10, size=(50, n_type2))
    
    # Cell type 3: High expression of genes 101-150
    type3 = np.random.poisson(lam=3, size=(n_genes, n_type3))
    type3[100:150, :] = np.random.poisson(lam=10, size=(50, n_type3))
    
    # Combine
    sc_counts = np.hstack([type1, type2, type3])
    
    # Add sparsity (dropout events)
    dropout_mask = np.random.rand(*sc_counts.shape) < 0.5
    sc_counts[dropout_mask] = 0
    
    # Create DataFrame
    cell_names = [f"Cell_{i:03d}" for i in range(1, n_cells + 1)]
    sc_df = pd.DataFrame(sc_counts, index=gene_names, columns=cell_names)
    
    # Create cell metadata
    cell_metadata = pd.DataFrame({
        'cell': cell_names,
        'true_type': ['Type1'] * n_type1 + ['Type2'] * n_type2 + ['Type3'] * n_type3,
        'n_genes': (sc_df > 0).sum(axis=0).values,
        'n_counts': sc_df.sum(axis=0).values
    })
    
    print("üî¨ Single-Cell Data Created")
    print("=" * 50)
    print(f"Dimensions: {sc_df.shape[0]} genes √ó {sc_df.shape[1]} cells")
    print(f"Sparsity: {(sc_df == 0).sum().sum() / sc_df.size * 100:.1f}% zeros")
    print(f"\nCell types: {cell_metadata['true_type'].value_counts().to_dict()}")
    
    return sc_df, cell_metadata

sc_counts, cell_metadata = create_single_cell_data()

In [None]:
# 4.2 Quality control visualization
def visualize_qc_metrics(cell_metadata):
    """Visualize QC metrics for single-cell data"""
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Number of genes per cell
    axes[0].hist(cell_metadata['n_genes'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[0].axvline(cell_metadata['n_genes'].median(), color='red', linestyle='--', label='Median')
    axes[0].set_xlabel('Number of Genes Detected')
    axes[0].set_ylabel('Number of Cells')
    axes[0].set_title('Genes per Cell')
    axes[0].legend()
    
    # Total counts per cell
    axes[1].hist(cell_metadata['n_counts'], bins=30, edgecolor='black', alpha=0.7, color='coral')
    axes[1].axvline(cell_metadata['n_counts'].median(), color='red', linestyle='--', label='Median')
    axes[1].set_xlabel('Total UMI Counts')
    axes[1].set_ylabel('Number of Cells')
    axes[1].set_title('UMI Counts per Cell')
    axes[1].legend()
    
    # Scatter: genes vs counts
    axes[2].scatter(cell_metadata['n_counts'], cell_metadata['n_genes'], 
                    alpha=0.6, s=20, color='purple')
    axes[2].set_xlabel('Total Counts')
    axes[2].set_ylabel('Genes Detected')
    axes[2].set_title('Genes vs Counts')
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä QC Summary:")
    print(f"   Median genes per cell: {cell_metadata['n_genes'].median():.0f}")
    print(f"   Median UMI counts per cell: {cell_metadata['n_counts'].median():.0f}")

visualize_qc_metrics(cell_metadata)

---
## Practice 5: Dimensionality Reduction

### üéØ Learning Objectives
- Apply PCA for dimensionality reduction
- Understand variance explained
- Visualize cells in low-dimensional space

### üìñ Key Concepts
**PCA:** Principal Component Analysis - linear projection  
**PC1, PC2:** First and second principal components  
**Variance Explained:** How much information each PC captures

In [None]:
# 5.1 Perform PCA
def perform_pca(sc_counts, cell_metadata):
    """Apply PCA to single-cell data"""
    
    # Normalize: log(CPM + 1)
    library_sizes = sc_counts.sum(axis=0)
    cpm = sc_counts / library_sizes * 1e6
    log_cpm = np.log2(cpm + 1)
    
    # Transpose: cells as rows, genes as columns
    data_for_pca = log_cpm.T
    
    # Standardize
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(data_for_pca)
    
    # PCA
    pca = PCA(n_components=50)
    pca_result = pca.fit_transform(data_scaled)
    
    # Add PCA coordinates to metadata
    cell_metadata['PC1'] = pca_result[:, 0]
    cell_metadata['PC2'] = pca_result[:, 1]
    
    print("üîç PCA Analysis")
    print("=" * 50)
    print(f"Variance explained by PC1: {pca.explained_variance_ratio_[0]*100:.2f}%")
    print(f"Variance explained by PC2: {pca.explained_variance_ratio_[1]*100:.2f}%")
    print(f"Cumulative variance (PC1-PC10): {pca.explained_variance_ratio_[:10].sum()*100:.2f}%")
    
    return pca, pca_result, cell_metadata

pca, pca_result, cell_metadata = perform_pca(sc_counts, cell_metadata)

In [None]:
# 5.2 Visualize PCA
def visualize_pca(cell_metadata):
    """Plot cells in PCA space"""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot by cell type
    for cell_type in cell_metadata['true_type'].unique():
        subset = cell_metadata[cell_metadata['true_type'] == cell_type]
        axes[0].scatter(subset['PC1'], subset['PC2'], 
                       label=cell_type, alpha=0.6, s=30)
    
    axes[0].set_xlabel('PC1')
    axes[0].set_ylabel('PC2')
    axes[0].set_title('PCA: Colored by Cell Type')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Plot by UMI counts
    scatter = axes[1].scatter(cell_metadata['PC1'], cell_metadata['PC2'],
                             c=cell_metadata['n_counts'], 
                             cmap='viridis', alpha=0.6, s=30)
    axes[1].set_xlabel('PC1')
    axes[1].set_ylabel('PC2')
    axes[1].set_title('PCA: Colored by UMI Counts')
    plt.colorbar(scatter, ax=axes[1], label='Total Counts')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n‚úÖ PCA successfully separates cell types!")

visualize_pca(cell_metadata)

---
## Practice 6: Clustering and Cell Type Identification

### üéØ Learning Objectives
- Apply K-means clustering
- Compare clustering results with known cell types
- Understand clustering metrics

### üìñ Key Concepts
**K-means:** Partitioning cells into K clusters  
**Leiden/Louvain:** Graph-based clustering (used in real scRNA-seq)  
**Silhouette Score:** Quality metric for clustering

In [None]:
# 6.1 Perform clustering
def perform_clustering(pca_result, cell_metadata, n_clusters=3):
    """Apply K-means clustering on PCA results"""
    
    # K-means on first 10 PCs
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(pca_result[:, :10])
    
    # Add to metadata
    cell_metadata['cluster'] = [f"Cluster_{i}" for i in clusters]
    
    # Calculate silhouette score
    from sklearn.metrics import silhouette_score
    silhouette = silhouette_score(pca_result[:, :10], clusters)
    
    print("üéØ Clustering Results")
    print("=" * 50)
    print(f"Number of clusters: {n_clusters}")
    print(f"Silhouette score: {silhouette:.3f} (higher is better)")
    print(f"\nCluster sizes:")
    print(cell_metadata['cluster'].value_counts())
    
    return cell_metadata

cell_metadata = perform_clustering(pca_result, cell_metadata, n_clusters=3)

In [None]:
# 6.2 Visualize clustering results
def visualize_clustering(cell_metadata):
    """Compare clustering with known cell types"""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot by predicted clusters
    for cluster in cell_metadata['cluster'].unique():
        subset = cell_metadata[cell_metadata['cluster'] == cluster]
        axes[0].scatter(subset['PC1'], subset['PC2'],
                       label=cluster, alpha=0.6, s=30)
    
    axes[0].set_xlabel('PC1')
    axes[0].set_ylabel('PC2')
    axes[0].set_title('Predicted Clusters (K-means)')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Plot by true cell types
    for cell_type in cell_metadata['true_type'].unique():
        subset = cell_metadata[cell_metadata['true_type'] == cell_type]
        axes[1].scatter(subset['PC1'], subset['PC2'],
                       label=cell_type, alpha=0.6, s=30)
    
    axes[1].set_xlabel('PC1')
    axes[1].set_ylabel('PC2')
    axes[1].set_title('True Cell Types')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Create confusion-like comparison
    comparison = pd.crosstab(cell_metadata['cluster'], 
                             cell_metadata['true_type'])
    
    print("\nüìä Cluster vs True Cell Type:")
    print(comparison)
    print("\n‚úÖ Clustering successfully identifies distinct cell populations!")

visualize_clustering(cell_metadata)

---
## Practice 7: Visualization and Interpretation

### üéØ Learning Objectives
- Create heatmaps of gene expression
- Identify marker genes for each cluster
- Generate publication-quality visualizations

### üìñ Key Concepts
**Heatmap:** Visualizing expression patterns across cells/genes  
**Marker Genes:** Genes specifically expressed in one cell type  
**Violin Plots:** Distribution of expression values

In [None]:
# 7.1 Find marker genes
def find_marker_genes(sc_counts, cell_metadata, top_n=5):
    """Identify top marker genes for each cluster"""
    
    # Normalize
    library_sizes = sc_counts.sum(axis=0)
    cpm = sc_counts / library_sizes * 1e6
    log_cpm = np.log2(cpm + 1)
    
    markers = {}
    
    for cluster in cell_metadata['cluster'].unique():
        # Get cells in this cluster
        cluster_cells = cell_metadata[cell_metadata['cluster'] == cluster]['cell']
        other_cells = cell_metadata[cell_metadata['cluster'] != cluster]['cell']
        
        # Calculate mean expression
        cluster_mean = log_cpm[cluster_cells].mean(axis=1)
        other_mean = log_cpm[other_cells].mean(axis=1)
        
        # Fold change
        fc = cluster_mean - other_mean
        
        # Get top genes
        top_genes = fc.nlargest(top_n).index.tolist()
        markers[cluster] = top_genes
    
    print("üî¨ Top Marker Genes per Cluster")
    print("=" * 50)
    for cluster, genes in markers.items():
        print(f"\n{cluster}:")
        for gene in genes:
            print(f"  - {gene}")
    
    return markers, log_cpm

markers, log_cpm_sc = find_marker_genes(sc_counts, cell_metadata, top_n=5)

In [None]:
# 7.2 Create expression heatmap
def create_expression_heatmap(log_cpm_sc, cell_metadata, markers):
    """Visualize marker gene expression across clusters"""
    
    # Get all marker genes
    all_markers = []
    for genes in markers.values():
        all_markers.extend(genes)
    all_markers = list(set(all_markers))[:15]  # Top 15 unique markers
    
    # Sort cells by cluster
    cell_metadata_sorted = cell_metadata.sort_values('cluster')
    sorted_cells = cell_metadata_sorted['cell']
    
    # Get expression matrix
    expr_matrix = log_cpm_sc.loc[all_markers, sorted_cells]
    
    # Create heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(expr_matrix, 
                cmap='RdYlBu_r',
                cbar_kws={'label': 'Log2(CPM + 1)'},
                xticklabels=False,
                yticklabels=True,
                linewidths=0)
    
    plt.xlabel('Cells (sorted by cluster)')
    plt.ylabel('Marker Genes')
    plt.title('Expression Heatmap: Top Marker Genes', fontsize=14, fontweight='bold')
    
    # Add cluster boundaries
    cluster_boundaries = cell_metadata_sorted.groupby('cluster').size().cumsum()[:-1].values
    for boundary in cluster_boundaries:
        plt.axvline(x=boundary, color='black', linewidth=2)
    
    plt.tight_layout()
    plt.show()
    
    print("\nüé® Heatmap shows distinct expression patterns for each cluster!")

create_expression_heatmap(log_cpm_sc, cell_metadata, markers)

---
## üéØ Practice Complete!

### Summary of What We Learned:

1. **RNA-seq Data Structure**: Understanding count matrices and metadata
2. **Normalization**: CPM, log transformation, and their importance
3. **Differential Expression**: Statistical testing and volcano plots
4. **Single-Cell Analysis**: QC, filtering, and handling sparsity
5. **Dimensionality Reduction**: PCA for visualization
6. **Clustering**: Identifying cell populations
7. **Marker Genes**: Finding genes that define cell types

### Key Takeaways:
- RNA-seq data requires careful normalization and QC
- Single-cell data is sparse and high-dimensional
- PCA and clustering reveal cell populations
- Marker genes help interpret biological meaning

### Next Steps:
- Try with real datasets (10X Genomics, GEO)
- Learn Seurat (R) or Scanpy (Python) for advanced analysis
- Explore trajectory analysis and RNA velocity
- Apply to your own research questions!

### üìö Recommended Resources:
- **Scanpy**: https://scanpy.readthedocs.io/
- **Seurat**: https://satijalab.org/seurat/
- **Single Cell Course**: https://www.singlecellcourse.org/