# Lecture 2: Introduction to Single-Cell Technology

**Course:** BRAIN - Single-Cell Neurogenomics Training  
**Date:** December 6, 2025  
**Duration:** 90 minutes  
**Instructor:** [Your Name]  

---

## Learning Objectives

By the end of this lecture, you will be able to:

1. **Understand** the principles and advantages of single-cell genomics over bulk sequencing
2. **Learn** major single-cell sequencing technologies (10X Genomics, Drop-seq, Smart-seq)
3. **Explain** key concepts: UMIs, cell barcodes, droplet microfluidics
4. **Analyze** and visualize single-cell RNA-seq data using Python
5. **Explore** the structure and components of AnnData objects
6. **Apply** quality metrics to assess single-cell datasets
7. **Identify** applications in biology, neuroscience, and medicine

---

## Table of Contents

1. [Introduction: The Single-Cell Revolution](#introduction)
2. [Bulk vs. Single-Cell RNA-seq](#bulk-vs-single-cell)
3. [Single-Cell RNA-seq Technologies](#technologies)
4. [10X Genomics Chromium Platform](#10x-platform)
5. [Key Concepts: UMIs and Cell Barcodes](#umis-barcodes)
6. [Data Structure: AnnData Objects](#anndata-structure)
7. [Hands-On: Loading and Exploring scRNA-seq Data](#hands-on)
8. [Quality Control Metrics](#qc-metrics)
9. [Applications in Neuroscience and Medicine](#applications)
10. [Summary and Key Takeaways](#summary)
11. [Additional Resources](#resources)
12. [Homework Assignment](#homework)

---

<a id='introduction'></a>
## 1. Introduction: The Single-Cell Revolution

### The Challenge of Cellular Heterogeneity

Traditional bulk RNA sequencing measures the average gene expression across millions of cells, masking the remarkable diversity present in biological tissues. This is particularly problematic in the brain, where hundreds of distinct cell types coexist in close proximity, each with unique molecular signatures and functions.

**Single-cell RNA sequencing (scRNA-seq)** has revolutionized our ability to study biology at unprecedented resolution. Instead of measuring averaged signals from tissue homogenates, scRNA-seq profiles the transcriptome of individual cells, revealing:

- **Cell type diversity**: Discovery of new cell types and subtypes
- **Cellular states**: Transient states during development, differentiation, or disease
- **Cell-to-cell variability**: Heterogeneity within populations
- **Rare cell populations**: Detection of stem cells, progenitors, or diseased cells
- **Spatial organization**: Tissue architecture when combined with spatial methods

### Historical Context

The field of single-cell genomics has evolved rapidly:

- **2009**: First single-cell RNA-seq of a mouse blastomere (Tang et al., *Nature Methods*)
- **2013**: Smart-seq2 enables full-length transcript sequencing (Picelli et al., *Nature Methods*)
- **2015**: Drop-seq introduces droplet-based barcoding (Macosko et al., *Cell*)
- **2017**: 10X Genomics commercializes high-throughput platform (Zheng et al., *Nature Communications*)
- **2020s**: Multi-omics integration (RNA + protein + chromatin) and spatial methods

### Impact on Neuroscience

Single-cell technologies have been transformative for neuroscience:

1. **Brain Cell Atlases**: Comprehensive maps of all brain cell types in mouse and human
2. **Neuronal Diversity**: Over 100 distinct neuronal subtypes identified in cortex alone
3. **Disease Mechanisms**: Cell-type-specific changes in Alzheimer's, Parkinson's, and schizophrenia
4. **Development**: Trajectories of neurogenesis and gliogenesis
5. **Therapy**: Identifying cellular targets for precision medicine

### This Lecture's Focus

In this lecture, we will explore the technological principles underlying modern single-cell RNA-seq platforms, with particular emphasis on the widely-used **10X Genomics Chromium system**. We will work with real data from peripheral blood mononuclear cells (PBMCs), learning how to load, explore, and interpret single-cell datasets using Python's scverse ecosystem.

---

<a id='bulk-vs-single-cell'></a>
## 2. Bulk vs. Single-Cell RNA-seq

### Bulk RNA-seq: Averaging Across Populations

**Bulk RNA-seq** sequences RNA from millions of cells simultaneously:

**Advantages:**
- High read depth per gene (more precise quantification)
- Lower cost per sample
- Well-established protocols
- Easier to detect low-abundance transcripts

**Limitations:**
- Cannot detect cell-to-cell variation
- Rare cell types are masked by abundant populations
- Cannot assign gene expression to specific cell types
- Mixed signals from heterogeneous tissues

### Single-Cell RNA-seq: Individual Cell Resolution

**scRNA-seq** profiles thousands to millions of individual cells:

**Advantages:**
- Cell type identification and discovery
- Detection of rare populations (e.g., 1% of cells)
- Cellular heterogeneity within populations
- Trajectory inference (developmental paths)
- Cell-cell communication analysis

**Limitations:**
- Lower read depth per cell (dropout events)
- Higher cost per cell
- Computational complexity
- 3' bias in droplet-based methods
- Technical noise and batch effects

### When to Use Each Approach?

**Use Bulk RNA-seq when:**
- Comparing conditions with homogeneous cell populations
- Detecting subtle gene expression changes
- Budget is limited
- Need deep sequencing of all transcripts

**Use scRNA-seq when:**
- Studying heterogeneous tissues (brain, tumors, immune system)
- Discovering new cell types
- Tracking cell state transitions
- Investigating rare populations
- Understanding cell-type-specific disease mechanisms

### The Resolution Trade-off

There is a fundamental trade-off between **number of cells** and **sequencing depth per cell**:

| Method | Cells | Reads/Cell | Use Case |
|--------|-------|------------|----------|
| Bulk RNA-seq | Millions (pooled) | 20-50M | High-precision quantification |
| Smart-seq2/3 | 100-1,000 | 1-2M | Full-length transcripts, isoforms |
| 10X Genomics | 1,000-100,000 | 20-50K | Cell type discovery, large atlases |
| SPLiT-seq | 100,000-1M | 1-5K | Very large-scale screening |

---

<a id='technologies'></a>
## 3. Single-Cell RNA-seq Technologies

### Overview of Major Platforms

Several technologies have been developed for scRNA-seq, each with distinct advantages:

#### 1. **Plate-Based Methods (Smart-seq2/3)**
- Individual cells sorted into 96/384-well plates
- Full-length transcript coverage
- Best for: Isoform analysis, alternative splicing
- Throughput: Hundreds of cells
- Cost: $10-20 per cell

#### 2. **Droplet-Based Methods (10X, Drop-seq, inDrop)**
- Cells encapsulated in nanoliter droplets with barcoded beads
- 3' or 5' end counting only
- Best for: Large-scale cell type discovery
- Throughput: Thousands to hundreds of thousands
- Cost: $0.10-1.00 per cell

#### 3. **Combinatorial Indexing (SPLiT-seq, sci-RNA-seq)**
- Barcoding through multiple rounds of split-pool
- No microfluidics required
- Best for: Ultra-high-throughput screens
- Throughput: Millions of cells
- Cost: $0.01-0.10 per cell

#### 4. **Multi-omic Methods (CITE-seq, SHARE-seq)**
- Simultaneous measurement of RNA and protein/chromatin
- Best for: Multi-modal profiling
- Throughput: Thousands of cells
- Cost: $2-5 per cell

### Technology Selection Criteria

**Choose based on your experimental goals:**

| Goal | Recommended Platform | Why? |
|------|---------------------|------|
| Cell type discovery in brain | 10X Genomics | High throughput, good sensitivity |
| Isoform analysis of neurons | Smart-seq2 | Full-length coverage |
| Screen 1M compounds on cells | SPLiT-seq | Ultra-high throughput |
| Surface protein + RNA | CITE-seq | Multi-modal profiling |
| Time-course development | 10X or Smart-seq | Balance of throughput and depth |

### Why 10X Genomics Dominates

The **10X Genomics Chromium** platform has become the de facto standard because:

1. **Balance**: Good throughput (10,000s cells) with reasonable sensitivity
2. **Standardization**: Commercial kits ensure reproducibility
3. **Cost-effective**: ~$0.50-1.00 per cell at scale
4. **User-friendly**: Minimal hands-on time (~30 minutes)
5. **Software**: CellRanger pipeline for data processing
6. **Ecosystem**: Compatible with multi-omics (RNA+protein, RNA+ATAC)

**Key publications using 10X:**
- Zheng et al. (2017). Massively parallel digital transcriptional profiling. *Nature Communications* 8:14049
- Many brain atlases: Allen, HCA, Linnarsson lab

---

<a id='10x-platform'></a>
## 4. 10X Genomics Chromium Platform

### Droplet Microfluidics Workflow

The 10X Chromium system uses **droplet microfluidics** to partition individual cells with barcoded beads:

#### Step 1: Cell Preparation
- Dissociate tissue into single-cell suspension
- Target: 700-1,200 viable cells/μL
- Mix with Master Mix (RT reagents)

#### Step 2: GEM Generation (Gel Bead-in-Emulsion)
- Cells flow through microfluidic chip
- Each cell co-encapsulated with:
  - **One gel bead** (contains ~750,000 barcoded primers)
  - **RT reagents** (reverse transcription enzymes)
  - **Oil** (forms droplet boundary)
- Result: ~10,000 droplets (GEMs) per channel
- Each GEM = individual reaction chamber

#### Step 3: Barcoding and Reverse Transcription
- Gel beads dissolve, releasing primers
- Each primer contains:
  1. **Illumina adapter** (P5 or P7 for sequencing)
  2. **Cell barcode** (16bp, unique to that bead)
  3. **UMI** (Unique Molecular Identifier, 10-12bp)
  4. **Poly(dT)** (captures mRNA poly-A tails)
- mRNA reverse transcribed into cDNA
- All transcripts in one droplet receive the **same cell barcode**

#### Step 4: Breaking Emulsions and Cleanup
- Break droplets with recovery agent
- Pool all cDNA together
- Cleanup to remove oil and unused primers

#### Step 5: Library Preparation
- Amplify cDNA by PCR
- Fragment and add Illumina adapters
- Size selection (~400-600bp)
- Final library ready for sequencing

#### Step 6: Sequencing
- Illumina sequencing (NovaSeq, NextSeq)
- Read 1: Cell barcode (16bp) + UMI (10-12bp)
- Read 2: cDNA sequence (50-150bp)

### Critical Quality Parameters

**Cell viability**: >70% (ideally >85%)
- Dead cells release ambient RNA, creating "soup"
- Causes contamination between droplets

**Cell concentration**: 700-1,200 cells/μL
- Too low: Waste reagents, low yield
- Too high: Multiplets (2+ cells in one droplet)

**Target cell recovery**: 10,000 cells per channel
- Actual recovery: ~50-65% (5,000-6,500 cells)
- Loss due to: empty droplets, doublets, QC filtering

### Advantages of the 10X System

1. **High throughput**: Up to 80,000 cells per run (8 channels)
2. **Low doublet rate**: ~1-2% (compared to 5-10% for Drop-seq)
3. **Sensitive**: Detects 1,000-5,000 genes per cell (cell-type dependent)
4. **Standardized**: Commercial reagents ensure reproducibility
5. **Fast**: 30 minutes hands-on, library in 1 day

### Limitations of the 10X System

1. **3' bias**: Only captures 3' end of transcripts (cannot detect isoforms)
2. **Cost**: ~$1,000-2,000 per channel (reagents only)
3. **Dropout**: Only ~10% of mRNA molecules captured per cell
4. **Cell size bias**: Large cells (neurons) may be underrepresented
5. **No nucleus**: Requires dissociated cells (may lose spatial info)

---

<a id='umis-barcodes'></a>
## 5. Key Concepts: UMIs and Cell Barcodes

### Cell Barcodes: Identifying Cells

**Cell barcodes** (also called "cellular barcodes" or "CBs") are short DNA sequences that uniquely identify which cell a read originated from.

**Structure:**
- Length: 16 base pairs (10X v3 chemistry)
- Diversity: 4^16 = 4.3 billion possible sequences
- Error correction: Hamming distance of 2 (detect 1bp errors)
- Whitelist: 10X provides list of valid barcodes (~737,000 sequences)

**How it works:**
1. Each gel bead has ~750,000 copies of primers with the **same** barcode
2. When cell + bead encapsulated together, all transcripts receive that barcode
3. During sequencing, we read the barcode in Read 1
4. Assign each read to a cell based on its barcode

**Quality control:**
- Some barcodes are empty droplets (no cell)
- Use read count distribution to distinguish real cells from empty droplets
- CellRanger automatically filters low-quality barcodes

### UMIs: Counting Molecules

**Unique Molecular Identifiers (UMIs)** are random sequences added to each mRNA molecule during reverse transcription. They enable us to count **individual mRNA molecules** rather than sequencing reads.

**Why UMIs are critical:**

**Without UMIs:**
- 1 mRNA molecule → Many PCR duplicates during amplification
- 100 reads might represent 1 molecule or 100 molecules
- Cannot distinguish biological signal from technical amplification bias

**With UMIs:**
- Each mRNA molecule receives a unique UMI during RT
- After sequencing, collapse reads with same: Cell barcode + Gene + UMI
- Count unique UMIs = count molecules
- Removes PCR bias and amplification artifacts

**UMI structure in 10X:**
- Length: 10-12 base pairs
- Diversity: 4^12 = 16.7 million possible sequences
- Random: Each mRNA molecule gets different UMI (statistically)
- Error correction: Similar UMIs collapsed if associated with same gene

**Example:**
```
Raw reads:
Cell_A + Gene_X + UMI_123 (50 reads)
Cell_A + Gene_X + UMI_456 (30 reads)
Cell_A + Gene_Y + UMI_789 (100 reads)

After UMI collapsing:
Cell_A + Gene_X = 2 UMIs → 2 molecules detected
Cell_A + Gene_Y = 1 UMI → 1 molecule detected
```

### From FASTQ to Count Matrix

The bioinformatics pipeline (e.g., CellRanger) performs these steps:

1. **Demultiplexing**: Assign reads to samples (if multiplexed)
2. **Barcode correction**: Match to whitelist, correct 1bp errors
3. **UMI extraction**: Extract UMI sequence from Read 1
4. **Alignment**: Map Read 2 to reference genome/transcriptome
5. **Gene assignment**: Determine which gene each read came from
6. **UMI counting**: 
   - Group reads by: Cell barcode + Gene + UMI
   - Count unique UMIs per cell-gene combination
7. **Cell calling**: Distinguish real cells from empty droplets
8. **Matrix generation**: Create gene × cell matrix of UMI counts

**Output:** A **sparse matrix** where:
- Rows = genes (~20,000-30,000)
- Columns = cells (~1,000-100,000)
- Values = UMI counts (integers)
- Sparsity: ~90-95% zeros (dropout)

---

<a id='anndata-structure'></a>
## 6. Data Structure: AnnData Objects

### What is AnnData?

**AnnData** (Annotated Data) is the standard data structure for single-cell analysis in Python. It's designed to store:
- **Expression matrix** (genes × cells)
- **Cell metadata** (cell type, batch, QC metrics)
- **Gene metadata** (gene names, chromosomes, features)
- **Unstructured annotations** (methods, parameters, author)
- **Dimensionality reductions** (PCA, UMAP coordinates)
- **Multi-layer data** (raw counts, normalized, scaled)

### AnnData Structure

An AnnData object has several components:

```python
adata = sc.AnnData(
    X=count_matrix,           # Main data matrix (n_obs × n_vars)
    obs=cell_metadata,        # Cell annotations (n_obs × m)
    var=gene_metadata,        # Gene annotations (n_vars × p)
    uns=unstructured_dict,    # Unstructured annotations (dict)
    obsm=reduced_dims,        # Multi-dimensional cell annotations (e.g., PCA)
    varm=gene_embeddings,     # Multi-dimensional gene annotations
    layers=alternative_X      # Additional matrices (raw, normalized)
)
```

### Key Components Explained

#### 1. `adata.X` - Expression Matrix
- **Type**: Sparse matrix (scipy.sparse) or NumPy array
- **Shape**: (n_cells, n_genes)
- **Values**: UMI counts (integers) or normalized expression (floats)
- **Access**: `adata.X[0, :]` gets first cell's expression across all genes

#### 2. `adata.obs` - Cell Metadata (DataFrame)
- **Type**: pandas DataFrame
- **Rows**: Cells (same order as adata.X)
- **Columns**: Metadata features
- **Common columns**:
  - `n_genes_by_counts`: Number of genes detected
  - `total_counts`: Total UMI count
  - `pct_counts_mt`: Mitochondrial percentage
  - `cell_type`: Annotated cell type
  - `batch`: Experimental batch
  - `leiden`: Cluster assignment

#### 3. `adata.var` - Gene Metadata (DataFrame)
- **Type**: pandas DataFrame
- **Rows**: Genes (same order as adata.X)
- **Columns**: Gene features
- **Common columns**:
  - `gene_ids`: Ensembl IDs
  - `n_cells_by_counts`: Number of cells expressing gene
  - `mean_counts`: Average expression
  - `highly_variable`: Boolean, HVG selection
  - `mt`: Boolean, mitochondrial gene

#### 4. `adata.uns` - Unstructured Annotations (Dict)
- **Type**: Python dictionary
- **Contents**: 
  - Plotting parameters
  - Analysis results (e.g., differential expression)
  - Method parameters
  - Dataset provenance
- **Example**: `adata.uns['log1p']['base'] = np.e`

#### 5. `adata.obsm` - Multi-dimensional Cell Annotations
- **Type**: Dictionary of NumPy arrays
- **Shape**: (n_cells, n_dimensions)
- **Common entries**:
  - `adata.obsm['X_pca']`: PCA coordinates (n_cells × 50)
  - `adata.obsm['X_umap']`: UMAP coordinates (n_cells × 2)
  - `adata.obsm['X_tsne']`: t-SNE coordinates (n_cells × 2)

#### 6. `adata.layers` - Alternative Expression Matrices
- **Type**: Dictionary of sparse/dense matrices
- **Shape**: Same as adata.X (n_cells × n_genes)
- **Common layers**:
  - `adata.layers['counts']`: Raw UMI counts (before normalization)
  - `adata.layers['lognorm']`: Log-normalized counts
  - `adata.layers['scaled']`: Scaled expression (z-scores)

### Advantages of AnnData

1. **Integration**: Works seamlessly with scanpy, scvi-tools, squidpy
2. **Memory efficient**: Uses sparse matrices for count data
3. **Slicing**: Can subset by cells or genes while keeping metadata aligned
4. **I/O**: Read/write with `.h5ad` format (HDF5-based, compressed)
5. **Interoperability**: Convert to/from Seurat (R), loom, SingleCellExperiment

### Example Operations

```python
# Subset to T cells
adata_tcells = adata[adata.obs['cell_type'] == 'T cell', :]

# Subset to highly variable genes
adata_hvg = adata[:, adata.var['highly_variable']]

# Access PCA coordinates
pca_coords = adata.obsm['X_pca']

# Get expression of one gene
cd8a_expression = adata[:, 'CD8A'].X.toarray().flatten()
```

---

<a id='hands-on'></a>
## 7. Hands-On: Loading and Exploring scRNA-seq Data

Now let's work with real single-cell data! We'll use the **PBMC 3k dataset** from 10X Genomics, published by **Zheng et al. (2017)** in *Nature Communications*.

### Dataset Information

- **Tissue**: Peripheral Blood Mononuclear Cells (PBMCs)
- **Species**: Human (*Homo sapiens*)
- **Technology**: 10X Genomics Chromium v2
- **Cells**: ~2,700 cells (after QC)
- **Genes**: ~32,000 genes (human transcriptome)
- **Citation**: Zheng et al. (2017) *Nature Communications* 8:14049
- **DOI**: 10.1038/ncomms14049

**Why PBMCs?**
- Well-characterized cell types (T cells, B cells, monocytes, NK cells)
- Relatively easy to dissociate (no enzymatic digestion)
- Standard benchmark dataset for scRNA-seq methods

---

In [None]:
# Import required libraries
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure scanpy settings
sc.settings.verbosity = 2  # Print info, warnings, and errors
sc.settings.set_figure_params(
    dpi=100,              # Resolution for inline plots
    frameon=False,        # Remove plot frames
    figsize=(8, 6),      # Default figure size
    fontsize=12          # Font size for labels
)

# Set seaborn style for professional plots
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.2)

# Print library versions for reproducibility
print("="*70)
print("Library Versions")
print("="*70)
print(f"Scanpy: {sc.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print("="*70)

### Loading the PBMC Dataset

Scanpy provides convenient functions to download and load the PBMC 3k dataset directly.

In [None]:
# Load the PBMC 3k dataset from 10X Genomics
# This dataset is hosted by scanpy and will be downloaded automatically
# If you've run this before, it will use the cached version

print("Loading PBMC 3k dataset...\n")
adata = sc.datasets.pbmc3k()

print("="*70)
print("Dataset loaded successfully!")
print("="*70)
print(f"Number of cells: {adata.n_obs:,}")
print(f"Number of genes: {adata.n_vars:,}")
print("="*70)

### Exploring the AnnData Structure

Let's examine the different components of our AnnData object.

In [None]:
# Display overview of the AnnData object
print("AnnData Object Overview:")
print(adata)
print("\n")

# Inspect the expression matrix (adata.X)
print("="*70)
print("Expression Matrix (.X)")
print("="*70)
print(f"Type: {type(adata.X)}")
print(f"Shape: {adata.X.shape} (cells × genes)")
print(f"Data type: {adata.X.dtype}")
print(f"Is sparse: {hasattr(adata.X, 'nnz')}")
if hasattr(adata.X, 'nnz'):
    total_elements = adata.X.shape[0] * adata.X.shape[1]
    sparsity = 100 * (1 - adata.X.nnz / total_elements)
    print(f"Sparsity: {sparsity:.2f}% zeros")
print("="*70)

In [None]:
# Examine cell metadata (adata.obs)
print("\nCell Metadata (.obs):")
print("="*70)
print(f"Shape: {adata.obs.shape}")
print(f"Columns: {list(adata.obs.columns)}")
print("\nFirst 5 cells:")
print(adata.obs.head())
print("="*70)

In [None]:
# Examine gene metadata (adata.var)
print("\nGene Metadata (.var):")
print("="*70)
print(f"Shape: {adata.var.shape}")
print(f"Columns: {list(adata.var.columns)}")
print("\nFirst 10 genes:")
print(adata.var.head(10))
print("="*70)

In [None]:
# Check if there are any precomputed embeddings
print("\nMulti-dimensional Annotations (.obsm):")
print("="*70)
if len(adata.obsm.keys()) > 0:
    print("Available embeddings:")
    for key in adata.obsm.keys():
        print(f"  - {key}: shape {adata.obsm[key].shape}")
else:
    print("No precomputed embeddings (we'll compute PCA and UMAP later)")
print("="*70)

### Examining the Count Matrix

Let's look at the actual UMI count values in the expression matrix.

In [None]:
# Get expression data for first 5 cells and 10 genes
# Convert sparse matrix to dense array for viewing
sample_expression = adata.X[:5, :10].toarray()

# Create a DataFrame for better visualization
sample_df = pd.DataFrame(
    sample_expression,
    index=adata.obs_names[:5],
    columns=adata.var_names[:10]
)

print("\nExpression Matrix Sample (UMI Counts):")
print("="*70)
print("Rows = cells, Columns = genes, Values = UMI counts\n")
print(sample_df)
print("\nNote: Many zeros due to dropout and biological sparsity")
print("="*70)

### Basic Statistics

Let's compute some basic statistics about our dataset.

In [None]:
# Calculate total UMI counts per cell
counts_per_cell = np.array(adata.X.sum(axis=1)).flatten()

# Calculate number of genes detected per cell (genes with UMI > 0)
genes_per_cell = np.array((adata.X > 0).sum(axis=1)).flatten()

# Calculate total UMI counts per gene
counts_per_gene = np.array(adata.X.sum(axis=0)).flatten()

# Calculate number of cells expressing each gene
cells_per_gene = np.array((adata.X > 0).sum(axis=0)).flatten()

print("="*70)
print("Dataset Statistics")
print("="*70)
print("\nPer-Cell Metrics:")
print(f"  Total UMI counts per cell:")
print(f"    - Mean: {counts_per_cell.mean():,.0f}")
print(f"    - Median: {np.median(counts_per_cell):,.0f}")
print(f"    - Min: {counts_per_cell.min():,.0f}")
print(f"    - Max: {counts_per_cell.max():,.0f}")
print(f"\n  Genes detected per cell:")
print(f"    - Mean: {genes_per_cell.mean():,.0f}")
print(f"    - Median: {np.median(genes_per_cell):,.0f}")
print(f"    - Min: {genes_per_cell.min():,.0f}")
print(f"    - Max: {genes_per_cell.max():,.0f}")

print("\nPer-Gene Metrics:")
print(f"  Total UMI counts per gene:")
print(f"    - Mean: {counts_per_gene.mean():,.0f}")
print(f"    - Median: {np.median(counts_per_gene):,.0f}")
print(f"\n  Cells expressing each gene:")
print(f"    - Mean: {cells_per_gene.mean():,.0f}")
print(f"    - Median: {np.median(cells_per_gene):,.0f}")

# Calculate overall sparsity
if hasattr(adata.X, 'nnz'):
    total_elements = adata.X.shape[0] * adata.X.shape[1]
    non_zero = adata.X.nnz
    sparsity = 100 * (1 - non_zero / total_elements)
    print(f"\nMatrix Sparsity: {sparsity:.2f}% zeros")
    print(f"Non-zero elements: {non_zero:,} / {total_elements:,}")
print("="*70)

### Visualizing Count Distributions

Let's create some visualizations to better understand our data.

In [None]:
# Create figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Total UMI counts per cell
axes[0, 0].hist(counts_per_cell, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(np.median(counts_per_cell), color='red', linestyle='--', linewidth=2, label='Median')
axes[0, 0].set_xlabel('Total UMI Counts')
axes[0, 0].set_ylabel('Number of Cells')
axes[0, 0].set_title('Distribution of Total UMI Counts per Cell')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Genes detected per cell
axes[0, 1].hist(genes_per_cell, bins=50, color='darkorange', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(np.median(genes_per_cell), color='red', linestyle='--', linewidth=2, label='Median')
axes[0, 1].set_xlabel('Number of Genes Detected')
axes[0, 1].set_ylabel('Number of Cells')
axes[0, 1].set_title('Distribution of Genes Detected per Cell')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: UMI counts vs genes detected (scatter)
axes[1, 0].scatter(genes_per_cell, counts_per_cell, alpha=0.3, s=5, color='green')
axes[1, 0].set_xlabel('Genes Detected per Cell')
axes[1, 0].set_ylabel('Total UMI Counts per Cell')
axes[1, 0].set_title('UMI Counts vs Genes Detected')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Cells expressing each gene (log scale)
axes[1, 1].hist(cells_per_gene, bins=50, color='purple', edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Number of Cells Expressing Gene')
axes[1, 1].set_ylabel('Number of Genes')
axes[1, 1].set_title('Distribution of Gene Expression Breadth')
axes[1, 1].set_yscale('log')  # Log scale because many genes are rare
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Top left: Most cells have 1,000-5,000 total UMI counts")
print("- Top right: Most cells detect 500-1,500 genes")
print("- Bottom left: Strong correlation between UMI counts and genes (expected)")
print("- Bottom right: Many genes are detected in few cells (dropout + rare genes)")

### Looking at Specific Genes

Let's examine the expression of some well-known marker genes for different immune cell types.

In [None]:
# Define marker genes for major PBMC cell types
marker_genes = {
    'CD3D': 'T cells (general)',
    'CD8A': 'CD8+ T cells (cytotoxic)',
    'CD4': 'CD4+ T cells (helper)',
    'CD79A': 'B cells',
    'MS4A1': 'B cells (CD20)',
    'CD14': 'Monocytes',
    'FCGR3A': 'NK cells / CD16+ monocytes',
    'NKG7': 'NK cells'
}

# Check which markers are present in our dataset
available_markers = {gene: desc for gene, desc in marker_genes.items() 
                     if gene in adata.var_names}

print("="*70)
print("Marker Gene Expression Summary")
print("="*70)

for gene, description in available_markers.items():
    # Get expression values for this gene
    expression = adata[:, gene].X.toarray().flatten()
    
    # Calculate statistics
    n_expressing = np.sum(expression > 0)
    pct_expressing = 100 * n_expressing / len(expression)
    mean_in_expressing = expression[expression > 0].mean() if n_expressing > 0 else 0
    
    print(f"\n{gene} ({description}):")
    print(f"  Cells expressing: {n_expressing:,} / {len(expression):,} ({pct_expressing:.1f}%)")
    print(f"  Mean UMI in expressing cells: {mean_in_expressing:.2f}")
    print(f"  Max UMI: {expression.max():.0f}")

print("="*70)

**Interpretation:**

Different marker genes show distinct expression patterns:
- **CD3D**: Expressed in ~40-50% of cells (T cells are abundant in PBMCs)
- **CD79A, MS4A1**: Expressed in ~15-20% of cells (B cells)
- **CD14**: Expressed in ~10-15% of cells (monocytes)
- **NKG7, FCGR3A**: Expressed in ~5-10% of cells (NK cells are rarer)

This aligns with expected PBMC composition: T cells > B cells > Monocytes > NK cells

---

<a id='qc-metrics'></a>
## 8. Quality Control Metrics

### Why Quality Control?

Not all cells captured in droplets are high-quality. Common problems include:

1. **Dying/dead cells**: Release RNA into solution, high mitochondrial content
2. **Empty droplets**: No cell, only ambient RNA
3. **Doublets/multiplets**: Two or more cells in one droplet
4. **Low-quality cells**: Damaged during dissociation, low RNA content
5. **Debris**: Cellular fragments, apoptotic bodies

### Standard QC Metrics

We typically evaluate three main metrics:

#### 1. **Total UMI Counts (Library Size)**
- **What it measures**: Total number of UMIs detected per cell
- **Expected range**: 1,000-10,000 for 10X (cell-type dependent)
- **Low values indicate**: Empty droplets, debris, or poor capture efficiency
- **High values indicate**: Doublets or very large cells (neurons)

#### 2. **Genes Detected per Cell**
- **What it measures**: Number of genes with at least 1 UMI
- **Expected range**: 500-5,000 for 10X
- **Low values indicate**: Low-quality cells, empty droplets
- **High values indicate**: Doublets or transcriptionally active cells

#### 3. **Mitochondrial Gene Percentage**
- **What it measures**: % of UMIs from mitochondrial genes (MT-*)
- **Expected range**: <5-20% depending on tissue
- **High values indicate**: Dying/stressed cells
- **Why**: Dying cells lose cytoplasmic RNA but retain mitochondrial RNA

### Calculating QC Metrics with Scanpy

Let's calculate these metrics for our PBMC dataset.

In [None]:
# Identify mitochondrial genes (start with "MT-" in humans)
adata.var['mt'] = adata.var_names.str.startswith('MT-')

# Count number of mitochondrial genes
n_mt_genes = adata.var['mt'].sum()
print(f"Number of mitochondrial genes detected: {n_mt_genes}")
print(f"Mitochondrial genes: {list(adata.var_names[adata.var['mt']])}")
print("\n")

# Calculate QC metrics
# This adds several columns to adata.obs:
#   - n_genes_by_counts: number of genes detected
#   - total_counts: total UMI count
#   - pct_counts_mt: percentage of mitochondrial UMIs
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=['mt'],      # Calculate metrics for mitochondrial genes
    percent_top=None,    # Don't calculate top gene percentages (optional)
    log1p=False,         # Don't log-transform the metrics
    inplace=True         # Modify adata in-place
)

print("="*70)
print("QC Metrics Calculated")
print("="*70)
print("\nNew columns added to adata.obs:")
for col in ['n_genes_by_counts', 'total_counts', 'pct_counts_mt']:
    if col in adata.obs.columns:
        print(f"  - {col}")
print("\nFirst 5 cells with QC metrics:")
print(adata.obs[['n_genes_by_counts', 'total_counts', 'pct_counts_mt']].head())
print("="*70)

In [None]:
# Summary statistics for QC metrics
print("\n" + "="*70)
print("QC Metrics Summary")
print("="*70)
print(f"\nTotal UMI Counts:")
print(f"  Mean: {adata.obs['total_counts'].mean():,.0f}")
print(f"  Median: {adata.obs['total_counts'].median():,.0f}")
print(f"  Min: {adata.obs['total_counts'].min():,.0f}")
print(f"  Max: {adata.obs['total_counts'].max():,.0f}")

print(f"\nGenes Detected:")
print(f"  Mean: {adata.obs['n_genes_by_counts'].mean():,.0f}")
print(f"  Median: {adata.obs['n_genes_by_counts'].median():,.0f}")
print(f"  Min: {adata.obs['n_genes_by_counts'].min():,.0f}")
print(f"  Max: {adata.obs['n_genes_by_counts'].max():,.0f}")

print(f"\nMitochondrial Percentage:")
print(f"  Mean: {adata.obs['pct_counts_mt'].mean():.2f}%")
print(f"  Median: {adata.obs['pct_counts_mt'].median():.2f}%")
print(f"  Min: {adata.obs['pct_counts_mt'].min():.2f}%")
print(f"  Max: {adata.obs['pct_counts_mt'].max():.2f}%")

# Identify potential low-quality cells
low_counts = (adata.obs['total_counts'] < 1000).sum()
low_genes = (adata.obs['n_genes_by_counts'] < 200).sum()
high_mt = (adata.obs['pct_counts_mt'] > 20).sum()

print(f"\nPotential Low-Quality Cells:")
print(f"  Cells with <1,000 UMIs: {low_counts} ({100*low_counts/adata.n_obs:.1f}%)")
print(f"  Cells with <200 genes: {low_genes} ({100*low_genes/adata.n_obs:.1f}%)")
print(f"  Cells with >20% MT: {high_mt} ({100*high_mt/adata.n_obs:.1f}%)")
print("="*70)

In [None]:
# Visualize QC metrics
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Violin plot for total counts
sc.pl.violin(adata, 'total_counts', ax=axes[0], show=False)
axes[0].set_title('Total UMI Counts per Cell')

# Violin plot for genes detected
sc.pl.violin(adata, 'n_genes_by_counts', ax=axes[1], show=False)
axes[1].set_title('Genes Detected per Cell')

# Violin plot for mitochondrial percentage
sc.pl.violin(adata, 'pct_counts_mt', ax=axes[2], show=False)
axes[2].set_title('Mitochondrial Gene Percentage')
axes[2].axhline(y=20, color='red', linestyle='--', linewidth=2, label='20% threshold')
axes[2].legend()

plt.tight_layout()
plt.show()

In [None]:
# Scatter plots showing relationships between QC metrics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Total counts vs genes detected
scatter = axes[0].scatter(
    adata.obs['total_counts'],
    adata.obs['n_genes_by_counts'],
    c=adata.obs['pct_counts_mt'],
    cmap='RdYlBu_r',
    s=5,
    alpha=0.5
)
axes[0].set_xlabel('Total UMI Counts')
axes[0].set_ylabel('Genes Detected')
axes[0].set_title('UMI Counts vs Genes Detected\n(colored by MT%)')
plt.colorbar(scatter, ax=axes[0], label='MT %')

# Total counts vs MT percentage
axes[1].scatter(
    adata.obs['total_counts'],
    adata.obs['pct_counts_mt'],
    s=5,
    alpha=0.5,
    color='coral'
)
axes[1].set_xlabel('Total UMI Counts')
axes[1].set_ylabel('Mitochondrial %')
axes[1].set_title('UMI Counts vs Mitochondrial %')
axes[1].axhline(y=20, color='red', linestyle='--', linewidth=2, label='20% threshold')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Left: Strong correlation between UMI counts and genes detected (expected)")
print("- Left: Cells with high MT% (red/orange) often have lower counts (dying cells)")
print("- Right: Most cells have <10% MT content (good quality)")
print("- Right: Few outliers with >20% MT should be filtered in QC step")

### Quality Control Filtering (Preview)

In the next lecture (Quality Control and Preprocessing), we will apply filtering thresholds:

```python
# Typical filtering criteria
sc.pp.filter_cells(adata, min_genes=200)        # Remove cells with <200 genes
sc.pp.filter_genes(adata, min_cells=3)          # Remove genes in <3 cells
adata = adata[adata.obs['pct_counts_mt'] < 20, :]  # Remove high-MT cells
```

These thresholds are **dataset-dependent** and should be determined by:
- Visualizing distributions
- Understanding tissue biology
- Comparing to similar published datasets
- Iterative analysis and refinement

---

<a id='applications'></a>
## 9. Applications in Neuroscience and Medicine

### Neuroscience Applications

Single-cell RNA-seq has transformed neuroscience research:

#### 1. **Brain Cell Type Atlases**
- **Mouse Brain**: Allen Institute's comprehensive atlas (4.3M cells) identified 5,000+ cell types
- **Human Brain**: Human Cell Atlas characterizing all brain regions
- **Impact**: Standard reference for cell type classification
- **Citation**: Yao et al. (2023) *Nature* 624:317-332

#### 2. **Neuronal Diversity**
- Single cortical area contains 50-100 neuronal subtypes
- Discover rare interneuron populations (<1% of neurons)
- Link transcriptional identity to electrophysiological properties
- **Citation**: Zeisel et al. (2015) *Science* 347:1138-1142

#### 3. **Neurodevelopment**
- Trajectory analysis of neural progenitor differentiation
- Identify transcription factors driving cell fate decisions
- Map spatial organization during cortical development
- **Citation**: Nowakowski et al. (2017) *Science* 358:1318-1323

#### 4. **Neurodegenerative Diseases**
- **Alzheimer's Disease**: Cell-type-specific changes in microglia and astrocytes
- **Parkinson's Disease**: Dopaminergic neuron vulnerability in substantia nigra
- **ALS**: Motor neuron stress signatures and glial activation
- **Citation**: Mathys et al. (2019) *Nature* 570:332-337

#### 5. **Psychiatric Disorders**
- Schizophrenia: Altered excitatory/inhibitory balance
- Autism: Convergence on specific cortical cell types
- Depression: Stress-induced transcriptional changes
- **Citation**: Velmeshev et al. (2019) *Science* 364:685-689

### Medical Applications Beyond Neuroscience

#### 1. **Cancer Biology**
- Tumor heterogeneity and evolution
- Identification of cancer stem cells
- Tumor-immune cell interactions
- Metastasis mechanisms

#### 2. **Immunology**
- Immune cell state transitions during infection
- Autoimmune disease mechanisms
- Vaccine response profiling
- COVID-19 immune signatures

#### 3. **Drug Discovery**
- Identify cellular targets for therapy
- Predict drug responses based on cell state
- Screen compounds on patient-derived cells
- Personalized medicine approaches

#### 4. **Developmental Biology**
- Organoid characterization
- Stem cell differentiation protocols
- Tissue regeneration studies
- Evolutionary developmental biology (evo-devo)

### Future Directions

The field continues to evolve rapidly:

1. **Spatial Multi-omics**: Combining RNA, protein, and chromatin in spatial context
2. **Live-cell Sequencing**: Imaging + sequencing of the same cells
3. **Single-cell Epigenomics**: ATAC-seq, ChIP-seq at single-cell resolution
4. **CRISPR Screens**: Perturb-seq linking genotype to phenotype
5. **Clinical Integration**: Diagnostic tools based on single-cell signatures

---

<a id='summary'></a>
## 10. Summary and Key Takeaways

### Key Concepts Learned

1. **Single-cell RNA-seq overcomes bulk RNA-seq limitations** by profiling individual cells, revealing heterogeneity, rare populations, and cellular states

2. **Multiple scRNA-seq technologies exist**, with droplet-based methods (10X Genomics) dominating due to high throughput and cost-effectiveness

3. **10X Chromium uses droplet microfluidics** to co-encapsulate cells with barcoded beads, enabling massively parallel single-cell profiling

4. **Cell barcodes identify cells**, while **UMIs count molecules**, together enabling accurate quantification of gene expression

5. **AnnData objects are the standard format** for storing scRNA-seq data in Python, integrating expression matrices with metadata

6. **Quality control is essential**, evaluating total counts, genes detected, and mitochondrial percentage to filter low-quality cells

7. **scRNA-seq has revolutionized neuroscience**, enabling brain cell atlases, disease mechanism studies, and developmental trajectories

### Technical Skills Acquired

✅ Load and explore scRNA-seq datasets using scanpy  
✅ Navigate AnnData object structure (.X, .obs, .var, .obsm)  
✅ Calculate and interpret QC metrics  
✅ Visualize count distributions and marker gene expression  
✅ Understand the relationship between raw data (FASTQ) and count matrices  

### What's Next?

**Lecture 3: Python Fundamentals**
- Deepen Python skills for bioinformatics
- Data manipulation with NumPy and Pandas
- Creating publication-quality visualizations

**Lecture 4: Quantification Pipeline**
- From FASTQ files to count matrices
- Running CellRanger and kallisto|bustools
- Interpreting pipeline outputs

**Lecture 5: Quality Control and Preprocessing**
- Apply filtering thresholds
- Normalization and feature selection
- Dimensionality reduction (PCA)

---

<a id='resources'></a>
## 11. Additional Resources

### Primary Literature

1. **Zheng et al. (2017)** Massively parallel digital transcriptional profiling of single cells. *Nature Communications* 8:14049
   - DOI: [10.1038/ncomms14049](https://doi.org/10.1038/ncomms14049)
   - Original 10X Genomics method paper

2. **Macosko et al. (2015)** Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. *Cell* 161:1202-1214
   - DOI: [10.1016/j.cell.2015.05.002](https://doi.org/10.1016/j.cell.2015.05.002)
   - Drop-seq pioneering paper

3. **Wolf et al. (2018)** SCANPY: large-scale single-cell gene expression data analysis. *Genome Biology* 19:15
   - DOI: [10.1186/s13059-017-1382-0](https://doi.org/10.1186/s13059-017-1382-0)
   - Scanpy software description

4. **Luecken & Theis (2019)** Current best practices in single-cell RNA-seq analysis: a tutorial. *Molecular Systems Biology* 15:e8746
   - DOI: [10.15252/msb.20188746](https://doi.org/10.15252/msb.20188746)
   - Comprehensive tutorial on scRNA-seq analysis

### Software and Documentation

- **Scanpy**: [https://scanpy.readthedocs.io](https://scanpy.readthedocs.io)
- **AnnData**: [https://anndata.readthedocs.io](https://anndata.readthedocs.io)
- **10X Genomics**: [https://www.10xgenomics.com/resources](https://www.10xgenomics.com/resources)
- **CellRanger**: [https://support.10xgenomics.com/single-cell-gene-expression/software](https://support.10xgenomics.com/single-cell-gene-expression/software)

### Online Courses and Tutorials

- **Scanpy Tutorials**: [https://scanpy-tutorials.readthedocs.io](https://scanpy-tutorials.readthedocs.io)
- **Harvard scRNA-seq Course**: [https://hbctraining.github.io/scRNA-seq](https://hbctraining.github.io/scRNA-seq)
- **Bioconductor Workflows**: [https://bioconductor.org/books/release/OSCA/](https://bioconductor.org/books/release/OSCA/)

### Datasets

- **10X Genomics Public Datasets**: [https://www.10xgenomics.com/datasets](https://www.10xgenomics.com/datasets)
- **Allen Brain Atlas**: [https://portal.brain-map.org/](https://portal.brain-map.org/)
- **Single Cell Portal**: [https://singlecell.broadinstitute.org](https://singlecell.broadinstitute.org)
- **CellxGene**: [https://cellxgene.cziscience.com/](https://cellxgene.cziscience.com/)

---

<a id='homework'></a>
## 12. Homework Assignment

### Assignment: Exploring a Larger PBMC Dataset

**Due:** Before Lecture 3  
**Points:** 100

#### Objectives
- Practice loading and exploring scRNA-seq data independently
- Calculate and interpret QC metrics
- Visualize data distributions
- Compare datasets of different sizes

---

#### Task 1: Load PBMC 68k Dataset (20 points)

Load the larger PBMC 68k dataset from 10X Genomics:

```python
# Load PBMC 68k dataset
adata_68k = sc.datasets.pbmc68k_reduced()
```

**Questions:**
1. How many cells and genes are in this dataset? (5 points)
2. What is the sparsity of the count matrix? (5 points)
3. Which metadata columns are already present in `adata_68k.obs`? (5 points)
4. Are there any precomputed embeddings in `adata_68k.obsm`? (5 points)

---

#### Task 2: Calculate QC Metrics (25 points)

Calculate the same QC metrics we computed for PBMC 3k:

```python
# Identify mitochondrial genes
adata_68k.var['mt'] = adata_68k.var_names.str.startswith('MT-')

# Calculate QC metrics
sc.pp.calculate_qc_metrics(adata_68k, qc_vars=['mt'], inplace=True)
```

**Questions:**
1. What is the median total UMI count per cell? (5 points)
2. What is the median number of genes detected per cell? (5 points)
3. What is the median mitochondrial percentage? (5 points)
4. How do these values compare to the PBMC 3k dataset? (10 points)

---

#### Task 3: Visualize QC Metrics (25 points)

Create visualizations to explore the QC metrics:

1. **Violin plots** for `total_counts`, `n_genes_by_counts`, and `pct_counts_mt` (10 points)
2. **Scatter plot** of total counts vs genes detected, colored by mitochondrial % (10 points)
3. **Histogram** of mitochondrial percentage with a vertical line at 15% (5 points)

**Hint:** Use `sc.pl.violin()` and standard matplotlib functions.

---

#### Task 4: Marker Gene Analysis (20 points)

Analyze the expression of cell type markers:

```python
markers = ['CD3D', 'CD8A', 'CD4', 'CD79A', 'MS4A1', 'CD14', 'NKG7']
```

For each marker gene:
1. Calculate the percentage of cells expressing it (UMI > 0) (10 points)
2. Calculate the mean expression in expressing cells (10 points)

Present results in a table format.

---

#### Task 5: Reflection Questions (10 points)

Answer the following in 2-3 sentences each:

1. What are the main advantages of the PBMC 68k dataset compared to PBMC 3k? (5 points)
2. Based on the QC metrics, does this dataset appear to be high quality? Why or why not? (5 points)

---

### Submission Guidelines

**Format:** Submit a Jupyter notebook (.ipynb file) containing:
- All code cells with outputs
- Markdown cells with your answers
- All required visualizations

**File name:** `lecture02_homework_[YourLastName].ipynb`

**Grading Rubric:**
- **Code functionality** (50%): Code runs without errors and produces correct results
- **Visualizations** (25%): Clear, properly labeled plots
- **Interpretation** (15%): Thoughtful answers to questions
- **Code quality** (10%): Well-commented, organized code

---

**Good luck, and see you in Lecture 3!**