# Lecture 2: Introduction to Single-Cell Technology

**Course:** Single-Cell Neurogenomics  
**Date:** December 6, 2025  
**Estimated Time:** 60 minutes  

---

## Learning Objectives

By the end of this assignment, you will be able to:
- Understand the principles of single-cell genomics
- Learn major single-cell sequencing technologies
- Analyze and visualize single-cell data structures
- Explore applications in biology and medicine

---

## Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity. Unlike bulk RNA-seq which measures average expression across millions of cells, scRNA-seq captures the transcriptome of individual cells.

**Key Concepts:**
- **Cell barcodes:** Unique molecular identifiers for each cell
- **UMIs (Unique Molecular Identifiers):** Tags to distinguish unique RNA molecules
- **AnnData:** The standard data structure for storing single-cell data
- **Sparsity:** Most genes are not expressed in any given cell

---

## Setup

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import anndata as ad
from scipy.sparse import csr_matrix

# Set scanpy settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(6, 6))

print("Libraries imported successfully!")
print(f"Scanpy version: {sc.__version__}")
print(f"AnnData version: {ad.__version__}")

---

## Task 1: Understanding Single-Cell Data Structure (20 points)

### Background
Single-cell data is typically stored in an **AnnData** object, which contains:
- `X`: Gene expression matrix (cells Ã— genes)
- `obs`: Cell metadata (observations)
- `var`: Gene metadata (variables)
- `uns`: Unstructured annotations

### Instructions
1. Create a simulated single-cell dataset with 50 cells and 20 genes
2. Make the expression matrix sparse (most values should be zero)
3. Add cell metadata (cell_type, batch) and gene metadata (gene_name, highly_variable)
4. Create an AnnData object and display its structure
5. Print summary information about the dataset

### Hints
- Use `np.random.poisson()` for count data
- Sparsify by setting random positions to zero
- Use `ad.AnnData()` to create the object
- Access attributes with `.X`, `.obs`, `.var`

In [None]:
# TODO: Create simulated single-cell data
np.random.seed(42)

# Create expression matrix (50 cells x 20 genes)
# Use Poisson distribution and make it sparse


# Create cell metadata


# Create gene metadata


# Create AnnData object


# Display structure


# Print summary


**Expected Output:** 
- AnnData object with 50 cells and 20 genes
- Summary showing sparsity and data structure
- Metadata tables displayed

---

## Task 2: Exploring Data Sparsity (20 points)

### Background
Single-cell data is extremely sparse - most genes are not expressed in most cells. Understanding sparsity is crucial for downstream analysis.

### Instructions
1. Calculate the sparsity of your dataset (percentage of zeros)
2. Visualize the distribution of non-zero counts per cell
3. Calculate and plot the number of cells expressing each gene
4. Create a heatmap showing the expression matrix

### Hints
- Sparsity = (number of zeros) / (total elements)
- Use `np.sum(X > 0, axis=...)` to count non-zero values
- Create histograms with `plt.hist()`
- Use `sns.heatmap()` for visualization

In [None]:
# TODO: Analyze data sparsity

# Calculate sparsity


# Plot distribution of non-zero genes per cell


# Plot number of cells expressing each gene


# Create heatmap of expression matrix


**Expected Output:** 
- Sparsity percentage
- Histogram of gene counts per cell
- Bar plot of cells per gene
- Heatmap showing sparse structure

---

## Task 3: Working with Real Single-Cell Data (25 points)

### Background
Scanpy provides several built-in datasets for learning. The PBMC3k dataset contains 3,000 peripheral blood mononuclear cells.

### Instructions
1. Load the PBMC 3k dataset using `sc.datasets.pbmc3k()`
2. Display basic information about the dataset
3. Show the first few rows of cell and gene metadata
4. Calculate basic statistics: total counts per cell, genes per cell
5. Identify the top 10 most highly expressed genes

### Hints
- Use `sc.datasets.pbmc3k()` to load data
- `.obs` contains cell metadata
- `.var` contains gene metadata
- Use `.sum(axis=...)` for calculations

In [None]:
# TODO: Load and explore PBMC3k dataset

# Load dataset


# Display basic info


# Show metadata


# Calculate statistics


# Find top 10 highly expressed genes


**Expected Output:** 
- Dataset dimensions and structure
- Metadata tables
- Statistics about counts and genes
- List of top 10 expressed genes

---

## Task 4: Quality Control Metrics (20 points)

### Background
Before analyzing single-cell data, we need to compute quality control (QC) metrics to identify low-quality cells.

### Instructions
1. Calculate QC metrics using `sc.pp.calculate_qc_metrics()`
2. Visualize the distribution of:
   - Total counts per cell
   - Number of genes per cell
   - Percentage of counts from mitochondrial genes (genes starting with 'MT-')
3. Create violin plots for these metrics
4. Identify potential thresholds for filtering low-quality cells

### Hints
- Identify mitochondrial genes: `adata.var_names.str.startswith('MT-')`
- Use `sc.pp.calculate_qc_metrics(qc_vars=['mt'])`
- Create violin plots with `sc.pl.violin()`
- Use histograms to visualize distributions

In [None]:
# TODO: Calculate and visualize QC metrics

# Identify mitochondrial genes


# Calculate QC metrics


# Create violin plots


# Suggest filtering thresholds


**Expected Output:** 
- QC metrics calculated and stored in `.obs`
- Violin plots showing distributions
- Suggested thresholds for filtering

---

## Task 5: Comparing Bulk vs Single-Cell Data (15 points)

### Background
Understanding the differences between bulk and single-cell approaches is fundamental. Single-cell data reveals heterogeneity masked in bulk measurements.

### Instructions
1. Select 3 genes of interest from the PBMC dataset (e.g., CD3D, CD79A, CST3)
2. For each gene, create:
   - A histogram showing expression across all cells (single-cell view)
   - Calculate the mean expression (what bulk would show)
3. Create violin plots showing expression distribution by cell type (if available)
4. Discuss why single-cell resolution is important

### Hints
- Access gene expression: `adata[:, 'gene_name'].X`
- Create subplots for multiple genes
- Use different colors for different genes
- Add vertical lines to show mean (bulk) values

In [None]:
# TODO: Compare single-cell vs bulk perspectives

# Select genes of interest


# Create visualizations


# Calculate and display bulk (mean) values


# Discuss importance of single-cell resolution


**Expected Output:** 
- Histograms showing cell-to-cell variability
- Mean values representing bulk measurements
- Clear visualization of heterogeneity
- Written discussion of single-cell advantages

---

## Reflection Questions (Bonus: 10 points)

Answer the following questions:

1. **Question 1:** What are the main technical challenges in single-cell RNA sequencing compared to bulk RNA-seq?

2. **Question 2:** Why is data sparsity (dropout) a major characteristic of scRNA-seq data? What causes it?

3. **Question 3:** How do cell barcodes and UMIs work together to enable single-cell sequencing?

4. **Question 4:** In what biological scenarios would single-cell analysis provide critical insights that bulk analysis would miss?

**Your Answers:**

1. [Your answer here]

2. [Your answer here]

3. [Your answer here]

4. [Your answer here]

---

## Submission Guidelines

1. Complete all tasks with proper code and outputs
2. Ensure all visualizations are clear and properly labeled
3. Answer reflection questions thoroughly
4. Save your notebook with outputs included
5. Submit the completed notebook file

---

## Grading Rubric

| Component | Points | Criteria |
|-----------|--------|----------|
| Task 1 | 20 | AnnData object created correctly with metadata |
| Task 2 | 20 | Sparsity analysis with appropriate visualizations |
| Task 3 | 25 | Successful loading and exploration of PBMC data |
| Task 4 | 20 | QC metrics calculated and visualized |
| Task 5 | 15 | Meaningful comparison between bulk and single-cell |
| Reflection | 10 | Thoughtful answers demonstrating understanding |
| **Total** | **110** | |

---

## Additional Resources

- Scanpy tutorials: https://scanpy.readthedocs.io/
- AnnData documentation: https://anndata.readthedocs.io/
- 10X Genomics protocols: https://www.10xgenomics.com/
- Single Cell Portal: https://singlecell.broadinstitute.org/