# Lab 1: Explore a scRNA-seq Dataset Structure

**Module:** 1 - Introduction to scRNA-seq Processing  
**Duration:** 30-45 minutes

## Objectives
- Understand the file structure of a processed scRNA-seq dataset
- Load and inspect a count matrix
- Explore basic dataset dimensions and properties


In [None]:
import scanpy as sc
import pandas as pd
import numpy as np

sc.settings.verbosity = 3


## 1. Load a Demo Dataset

We'll use scanpy's built-in PBMC dataset to explore structure.


In [None]:
# Load PBMC 3k dataset (pre-processed example)
adata = sc.datasets.pbmc3k()
adata


## 2. Understand the AnnData Object

The AnnData object is the standard container for single-cell data.


In [None]:
# How many cells?
print(f"Number of cells: {adata.n_obs}")

# How many genes?
print(f"Number of genes: {adata.n_vars}")

# Matrix shape
print(f"Matrix shape: {adata.X.shape}")


In [None]:
# Explore cell metadata (observations)
adata.obs.head()


In [None]:
# Explore gene metadata (variables)
adata.var.head()


## 3. Inspect the Count Matrix


In [None]:
# What does the matrix look like?
print(f"Matrix type: {type(adata.X)}")

# Check sparsity
total_entries = adata.X.shape[0] * adata.X.shape[1]
nonzero_entries = adata.X.nnz if hasattr(adata.X, 'nnz') else np.count_nonzero(adata.X)
sparsity = 1 - (nonzero_entries / total_entries)
print(f"Sparsity: {sparsity:.2%}")


In [None]:
# Look at a small slice of the matrix (first 5 cells, first 10 genes)
if hasattr(adata.X, 'toarray'):
    print(adata.X[:5, :10].toarray())
else:
    print(adata.X[:5, :10])


## 4. Basic Statistics


In [None]:
# Calculate basic QC metrics
sc.pp.calculate_qc_metrics(adata, inplace=True)

# Now obs has QC columns
adata.obs[['n_genes_by_counts', 'total_counts']].describe()


## Exercise Questions

1. What is the median number of genes detected per cell?
2. What is the median total UMI count per cell?
3. Why is the matrix sparse? What does this mean biologically?
4. How would you access the expression of gene 'CD3D' for all cells?


In [None]:
# Your answers here

# Q1: Median genes per cell


# Q2: Median UMIs per cell


# Q4: Access CD3D expression

