# Lab 4: Reference Preparation (Genome + Annotation)

## Objectives
- Understand what a reference is (FASTA + GTF)
- Learn how version mismatches break processing
- Document reference provenance (source, release, checksum)

## Outputs
- `results/reference_notes.md`

---

## Important Principle
**Never** process without recording:
- species
- genome build / release
- annotation release
- source URL
- date downloaded

This lab is written as a checklist + commands you can copy into a terminal.


## A) Choose reference source

Pick ONE source and stick with it:
- Ensembl (common)
- GENCODE (human/mouse common)
- RefSeq (less common for scRNA-seq pipelines)

## B) Download files

You need:
- `genome.fa.gz`
- `genes.gtf.gz`

## C) Validate

- Confirm species matches
- Confirm FASTA and GTF releases match (same release/build)
- Compute checksums

## D) Record in `results/reference_notes.md`

Include:
- URLs
- dates
- checksums
- tool that will consume the reference (Cell Ranger / STARsolo / kb)

---

## Terminal commands (examples)

```bash
mkdir -p data/reference && cd data/reference

# Example placeholders (replace with real URLs)
# wget -O genome.fa.gz "<FASTA_URL>"
# wget -O genes.gtf.gz "<GTF_URL>"

# checksums
sha256sum genome.fa.gz genes.gtf.gz > checksums.sha256

# peek headers
zcat genes.gtf.gz | head
zcat genome.fa.gz | head
```

