<a href="https://colab.research.google.com/github/joannafernandez/cnt_MSci/blob/main/L4_bowtie2_samtools_worksheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Read Alignment with Bowtie2 (Toy Genome)

In this worksheet you will learn:
- What a reference genome index is
- How to build a Bowtie2 index
- How paired-end FASTQ files are aligned
- Common Bowtie2 parameters
- How to inspect alignment output

This notebook uses a **toy genome and tiny FASTQ files** so everything runs quickly.


## Install required tools

We will install:
- bowtie2
- samtools


In [None]:
!apt-get update -qq
!apt-get install -y bowtie2 samtools

## Create a working directory

In [None]:
%%bash
set -e
cd ~
mkdir -p bowtie2_demo/{genome,reads,alignments}
cd bowtie2_demo
pwd

## Create a toy reference genome

This is a **very small artificial genome** (~1 kb) that mimics a chromosome.
In real projects, this would be a full FASTA genome.

In [None]:
%%bash
set -e
cd ~/bowtie2_demo/genome

cat > toy_genome.fa << 'EOF'
>chrToy
ATGCGTACGTAGCTAGCTAGCTAGCTAGCTGACTGACTGACTGACTGAC
ATGCGTACGTAGCTAGCTAGCTAGCTAGCTGACTGACTGACTGACTGAC
ATGCGTACGTAGCTAGCTAGCTAGCTAGCTGACTGACTGACTGACTGAC
ATGCGTACGTAGCTAGCTAGCTAGCTAGCTGACTGACTGACTGACTGAC
EOF

cat toy_genome.fa


## Build the Bowtie2 index

Bowtie2 requires a pre-built index of the reference genome.
This step only needs to be done **once per genome**.


In [None]:
%%bash
set -e
cd ~/bowtie2_demo/genome

bowtie2-build toy_genome.fa toy_index

ls -lh


## Create paired-end FASTQ files

We will create tiny FASTQ files that:
- Contain reads matching the toy genome
- Include a few mismatches to demonstrate alignment behavior

In real life, FASTQs come from a sequencer.


In [None]:
%%bash
set -e
cd ~/bowtie2_demo/reads

cat > reads_R1.fastq << 'EOF'
@read1/1
ATGCGTACGTAGCTAGCTA
+
IIIIIIIIIIIIIIIIIII
@read2/1
CTAGCTAGCTGACTGACTG
+
IIIIIIIIIIIIIIIIIII
EOF

cat > reads_R2.fastq << 'EOF'
@read1/2
TAGCTAGCTAGCTGACTGA
+
IIIIIIIIIIIIIIIIIII
@read2/2
GACTGACTGACTGACTGAC
+
IIIIIIIIIIIIIIIIIII
EOF

ls -lh


## Align paired-end reads with Bowtie2

Key arguments:
- `-x` : index prefix
- `-1` : R1 FASTQ
- `-2` : R2 FASTQ
- `-S` : output SAM file


In [None]:
%%bash
set -e
cd ~/bowtie2_demo

bowtie2 \
  -x genome/toy_index \
  -1 reads/reads_R1.fastq \
  -2 reads/reads_R2.fastq \
  -S alignments/aligned.sam


## Inspect the SAM file

SAM files are text-based and human-readable.
They store alignment positions and metadata.


In [None]:
!head -n 20 ~/bowtie2_demo/alignments/aligned.sam


## Convert SAM to BAM

BAM is the compressed binary version of SAM.
This is the standard format for downstream analysis.


In [None]:
%%bash
set -e
cd ~/bowtie2_demo/alignments

samtools view -bS aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned.sorted.bam
samtools index aligned.sorted.bam

ls -lh


## Alignment statistics

These give a quick summary of mapping success.


In [None]:
!samtools flagstat ~/bowtie2_demo/alignments/aligned.sorted.bam

## Common Bowtie2 parameters

- `--very-fast` / `--very-sensitive`
  - Speed vs sensitivity tradeoff
- `-p N`
  - Number of CPU threads
- `--no-unal`
  - Do not write unaligned reads
- `-k N`
  - Report up to N alignments per read
- `--end-to-end` (default)
- `--local`
  - Soft-clipping allowed


## Exercises

1. Re-run Bowtie2 using `--very-sensitive`
2. Try `--local` alignment
3. Add `--no-unal` and inspect the SAM output
4. What happens if you introduce mismatches into a read?


## Key takeaways

- Bowtie2 aligns reads to a reference genome
- A genome index must be built before alignment
- Paired-end reads improve alignment confidence
- SAM/BAM files store alignment results
- Alignment parameters affect sensitivity and specificity
