🧬 Lesson 3: Align Reads to Genome with HISAT2

✅ 0. Identify your organism
You can:
Look at your runinfo.csv or SRA metadata (check the “ScientificName” column).
Or run this command on an SRR file to extract metadata:

In [None]:
esearch -db sra -query SRRXXXXX | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism

Its Clostridium autoethanogenum in our case

Choose the correct genome source. Most reference genomes and annotations come from one of the following:

Resource	            Use if your organism is...
Ensembl	                Eukaryotic model organisms
NCBI Genome 	        Most organisms, including bacteria
UCSC Genome Browser 	Model eukaryotes
RefSeq (NCBI)	        Common reference source, especially for viruses and prokaryotes

For NCBI Genomes:

Visit: https://www.ncbi.nlm.nih.gov/genome

Search your organism name.

Choose the latest RefSeq assembly.

Look under “Download the GenBank/FASTA/GTF files”.

Example for Bacillus subtilis:

✅ 1. Download Reference Genome and Annotation (GTF)

# Create reference directory and enter it
mkdir -p rnaseq_project/reference
cd rnaseq_project/reference

# Download genome FASTA and annotation GTF from NCBI
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/484/505/GCF_000484505.2_ASM48450v2/GCF_000484505.2_ASM48450v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/484/505/GCF_000484505.2_ASM48450v2/GCF_000484505.2_ASM48450v2_genomic.gff.gz

# Unzip files
gunzip GCF_000484505.2_ASM48450v2_genomic.fna.gz
gunzip GCF_000484505.2_ASM48450v2_genomic.gff.gz

🔁 Note: HISAT2 uses .gff or .gtf only for optional downstream tools like StringTie. For now, .gff is fine.

✅ 2. Build HISAT2 Index
Still inside rnaseq_project/reference run:

In [None]:
hisat2-build GCF_000484505.2_ASM48450v2_genomic.fna ca_index

This creates index files: ca_index.1.ht2, ca_index.2.ht2, etc.

In [None]:
✅ 3. Align Trimmed Reads to Genome

Save the script below as:
📄 rnaseq_project/align_all_trimmed_parallel.sh

In [None]:
#!/bin/bash
# File: align_all_trimmed_parallel.sh
# Purpose: Align all paired-end trimmed FASTQ files in parallel using HISAT2 + samtools
# Hardware: 4 cores, 8 threads total available

mkdir -p bam_files

# Function to align one sample
align_sample() {
    local sample="$1"
    local f1="trimmed_data/${sample}_1.trimmed.fastq.gz"
    local f2="trimmed_data/${sample}_2.trimmed.fastq.gz"
    local sam_out="bam_files/${sample}.sam"
    local bam_out="bam_files/${sample}.bam"

    echo "Aligning $sample ..."

    # Use 6 threads for HISAT2 (leaving 2 threads for other processes)
    hisat2 -x reference/ca_index \
           -1 "$f1" \
           -2 "$f2" \
           -p 6 \
           -S "$sam_out"

    # Use 2 threads for samtools conversion
    samtools view -@ 2 -S -b "$sam_out" > "$bam_out"
    rm "$sam_out"
}

export -f align_sample

# Get sample names and run in parallel
# Run only 1 alignment at a time since HISAT2 will use 6 threads
find trimmed_data -name "*_1.trimmed.fastq.gz" | sed 's/.*\///; s/_1.trimmed.fastq.gz//' | \
    parallel --jobs 1 --load 100% align_sample

echo "All alignments complete!"

Make it executable and run:

In [None]:
chmod +x align_all_trimmed_parallel.sh
./align_all_trimmed_parallel.sh

Now reads are all aligned against the ref.
////////////////////////////////////////////////////

useful command (run in terminal) to monitor hisat2|samtools procceses:

In [None]:
while true; do date -u "+%Y-%m-%d %H:%M:%S"; echo "User: $USER"; echo "----------------------------------------"; ps aux | grep -E "hisat2|samtools" | grep -v "grep"; echo; sleep 30; done