# Alignment
Some interesting and important info from lecture slides:
3 RNA-seq mapping/alignment strategies
1. de novo assembly: if the reference genome doesn't exit. Tools: Trinity, Velvet
2. align to transcriptome: relies on known transcripts
3. align to reference genome: allows for novel transcript discovery

Genomic cordinates
1-based: GFF, SAM, VCF, Ensembl browser, etc.
0-based: BED, BAM, UCSC browser

Alignment QC: https://github.com/griffithlab/rnabio.org/blob/master/assets/lectures/cshl/2024/mini/RNASeq_MiniLecture_02_04_alignmentQC.pdf

# Adapter trim
Use Fastp to trim sequence adapter from the read FASTQ files and also perform basic data quality cleanup. The output of this step will be trimmed and filtered FASTQ files for each data set.

In [1]:
echo $RNA_DATA_TRIM_DIR
mkdir -p $RNA_DATA_TRIM_DIR
cd $RNA_REFS_DIR

#Download necessary Illumina adapter sequence files.
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa

/home/ubuntu/workspace/rnaseq/data/trimmed
--2025-04-27 01:32:11--  http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa
Resolving genomedata.org (genomedata.org)... 54.71.55.4
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa [following]
--2025-04-27 01:32:11--  https://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161
Saving to: ‘illumina_multiplex.fa’


2025-04-27 01:32:12 (134 MB/s) - ‘illumina_multiplex.fa’ saved [161/161]



Use fastp to remove illumina adapter sequences (if any), trim the first 13 bases of each read, and perform default read quality filtering to remove reads that are too short, have too many low quality bases or have too many N’s.

* -l 25: the minimum read length allowed after trimming is 25bp
* –adapter_fasta: the path to the adapter FASTA file containing adapter sequences to trim
* –trim_front1 13: trim a fixed number (13 in this case) of bases off the left end of read1
* –trim_front2 13: trim a fixed number (13 in this case) of bases off the left end of read2
* –json: the path to store a log file in JSON file format
* –html: the path to store a web report file
* 2>: use to store the information that would be printed to the screen into a file instead

In [2]:
cd $RNA_HOME

export S1=UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22
fastp -i $RNA_DATA_DIR/$S1.read1.fastq.gz -I $RNA_DATA_DIR/$S1.read2.fastq.gz -o $RNA_DATA_TRIM_DIR/$S1.read1.fastq.gz -O $RNA_DATA_TRIM_DIR/$S1.read2.fastq.gz -l 25 --adapter_fasta $RNA_REFS_DIR/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json $RNA_DATA_TRIM_DIR/$S1.fastp.json --html $RNA_DATA_TRIM_DIR/$S1.fastp.html 2>$RNA_DATA_TRIM_DIR/$S1.fastp.log

export S2=UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22
fastp -i $RNA_DATA_DIR/$S2.read1.fastq.gz -I $RNA_DATA_DIR/$S2.read2.fastq.gz -o $RNA_DATA_TRIM_DIR/$S2.read1.fastq.gz -O $RNA_DATA_TRIM_DIR/$S2.read2.fastq.gz -l 25 --adapter_fasta $RNA_REFS_DIR/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json $RNA_DATA_TRIM_DIR/$S2.fastp.json --html $RNA_DATA_TRIM_DIR/$S2.fastp.html 2>$RNA_DATA_TRIM_DIR/$S2.fastp.log

export S3=UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22
fastp -i $RNA_DATA_DIR/$S3.read1.fastq.gz -I $RNA_DATA_DIR/$S3.read2.fastq.gz -o $RNA_DATA_TRIM_DIR/$S3.read1.fastq.gz -O $RNA_DATA_TRIM_DIR/$S3.read2.fastq.gz -l 25 --adapter_fasta $RNA_REFS_DIR/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json $RNA_DATA_TRIM_DIR/$S3.fastp.json --html $RNA_DATA_TRIM_DIR/$S3.fastp.html 2>$RNA_DATA_TRIM_DIR/$S3.fastp.log

export S4=HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22
fastp -i $RNA_DATA_DIR/$S4.read1.fastq.gz -I $RNA_DATA_DIR/$S4.read2.fastq.gz -o $RNA_DATA_TRIM_DIR/$S4.read1.fastq.gz -O $RNA_DATA_TRIM_DIR/$S4.read2.fastq.gz -l 25 --adapter_fasta $RNA_REFS_DIR/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json $RNA_DATA_TRIM_DIR/$S4.fastp.json --html $RNA_DATA_TRIM_DIR/$S4.fastp.html 2>$RNA_DATA_TRIM_DIR/$S4.fastp.log

export S5=HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22
fastp -i $RNA_DATA_DIR/$S5.read1.fastq.gz -I $RNA_DATA_DIR/$S5.read2.fastq.gz -o $RNA_DATA_TRIM_DIR/$S5.read1.fastq.gz -O $RNA_DATA_TRIM_DIR/$S5.read2.fastq.gz -l 25 --adapter_fasta $RNA_REFS_DIR/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json $RNA_DATA_TRIM_DIR/$S5.fastp.json --html $RNA_DATA_TRIM_DIR/$S5.fastp.html 2>$RNA_DATA_TRIM_DIR/$S5.fastp.log

export S6=HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22
fastp -i $RNA_DATA_DIR/$S6.read1.fastq.gz -I $RNA_DATA_DIR/$S6.read2.fastq.gz -o $RNA_DATA_TRIM_DIR/$S6.read1.fastq.gz -O $RNA_DATA_TRIM_DIR/$S6.read2.fastq.gz -l 25 --adapter_fasta $RNA_REFS_DIR/illumina_multiplex.fa --trim_front1 13 --trim_front2 13 --json $RNA_DATA_TRIM_DIR/$S6.fastp.json --html $RNA_DATA_TRIM_DIR/$S6.fastp.html 2>$RNA_DATA_TRIM_DIR/$S6.fastp.log

## Use FastQC and multiqc to compare the impact of trimming

In [3]:
cd $RNA_DATA_TRIM_DIR
fastqc *.fastq.gz
multiqc ./

Started analysis of HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 5% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 10% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 15% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 20% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 25% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 30% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 35% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 40% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 45% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 50% complete for HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz
Approx 55% complete for HBR_Rep1_ERCC

The resulting html reports can be viewed by navigating to:
* http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/data/
* http://YOUR_PUBLIC_IPv4_ADDRESS/rnaseq/data/trimmed/

* http://54.197.176.236/rnaseq/data/fastqc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1_fastqc.html
* http://54.197.176.236/rnaseq/data/trimmed/fastqc/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1_fastqc.html
* http://54.197.176.236/rnaseq/data/multiqc_report.html
* http://54.197.176.236/rnaseq/data/trimmed/multiqc_report.html

## Move the fastqc and fastp results of trimmed file into sub-directories to keep things tidy

In [4]:
cd $RNA_DATA_TRIM_DIR
mkdir fastqc
mv *_fastqc* fastqc
mkdir fastp
mv *fastp.* fastp

# PRACTICAL EXERCISE 5
Using the approach above, trim the reads for both normal and tumor samples that you downloaded for the previous practical exercise. NOTE: try dropping the hard left trim option used above (‘–trim_front1 13’ and ‘–trim_front2 13’). Once you have trimmed the reads, compare a pre- and post- trimming FastQ file using the FastQC and multiqc tools.

In [5]:
cd $RNA_HOME/practice/data/
mkdir trimmed
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa

fastp -i hcc1395_normal_rep1_r1.fastq.gz -I hcc1395_normal_rep1_r2.fastq.gz -o trimmed/hcc1395_normal_rep1_r1.fastq.gz -O trimmed/hcc1395_normal_rep1_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_normal_rep1.fastp.json --html trimmed/hcc1395_normal_rep1.fastp.html 2>trimmed/hcc1395_normal_rep1.fastp.log
fastp -i hcc1395_normal_rep2_r1.fastq.gz -I hcc1395_normal_rep2_r2.fastq.gz -o trimmed/hcc1395_normal_rep2_r1.fastq.gz -O trimmed/hcc1395_normal_rep2_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_normal_rep2.fastp.json --html trimmed/hcc1395_normal_rep2.fastp.html 2>trimmed/hcc1395_normal_rep2.fastp.log
fastp -i hcc1395_normal_rep3_r1.fastq.gz -I hcc1395_normal_rep3_r2.fastq.gz -o trimmed/hcc1395_normal_rep3_r1.fastq.gz -O trimmed/hcc1395_normal_rep3_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_normal_rep3.fastp.json --html trimmed/hcc1395_normal_rep3.fastp.html 2>trimmed/hcc1395_normal_rep3.fastp.log

fastp -i hcc1395_tumor_rep1_r1.fastq.gz -I hcc1395_tumor_rep1_r2.fastq.gz -o trimmed/hcc1395_tumor_rep1_r1.fastq.gz -O trimmed/hcc1395_tumor_rep1_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_tumor_rep1.fastp.json --html trimmed/hcc1395_tumor_rep1.fastp.html 2>trimmed/hcc1395_tumor_rep1.fastp.log
fastp -i hcc1395_tumor_rep2_r1.fastq.gz -I hcc1395_tumor_rep2_r2.fastq.gz -o trimmed/hcc1395_tumor_rep2_r1.fastq.gz -O trimmed/hcc1395_tumor_rep2_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_tumor_rep2.fastp.json --html trimmed/hcc1395_tumor_rep2.fastp.html 2>trimmed/hcc1395_tumor_rep2.fastp.log
fastp -i hcc1395_tumor_rep3_r1.fastq.gz -I hcc1395_tumor_rep3_r2.fastq.gz -o trimmed/hcc1395_tumor_rep3_r1.fastq.gz -O trimmed/hcc1395_tumor_rep3_r2.fastq.gz -l 25 --adapter_fasta illumina_multiplex.fa --json trimmed/hcc1395_tumor_rep3.fastp.json --html trimmed/hcc1395_tumor_rep3.fastp.html 2>trimmed/hcc1395_tumor_rep3.fastp.log

--2025-04-27 01:58:18--  http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa
Resolving genomedata.org (genomedata.org)... 54.71.55.4
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa [following]
--2025-04-27 01:58:18--  https://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa
Connecting to genomedata.org (genomedata.org)|54.71.55.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161
Saving to: ‘illumina_multiplex.fa’


2025-04-27 01:58:19 (113 MB/s) - ‘illumina_multiplex.fa’ saved [161/161]



In [6]:
cd $RNA_HOME/practice/data/trimmed/
fastqc *.fastq.gz
multiqc ./

Started analysis of hcc1395_normal_rep1_r1.fastq.gz
Approx 5% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 10% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 15% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 20% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 25% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 30% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 35% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 40% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 45% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 50% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 55% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 60% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 65% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 70% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 75% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 80% complete for hcc1395_normal_rep1_r1.fastq.gz
Approx 85% complete for hcc1395_normal_rep1_r1.fastq.

* http://YOUR_IPV4/rnaseq/practice/data/hcc1395_normal_rep1_r1_fastqc.html
* http://YOUR_IPV4/workspace/rnaseq/practice/data/trimmed/hcc1395_normal_rep1_1_fastqc.html
* http://54.197.176.236/rnaseq/practice/data/hcc1395_normal_rep1_r1_fastqc.html
* http://54.197.176.236/rnaseq/practice/data/trimmed/hcc1395_normal_rep1_r1_fastqc.html

1. After trimming, what is the range of read lengths observed for hcc1395 normal replicate 1, read 1?
2. Which sections of the FastQC report are most informative for observing the effect of trimming?
Ans: Basic Statistics’, ‘Sequence Length Distribution’, and ‘Adapter Content’
3. In the ‘Per base sequence content section’, what pattern do you see? What could explain this pattern?
Ans: The first 9 base positions show a spiky pattern, suggesting biased representation of each base near the beginning of our reads/fragments. One possible explanation is that random hexamer priming for cDNA synthesis during library prep is happening in a non-random way. i.e. certain random hexamers are favored, therefore the creation of fragments (and ultimately reads) has a non-random pattern near the beginning.

In [None]:
cd $RNA_HOME/practice/data
mkdir fastqc
mv *_fastqc* fastqc