Content and data courtesy of https://genomics.sschmeier.com/

Basic file/sequence analysis\
We are working with fastq files. Each file contains sequences, which are composed of four lines - the identifier, the sequence, and a separator (+). 

In [1]:
%%bash
## Count sequences
cd analysis/data
for file in *.fastq.gz
do
  count=$(zcat "$file" | wc -l | awk '{print $1/4}')
  echo "$file: $count sequences"
done


anc_R1.fastq.gz: 281748 sequences
anc_R2.fastq.gz: 281748 sequences
evol1_R1.fastq.gz: 988653 sequences
evol1_R2.fastq.gz: 988653 sequences
evol2_R1.fastq.gz: 926261 sequences
evol2_R2.fastq.gz: 926261 sequences


This sequencing run did not use PhiX, so no removal is necessary

<h3> Adapter trimming </h3>
We will be using fastp for trimming. This involves the removal of the exogenous adapter sequences.

In [4]:
%%bash
# activate env
source activate base
conda activate qc

## fastp
cd analysis
ls
mkdir -p trimmed

## ancestral samples
fastp --detect_adapter_for_pe\
        --overrepresentation_analysis\
        --correction --cut_right --thread 2\
        --html trimmed/anc.fastp.html --json trimmed/anc.fastp.json\
        -i data/anc_R1.fastq.gz -I data/anc_R2.fastq.gz\
        -o trimmed/anc_R1.fastq.gz -O trimmed/anc_R2.fastq.gz\

## Evolved samples 
fastp --detect_adapter_for_pe\
        --overrepresentation_analysis\
        --correction --cut_right --thread 2\
        --html trimmed/evol1.fastp.html --json trimmed/evol1.fastp.json\
        -i data/evol1_R1.fastq.gz -I data/evol1_R2.fastq.gz\
        -o trimmed/evol1_R1.fastq.gz -O trimmed/evol1_R2.fastq.gz\

fastp --detect_adapter_for_pe\
        --overrepresentation_analysis\
        --correction --cut_right --thread 2\
        --html trimmed/evol2.fastp.html --json trimmed/evol2.fastp.json\
        -i data/evol2_R1.fastq.gz -I data/evol2_R2.fastq.gz\
        -o trimmed/evol2_R1.fastq.gz -O trimmed/evol2_R2.fastq.gz\


data
ngs-tutorial
trimmed


Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 281748
total bases: 37625783
Q20 bases: 35813651(95.1838%)
Q30 bases: 33016386(87.7494%)

Read2 before filtering:
total reads: 281748
total bases: 37616928
Q20 bases: 35254581(93.72%)
Q30 bases: 32356901(86.0169%)

Read1 after filtering:
total reads: 266340
total bases: 34010216
Q20 bases: 33090965(97.2971%)
Q30 bases: 30885104(90.8113%)

Read2 after filtering:
total reads: 266340
total bases: 32927340
Q20 bases: 31917769(96.9339%)
Q30 bases: 29800717(90.5045%)

Filtering result:
reads passed filter: 532680
reads failed due to low quality: 396
reads failed due to too many N: 2
reads failed due to too short: 30418
reads with adapter trimmed: 1534
bases trimmed due to adapters: 49193
reads corrected by overlap analysis: 18866
bases corrected by overlap analysis: 21173

Duplication rate: 0.568593%

Insert size pe

<h3> FastQC and MultiQC </h3>

In [37]:
%%bash

# activate env
source activate base
conda activate qc
cd analysis
mkdir -p trimmed-fastqc


for FILE in trimmed/*.fastq.gz
do
  
  echo "$FILE"
  fastqc -o trimmed-fastqc "$FILE"
done

multiqc trimmed-fastqc trimmed

trimmed/anc_R1.fastq.gz
application/gzip


Started analysis of anc_R1.fastq.gz
Approx 5% complete for anc_R1.fastq.gz
Approx 10% complete for anc_R1.fastq.gz
Approx 15% complete for anc_R1.fastq.gz
Approx 20% complete for anc_R1.fastq.gz
Approx 25% complete for anc_R1.fastq.gz
Approx 30% complete for anc_R1.fastq.gz
Approx 35% complete for anc_R1.fastq.gz
Approx 40% complete for anc_R1.fastq.gz
Approx 45% complete for anc_R1.fastq.gz
Approx 50% complete for anc_R1.fastq.gz
Approx 55% complete for anc_R1.fastq.gz
Approx 60% complete for anc_R1.fastq.gz
Approx 65% complete for anc_R1.fastq.gz
Approx 70% complete for anc_R1.fastq.gz
Approx 75% complete for anc_R1.fastq.gz
Approx 80% complete for anc_R1.fastq.gz
Approx 85% complete for anc_R1.fastq.gz
Approx 90% complete for anc_R1.fastq.gz
Approx 95% complete for anc_R1.fastq.gz


Analysis complete for anc_R1.fastq.gz
trimmed/anc_R2.fastq.gz
application/gzip


Started analysis of anc_R2.fastq.gz
Approx 5% complete for anc_R2.fastq.gz
Approx 10% complete for anc_R2.fastq.gz
Approx 15% complete for anc_R2.fastq.gz
Approx 20% complete for anc_R2.fastq.gz
Approx 25% complete for anc_R2.fastq.gz
Approx 30% complete for anc_R2.fastq.gz
Approx 35% complete for anc_R2.fastq.gz
Approx 40% complete for anc_R2.fastq.gz
Approx 45% complete for anc_R2.fastq.gz
Approx 50% complete for anc_R2.fastq.gz
Approx 55% complete for anc_R2.fastq.gz
Approx 60% complete for anc_R2.fastq.gz
Approx 65% complete for anc_R2.fastq.gz
Approx 70% complete for anc_R2.fastq.gz
Approx 75% complete for anc_R2.fastq.gz
Approx 80% complete for anc_R2.fastq.gz
Approx 85% complete for anc_R2.fastq.gz
Approx 90% complete for anc_R2.fastq.gz
Approx 95% complete for anc_R2.fastq.gz


Analysis complete for anc_R2.fastq.gz
trimmed/evol1_R1.fastq.gz
application/gzip


Started analysis of evol1_R1.fastq.gz
Approx 5% complete for evol1_R1.fastq.gz
Approx 10% complete for evol1_R1.fastq.gz
Approx 15% complete for evol1_R1.fastq.gz
Approx 20% complete for evol1_R1.fastq.gz
Approx 25% complete for evol1_R1.fastq.gz
Approx 30% complete for evol1_R1.fastq.gz
Approx 35% complete for evol1_R1.fastq.gz
Approx 40% complete for evol1_R1.fastq.gz
Approx 45% complete for evol1_R1.fastq.gz
Approx 50% complete for evol1_R1.fastq.gz
Approx 55% complete for evol1_R1.fastq.gz
Approx 60% complete for evol1_R1.fastq.gz
Approx 65% complete for evol1_R1.fastq.gz
Approx 70% complete for evol1_R1.fastq.gz
Approx 75% complete for evol1_R1.fastq.gz
Approx 80% complete for evol1_R1.fastq.gz
Approx 85% complete for evol1_R1.fastq.gz
Approx 90% complete for evol1_R1.fastq.gz
Approx 95% complete for evol1_R1.fastq.gz


Analysis complete for evol1_R1.fastq.gz
trimmed/evol1_R2.fastq.gz
application/gzip


Started analysis of evol1_R2.fastq.gz
Approx 5% complete for evol1_R2.fastq.gz
Approx 10% complete for evol1_R2.fastq.gz
Approx 15% complete for evol1_R2.fastq.gz
Approx 20% complete for evol1_R2.fastq.gz
Approx 25% complete for evol1_R2.fastq.gz
Approx 30% complete for evol1_R2.fastq.gz
Approx 35% complete for evol1_R2.fastq.gz
Approx 40% complete for evol1_R2.fastq.gz
Approx 45% complete for evol1_R2.fastq.gz
Approx 50% complete for evol1_R2.fastq.gz
Approx 55% complete for evol1_R2.fastq.gz
Approx 60% complete for evol1_R2.fastq.gz
Approx 65% complete for evol1_R2.fastq.gz
Approx 70% complete for evol1_R2.fastq.gz
Approx 75% complete for evol1_R2.fastq.gz
Approx 80% complete for evol1_R2.fastq.gz
Approx 85% complete for evol1_R2.fastq.gz
Approx 90% complete for evol1_R2.fastq.gz
Approx 95% complete for evol1_R2.fastq.gz


Analysis complete for evol1_R2.fastq.gz
trimmed/evol2_R1.fastq.gz
application/gzip


Started analysis of evol2_R1.fastq.gz
Approx 5% complete for evol2_R1.fastq.gz
Approx 10% complete for evol2_R1.fastq.gz
Approx 15% complete for evol2_R1.fastq.gz
Approx 20% complete for evol2_R1.fastq.gz
Approx 25% complete for evol2_R1.fastq.gz
Approx 30% complete for evol2_R1.fastq.gz
Approx 35% complete for evol2_R1.fastq.gz
Approx 40% complete for evol2_R1.fastq.gz
Approx 45% complete for evol2_R1.fastq.gz
Approx 50% complete for evol2_R1.fastq.gz
Approx 55% complete for evol2_R1.fastq.gz
Approx 60% complete for evol2_R1.fastq.gz
Approx 65% complete for evol2_R1.fastq.gz
Approx 70% complete for evol2_R1.fastq.gz
Approx 75% complete for evol2_R1.fastq.gz
Approx 80% complete for evol2_R1.fastq.gz
Approx 85% complete for evol2_R1.fastq.gz
Approx 90% complete for evol2_R1.fastq.gz
Approx 95% complete for evol2_R1.fastq.gz


Analysis complete for evol2_R1.fastq.gz
trimmed/evol2_R2.fastq.gz
application/gzip


Started analysis of evol2_R2.fastq.gz
Approx 5% complete for evol2_R2.fastq.gz
Approx 10% complete for evol2_R2.fastq.gz
Approx 15% complete for evol2_R2.fastq.gz
Approx 20% complete for evol2_R2.fastq.gz
Approx 25% complete for evol2_R2.fastq.gz
Approx 30% complete for evol2_R2.fastq.gz
Approx 35% complete for evol2_R2.fastq.gz
Approx 40% complete for evol2_R2.fastq.gz
Approx 45% complete for evol2_R2.fastq.gz
Approx 50% complete for evol2_R2.fastq.gz
Approx 55% complete for evol2_R2.fastq.gz
Approx 60% complete for evol2_R2.fastq.gz
Approx 65% complete for evol2_R2.fastq.gz
Approx 70% complete for evol2_R2.fastq.gz
Approx 75% complete for evol2_R2.fastq.gz
Approx 80% complete for evol2_R2.fastq.gz
Approx 85% complete for evol2_R2.fastq.gz
Approx 90% complete for evol2_R2.fastq.gz
Approx 95% complete for evol2_R2.fastq.gz


Analysis complete for evol2_R2.fastq.gz



[91m///[0m ]8;id=860487;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.23[0m

[34m       file_search[0m | Search path: /home/misumi/ngs-tutorial/analysis/trimmed-fastqc
[34m       file_search[0m | Search path: /home/misumi/ngs-tutorial/analysis/trimmed
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m24/24[0m  4/24[0m [2mtrimmed/evol2.fastp.html[0m
[?25h[34m             fastp[0m | Found 3 reports
[34m            fastqc[0m | Found 6 reports
[34m     write_results[0m | Data        : multiqc_data
[34m     write_results[0m | Report      : multiqc_report.html
[34m           multiqc[0m | MultiQC complete


<h3>QC Results</h3>
Fastqc will produce results in an HTML document that can be viewed in a browser. Browsing these results, you can observe a common error in "Per base sequence content". You can see a significant deviation in ACTG content in later positions, which is probably what is causing this error. All sequences would fail QC.