# EEE338 NGS Bioinformatics

* 19 Sep. 2025
* Masaomi Hatakeyama
    - https://github.com/masaomi/EEE338_2025


# Today's plan

1. NGS overview and key terms
2. Typical workflows (RNA-seq / DNA-seq)
3. Data formats (FASTQ/FASTA) and Phred quality
4. Quality control (FastQC panels) and brief exercises



# Learning objectives

By the end of this 1–2 hour introduction, you will be able to:

1. Explain typical NGS workflows for RNA-seq and DNA-seq
2. Describe FASTQ/FASTA formats
4. Motivate why quality control is essential



# Git commands


Download (for the first time, only once enough)

    git clone https://github.com/masaomi/EEE338_2025

Update in the local repository (you do not have to do it now)

    git pull


# GitHub

<div class="row">
  <div class="column" align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/whatisgithub1.png?raw=true" width="600px">
  </div>
  <div class="column" align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/whatisgithub2.png?raw=true" width="800px">
  </div>
</div>

<span class="small">By David Whitaker, https://www.coursereport.com/blog/what-is-github</span>


# Open Jupyter notebook by Google Colab


1. Go to https://github.com/masaomi/EEE338_2025
2. Open by Google Colab
     - https://colab.research.google.com/github/masaomi/EEE338_2025


# JupyterLab/Jupyter notebook/Google Colab


* **JupyterLab** is a web-based interactive development environment for Jupyter notebooks, code, and data
* **Jupyter Notebook** is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
* **Google Colab** is an alternative solution

https://jupyter.org/
https://colab.research.google.com/


# JupyterLab

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/jupyterlab.png?raw=true" width="700px">
</div>



# Jupyter notebook

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/jupyternotebook.png?raw=true" width="700px">
</div>



# How to use Jupyter notebook

- You can run this *cell* by typing CTRL+ENTER or SHIFT+ENTER.
    - Two types of cell: **markdown**, **code**
- Double click or ENTER, goes into the edit mode
- You can edit and save your notes, and take it back to home
- *Restart* when stuck

<span class="small">Anaconda: https://www.continuum.io/downloads</span>

In [None]:
# FYI: Python code can run on Jupyter notebook
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x**2)
plt.plot(x, y)
plt.title("A little chirp")
fig = plt.gcf()  # let's keep the figure object around for later...

In [None]:
%%bash
# Basic Unix command can also run
date
echo "Hello, World!!"

# LaTeX is also available
\begin{equation*}
1 +  \frac{q^2}{(1-q)}+\frac{q^6}{(1-q)(1-q^2)}+\cdots =
\prod_{j=0}^{\infty}\frac{1}{(1-q^{5j+2})(1-q^{5j+3})}
\end{equation*}

# Other related courses

* FGCZ courses
    - https://fgcz.ch/education.html
* Bio334 Practical bioinformatics
    - Python, R
    - https://studentservices.uzh.ch/uzh/anonym/vvz/?sap-language=EN&sap-ui-language=EN#/details/2023/004/SM/50628703


# FGCZ courses

* RNAseq course
* Genomics course
* Metabolomics course
    * ngs.courses@fgcz.ethz.ch
    * http://www.fgcz.ch/education.html


# NGS or HTS?

<div class="row">
  <div class="column">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/ngs.png?raw=true">
  </div>
  <div class="column">
          <item>
            <ul>- NGS: Next Generation Sequencing</ul>
            <ul>- HTS: High Throughput Sequencing</ul>
          </item>
  </div>
</div>




# NGS data analyses

* DNA-seq: DNA sequencing, Genomics
* RNA-seq: RNA sequencing, Transcriptomics
* ChIP-seq: Chromatin immunoprecipitation (ChIP) 
* Metagenomics, Epigenetics/genomics, Population genetics/genomics, and so on.


# DNA, nucleotide, gene, etc.

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/Where-is-DNA-found-in-a-human-cell-600x445.jpg?raw=true" width="500px">
</div>

<span class="small">https://whereismap.net/where-is-dna-found-in-a-human-cell-what-is-a-gene/</span>



# Genome & Transcriptome

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/transcript_structure.png?raw=true" width="800px">
</div>



# NGS data analysis

<div class="row">
  <div class="column" align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/ngs_level.png?raw=true" width="300px">
  </div>
  <div class="column" align="left">
      <p>
    - NGS data analysis has different levels
      </p>
  </div>
</div>
<span class="small">Werner T Brief Bioinform, 2010, 11:499-511</span>


# After sequencing

e.g. Illumina HiSeq (fastq file)
```
@HWI-ST1357:71:D2FNTACXX:4:1101:1667:2141 1:N:0:CGA
GCTGCAACTAACGGCATCTGAGTTACCCATTCAAATTTTTCGCGGCTGTGT
+
@@@DADBEHDCFD8:@EBHICGAC?4EG@FDHGDHGIIII??@?FA;B=CE
@HWI-ST1357:71:D2FNTACXX:4:1101:1664:2223 1:N:0:CGA
TGTTGGTTGAGGAGGGTATGGAGGAGGAGGGTAAGCTGACGATAGCGGAGG
+
```

# Many problems

* Many short reads
    - 100 - 200 base length, > 1 million reads
* Which gene sequence? 
    - exon? intron?
* Mis-sequencing? 
    - mis-sequence? mutation?
* Contamination? 
    - adapter? others?


# How to assemble?(short reads)

1. **Mapping** / **Alignment**
    * Align reads on a reference sequence
    
2. **De novo assembly**
    * Make a whole reference sequence



# Mapping / Alignment


<div>
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/mapping.png?raw=true" width="700">
</div>

* Align short reads onto a reference sequence
    - Reference sequence is necessary


# De novo assembly

<div class="row">
  <div class="column" align="center">
      <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/assembly.png?raw=true" width="500px">
   </div>
  <div class="column">
      <item>
          <ul>
    - Concatenate reads into longer sequences (contigs/scaffolds)
          </ul>
          <item>
              <ul>
    - Mis-assembling may happen
              </ul>
          </item>
      </item>
  </div>
</div>

# Popular software (Alignment)

* Bowtie: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
* BWA: http://bio-bwa.sourceforge.net/
* STAR: https://github.com/alexdobin/STAR
* HISAT2: http://daehwankimlab.github.io/hisat2/




# Popular software (De novo assembly)

* (Hi)Canu: https://github.com/marbl/canu
* Trinity: http://trinityrnaseq.sourceforge.net/
* Hifiasm: https://github.com/chhylp123/hifiasm




# Your teacher is on the Internet...

* ChatGPT, Gemini, Claude, Perplexity, ChatHub, etc. etc.
    - https://chat.openai.com/, https://gemini.google.com, https://claude.ai/, https://www.perplexity.ai/, https://chathub.gg/


# Keywords

* *Read*: sequenced short DNA RNA fragment
* *Assembly*: the concatenation of short reads
    - Alignment / De novo assembly
* *Contig*: assembled longer sequence from reads
* *Scaffold*: concatenated longer sequence from contigs


# Next 

Typical NGS workflows

1. RNA-seq: (RNA sequencing, Transcriptomics)
2. DNA-seq: (DNA sequencing, Genomics)



# What is RNA-seq?

RNA-seq is a next-generation sequencing method that captures and sequences all RNA molecules in a biological sample, allowing researchers to comprehensively measure gene expression, discover new transcripts, and study alternative splicing and other regulatory processes. (by ChatGPT5)

The main purposes of RNA-seq include:
1. **Gene expression analysis** 
2. **Transcript discovery** 





# What is DNA-seq?

DNA-seq is a next-generation sequencing method that determines the complete DNA sequence of an organism or sample, enabling researchers to identify genetic variants, mutations, structural changes, and study the overall genomic landscape. (by ChatGPT5)

# Whole genome shotgun sequencing

<div class="row">
  <div class="column" align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/wgss.png?raw=true" width="600px">
  </div>
  <div class="column" align="left">
      <br>
In whole genome shotgun sequencing, the entire genome 
   is sheared randomly into small fragments and then reassembled.
   (Wikipedia)
  </div>
</div>

   




# Typical process (NGS raw data processing)

1. Mapping / Alignment
    - Map reads on a reference sequence
2. *De novo* Assembly
    - Make a reference sequence



# Typical processes (NGS)

1. Mapping / Alignment
    - *Gene expression analysis* (RNA)
    - *Variant calling/SNP calling/Genotyping* (DNA)
        - population genetics, phylogenetics, genome wide association study

2. *De novo* Assembly 
    - *Transcriptome* reference (RNA)
    - *Genome* reference (DNA)



# RNA-seq workflow <br><span class="smaller">(Differential expression analysis)</span>

<div class="row">
  <div class="column" align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/count_based_deg.png?raw=true" width="500px">
  </div>
  <div class="column" align="left">
      <br>
1. Sequencing<br>
2. Quality control<br>
3. Mapping / Alignment<br>
4. Estimate gene expression level<br>
5. Differentially expressed genes<br>
6. Functional analysis<br>
  </div>
</div>

<span class="small">Simon Anders, et al., Count-based differential expression analysis of RNA sequencing data using R and Bioconductor, Nature Protocols, 2013, 8(9), 1765</span>



# RNA-seq workflow <br><span class="smaller">(RNAseq de novo assembly)</span>

If you **DO NOT** have a reference genome
1. Sequencing
2. Quality control
3. **De novo assembly** (transcriptome)
4. Mapping / Alignment
...
or


# DNA-seq workflow <br><span class="smaller">(DNAseq de novo assembly)</span>

If you **DO NOT** have a genome reference
1. Sequencing
2. Quality control
3. **De novo assembly** (genome)
4. Gene annotation, prediction



# DNA-seq workflow <br><span class="smaller">(Population Genomics)</span>

If you **HAVE** a genome reference
1. Re-sequencing
2. Quality control
3. **Mapping / Alignment**
4. **Variant calling, genotyping**
5. Population genetics analysis, etc.


# NGS workflows <br> (RNAseq/DNAseq)

<div class="row">
  <div class="column" align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/SFG.png?raw=true" width="500px">
  </div>
  <div class="column" align="left">
      <br>
<br>
      1. Sequencing<br>
      2. Quality control<br>
      3. Mapping / Alignment / De novo assembly<br>
...
  </div>
</div>

<span class="small">http://sfg.stanford.edu/SFG.pdf</span>


# RNA/DNA-seq workflow

<span class="important">Quality control</span>

is necessary in any case


# Mini summary (NGS workflow)

* Several workflows with RNA or DNA

* **Quality control** is necessary in any case 
    - check, check, and check...

* *Reference sequence* (genome/transcriptome) is necessary
    - easier to study a model species, Arabidopsis, Human, Mouse...
    - because you do not have to make the genome/transcriptome reference

# Next

* Quality control
* Data format


# What is quality control?

In the context of NGS data analysis, **quality control (QC)** is the process of assessing the raw sequencing data to ensure reliability by checking factors such as read quality scores, adapter contamination, sequence duplication, and GC content, before proceeding to downstream analyses. (by ChatGPT5)

* Checking sequenced reads quality
* Trimming and filtering low quality parts of reads
* Filtering adapter sequences



# After sequencing

e.g. Illumina HiSeq (fastq file)
```
@HWI-ST1357:71:D2FNTACXX:4:1101:1667:2141 1:N:0:CGA
GCTGCAACTAACGGCATCTGAGTTACCCATTCAAATTTTTCGCGGCTGTGT
+
@@@DADBEHDCFD8:@EBHICGAC?4EG@FDHGDHGIIII??@?FA;B=CE
@HWI-ST1357:71:D2FNTACXX:4:1101:1664:2223 1:N:0:CGA
TGTTGGTTGAGGAGGGTATGGAGGAGGAGGGTAAGCTGACGATAGCGGAGG
+
```
*How accurate are they?*

#  Sequencing Error

| Platform                                  | Representative model / chemistry     | Typical per-base error (raw or standard consensus)                                     | Notes                                                                                                                                                           |
| ----------------------------------------- | ------------------------------------ | -------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Illumina (short-read)                | NovaSeq X / X Plus, 2×150            | \~0.1% (≈Q30) per base                                                                 |
| MGI / BGI (short-read)            | DNBSEQ-T7 / T7RS (Standard MPS 2.0)  | \~0.1% (≈Q30) per base                                                                 |
| PacBio HiFi (long-read, CCS)              | Sequel IIe / Revio (HiFi reads)      | ≤0.1% (≥99.9% accuracy; ≈Q30) |
| Oxford Nanopore (long-read, simplex) | R10.4.1 + Kit14, simplex basecalling | \~1% (≈Q20)
| Oxford Nanopore (long-read, duplex)  | R10.4.1 + Kit14, duplex basecalling  | \~0.1% (≈Q30)



# FASTQ format

    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTC
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CC
1. ID
2. Sequence
3. Nothing
4. Quality (ASCII code)



# FASTA format

Nucleotides (*Arabidopsis thaliana*)

	>AT1G51370.2 | F-box domains-containing protein
    ATGGTGGGTGGCAAGAAGAAAACCAAGATATGTGACAAAGTGTCACATG
    TTTGATATCTGAAATACTTTTTCATCTTTCTACCAAGGACTCTGTCAGA
    TTTGGCAATCGGTTCCTGGATTGGACTTAGACCCCTACGCATCCTCAAA
1. **>** Annotation information 
2. Sequence

**No base quality information**



# FASTQ format
    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAAC
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C
1. ID
2. Sequence
3. Nothing
4. **Quality** (ASCII code)


# Phred quality score

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/quality_score.png?raw=true" height="400px" width="500px">
</div>

    Phred quality scores Q are defined as a property 
    which is logarithmically related to the base-calling 
    error probabilities (Wikipedia)





# ASCII code (33-126)

    33:! 34:" 35:# 36:$ 37:% 38:& 39:' 40:(
    41:) 42:* 43:+ 44:, 45:- 46:. 47:/ 48:0
    49:1 50:2 51:3 52:4 53:5 54:6 55:7 56:8
    57:9 58:: 59:; 60:< 61:= 62:> 63:? 64:@
    65:A 66:B 67:C 68:D 69:E 70:F 71:G 72:H
    73:I 74:J 75:K 76:L 77:M 78:N 79:O 80:P
    81:Q 82:R 83:S 84:T 85:U 86:V 87:W 88:X
    89:Y 90:Z 91:[ 92:\ 93:] 94:^ 95:_ 96:`
    97:a 98:b 99:c 100:d 101:e 102:f 103:g 104:h
    105:i 106:j 107:k 108:l 109:m 110:n 111:o 112:p
    113:q 114:r 115:s 116:t 117:u 118:v 119:w 120:x
    121:y 122:z 123:{ 124:| 125:} 126:~

character = chr(Quality + 33) 

e.g. Phred quality score=30 $\rightarrow$ ???

# Phred quality score


<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/phred_scores.png?raw=true" height="400px" width="500px">
</div>

* **Illumina 1.9** (Sanger format)
* Scoring:0-41, Offset:33
  
<span class="small">http://en.wikipedia.org/wiki/FASTQ_format</span>


# Exercise1

Calculate the average quality of the sequence (average quality score per base)

    @SEQ_ID
    GATTT
    +
    !''*(


# Exercise2

What is the base call accuracy of the character **A**, **!** and **I**  ?

(Illumina 1.9 Phred Quality Score)



# Popular software

* FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
* Fastp: https://github.com/OpenGene/fastp
* Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic



<div class="question">FastQC</div>

# Phred quality score distribution per base (Good case)

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/good_quality.png?raw=true" width="400px">
</div>

<font color="red">red</font>:median, <font color="blue">blue</font>:average, box:25-75%, bar:10-90%


# Phred quality score distribution per base (Bad case)

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/bad_quality.png?raw=true" width="400px">
</div>
 
<font color="red">red</font>:median, <font color="blue">blue</font>:average, box:25-75%, bar:10-90%

# Sequence content per base

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/gc_content.png?raw=true" width="1000px">
</div>


# Overrepresented sequences

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/overrepresented.png?raw=true" width="900px">
</div>

* Adapter contamination
* Highly expressed transcripts
* Less diversity library



# Adapter contamination

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/adapter.png?raw=true" width="900px">
</div>

* If the inserted DNA/RNA fragment is too short, the read will contain a part of adapter.


# Example (RNA-seq from the previous Bio373)

*Arabidopsis halleri*, Leaf, RNA, Ion PGM
* [ahal_rna_sample1_fastqc.html](fastqc_examples/ahal_rna_sample1_fastqc.html)

*Metrosideros polymorpha*, DNA, Illumina HiSeq
* [metros_dna_200_fastqc.html](fastqc_examples/metros_dna_200_fastqc.html)


# PacBio RSII (Finger millet DNA)

<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/pacbio.png?raw=true" width="700px">
</div>

# Example (RNA-seq from the previous Bio373)

*Arabidopsis halleri*, Leaf, RNA, Ion PGM
* [ahal_rna_sample1_fastqc.html](fastqc_examples/ahal_rna_sample1_fastqc.html)

After trimming (by Trimmomatic)
* [ahal_rna_sample1_trimmed_fastqc.html](fastqc_examples/ahal_rna_sample1_trimmed_fastqc.html)



#  Check Contamination

If other species DNA/RNA is contaminated,

* RNAseq (Differential expression analysis) 
    - **usable** (depends on the purpose)
* DNAseq (De novo assembly, SNP calling) 
    - **big problem**

FastqScreen: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/



# Example

A bug mRNA contamination in *A. halleri* RNAseq
<div align="center">
<table><tr><td>
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/fastqscreen_reinhold.png?raw=true" width="400px">
    </td><td>
<div align="center">
    <img src="https://github.com/masaomi/EEE338_2025/blob/main/png/cricket.jpg?raw=true" width="100px">
    </td></tr></table>
</div>



# Mini summary (preprocessing)

Before NGS data processing
* **Quality control** **★**
* Sequencing error/bias
* Phred quality score

If bad quality,
 - **sequencing again** (sample prep. again?)
 - **trimming/filtering** (total reads decrease)

# Mini summary (Quality control)

1. Check read quality
    * Sequence quality
    * Read length distribution
    * Contamination
2. Filtering & Trimming
    * Filtering low quality read 
    * Trimming low quality part/adapter/contamination

