<title>
Microbial Genomics Group Rotation Report | 
Michał Kowalski | 
MCB PhD school
</title>

# Scope
How to make sense of microbial sequence data?

# Topics
## Next Generation Sequencing (Illumina vs Nanopore)
Understanding of differences between outputs of two most common sequencing platforms

### Illumina
Great quality of reads, running on short reads

### Nanopore
Poor to decent quality of reads, running on long reads

<img src="https://i.imgur.com/FD6stLs.png"
     alt="Illumina vs Nanopore"
     style="float: left; margin-right: 10px;" />

## Quality Control
### Illumina
Base pairs are read based on the fluorescence signal
### Nanopore
Base pairs are read based on basecalling on raw electric current signal
### Why do we need QC
Uncertainity in the ability to read signal is reflected in the QC score
### FastQ format
As presented in [online materials](https://en.wikipedia.org/wiki/FASTQ_format) every Illumina read has its certainity score encoded in ASCII characters for better control of quality of the results. Those will be explored in excercises presended below.

### Excercise 1
Open <code>assembly-data/part1_qc/data/short_reads_1.fastq</code> using a text editor (or "head -4" command)

Open [wiki](https://en.wikipedia.org/wiki/FASTQ_format)

Investigate the ﬁrst short-read

Answer the following questions:

<code>
A. Which ASCII character corresponds to the worst Phred score for Illumina 1.8+?
B. What is the Phred quality score of the 3rd nucleotide of the 1st sequence?
C. What is the accuracy of this 3rd nucleotide?
</code>

In [1]:
%%bash
head -4 assembly-data/part1_qc/data/short_reads_1.fastq

@SRR031716.1 HWI-EAS299_4_30M2BAAXX:3:1:944:1798 length=37
GTGGATATGGATATCCAAATTATATTTGCATAATTTG
+SRR031716.1 HWI-EAS299_4_30M2BAAXX:3:1:944:1798 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIIII8IIIIIII


### Answer 1
<code>
A. "!" corresponds to the worst Phred score for Phred-33 (Illumina 1.8+)
B. Phred score of 3rd nucleotide of 1st sequence is 40
C. Accuracy is above 99.99%, probability of error is equal to 0.001
</code>

### Excercise 2
Open <code>assembly-data/part1-qc/data/short_reads_1.fastq</code> in FastQC

Answer the following questions:

<code>
A. Which Phred encoding is used in the FASTQ ﬁle for these sequences?
B. How is the mean per-base score changing along the sequence?
C. Is this tendency seen in all sequences?
D. Why is there a warning for the per-base sequence content graphs?
E. Why is there a warning for the per sequence GC content graphs?
</code>

In [4]:
%%bash
fastqc assembly-data/part1_qc/data/short_reads_1.fastq

Analysis complete for short_reads_1.fastq


Started analysis of short_reads_1.fastq
Approx 5% complete for short_reads_1.fastq
Approx 10% complete for short_reads_1.fastq
Approx 15% complete for short_reads_1.fastq
Approx 20% complete for short_reads_1.fastq
Approx 25% complete for short_reads_1.fastq
Approx 30% complete for short_reads_1.fastq
Approx 35% complete for short_reads_1.fastq
Approx 40% complete for short_reads_1.fastq
Approx 45% complete for short_reads_1.fastq
Approx 50% complete for short_reads_1.fastq
Approx 55% complete for short_reads_1.fastq
Approx 60% complete for short_reads_1.fastq
Approx 65% complete for short_reads_1.fastq
Approx 70% complete for short_reads_1.fastq
Approx 75% complete for short_reads_1.fastq
Approx 80% complete for short_reads_1.fastq
Approx 85% complete for short_reads_1.fastq
Approx 90% complete for short_reads_1.fastq
Approx 95% complete for short_reads_1.fastq
Approx 100% complete for short_reads_1.fastq


### Answers 2
<code>
A. Sanger / Illumina 1.9 encoding
B. Most of sequences have almost 99% good quality of per read (Mean Sequence Quality is equal to 39). But longer the read, the quality drops.
C. Yes. On "Per base sequence quality" plot, we can observe that the mean quality drops are fitting into IQR
D. The "Per base sequence content" plot shows that between 13th and 33rd position, reads have very little information entropy. But from 1st to 13th position entropy is huge.
E. the GC content is higher than expected in most of the sequences. Distribution of GC is also negatively skewed 
</code>

### Excercise 3
Open <code>assembly-data/part1_qc/data/short_reads_2.fastq </code> in FastQC

Repeat the analysis and answer the following questions:

<code>
A. What do you make of the quality of the data?
B. What can we do about it?
</code>

In [2]:
%%bash
fastqc assembly-data/part1_qc/data/short_reads_2.fastq

Analysis complete for short_reads_2.fastq


Started analysis of short_reads_2.fastq
Approx 5% complete for short_reads_2.fastq
Approx 10% complete for short_reads_2.fastq
Approx 15% complete for short_reads_2.fastq
Approx 20% complete for short_reads_2.fastq
Approx 25% complete for short_reads_2.fastq
Approx 30% complete for short_reads_2.fastq
Approx 35% complete for short_reads_2.fastq
Approx 40% complete for short_reads_2.fastq
Approx 45% complete for short_reads_2.fastq
Approx 50% complete for short_reads_2.fastq
Approx 55% complete for short_reads_2.fastq
Approx 60% complete for short_reads_2.fastq
Approx 65% complete for short_reads_2.fastq
Approx 70% complete for short_reads_2.fastq
Approx 75% complete for short_reads_2.fastq
Approx 80% complete for short_reads_2.fastq
Approx 85% complete for short_reads_2.fastq
Approx 90% complete for short_reads_2.fastq
Approx 95% complete for short_reads_2.fastq
Approx 100% complete for short_reads_2.fastq


### Answers 3

<code>
A. The quality of data is not as good as in previous example. Some reads fall into the less than 50% quality score
B. Preprocessing techniques, droping sequences with ambiguous nucleotides, trimming, masking, cutting of 5' and 3' ends, removal of adapters.

### Excercise 4
Open <code>assembly-data/part1_qc/fail-examples/31_S7_L001_R2_001_fastqc.html</code> and <code>assembly-data/part1_qc/fail-examples/50_S26_L001_R1_001_fastqc.html</code>

What can you say about the quality of these data? Can you improve it?

### Answers 4
Quality of both examples is one step from being terrible.
There are no adapters distinguished and the entropy of information is huge. In my opinion reads are too long for illumina and the quality drops are too significant to ignore them. I think the best solution would be to do sequencing once again.

## Assembly

### de Bruijn graph
- Named after a Dutch mathematician, Nicholaas Govert de Bruijn
- A directed graph of sequences of symbols
- Nodes in the graph are k-mers (oligonucleotides of length <i>k</i>)
- Edges represent consecutive k-mers (which overlap by <i>k</i>-1 symbols)
<img src="https://i.imgur.com/SdXAgfP.png"
     alt="de Bruijn graph"
     style="float: left; margin-right: 2px;" />

### Assembly software
#### Tools for short read assembly
- SPAdes
- MEGAHIT
- Velvet
- SKESA
- IDBA
- ABySS

#### Long read assembly
- Canu
- Wtdbg2
- Flye
- Ra
- Miniasm
- Shasta
- [Comparison of tools](https://github.com/rrwick/Long-read-assembler-comparison)

#### State of the art pipeline for hybrid assembly
[Unicycler](https://github.com/rrwick/Unicycler)

#### Bandage as a tool for viewing assembly graphs
[Bandage](http://rrwick.github.io/Bandage/)


In [4]:
%%bash
Bandage 

QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-michal'


<img src="https://i.imgur.com/irgruKB.png"
     alt="Example of Bandage"
     style="float: left; margin-right: 2px;" />