# QC inicial para secuenciación del genoma de nanoporos e ilumina SARS-Cov-2
En este notebook analizaremos los resultados de  secuenciación, utilizando dos enfoques para la secuenciación delgenoma de SARS-CoV-2. Ambos se basan en el **ARTIC protocol**,, desarrollado por [Artic Network] (https://artic.network/ncov-2019). Para Illumina, corresponde al protocolo Classic Artic, que amplifica el genoma SARS-CoV-2 en 98 fragmentos de 400 pb cada uno. Para Nanopore, el protocolo usado se llama **"Midnight Protocol"** y se basa en la amplificación de 29 fragmentos superpuestos de 1200 pb que cubren todo el genoma SARS-CoV-2 SARS-CoV-2.
El contenido del notebook se puede resumir en:

* Descargar datos
* Instalar software y preparar el entorno
* Ejecutar el control de calidad de la secuenciación


### Descargar datos

In [None]:
!wget https://zenodo.org/records/10681134/files/module_2.tar.gz

### Extraer los archivos .tar.gz

In [None]:
!tar xvf module_2.tar.gz

### Instalar condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Instalar software

In [None]:
!conda install -c bioconda fastqc

In [None]:
!pip install nanoplot

In [None]:
!pip install multiqc

### Recordemos el formato FASTQ

All sequencers produces data in a format called **fastq**. The structure is showed below. All sequences with a fastq are represented by 4 lines:

```
@SEQ_ID                   <---- SEQUENCE NAME
AGCGTGTACTGTGCATGTCGATG   <---- SEQUENCE AS BASES
+                         <---- SEPARATOR LINE
%%).1***-+*''))**55CCFF   <---- ASCII QUALITY SCORES

```

The quality of the sequences is represented as a character of the ASCII code. Check [here](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) for an explanation.
The numerical values correspond to phred quality values

# Illumina QC

We will use [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) software for the analysis of the results of a Illumina run. FastQC run a series of analysis on fastq files, and report the results as an HTML file that you open in a browser. For help on any of the sections, please check the following links.

*   [Basic statitistics](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/1%20Basic%20Statistics.html)
*   [Per base sequence quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/2%20Per%20Base%20Sequence%20Quality.html)
*   [Per base sequence content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html)
*   [Per sequence GC content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/5%20Per%20Sequence%20GC%20Content.html)
*   [Per base N content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/6%20Per%20Base%20N%20Content.html)
*   [Sequence length distribution](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/7%20Sequence%20Length%20Distribution.html)
*   [Duplicate Sequences](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html)
*   [Overrepresented Sequences](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html)
*   [Adapter content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html)
*   [Kmer content](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html)
*   [Per tile sequence quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html)


Run fastQC from command line

In [None]:
#Create a directory to store all FastQC results and run FastQC
!mkdir Illumina_fastqc_results
!fastqc -o Illumina_fastqc_results /content/Illumina_fastq/fastq/*

As we did in the previous module, we can summarize the results of fastqc using multiqc

In [None]:
!multiqc -o /content/Illumina_fastqc_results/ /content/Illumina_fastqc_results/

This will create an HTML result file (`multiqc_report`) with a summary of FastQC reports.

Navigate the results for each file and report:

> **Which sample has more reads?**

> **Is there any distribution of sequences sizes?**

# Nanopore QC

Run fastQC from command line (actually, for Nanopore, FastQC is not a good choice)

In [None]:
!mkdir Nanopore_FastQC_report
!fastqc -o Nanopore_FastQC_report /content/Nanopore_READS/nanopore_fastq/barcode*/*

In [None]:
!multiqc -o /content/Nanopore_FastQC_report/ /content/Nanopore_FastQC_report/

Running NanoPlot for Nanopore data 

In [None]:
!NanoPlot -o nanoplot_output --fastq_rich /content/Nanopore_READS/nanopore_fastq/barcode*/*.fastq.gz 

The output will be in the folder nanoplot_output. Download the file `NanoPlot-report.html` and browse the results.

---


> **How many reads are in total?**

> **Which is the average read size?**

> **How does this compares with Illumina results?**




We don't do any trimming because the pipelines we'll use do this for use. See you on the next notebook...