# FASTQC Report Analysis

All .fastq files obtained from the .sra files were passed to the FASTQC algorithm to get a quality report on the sequencing data. The individual .fastq files are listed below:

<b>Control Group Replicate 1</b>
 - SRR3478128_1.fastq
 - SRR3478128_2.fastq
 
<b>Control Group Replicate 2</b>
 - SRR3478129_1.fastq
 - SRR3478129_2.fastq
 
<b>METTL14 KnockDown Replicate 1</b>
 - SRR3478130_1.fastq
 - SRR3478130_2.fastq
 
<b>METTL14 KnockDown Replicate 2</b>
 - SRR3478131_1.fastq
 - SRR3478131_2.fastq


The _1 and _2 versions of the same fastq files are the split files which were generated during the fastq dumping process, and they represent the same reads, starting from opposite ends.

There are a total of 12 categories that FASTQC checks for in given sequencing data:

Basic Statistics, Per Base Sequence Quality, Per Tile Sequence Quality, Per Sequence Quality Scores, Per Base Sequence Content, Per Sequence GC Content, Per Base N content, Sequence Length Distribution, Sequence Duplication Levels, Overrepresented Sequences, Adapter Content and Kmer Content.

Since almost all replicates/samples gave overall the same results, one picture for each part on the quality check will be used.

## Basic Statistics

![Screen%20Shot%202020-09-29%20at%2021.29.16.png](attachment:Screen%20Shot%202020-09-29%20at%2021.29.16.png)

Basic statistics category passes the check for all .fastq files, this is expected, since this category never gives an error. It gives a general information on the file contents.

## Per Base Sequence Quality

![Screen%20Shot%202020-09-29%20at%2021.37.41.png](attachment:Screen%20Shot%202020-09-29%20at%2021.37.41.png)



In the upper figure, we can see a box plot for each position in read. All sequences passed this test, and all look alike. This figure can be taken as a representation for all of them. 

The blue line connecting the box plots show the average quality. We can see that it's declining towards the end, but it stays in the green area - a high quality score. It's normal that the quality decreases as it approaches the end sequences, this happens in almost all platforms of sequencing.

## Per Tile Sequence Quality

![Screen%20Shot%202020-10-12%20at%2017.59.18.png](attachment:Screen%20Shot%202020-10-12%20at%2017.59.18.png)

The figure above shows two different results. Left-side figure represents the forward reads of the split files (denoted as _1 in .fastqc files). The right-side figure represents the backward reads of the split files (denoted as _2 in .fastqc files). The forward reads give just a warning while the backward reads give an error in this section.

The plots represent flowcell tiles from where each read in the fastq file came. Red colors indicate lower quality, while dark blue indicates high quality. Even though the right-side figure gives an error, there seems to be no significant problem with the reads overall. 

## Per Sequence Quality Scores

![Screen%20Shot%202020-10-12%20at%2022.03.02.png](attachment:Screen%20Shot%202020-10-12%20at%2022.03.02.png)

This section shows if a subset of the sequences have a significantly lower quality. It passes the test for both split fastq files.

## Per Base Sequence Content

![Screen%20Shot%202020-10-12%20at%2022.06.33.png](attachment:Screen%20Shot%202020-10-12%20at%2022.06.33.png)

This plot shows the proportion of each base called for a specific base position within all reads in the file. This section gives a warning in both forward and backward split fastq files, which is most likely due to the part in the beginning (fisrt 12-13 bases). Even though its concerning, it's normal for the 5' of reads to have a certain amount of error. This is due to bias in sequencing.

## Per Sequence GC Content

![Screen%20Shot%202020-10-12%20at%2022.12.30.png](attachment:Screen%20Shot%202020-10-12%20at%2022.12.30.png)

This part shows the GC distribution along the whole length of each sequence in the fastq file given as input (red), and compares it to a normal distribution of GC content (blue). If the sum of deviations from the blue line is more than 15% of the reads, this part gives an error, most likely indicating to a contamination issue. Our forward and backwards samples have passed this part.

## Per Base N Content

![Screen%20Shot%202020-10-12%20at%2022.14.07.png](attachment:Screen%20Shot%202020-10-12%20at%2022.14.07.png)

In sequencing, if the position that is read can't be assigned a base with significant certainty, an N is assigned. This part checks if the sequence has N values, and if so, at what percentage? It gives an error for >5% N. Our sequences have no Ns assigned, which is very good.

## Sequence Length Distribution

![Screen%20Shot%202020-10-12%20at%2022.14.30.png](attachment:Screen%20Shot%202020-10-12%20at%2022.14.30.png)

This part shows the length of our sequences in the fasqt file. Some high throughput sequencing methods produce sequences with equal length (like Illumina, used here). This parts raise a warning when the sequence lengths differ, but it can be ignored if another sequencing method was used, and different sequence lengths could be expected. In our case, there was no warning.

## Sequence Duplication Levels

![Screen%20Shot%202020-10-12%20at%2022.14.55.png](attachment:Screen%20Shot%202020-10-12%20at%2022.14.55.png)

This part shows the degree of duplicated sequences in our fastq file. A high level of duplication might indicate an enrichment bias.

In this part of the analysis, our samples give an error. As it can be seen on the graph, there is a high level of duplicated sequences shown as a peak (2 high peaks) in the blue line. Even though this is concerning, we should also note that human RNA-seq data contains tons of sequences that vary in their expression in the human body. To be able to detect lowly expressed transcripts, over-sequencing the highly expressed transcripts is sometimes necessary. This is most likely the reason for our unwanted peaks.

## Overrepresented Sequences

There were not observed any overrepresented sequences in any of the fastq files

## Adapter Content

![Screen%20Shot%202020-10-12%20at%2022.15.15.png](attachment:Screen%20Shot%202020-10-12%20at%2022.15.15.png)

This part checks for adapter content in our sequences. Since adapters were trimmed, this part shows zero adapters and gives no warnings/errors.

## KMer Content

![Screen%20Shot%202020-10-12%20at%2022.16.02.png](attachment:Screen%20Shot%202020-10-12%20at%2022.16.02.png)

This plot shows if there is overrepresentation for some kmers in the sequences in our fastq file. This might be due to a bias, but it might also be due to the natural occurance of these sequence biologically. Our fastq files give an error for this part.

This error is due to the beginning of the sequences, as can be seen in the plot, this is not something to be alarmed about as it's most likely due to random priming.