# Assessing Data Quality

When data from DNA sequencing is produced, it is stored in a file format called FastQ. This file contains information not just about the DNA sequence itself (e.g. the nucleotide sequence) but also contains information about the quality of the DNA sequence. Sequencing DNA is not an error-free process. Like any measurement, there is a potential for error. In generating millions of bases of DNA sequence, tens or or hundreds of thousands of those bases may be mistaken. Fortunately, we can estimate the error; we do this using something called a [Phred](https://en.wikipedia.org/wiki/Phred_quality_score)
Now that we have the sequence data we want to be able to determine the quality of the sequencing reads. 




## FastQ file format

A fastQ file (file extension .fastq) is a file that contains perhaps hundreds of thousands or millions of individual sequence reads. Each sequence read is represented in 4 lines of the file. Here is an example:

```bash
@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
```

Line |Description 
-----|----- 
1|Always begins with ‘@’ and then information about the read (e.g. it’s direction, the machine it was sequenced on, etc.)
2|The actual DNA sequence
3|Always begins with a ‘+’ and sometimes the same info in line 1
4|Has a string of characters which represent the quality scores; must have same number of characters as line 2

## Phred score in detail
A Phred score is a measure of uncertainty and error. The value of a [Phred](https://en.wikipedia.org/wiki/Phred_quality_score) score is an indication of how likely it is that a base reported in a sequencing read is in error. Here is the table describing the Phred score values:

Phred Quality Score|Probability of incorrect base call|Base call accuracy
-------------------|----------------------------------|------------------
10|1 in 10 |90%
20|1 in 100|99%
30|1 in 1000|99.9%
40|1 in 10,000|99.99%
50|1 in 100,000|99.999%
60|1 in 1,000,000|99.9999%

## Running and interpreting FastQC

[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is one of several programs we can use to determine the quality of DNA sequence. A single fastQ file may have millions of individual sequencing reads, each with its own quality information (Phred core). The FastQC report generates graphs and descriptive statistics that allow us to get a sense of the overall quality of a file of sequencing data. Once we have that information, we can use it later on to trim and/or filter out low quality data.

**Per base sequence quality**
This report shows the average quality score across the length of all reads. Poor quality at the beginning or end of the reads may suggest settings for trimming.

**Per sequence quality scores**
This report indicates how individual reads of a given quality score are distributed in your sequence file. Ideally, most reads will have a high average quality score. Populations of lower average-scored reads can be removed by downstream filtering.

**Adapter Content**
This report indicates the presence of sequencing adapters. If adapters are detected, you will need to remove them in downstream cleaning.


## Installing FastQC

We will install fastqc using the [conda](https://docs.conda.io/en/latest/) package manager. We are doing this in the Jupyter notebook but this and all other commands will work just about identically on any linux terminal/machine. 

### Important!!! - complete each of the numbered steps

1. We will search for the tool we want to install

We will use the `conda search` command and the channel (`-c`) flag to search [bioconda](https://bioconda.github.io/)

In [None]:
conda search fastqc -c bioconda

2. Create a conda enviornment

Conda uses something called "enviornments" which are essentially isolated configurations on our computer where we can included all the needed compatible tools and exlude other tools which are unnesessary or would have conflicts with our desired tool. We will use the `-y` option to install without prompting the user for input, the `--name` option to name the enviornment for the tool. We will enforce versioning (`tool==version`) so that we know what version of a tool was used to do an analysis should we wish to repeat the analysis. 

**Tip**: Use the latest version where possible, but if you get an error with dependancies, using a lower version may help. Some tools may never be installed successfully using conda, but we will face those when we have too. 


In [None]:
conda create -y --name fastqc fastqc==0.11.9 -c bioconda

3. We will use the `conda init` command so that conda can be configured for this shell

In [None]:
conda init

4. **DON'T SKIP**: We need to restart the computer's [kernal](https://en.wikipedia.org/wiki/Kernel_(operating_system)). Go to the **Kernal** menu and choose **Restart Kernal**

5. Finally, we can activate the conda enviornment (created with the name used for the environment). When you run the next cell it should return the name of the environment.  

In [None]:
conda activate fastqc

## Running Fast QC

First, we will be run FastQC on a small subset of the sequence reads. In this exercise, the reads are in the `concat_fastq` folder. We will run on a sample that has 100 fastq files from the polyrhiza experiment (`100_spolyrhiza_reads.fastq.gz`). 

**Tip**: When using commands or searching for files, the tab key will help you autocomplete (and help ensure the files and commands you think you have are actually accessible).

1. Run fastqc on the `100_spolyrhiza_reads.fastq.gz` file located in concat_fastq

In [None]:
fastqc data/input/concat_fastq/100_spolyrhiza_reads.fastq.gz

2. In the file browser (left of your screen) navigate into the `concat_fastq` folder to locate the following outputs:

- There are two outputs from the fastqc command.

         1. An .html file (than can be viewed as a webpage)
         2. A .zip file that contains individual images and reports

The output of fastQC is at the location of the original fastq files. This can get messy so let's clean up and organize

3. Make a new directory to store the .zip and .html files that are the outputs of the fastQC analysis

In [None]:
mkdir -p data/output/fastqc_sample_results

4. Inside this directory, let's make a folder specifically for the 100 reads analysis

In [None]:
mkdir -p data/output/fastqc_sample_results/100_reads

5. We will move the .html and .zip files into the folder we have created. The "*" allows us to move all files that have those file extensions. 

In [None]:
mv *.zip data/output/fastqc_sample_results/100_reads
mv *.html data/output/fastqc_sample_results/100_reads

6. Use the `ls` command to confirm the results have been moved to the appropriate directory. 

In [None]:
ls data/output/fastqc_sample_results/100_reads

7. Use the file browser to take a look at the results we can double click on the .html file and navigate through the different reports.

# Questions

The documentation for each fastqc report is located in the [fastqc help](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/)


- How many sequences were analyzed; what were their lengths?

-  Is the per base sequence quality more or less the same across the length of the sequence? Are there locations of the sequence that tend to have higher or lower Phred scores?


- Are there over represented sequences? Can they be identified ?

# Challenges

1. Repeat this analysis for the sample of 1000 reads (concat_fastq/1000_spolyrhiza_reads.fastq.gz) and then the entire data set (concat_fastq/spolyrhiza_reads.fastq.gz). Save your work in this notebook. When you terminate this analysis in CyVerse, you can share all the outputs with your instructor. 

**Tip**: Run fastqc with the `-t` option so that it can use all available cpus. For example `fastqc -t 16 reads.fastq.gz` would use 16 cpus. Check to make sure the number of CPUs matches the number you launched with using cyverse. You can check the number using the command `cat /proc/cpuinfo` - add "1" to the last processor listed since the count starts at 0. See also `fastqc --help`