# HW06 Data Check: Quality Control and Trimming

This notebook will walk you through checking your data, and gathering information you need for writing up your Deep Dive for this section of the learning module. 

This notebook will check the analyses you ran for read quality control and trimming. 

## Getting Started

Before we get started you will need to set several variables that we will use throughout this notebook. 

In [None]:
# set the variables for your netid and xfile
# note that each person has 8 SRA accession ids in the xfile.
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/06_qc_trimming"
%cd $work_dir

In [None]:
# Set the fastq directory. This is where we have our raw fastq files.
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"

### Step 1: Checking your quality control steps

You should have already ran the homework hw06_qc_trimming.ipynb. Now we need to check your files and make sure you are ready to move on to the next step in the project.

# Step 1: Do you have the expected files?

First, Let's check that you have the expected list of files showing both the log files and directories created in the analysis. 

In [None]:
# check the list of files
!ls

#### Do you see something like this?

```
06A_fastqc-0.out        06B_trim-2.out    06C_fastqc-6.out
06A_fastqc-1.out        06B_trim-3.out    06C_fastqc-7.out
06A_fastqc-2.out        06B_trim-4.out    06C_run_fastqc.sh
06A_fastqc-3.out        06B_trim-5.out    06_launch_pipeline.sh
06A_fastqc-4.out        06B_trim-6.out    after_qc_trimming
06A_fastqc-5.out        06B_trim-7.out    before_qc_trimming
06A_fastqc-6.out        06C_fastqc-0.out  config.sh
06A_fastqc-7.out        06C_fastqc-1.out  trimmed_reads
06A_run_fastqc.sh       06C_fastqc-2.out  TruSeq3-PE-2.fa
06B_run_trimmomatic.sh  06C_fastqc-3.out  unpaired_reads
06B_trim-0.out          06C_fastqc-4.out
06B_trim-1.out          06C_fastqc-5.out
```

Great!

### Checking your output files

Let's check and make sure trimmomatic did a good job. 

Here are a few things we will check:

1. File size: Are the trimmed files smaller than the original raw files?
2. Poor quality sequences: How many sequences are flagged as being poor quality?
3. Sequence length: What is the range of sequence length?
4. Read counts: Before and after trimming?
5. Fastqc tests: Are we passing all of our tests with fastqc?
5. What do the fastqc plots show us? Have we done a good job trimming our reads?


In [None]:
# First, let's check the file sizes for our trimmed reads
# Notice I am using the "du" or disk usage command.
!echo "trimmed:"
!du -h /xdisk/bhurwitz/bh_class/$netid/assignments/06_qc_trimming/trimmed_reads/*fastq.gz

In [None]:
# Let's compare the file sizes above to the untrimmed raw reads that we started with.
# The file sizes should be smaller for the trimmed reads (above
# ...but not too small (meaning way too much got trimmed)
!echo "trimmed:"
!du -h /xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data/*fastq.gz

#### Unzipping the report files

Fastqc creates interactive html graphs we can look at (more on that in a minute), but also gives us data we can quickly look through using Unix. First we are going to check those text-based results... 

In [None]:
# unzip the fastqc results (for the before and after trimming reports)
# Note that the unzip command will produce a bunch of output!
%cd $work_dir
%cd before_qc_trimming
!for file in `ls *zip`; do unzip $file; done
%cd ..
%cd after_qc_trimming
!for file in `ls *zip`; do unzip $file; done

#### Compiling the results across files

Next, we are going to get some stats for each of our files from the "before" and "after" fastqc_data.txt files.

Each of these files has a report that looks like this:

```
Filename	ERR9752317_1.fastq.gz
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	23721314
Sequences flagged as poor quality	0
Sequence length	2-151
%GC	47
```

We are going to combine these into a single file so we can see all the data together, and create a table for before and after.

In [None]:
# First we will create the "before" table
%cd $work_dir
!egrep "Filename" ./before_qc_trimming/*/fastqc_data.txt | cut -f2 > b_filename.txt
!egrep "Total Sequences" ./before_qc_trimming/*/fastqc_data.txt | cut -f2 > b_total_seqs.txt
!egrep "Sequences flagged as poor quality" ./before_qc_trimming/*/fastqc_data.txt | cut -f2 > b_poor_quality.txt
!egrep "Sequence length" ./before_qc_trimming/*/fastqc_data.txt | cut -f2 > b_seq_range.txt
!egrep "%GC" ./before_qc_trimming/*/fastqc_data.txt | cut -f2 > b_gc_content.txt
!paste b_filename.txt b_total_seqs.txt b_poor_quality.txt b_seq_range.txt b_gc_content.txt > before_all_report.txt
!cat before_all_report.txt

In [None]:
# Next, we will create the after trimming table
# what do you notice?
%cd $work_dir
!egrep "Filename" ./after_qc_trimming/*/fastqc_data.txt | cut -f2 > a_filename.txt
!egrep "Total Sequences" ./after_qc_trimming/*/fastqc_data.txt | cut -f2 > a_total_seqs.txt
!egrep "Sequences flagged as poor quality" ./after_qc_trimming/*/fastqc_data.txt | cut -f2 > a_poor_quality.txt
!egrep "Sequence length" ./after_qc_trimming/*/fastqc_data.txt | cut -f2 > a_seq_range.txt
!egrep "%GC" ./after_qc_trimming/*/fastqc_data.txt | cut -f2 > a_gc_content.txt
!paste a_filename.txt a_total_seqs.txt a_poor_quality.txt a_seq_range.txt a_gc_content.txt > after_all_report.txt
!cat after_all_report.txt

#### Do our sequences pass now? 

Let's check to see if our sequences are passing the quality checks after trimming.

You should see something like this for the "before" summary.txt file for one of your samples.

```
PASS	Basic Statistics	ERR9752317_1.fastq.gz
PASS	Per base sequence quality	ERR9752317_1.fastq.gz
PASS	Per sequence quality scores	ERR9752317_1.fastq.gz
FAIL	Per base sequence content	ERR9752317_1.fastq.gz
WARN	Per sequence GC content	ERR9752317_1.fastq.gz
PASS	Per base N content	ERR9752317_1.fastq.gz
WARN	Sequence Length Distribution	ERR9752317_1.fastq.gz
FAIL	Sequence Duplication Levels	ERR9752317_1.fastq.gz
PASS	Overrepresented sequences	ERR9752317_1.fastq.gz
PASS	Adapter Content	ERR9752317_1.fastq.gz
```

We are going to look at all of the files together, before and after...

Note that we are going to ignore "FAIL	Sequence Duplication Levels". Some of our sampples have low complexity because they are from an infant gut that is just developing with fewer microbes.


In [None]:
# Notice that all of your files PASS Basic Statistics before and after QC
# This is because the raw reads were already pretty good quality!
!echo "Before QC Basic Statistics Passing count:"
!egrep "Basic Statistics" ./before_qc_trimming/*/summary.txt | egrep "PASS" | wc -l
!echo "After QC Basic Statistics Passing count:"
!egrep "Basic Statistics" ./after_qc_trimming/*/summary.txt | egrep "PASS" | wc -l

In [None]:
# Next we are going to check a category that might be failing.
# This is the "Per base sequence content" metric. We should get a graph that shows we have a pretty equal
# distribution of A, T, G, and C in our sequence (aka random). If we don't see this, that means we have
# adapter still...
# Notice that for have more files Passing the "Per base sequence content" after trimming.
# This is because we removed 10 base pairs at the beginning of the reads that were likely adapter.
!echo "Before Per base sequence content Passing count:"
!egrep "Per base sequence content" ./before_qc_trimming/*/summary.txt | egrep "PASS" | wc -l
!echo "After Per base sequence content Passing count:"
!egrep "Per base sequence content" ./after_qc_trimming/*/summary.txt | egrep "PASS" | wc -l

#### Time to explore the fastqc graphs

It is not nearly as fun to look at the QC results on the command line as it is using the interactive graphs that Fatsqc produces. Using that you learned in the exercises, go explore the graphs in your home directory.

To do this, navigate to these folders in Jupyter and open the *.html files:

Before:

~/be487-fall-2024/assignments/06_qc_trimming/before_qc_trimming

and

After:

~/be487-fall-2024/assignments/06_qc_trimming/before_qc_trimming

And pick a few files to compare and contrast. Or, look at them all!

One thing you might notice is that the data from the SRA are really not that bad! This is because the studies that are successful and publish thier data, have only published their sequencing runs that worked. But, there is a deep dark secret. This is not always the case! Sometimes a sequence run will fail and have very low-quality reads and that sequencing run is discarded, and the sample may be re-seqeunced. This can happen if there was a problem with the reagents and sequencing chemistry. 

Main point, always check the quality of your sequence data! Garbage in equals Garbage out! Go yell at the sequencing center if you get a poor run...you deserve better.

### What data do you need to report on in your deep dive?

1. What tool and version did you use to look at the quality of the fastq files?
2. What tool, version, and options did you use to trim raw fastq files?
3. Create a table showing the read counts before and after trimming.
4. Look at a few of the before/after plots from fastqc, why did we use the trimming parameters we did? Describe the parameters used.

* Note you are only going to report on 8 of the 56 total samples for your team project, in this report.

## Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/06_qc_trimming/hw06_check.ipynb $work_dir