# Quality control
<br>
<div style="text-align: justify">SELEX was carried out with a 10 fold excess of DNA by Indu Patwal (2014) over histone octamer and samples from:
    <br>- The initial library (round 0)
    <br>- After adapter ligation (the initial SELEX input)
    <br>- Output from rounds 1,3 and 6
<br>13 files in total were submitted to the EMBL sequencing service. 
<br>
This is a quality control step of raw sequencing data from SELEX. The raw data of Illumina sequencing are paired-end reads in FASTQ files, which were further filtered to yield better results for downstream analysis by removing 5’ adapters ligated in SELEX library preparation and trimming low-quality 3’ ends with threshold 20.</div>

The first step is to generate QC report of paired-end fastq files (sample_1_sequence.txt.gz and sample_2_sequence.txt.gz)

In [2]:
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))

In [None]:
%%bash
fastqc sample_1_sequence.txt
fastqc sample_2_sequence.txt

[Cutadapt](https://cutadapt.readthedocs.io/en/stable/) trimmed regular 5' adapters when we used -g and -G parameters. The reads can be seen below:
<br>ADAPTER*mysequence*
<br>DAPTER*mysequence*
<br>TER*mysequence*
<br>somethingADAPTER*mysequence*
<br>After removal, all of them became _mysequence_

In [1]:
%%bash
cutadapt --help

cutadapt version 1.18

Copyright (C) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.


To get potential nucleosomal DNA fragments, which would be ~147 bp in length, we gave priority to remove SELEX adapter first, and then to obtain 60 to 80 bp of reads. We did not trim a few bases at the 3' end before because if we did that, we would not know how long we should trim to collect highly potential nucleosomal DNAs. Unfortunately, Cutadapt generally understands trimming low quality at 3' end before adapter removal, so we had to remove SELEX adapter and then filter reads from 60 to 80 bp long at the same time, which preceded 3' end trimming. Output files from the first quality control are sample_1_QC1.fq.gz and sample_2_QC1.fq.gz

In [None]:
%%bash
cutadapt -g GTCATAGCTGTTTCCTGTGTGAT -G GTCATAGCTGTTTCCTGTGTGAT --minimum-length 60 --maximum-length 80 -o sample_1_QC1.fq.gz -p sample_2_QC1.fq.gz sample_1_sequence.txt.gz sample_2_sequence.txt.gz

After that, Cutadapt trimmed low quality from 3' end with quality score 20 to keep reads of high quality score (See [algorithm](https://cutadapt.readthedocs.io/en/stable/algorithms.html#quality-trimming-algorithm))

In [None]:
%%bash
cutadapt -q 20 -o sample_1_QC2.fq.gz -p sample_2_QC2.fq.gz sample_1_QC1.fq.gz sample_2_QC1.fq.gz

Final output files sample_1_QC2.fq.gz and sample_2_QC2.fq.gz were double-checked by FASTQC, which gave html reports

In [None]:
%%bash
fastqc sample_1_QC2.fq.gz
fastqc sample_2_QC2.fq.gz