# Quality control and adapter trimming of Illumina reads

The first step with sequence data is always quality control (QC). We will use [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this.  
As this notebook is again located in `notebooks` folder, move to the main course folder to run the analysis.

Keep in mind that all steps will be done separately for the bacterial (16S) and fungal (ITS) data. 

The data to be analysed is in the folders created under `data/` folder (`bac_data` & `fun_data`). First step is to create output folders for fastqc for both 16S and ITS data. 

Move to the right folder. 

In [None]:
cd ../

Make the output for bacterial reads.

In [None]:
mkdir data/bac_data/FASTQC

And for the fungal data. 

In [None]:
mkdir data/fun_data/FASTQC

Then run fastqc on the bacterial data.

In [None]:
!fastqc --quiet --threads 4 --outdir data/bac_data/FASTQC --format fastq data/bac_data/*.fastq.gz

Then do the same for fungal data.

In [None]:
!fastqc -h

After the QC has been done, we can combine the reports, separately for bacterial and fungal data.

First bacterial data

In [None]:
!multiqc --outdir data/bac_data/FASTQC --interactive data/bac_data/FASTQC/*

And then fungal data

In [None]:
!multiqc 

After multiqc has completed, find the output files (`multiqc_report.html`) from the file browser on the left and download them to your own computer. Remember to change the names locally, otherwise the second one will overwrite the first if you download them to the same folder. 

After initial quality control, we need to trim off the PCR primers from our reads. We will use [cutadapt](https://cutadapt.readthedocs.io/en/stable/guide.html) for the job. Read the "Adapter types" section from the manual behind the link and think about where would we excpet out primers to be. And which options should we use to trim them.  We will run cutadapt in paired-end mode, so we need to specify the outputs separately for R1 and R2 reads. 

In [None]:
!cutadapt -h

Instead of running each sample separately, we can make a simple for loop to run each sample. But we first need the sample names.  

First go to the right folder. 

In [None]:
cd data/bac_data

Then get the sample names from the forward read files. 

In [None]:
ls SRR*_1.fastq.gz |cut -d "_" -f 1 > sample_names.txt

Then the names for ITS samples too.  

Before running cutadapt, we need to make new folders for trimmed reads under the `trimmed_data` folder. 
Call them again `bac_data` and `fun_data`.

Then we can run cutadapt in a loop going thru the file with all sample names. Run the command from the samin folder. 

In [None]:
!for sample in $(cat data/bac_data/sample_names.txt); do cutadapt -g PRIMER_SEQUENCE -G PRIMER_SEQUENCE -O 10 --cores 4 data/bac_data/${sample}_1.fastq.gz data/bac_data/${sample}_2.fastq.gz -o trimmed_data/bac_data/${sample}_trimmed_1.fastq.gz -p trimmed_data/bac_data/${sample}_trimmed_2.fastq.gz > trimmed_data/fun_data/${sample}.log; done 

When the trimming is done, run fastqc and multiqc again on thee trimmed data to make sure everything looks ok. 

In [None]:
!fastqc

In [None]:
!multiqc

__And we are done.__