# Quality control and adapter trimming of Illumina reads

The first step with sequence data is always quality control (QC). We will use [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for this.  
As this notebook is again located in `notebooks` folder, move to the main course folder to run the analysis.

Keep in mind that all steps will be done separately for the bacterial (16S) and fungal (ITS) data. 

The data to be analysed is in the folders created under `data/` folder (`bac_data` & `fun_data`). First step is to create output folders for fastqc for both 16S and ITS data. 

Move to the right folder. 

In [1]:
cd ../

/scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology


In [3]:
ls data/bac_data/

[0m[01;32mA001-A-agri-bac-TGAAGGAT-AACTGACT-MMB17-course-run20230203R_S1_L001_R1_001.fastq.gz[0m[K*
[01;32mA001-A-agri-bac-TGAAGGAT-AACTGACT-MMB17-course-run20230203R_S1_L001_R2_001.fastq.gz[0m[K*
[01;32mA003-A-fore-bac-TGAAGGAT-GCACTCTG-MMB17-course-run20230203R_S3_L001_R1_001.fastq.gz[0m[K*
[01;32mA003-A-fore-bac-TGAAGGAT-GCACTCTG-MMB17-course-run20230203R_S3_L001_R2_001.fastq.gz[0m[K*
[01;32mA005-A-park-bac-TGAAGGAT-CGTACACA-MMB17-course-run20230203R_S5_L001_R1_001.fastq.gz[0m[K*
[01;32mA005-A-park-bac-TGAAGGAT-CGTACACA-MMB17-course-run20230203R_S5_L001_R2_001.fastq.gz[0m[K*
[01;32mA007-A-NC-bac-TGAAGGAT-GTGTTCGC-MMB17-course-run20230203R_S7_L001_R1_001.fastq.gz[0m[K*
[01;32mA007-A-NC-bac-TGAAGGAT-GTGTTCGC-MMB17-course-run20230203R_S7_L001_R2_001.fastq.gz[0m[K*
[01;32mA009-B-park-bac-CACTTCGT-AACTGACT-MMB17-course-run20230203R_S9_L001_R1_001.fastq.gz[0m[K*
[01;32mA009-B-park-bac-CACTTCGT-AACTGACT-MMB17-course-run20230203R_S9_L001_R2_001.fastq.gz[0m[K*


Make the output for bacterial reads.

In [4]:
mkdir data/bac_data/FASTQC

And for the fungal data. 

In [5]:
mkdir data/fun_data/FASTQC

Then run fastqc on the bacterial data.

In [6]:
!fastqc --quiet --threads 4 --outdir data/bac_data/FASTQC --format fastq data/bac_data/*.fastq.gz

In [7]:
!fastqc --quiet --threads 4 --outdir data/fun_data/FASTQC --format fastq data/fun_data/*.fastq.gz

Then do the same for fungal data.

In [None]:
!fastqc -h

After the QC has been done, we can combine the reports, separately for bacterial and fungal data.

First bacterial data

In [8]:
!multiqc --outdir data/bac_data/FASTQC --interactive data/bac_data/FASTQC/*


  [34m/[0m[32m/[0m[31m/[0m ]8;id=10180;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.14[0m

[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/bac_data/FASTQC/A001-A-agri-bac-TGAAGGAT-AACTGACT-MMB17-course-run20230203R_S1_L001_R1_001_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/bac_data/FASTQC/A001-A-agri-bac-TGAAGGAT-AACTGACT-MMB17-course-run20230203R_S1_L001_R1_001_fastqc.zip
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/bac_data/FASTQC/A001-A-agri-bac-TGAAGGAT-AACTGACT-MMB17-course-run20230203R_S1_L001_R2_001_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/bac_data/FASTQC/A001-A-agri-bac-TGAAGGAT-AACTGACT-MMB17-course-run20230203R_S1_L0

In [9]:
!multiqc --outdir data/fun_data/FASTQC --interactive data/fun_data/FASTQC/*


  [34m/[0m[32m/[0m[31m/[0m ]8;id=304958;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.14[0m

[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/fun_data/FASTQC/A002-A-agri-fun-TGAAGGAT-TGAGAGAA-MMB17-course-run20230203R_S2_L001_R1_001_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/fun_data/FASTQC/A002-A-agri-fun-TGAAGGAT-TGAGAGAA-MMB17-course-run20230203R_S2_L001_R1_001_fastqc.zip
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/fun_data/FASTQC/A002-A-agri-fun-TGAAGGAT-TGAGAGAA-MMB17-course-run20230203R_S2_L001_R2_001_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/fun_data/FASTQC/A002-A-agri-fun-TGAAGGAT-TGAGAGAA-MMB17-course-run20230203R_S2_L

And then fungal data

In [None]:
!multiqc 

After multiqc has completed, find the output files (`multiqc_report.html`) from the file browser on the left and download them to your own computer. Remember to change the names locally, otherwise the second one will overwrite the first if you download them to the same folder. 

After initial quality control, we need to trim off the PCR primers from our reads. We will use [cutadapt](https://cutadapt.readthedocs.io/en/stable/guide.html) for the job. Read the "Adapter types" section from the manual behind the link and think about where would we excpet out primers to be. And which options should we use to trim them.  We will run cutadapt in paired-end mode, so we need to specify the outputs separately for R1 and R2 reads. 

In [None]:
!cutadapt -h

Instead of running each sample separately, we can make a simple for loop to run each sample. But we first need the sample names.  

First go to the right folder. 

In [10]:
cd data/bac_data

/scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/bac_data


In [12]:
ls -ltr

total 98036
-rwxrwx--- 1 antkark project_2007145 2228015 Feb  6 11:15 [0m[01;32mA039-E-NC-bac-CTTGGTTG-GTGTTCGC-MMB17-course-run20230203R_S39_L001_R2_001.fastq.gz[0m[K*
-rwxrwx--- 1 antkark project_2007145 2080765 Feb  6 11:15 [01;32mA039-E-NC-bac-CTTGGTTG-GTGTTCGC-MMB17-course-run20230203R_S39_L001_R1_001.fastq.gz[0m[K*
-rwxrwx--- 1 antkark project_2007145 3142349 Feb  6 11:15 [01;32mA035-E-fore-bac-CTTGGTTG-GCACTCTG-MMB17-course-run20230203R_S35_L001_R2_001.fastq.gz[0m[K*
-rwxrwx--- 1 antkark project_2007145 2790572 Feb  6 11:15 [01;32mA035-E-fore-bac-CTTGGTTG-GCACTCTG-MMB17-course-run20230203R_S35_L001_R1_001.fastq.gz[0m[K*
-rwxrwx--- 1 antkark project_2007145 3775093 Feb  6 11:15 [01;32mA034-E-park-bac-CTTGGTTG-TGAGAGAA-MMB17-course-run20230203R_S34_L001_R2_001.fastq.gz[0m[K*
-rwxrwx--- 1 antkark project_2007145 3380791 Feb  6 11:15 [01;32mA034-E-park-bac-CTTGGTTG-TGAGAGAA-MMB17-course-run20230203R_S34_L001_R1_001.fastq.gz[0m[K*
-rwxrwx--- 1 antkark project_20071

Then get the sample names from the forward read files. 

In [13]:
ls *_R1_001.fastq.gz |cut -d "_" -f 1 > sample_names.txt

In [15]:
cd ../fun_data/

/scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/data/fun_data


In [32]:
ls *_R1_001.fastq.gz |cut -d "_" -f 1,2,3,4 > sample_names.txt

ls: cannot access '*_R1_001.fastq.gz': No such file or directory


In [21]:
cd ../..

/scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology


In [None]:
%%bash
mkdir trimmed_data/bac_data
mkdir trimmed_data/fun_data

In [33]:
%%bash

for sample in $(cat data/bac_data/sample_names.txt) 
do 
    cutadapt \
    -g CCTACGGGNGGCWGCAG \
    -G GACTACHVGGGTATCTAATCC \
    -O 10 \
    --cores 4 \
    data/bac_data/${sample}*_R1_001.fastq.gz \
    data/bac_data/${sample}*_R2_001.fastq.gz \
    -o trimmed_data/bac_data/${sample}_trimmed_1.fastq.gz \
    -p trimmed_data/bac_data/${sample}_trimmed_2.fastq.gz \
    > trimmed_data/bac_data/${sample}.log
done

In [35]:
%%bash
for sample in $(cat data/fun_data/sample_names.txt) 
do 
    cutadapt \
    -g TCCTCCGCTTATTGATATGC \
    -a CAAAGATTCGATGAYTCAC \
    -G GTGARTCATCGAATCTTTG \
    -A GCATATCAATAAGCGGAGGA \
    -O 10 \
    --cores 4 \
    data/fun_data/${sample}*_R1_001.fastq.gz \
    data/fun_data/${sample}*_R2_001.fastq.gz \
    -o trimmed_data/fun_data/${sample}_trimmed_1.fastq.gz \
    -p trimmed_data/fun_data/${sample}_trimmed_2.fastq.gz \
    > trimmed_data/fun_data/${sample}.log
done 

Then the names for ITS samples too.  

Before running cutadapt, we need to make new folders for trimmed reads under the `trimmed_data` folder. 
Call them again `bac_data` and `fun_data`.

Then we can run cutadapt in a loop going thru the file with all sample names. Run the command from the samin folder. 

When the trimming is done, run fastqc and multiqc again on thee trimmed data to make sure everything looks ok. 

In [36]:
%%bash
mkdir trimmed_data/bac_data/FASTQC
mkdir trimmed_data/fun_data/FASTQC

In [37]:
!fastqc --quiet --threads 4 --outdir trimmed_data/bac_data/FASTQC --format fastq trimmed_data/bac_data/*.fastq.gz

In [39]:
!fastqc --quiet --threads 4 --outdir trimmed_data/fun_data/FASTQC --format fastq trimmed_data/fun_data/*.fastq.gz

In [41]:
!multiqc --outdir trimmed_data/bac_data/FASTQC --interactive trimmed_data/bac_data/FASTQC/*


  [34m/[0m[32m/[0m[31m/[0m ]8;id=478082;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.14[0m

[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/bac_data/FASTQC/A001-A-agri-bac_trimmed_1_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/bac_data/FASTQC/A001-A-agri-bac_trimmed_1_fastqc.zip
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/bac_data/FASTQC/A001-A-agri-bac_trimmed_2_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/bac_data/FASTQC/A001-A-agri-bac_trimmed_2_fastqc.zip
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/bac_data/

In [42]:
!multiqc --outdir trimmed_data/fun_data/FASTQC --interactive trimmed_data/fun_data/FASTQC/*


  [34m/[0m[32m/[0m[31m/[0m ]8;id=785174;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.14[0m

[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/fun_data/FASTQC/A002-A-agri-fun_trimmed_1_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/fun_data/FASTQC/A002-A-agri-fun_trimmed_1_fastqc.zip
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/fun_data/FASTQC/A002-A-agri-fun_trimmed_2_fastqc.html
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/fun_data/FASTQC/A002-A-agri-fun_trimmed_2_fastqc.zip
[34m|           multiqc[0m | Search path : /scratch/project_2007145/antkark/MMB-117/MMB-117_EnvironmentalMicrobiology/trimmed_data/fun_data/

__And we are done.__