# Quality filtering and trimming

Raw sequence reads need to be filtered and trimmed to remove poor quality reads and trim poor quality ends of reads (and/or any retained adapter sequence). 

Common options for this include *trimmomatic* or *bbduk*. An example is provided below using *trimmomatic*.

The relevant parameters for the filtering and trimming should be decided for each dataset (e.g. based on read length, quality, presence of adapters (which can be assessed via *fastQC*), etc).


***

## Trim and filter reads

#### Trimmomatic: prep raw data



Pre-processing: concatenating sample files from multiple lanes

- If samples have been run over multiple lanes, concatenate the files for each read (R1 and R2, separately) into single sets of paired read files per sample
- Also rename files to a simpler format for downstream use
  - For ease of writing for-loops and/or running slurm array jobs, it can be useful to name samples with a shared string followed by a number, e.g. `sample_1`, `sample_2`,... `sample_n`.

In [None]:
# Change to working directory
cd /working/dir/

# Make directory 0.raw_data
mkdir 0b.Raw_concat

# Set up variables for input files path and output path
inpath=0a.Raw/hiseq/fastq
outpath=0b.Raw_concat

# For each of reads 1 and 2 (R1 and R2), concatenate files from multiple lanes (e.g. L001-L008) into single output files, and rename based on sampleIDs
for read in R1 R2;
do
    cat ${inpath}/*4462-40*_${read}_001.fastq.gz > ${outpath}/S1_${read}.fastq.gz
    cat ${inpath}/*4462-44*_${read}_001.fastq.gz > ${outpath}/S2_${read}.fastq.gz
    cat ${inpath}/*4462-48*_${read}_001.fastq.gz > ${outpath}/S3_${read}.fastq.gz
    cat ${inpath}/*4462-52*_${read}_001.fastq.gz > ${outpath}/S4_${read}.fastq.gz
done


#### Trimmomatic: run

Note: 

- We recommend here also including a search for a truncated version of the adapters, as sometimes these aren't picked up by trimmomatic or fastqc (this is the `ILLUMINACLIP` bit in the script below).
  - It may be necessary to check with your sequencing provider what the relevant adapters are and select the shared stretch of sequence to include in `iua.fna` in the script below
  - For reference, the sequence included in the example below is based on a truncated section of the Illumina TruSeq adapters
- Set `CROP` and `MINLENGTH` to something appropriate for your data (relative to sequence length for this sequencing run)

Slurm array for 9 samples

- Change `#SBATCH --array=1-9` for required number of samples
  - (This works easiest if your sample names are numbered, e.g. S1_R1.fastq.gz, S2_R1.fastq.gz, etc.)


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J wgs_1_trimmomatic
#SBATCH --time 00:05:00
#SBATCH --mem=12GB
#SBATCH --array=1-9
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH -e wgs_1_trimmomatic_%a.err
#SBATCH -o wgs_1_trimmomatic_%a.out

# Load module(s)
module purge
module load Trimmomatic/0.39-Java-1.8.0_144                

# Change to working directory
cd /working/dir/

# Make output directory 
mkdir -p 1a.QC_Filtered_trimmomatic/

# Set up variables for input path and output path
inpath=0b.Raw_concat
outpath=1a.QC_Filtered_trimmomatic

# Make adapter file if not already created
if [ ! -f iua.fna ]; then
    echo ">FastQC_adapter" > iua.fna
    echo "AGATCGGAAGAG" >> iua.fna
fi

## Water filter samples              
# Quality filter and trim 
srun trimmomatic PE -threads 10 -phred33 -quiet \
${inpath}/S${SLURM_ARRAY_TASK_ID}_R1.fastq.gz ${inpath}/S${SLURM_ARRAY_TASK_ID}_R2.fastq.gz \
${outpath}/S${SLURM_ARRAY_TASK_ID}_R1.fastq S${SLURM_ARRAY_TASK_ID}_R1.single1.fastq \
${outpath}/S${SLURM_ARRAY_TASK_ID}_R2.fastq S${SLURM_ARRAY_TASK_ID}_R2.single2.fastq \
ILLUMINACLIP:iua.fna:1:25:7 CROP:115 SLIDINGWINDOW:4:30 MINLEN:50

# Tidy up the singleton reads
cat S${SLURM_ARRAY_TASK_ID}_R1.single1.fastq S${SLURM_ARRAY_TASK_ID}_R2.single2.fastq \
> ${outpath}/S${SLURM_ARRAY_TASK_ID}_single.fastq

rm S${SLURM_ARRAY_TASK_ID}_R1.single1.fastq S${SLURM_ARRAY_TASK_ID}_R2.single2.fastq



***

## FastQC analysis of trimmed reads 

Examine the quality of the trimmed reads via *fastQC* and *multiQC*. 

Some important things to look for include: 

- Sequence counts: this will give you an indication if some samples are have more or less sequences associated with them. This can be useful to bear in mind if you are missing samples later; it may just be that there were no quality sequences recovered from those samples to begin with.
- Overall sequencing quality and quality scores across reads
  - incl. if the *ends* of sequences have high error rate and may require further trimming
- GC content of reads
- Retained adapter sequences which will require further trimming


#### FastQC

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J wgs_1_qc_fastqc
#SBATCH --time 01:00:00
#SBATCH --mem 1GB
#SBATCH --array=1-9
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 2
#SBATCH -e wgs_1_qc_fastqc_%a.err
#SBATCH -o wgs_1_qc_fastqc_%a.out

# Set up working directories
cd /working/dir/
mkdir -p 1a.QC_Filtered_trimmomatic/fastqc/

# load modules
module load FastQC/0.11.9
module load MultiQC/1.9-gimkl-2020a-Python-3.8.2

# Run fastqc on each sample
srun fastqc \
-o 1a.QC_Filtered_trimmomatic/fastqc/ \
1a.QC_Filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_R1.fastq 1a.QC_Filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_R2.fastq


#### MultiQC


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J wgs_1_qc_multiqc
#SBATCH --time 00:10:00
#SBATCH --mem 1GB
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 2
#SBATCH -e wts_1_qc_multiqc.err
#SBATCH -o wts_1_qc_multiqc.out

# Set up working directories
cd /working/dir

# load modules
module load FastQC/0.11.9
module load MultiQC/1.9-gimkl-2020a-Python-3.8.2

# Run multiqc to generate report for all samples
srun multiqc -f \
-o 1a.QC_Filtered_trimmomatic/fastqc/ \
1a.QC_Filtered_trimmomatic/fastqc/


***

## Optional: Filter out host sequences


#### Preamble

Metagenome data derived from microbial communities associated with a host should ideally be filtered to remove any reads originating from host DNA. This may improve the quality and efficiency of downstream data processing (since we will no longer be processing a bunch of data that we are likely not interested in), and is also an important consideration when working with metagenomes that may include data of a sensitive nature (and which may also need to be removed prior to making the data publicly available). This is especially important for any studies involving human subjects or those involving samples derived from taonga species.

There are several approaches that can be used to achieve this. The general principle is to map your reads to a reference genome (e.g. human genome) and remove those reads that map to the reference from the dataset. 

The steps below provide an example using *BBMap* to map against a masked human reference genome and retain only those reads that do *not* map to the reference. Here we are mapping the quality-filtered reads against a pre-prepared human genome that has been processed to mask sections of the genome, including those that: are presumbed microbial contaminant in the reference; have high homology to microbial genes/genomes (e.g. ribosomes); or those that are of low complexity. This ensures that reads that would normally map to these sections of the human genome are *not* removed from the dataset (as genuine microbial reads that we wish to retain might also map to these regions), while all reads mapping to the rest of the human genome are removed.

Notes: 

- The same process can be used to remove DNA matching other hosts (e.g. mouse), however you would need to search if anyone has prepared (and made available) a masked version of the reference genome, or create a masked version using bbmask. The creator of BBMap has made available masked human, mouse, cat, and dog genomes. More information, including links to these references and instructions on how to generate a masked genome for other taxa, can be found within [this thread](http://seqanswers.com/forums/showthread.php?t=42552).
- You can also map to a *non*-masked reference, with the caveat that you may filter out some genuninely microbial sequences that are similar to regions in the host genome.
- This process may be more complicated if a reference genome for your host taxa is not readily available. In this case an alternative method would need to be employed (for example: predicting taxonomy via Kraken2 and then filtering out all reads that map to the pylum or kingdom of your host taxa).
- If you are interested in viruses, and the virus of interest happens to be integrated in the reference genome, then this data may be lost in this process.

#### Download reference genome

Download the host reference genome of interest. Select pre-prepared masked references are available [here](http://seqanswers.com/forums/showthread.php?t=42552), or download your own non-masked reference.

For reference:

- The masked human reference genome is available from [here](https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit)
- The masked mouse reference genome is available from [here](https://drive.google.com/file/d/0B3llHR93L14wYmJYNm9EbkhMVHM/view)

#### Host filtering: Build BBMap index


In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J wgs_1_hostfilt_mapping_index
#SBATCH --time 00:20:00
#SBATCH --mem 23GB
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH -e wgs_1_hostfilt_mapping_index.err
#SBATCH -o wgs_1_hostfilt_mapping_index.out

# working directories
cd /working/dir/
mkdir -p 1b.QC_Filtered_host/
cd 1b.QC_Filtered_host/

# Load BBMap module
module purge
module load BBMap/38.81-gimkl-2020a

# Build BBMap index of reference genome
srun bbmap.sh ref=/path/to/reference/genome.fa.gz -Xmx23g


#### Host filtering: per-sample BBMap read mapping, slurm array

Note:

- This step outputs fastq files where reads that map to the reference genome have been filtered out.
- The output from `outu` is the filtered file for downstream use.
- Host filtering here is run as a two step process for each sample: first, on the paired reads (R1 and R2), and then again for the unpaired (single) reads file.
- The parameters are set based on the recomendations for host filtering outlined [here](http://seqanswers.com/forums/showthread.php?t=42552)


In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J wgs_1_hostfilt_mapping
#SBATCH --time 01:00:00
#SBATCH --mem 28GB
#SBATCH --ntasks 1
#SBATCH --array=1-9
#SBATCH --cpus-per-task 32
#SBATCH -e wgs_1_hostfilt_mapping_%a.err
#SBATCH -o wgs_1_hostfilt_mapping_%a.out

# Set up working directories
cd /working/dir/1b.QC_Filtered_host/

# Load BBMap module
module purge
module load BBMap/38.81-gimkl-2020a

## Run bbmap

# Paired reads (R1 and R2)
srun bbmap.sh -Xmx26g -t=32 \
minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 qtrim=rl trimq=10 untrim \
in1=../1a.QC_Filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_R1.fastq \
in2=../1a.QC_Filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_R2.fastq \
outu1=S${SLURM_ARRAY_TASK_ID}_R1_hostFilt.fastq \
outu2=S${SLURM_ARRAY_TASK_ID}_R2_hostFilt.fastq

# Unpaired (single) reads
srun bbmap.sh -Xmx26g -t=32 \
minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 qtrim=rl trimq=10 untrim \
in=../1a.QC_Filtered_trimmomatic/S${SLURM_ARRAY_TASK_ID}_single.fastq \
outu=S${SLURM_ARRAY_TASK_ID}_single_hostFilt.fastq


***

## Optional: Post-filtering assessment

It pays to check that the filtering and trimming process has done what you expected, and that seqeuences in the filtered and trimmed output files are now of the standard that you want to use for all downstream processing.

This can be done via:

1. Comparing the numbers of reads in the raw data, post-trimming, and post-host DNA removal to make sure it hasn't filtered out more reads than you'd expect.
1. Running the filtered outputs back through fastqc to check that sequences are generally of a length and quality that you'd expect after the trimming and filtering process (and that no adapter sequence has been retained).

#### Checking library sizes: Summary file of read counts during filtering process

Use line counts of the files (`(wc -l)/4`) to summarise read counts for each sample at each of the filtering steps.

The below outputs to `Read_counts_summary.txt`


In [None]:
#!/bin/bash
#SBATCH -A your_project_account
#SBATCH -J wgs_1_qc_summary
#SBATCH --time 02:00:00
#SBATCH --mem 1GB
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH -e wgs_1_qc_summary.err
#SBATCH -o wgs_1_qc_summary.out

# Set up working directories
cd /working/dir/

# Set up variables for input files path and output path
raw_path=0b.raw_data_concat
trimmed_path=1a.QC_Filtered_trimmomatic
host_filt_path=1b.QC_Filtered_host

# Set up read_counts_summary file headers
echo -e "Sample\traw_read_count\tTrimmed (paired)\tTrimmed (single)\tHost-filtered (paired)\tHost-filtered (single)" > Read_counts_summary.txt

# Summarise read counts for all samples and add to Read_counts_summary.txt
for i in {1..9};
do
    # Summarise the raw data, trimmed, and host-filtered files
    count_raw=$(($(zcat ${raw_path}/S${i}_R1.fastq.gz | wc -l)/4))
    count_trimmed_paired=$(($(zcat ${phix_filt_path}/S${i}_R1_Filt.fastq.gz | wc -l)/4))
    count_trimmed_single=$(($(zcat ${phix_filt_path}/S${i}_single_Filt.fastq.gz | wc -l)/4))
    count_hostFilt_paired=$(($(cat ${host_filt_path}/S${i}_R1_hostFilt.fastq | wc -l)/4))
    count_hostFilt_single=$(($(cat ${host_filt_path}/S${i}_single_hostFilt.fastq | wc -l)/4))
    # write results to summary file
    echo "|S"${i}"|"${count_raw}"|"${count_trimmed_paired}"|"${count_trimmed_single}"|"${count_hostFilt_paired}"|"${count_hostFilt_single}"|" >> Read_counts_summary.txt
done


***