# ONT sequencing data processing for prokaryote isolates

# QC and sequence read trimming

***

## Assess quality of sequencing reads

There are several options to assess read quality statistics. Examples below include *NanoStat*, *Nanoplot*, and *pycoQC*.

NOTE: 

- If you get very fragmented assemblies downstream, you can opt to apply a filter here (e.g. length or quality filter) to remove some of the poorer reads, which *may* improve assemblies downstream. Some options for this include *nanofilt* or *filtlong*.


#### Run Nanostat

It may take a few mins per sample to run this. If connection drop outs are an issue, you could run this remotely via, e.g., *slurm* or *tmux*.

In [None]:
# Working directory
cd /working/dir
mkdir -p 0.raw_data/3.NanoStat

# Load modules
module purge
module load NanoStat/1.5.0-gimkl-2020a-Python-3.8.2

# Run NanoStat on each barcode dataset
for i in {1..96}; do
    echo "S${i}"
    NanoStat -t 8 --tsv \
    -n 0.raw_data/3.NanoStat/S${i}.NanoStat.tsv \
    --fastq 0.raw_data/2.basecalled.demux/S${i}.fastq
done


#### Merge Nanostat results into one summary table 

Example python code to generate summary table:

In [None]:
# Working directory
cd /working/dir

# Load python
module purge
module load Python/3.8.2-gimkl-2020a
python3

### Import required libraries
import pandas as pd
import numpy as np
from glob import glob

# Compile NanoStat results
results_list = []
file_count = len(glob('0.raw_data/3.NanoStat/*.NanoStat.tsv'))
for i in range (1,file_count):
    tmp_df = pd.read_csv('0.raw_data/3.NanoStat/S'+str(i)+'.NanoStat.tsv', sep='\t')
    tmp_df.index=['S'+str(i)]*len(tmp_df)
    tmp_df = tmp_df.pivot(columns='Metrics', values='dataset')
    results_list.append(tmp_df)

# Generate summary table and write out
results_df = pd.concat(results_list, axis=0)
results_df.index.name = 'SampleID'
results_df.to_csv('0.raw_data/3.NanoStat/summary_table_NanoStat.tsv', sep='\t')

quit()


#### Plot read metrics via NanoPlot

In [None]:
cd /work/dir/0.raw_data
mkdir -p 0.raw_data/3.NanoPlot

module purge 
module load NanoPlot/1.41.0-gimkl-2022a-Python-3.10.5

reads_files=$(ls 0.raw_data/2.basecalled.demux/*.fastq)

NanoPlot -t 8 --tsv_stats \
--fastq ${reads_files} \
-o 0.raw_data/3.NanoPlot


#### Check raw reads via pycoQC

- similar to fastqc but for nanopore data
- Note: this is run on the `sequencing_summary_<sequencing_run_ID>.txt` file generated during sequencing. As such, it may reflect stats for the data as per the live demultiplexing during the sequencing run, rather than the re-basecalled and demultiplexed data. But it should still give a rough indication of the data.

In [None]:
cd /work/dir
mkdir -p 0.raw_data/3.pycoQC

module purge
module load pycoQC/2.5.2-gimkl-2020a-Python-3.8.2

pycoQC -f 0.raw_data/sequencing_summary.txt -o 0.raw_data/3.pycoQC/pycoQC.html


***

## Quality trim and filter via *chopper*

Set the filtering and trimming metrics as is appropriate for your data based on the quality assessments above. 

For these example data, based on pycoQC output:
- filtering was set to q >= 8
- Looked to be a number of short reads in the data (~100-200 bp). min read length set to 100 to filter these out.
- head and tail crop set to 15 (picked arbitrarily in case any junk at ends remain)

In this example, stderr is also written to file (this includes read counts retained out of total per file): `1.trimmed_and_filtered_reads/1.chopper/chopper.readcounts.txt`

Note: there are later chopper versions available, but at the time of writing these docs these later versions weren't working on NeSI

In [None]:
cd /work/dir/
mkdir -p 1.trimmed_and_filtered_reads/1.chopper/fastq_files

module purge
module load chopper/0.5.0-GCC-11.3.0

# write stderr to file for read counts summary
> 1.trimmed_and_filtered_reads/1.chopper/chopper.readcounts.txt

# run chopper
for i in {1..96}; do
    echo S${i}
    echo "S${i}.fastq:" >> 1.trimmed_and_filtered_reads/1.chopper/chopper.readcounts.txt
    cat 0.raw_data/2.basecalled.demux/S${i}.fastq | chopper --threads 8 -q 8 -l 100 --headcrop 15 --tailcrop 15 \
    > 1.trimmed_and_filtered_reads/1.chopper/fastq_files/S${i}.chopper.fastq 2>>1.trimmed_and_filtered_reads/1.chopper/chopper.readcounts.txt
done


### Add filtered read counts to summary_table (write as summary_table_NanoStatRaw_ChopperFiltered.tsv)

In [None]:
# Working directory
cd /work/dir/

# Load python
module purge
module load Python/3.8.2-gimkl-2020a
python3

### Import required libraries
import pandas as pd
import numpy as np
from glob import glob

df1 = pd.read_csv('0.raw_data/3.NanoStat/summary_table_NanoStat.tsv', sep='\t').add_prefix('raw_reads_').rename(columns={'raw_reads_SampleID': 'SampleID'})

sampleID = []
raw_read_count = []
filtered_read_count = []
with open('1.trimmed_and_filtered_reads/1.chopper/chopper.readcounts.txt') as file:
    for line in file:
        if 'fastq' in line:
            sampleID.append(line.split('.')[0])
        elif 'Kept' in line:
            raw_read_count.append(line.split(' ')[5])
            filtered_read_count.append(line.split(' ')[1])

df2 = pd.DataFrame(data={'SampleID': sampleID, 'raw_reads_count': raw_read_count, 'filtered_reads_count': filtered_read_count})

df = pd.merge(df1, df2, on = 'SampleID')

df.to_csv('1.trimmed_and_filtered_reads/summary_table_NanoStatRaw_ChopperFiltered.tsv', sep='\t')

quit()



#### Re-run summary stats: e.g. NanoPlot


In [None]:
cd /work/dir/1.trimmed_and_filtered_reads/
mkdir -p 2.chopper.NanoPlot

module purge 
module load NanoPlot/1.41.0-gimkl-2022a-Python-3.10.5

reads_files=$(ls 1.chopper/fastq_files/*.chopper.fastq)

NanoPlot -t 8 --tsv_stats \
--fastq ${reads_files} \
-o 2.chopper.NanoPlot


***