# Oxford Nanopore data processing: prokaryote isolate sequencing


***

## 1. Data prep and QC

## Index

- [1.1 General notes](#1.1-General-notes)
- [1.2 Data prep](#1.2-Data-prep)
- [1.3 QC](#1.3-Read-QC)

## 1.1 General notes

The examples below are based on sequencing of **12 isolates**, multiplexed during ONT library prep (each tagged with a unique barcode). Long-read sequencing data was generated on an Oxford Nanopore GridION, with basecalling based on high accuracy basecalling (HAC; Q9). In this example, basecalling was done in real time during sequencing. However, you can also opt to perform basecalling separately on the raw data after the fact. In our tests, comparable (generally identical) steps were also appropriate for processing data based on super accuracy basecalling model (SUP; Q10). 

Note: 

- Using HAC data can increase coverage (compared with SUP data), but at the expense of a lower quality threshold for individual reads. Coverage depth can be an important factor in the completeness and accuracy of genome assemblies downstream, and so can be important to consider when choosing between HAC and SUP data.


## 1.2 Data prep

#### Concatenate read chunks for each barcode

Basecalling data for all barcodes are output from a minION/gridION in chucks. The chunks for all *used* barcodes first need to be concatenated into single files for each barcode. If samples were run split into replicates across multiple barcodes, these can also be concatenated together.

NOTE: 

- There can be a small number of reads assigned to unused barcodes. This represents a low level error rate in the process of assigning reads correctly (cross-talk). It pays to be conscious of the fact that there will likely be a small fraction of incorrect reads in each of your sample data sets, although these should be easy to spot downstream based on differential coverage between the correct sequences and those from cross-talk. For this step, we can simply ignore reads in the unused barcodes' directories.
- If you have run replicate samples across multiple barcodes (to increase data generated), and/or if you have the same samples run over multiple sequencing runs, you can choose whether to pool these here or instead process each separately and generate duplicate assemblies downstream. In this example, we will assume samples were run in duplicate on a single sequencing run, and we wish to pool these prior to assembly and downstream work.

This example assumes the raw data are contained within `/working/dir/0.ONT_data_HAC/*/fastq_pass/`

In [None]:
# Working directory
cd /working/dir
mkdir -p 1.ONT_data_HAC_concatenated/0.concat_barcodes

# Loop through each barcode , and for each, concatenate all chunks (fastq.gz files) into one fastq.gz file
for barcode_path in 0.ONT_data_HAC/*/fastq_pass/barcode*; do
    barcode=$(basename ${barcode_path})
    cat 0.ONT_data_HAC/*/fastq_pass/${barcode}/*.fastq.gz > 1.ONT_data_HAC_concatenated/0.concat_barcodes/${barcode}.fastq.gz
done


#### Optional: Concatenate isolate replicates and rename as sequential isolate IDs

Concatenate replicates together for each isolate (if split across multipe barcodes during sequencing). You may also wish to rename isolates here for ease of downstream use (for example, the steps that follow assume that all samples are named based on sequential isolateIDs (e.g. isolate_1, isolate_2, etc.).

A simple (incomplete) example to achieve this via `cat` commands is given below. (Although, for a large number of samples this may be cleaner to achieve via a loop incorporating arrays of barcode IDs and sample numbers).

In [None]:
# Working directory
cd /working/dir/1.ONT_data_HAC_concatenated/
mkdir -p 1.concat_replicates

# Concatenate replicates of same samples, and rename as isolate_n
cat 0.concat_barcodes/barcode01.fastq.gz 0.concat_barcodes/barcode06.fastq.gz > 1.concat_replicates/isolate_1.fastq.gz
cat 0.concat_barcodes/barcode02.fastq.gz 0.concat_barcodes/barcode07.fastq.gz > 1.concat_replicates/isolate_2.fastq.gz
cat 0.concat_barcodes/barcode03.fastq.gz 0.concat_barcodes/barcode08.fastq.gz > 1.concat_replicates/isolate_3.fastq.gz
#... etc.


## 1.3 Read QC 

There are several options to assess read quality statistics. In this example we will generate basic read quality stats via *NanoStat*.

NOTE: 

- It may take a few mins per sample to run this. If connection drop outs are an issue, you could run this remotely via, e.g., *slurm* or *tmux*
- If you get very fragmented assemblies downstream, you opt to apply a filter here (e.g. length or quality filter) to remove some of the poorer reads, which *may* improve assemblies downstream. Some options for this include *nanofilt* or *filtlong*.


#### Run *Nanostat* on data for all 12 isolates

In [None]:
# Working directory
cd /working/dir
mkdir -p 1.ONT_data_HAC_concatenated/1.concat_replicates/NanoStat

# Load modules
module purge
module load NanoStat/1.5.0-gimkl-2020a-Python-3.8.2

# Run NanoStat on each barcode dataset
for i in {1..12}; do
    NanoStat -t 8 --tsv \
    -n 1.ONT_data_HAC_concatenated/1.concat_replicates/NanoStat/isolate_${i}_NanoStat.tsv \
    --fastq 1.ONT_data_HAC_concatenated/1.concat_replicates/isolate_${i}.fastq.gz
done


#### Merge Nano stat results into one summary table 

NOTE:

- This will ultimately be put into a script for ease of use. But for now we can use the python code below.

In [None]:
# Working directory
cd /working/dir

# Load python
module purge
module load Python/3.8.2-gimkl-2020a
python3

### Import required libraries
import pandas as pd
import numpy as np

# Compile NanoStat results
results_list = []
for i in range (1,13):
    tmp_df = pd.read_csv('1.ONT_data_HAC_concatenated/1.concat_replicates/NanoStat/isolate_'+str(i)+'_NanoStat.tsv', sep='\t')
    tmp_df.index=['isolate_'+str(i)]*len(tmp_df)
    tmp_df = tmp_df.pivot(columns='Metrics', values='dataset')
    results_list.append(tmp_df)

# Generate summary table and write out
results_df = pd.concat(results_list, axis=0)
results_df.index.name = 'isolateID'
results_df.to_csv('1.ONT_data_HAC_concatenated/1.concat_replicates/NanoStat/summary_table_NanoStat.tsv', sep='\t')

quit()


***