# ONT sequencing data processing for prokaryote isolates

# Basecalling and demultiplexing

Note: the example below is based on a sequencing run of one plate of 96 samples using multiplexing barcodes 1-96

***

## Data prep

#### Copy POD5 files

Note: If data is received from the sequencing provider already demultiplexed, this demultiplexing can be based on live 'fast' basecalling during the sequencing run (and the 'alias' column in the *sequencing_summary.txt* files is _likely_ derived from this fast basecalling). If this is the case, it would be preferable to _first_ redo basecalling with SUP model, and then redo all of the demultiplexing to ensure demultiplexing is done based on the highest quality base calls.

If the data was provided demultiplexed, copy each set of raw data files into a single pod5 directory, ignoring the previous split by 'pass' and 'fail' and previous demultiplexing by barcode.

Note: the basecalling step here uses the *sample_sheet_<sequencing_run_ID>.csv* file generated during the sequencing run (this should be provided with the data delivery). Also copy the *sequencing_summary_<sequencing_run_ID>.txt* file if you wish to run *pycoQC* later.

In [None]:
cd /work/dir/
mkdir -p 0.raw_data

# pod5 files
cp -r /path/to/raw/data/pod5_pass/*/*.pod5 0.raw_data/0.pod5/
cp -r /path/to/raw/data/pod5_fail/*/*.pod5 0.raw_data/0.pod5/

# accessory files
cp /path/to/raw/data/sample_sheet.csv 0.raw_data/
cp /path/to/raw/data/sequencing_summary.txt 0.raw_data/

#### Edit sample sheets

For formatting, see docs [here](https://github.com/nanoporetech/dorado/blob/release-v0.7/documentation/SampleSheets.md)

Key points: 

- At a minimum a sample sheet must contain `kit`, `experiment_id` and one of `position_id` or `flow_cell_id`. All rows in a sample sheet must contain the same `experiment_id`.
- Add `barcode` and `alias` columns to enable renaming output files from dorado

Example below for reformatting via *python*. Update 'kit', 'barcode', and 'alias' entries as appropriate for your sample number and sample IDs. This example used kit 'SQK-RBK114-96', barcode1-barcode96, and sample IDs (taken from 'alias') S1-S96. 

In [None]:
cd /work/dir

# Load python
module purge
module load Python/3.11.6-foss-2023a
python3

# Import required libraries
import pandas as pd
import numpy as np

# plate 1
df = pd.read_csv('0.raw_data/sample_sheet.csv')[['experiment_id', 'position_id', 'flow_cell_id', 'protocol_run_id', 'flow_cell_product_code']].loc[[0]]
df['kit'] = 'SQK-RBK114-96'
# add barcode-to-sampleID mapping as 'alias'
df = pd.merge(df, pd.DataFrame(data={'kit': 'SQK-RBK114-96', 'barcode': 'barcode'+pd.Series(range(1,97)).astype(str).str.zfill(2), 'alias': 'S'+pd.Series(range(1,97)).astype(str)}), on='kit')
df.to_csv('0.raw_data/sample_sheet.formatted.csv', index=False)

quit()


***

## Base-calling with dorado

#### Pre-download model

Example here is with Dorado v0.7.3 and the appropriate model for this particular sequencing run. Check for later Dorado versions and model updates (note: models aren't always available that match exactly to the sequencing kit used; pick the most appropriate one).

In [None]:
cd /work/dir/0.raw_data
mkdir basecalling_model
cd basecalling_model

module purge
module load Dorado/0.7.3 
dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.0.0 


#### Run basecalling

This basecalling example requires GPU access on NeSI to run.

Note: Using HAC (high-accuracy) data can increase coverage (compared with SUP (super-high accuracy) data), but at the expense of a lower quality threshold for individual reads. Coverage depth can be an important factor in the completeness and accuracy of genome assemblies downstream, and so can be important to consider when choosing between HAC and SUP data.

In [None]:
#!/bin/bash -e
#SBATCH -A <>
#SBATCH -J dorado_bascalling
#SBATCH --time=2-12:00:00
#SBATCH --mem=40G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --output=dorado_bascalling.out
#SBATCH --error=dorado_bascalling.err
#SBATCH --partition=hgx
#SBATCH --gpus-per-node=A100:1

cd /work/dir/0.raw_data/
mkdir -p 1.basecalling

module purge
module load Dorado/0.7.3 

dorado basecaller --kit-name SQK-RBK114-96 --trim all --device 'cuda:all' --recursive --sample-sheet sample_sheet.formatted.csv basecalling_model/dna_r10.4.1_e8.2_400bps_sup@v5.0.0 0.pod5/ > 1.basecalling/basecalled.dorado0.7.3_sup.bam


Example runtime < 2.5 days; MaxRSS < 30 GB

***

### Demultiplexing

Demultiplex based on barcode/sampleIDs (added during basecalling step above)


In [None]:
#!/bin/bash
#SBATCH -A <>
#SBATCH -J dorado_demux
#SBATCH --time 00:30:00
#SBATCH --ntasks=1
#SBATCH --mem 4GB
#SBATCH --cpus-per-task=16
#SBATCH -e dorado_demux.err
#SBATCH -o dorado_demux.out

cd /work/dir/0.raw_data/
mkdir -p 2.basecalled.demux

module purge
module load Dorado/0.7.3 

dorado demux --no-classify --emit-fastq --output-dir 2.basecalled.demux 1.basecalling/basecalled.dorado0.7.3_sup.bam


Example runtime < 25 mins ; maxrss < 2 GB 

***