# Dorado 🐟 basecalling for Google Colab


---


...for when you don't have a GPU but really needs those fast5s converted.

And if playing with a v0.0.1 release basecaller sounds like your jam. 

This note is based on the great work first written by Miles Benton over at his github [gist](https://gist.github.com/sirselim/13f70ae69f2a512e7d9e1f00f9704f53), describing using Google Colab for Guppy basecalling of fast5 reads. For anything resembling a production basecalling on Colab environment, I'd recommend following the link and going through his notes. 

Google Colab free tier has some limitations stopping it from being a great main bioinformatics platform such as : 



*   12 hour limit per session ⏲ - for reference, a bacterial genome fast5 directory about 30GB in size would take about ~5 hours to basecall on a Colab runtime with Tesla T4 GPU, using Guppy (6.3.8) with SUP configuration.

*   Hardware availability fluctuates 🖥 - disk, ram and GPU available for a particular session is assigned by Google, and will not be under user control (free tier Google Colab session will generally have about ~12GB of RAM).

*   Storage limitations 💾 - under most scenarios, bioinformatics Google Colab usage will mandate Google Drive linkage. While the default (volatile, since it's based on the instance) storage for the free tier usually hover around 78+GB, getting larger data off your Google Colab instance will be excruciatingly slow, and count toward the 12 hour limit already imposed on the users. And additional Google Drive storage fees can quickly add up. 



However, if you can work around above limitations, a Colab note makes for a surprisingly good prototyping platform with all the benefits of jupyter notebooks and replaceable instance computing. The GPU access provided for all tiers is certainly irresistible for those of us without decent hardware for the latest crop of deep-learning oriented bioinformatics tools.



First, let's see if our instance has a GPU.
``` 
%%shell
```
Typing above on the first line of a code block tells the Colab note to treat the rest of the block content as standard linux shell commands.

*Colab note will treat anything typed into a code block without shell call as python script.*

In [1]:
%%shell
nvidia-smi

Fri Oct  7 12:53:24 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces



If the instance does not have a GPU (in many cases it won't), you can change the runtime type using the Colab menu that should be available to the upper left side. 

Click "Runtime", choose "Change runtime type" and then choose GPU. 

This will start a whole new runtime, and will wipe out any progress you made in the note. 

The shell prompt is essential to add some flexibilty to a Colab note, and is worth getting familiar with. 

Here's an example running ls to list current directory content, just like in any linux shell environment. 

*the 'sample_data' directory you might run into is simply a default sample dataset Google generates for all Colab notes, and can be safely ignored*

In [2]:
%%shell
ls

sample_data




Let's import tensorflow into our runtime and check what our GPU device number is:

In [3]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

And let's link this Colab runtime with our Google Drive account and mount the drive.

Running below code will create a popup asking us to authorize Colab to access your Google Drive.

*Note that you'll have to re-authorize Colab access with every new runtime.* 

In [4]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


Dorado is an alpha v0.0.1 basecaller (as of October 7th 2022) based on libtorch, so written primarily C++ with all the benefits and potential complexities it entails (though [an introduction to Dorado](https://www.youtube.com/watch?v=EraN4vQedtQ) suggests libtorch is a surprisingly comprehensible match for people versed in pytorch already).  

Dorado code and binary are available from Nanopore github [here](https://github.com/nanoporetech/dorado). 

It should also be noted that Dorado recommends working with the new [pod5](https://github.com/nanoporetech/pod5-file-format) file format for maximum performance, not fast5. Apparently fast5 files are supported, but at a cost of basecalling speed. 

Let's start with installing a suite of pod5 tools so we can convert our sample fast5 file. 

In [5]:
%%shell
pip install pod5_format_tools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pod5_format_tools
  Downloading pod5_format_tools-0.0.23-py3-none-any.whl (18 kB)
Collecting pod5-format
  Downloading pod5_format-0.0.23-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 45.0 MB/s 
[?25hCollecting ont-fast5-api
  Downloading ont_fast5_api-4.1.0-py3-none-any.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 54.5 MB/s 
Collecting progressbar33>=2.3.1
  Downloading progressbar33-2.4.tar.gz (10 kB)
Collecting pyarrow~=8.0.0
  Downloading pyarrow-8.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.3 MB)
[K     |████████████████████████████████| 29.3 MB 68.3 MB/s 
Collecting iso8601
  Downloading iso8601-1.1.0-py3-none-any.whl (9.9 kB)
Building wheels for collected packages: progressbar33
  Building wheel for progressbar33 (setup.py) ... [?25l[?25hdone
  Created wheel



And now we can use the 'pod5-convert-from-fast5' tool to convert our fast5 to pod5. Command for conversion is:

```
pod5-convert-from-fast5 input_fast5_dir output_dir
```

*When working with Colab environment, I'd recommend outputting pod5 to a pre-existing directory in Google Drive just to be on the safe side*

In [15]:
%%shell
pod5-convert-from-fast5 gdrive/MyDrive/data/fast5/tmp/*.fast5 pod5

Converting reads...
0/1/1 files	1100/4000 reads, 550.3 MSamples, 71.1 MB/s
0/1/1 files	2200/4000 reads, 1.1 GSamples, 70.2 MB/s
0/1/1 files	3200/4000 reads, 1.6 GSamples, 68.2 MB/s
1/1/1 files	4000/4000 reads, 2.0 GSamples
Conversion complete: 1963063745 samples
Close all deleted: <pod5_format.writer.Writer object at 0x7f477187bf10>




'output.pod5' is the default name for pod5 conversion. The converter also concatenates all fast5 files found in the input directory into a single .pod5 file by default.

In [23]:
%%shell
ls gdrive/MyDrive/data/fast5/tmp/pod5

output.pod5





Simple 'wget' command can be used to directly download Dorado binary tarball from its github page (available under 'Installations'). However, we need to change the file extension from .gz to .tar.gz in order to decompress it. 

In [None]:
%%shell
wget https://nanoporetech.box.com/shared/static/h8eqc9htxk938jzpl4fch2rqlm48yeb0.gz

Changing the downloaded tarball name to what it would be if we downloaded it on a local machine.

In [None]:
%%shell
mv h8eqc9htxk938jzpl4fch2rqlm48yeb0.gz dorado-0.0.1+4b67720-linux-x64.tar.gz

In [33]:
%%shell
tar zxvf dorado-0.0.1+4b67720-linux-x64.tar.gz

dorado-0.0.1+4b67720-Linux/hdf5/
dorado-0.0.1+4b67720-Linux/hdf5/lib/
dorado-0.0.1+4b67720-Linux/hdf5/lib/plugin/
dorado-0.0.1+4b67720-Linux/hdf5/lib/plugin/libvbz_hdf_plugin.a
dorado-0.0.1+4b67720-Linux/lib/
dorado-0.0.1+4b67720-Linux/lib/libtorch_cuda_cpp.so
dorado-0.0.1+4b67720-Linux/lib/libhdf5hl_fortran.so.8.0.1
dorado-0.0.1+4b67720-Linux/lib/libbackend_with_compiler.so
dorado-0.0.1+4b67720-Linux/lib/libc10.so
dorado-0.0.1+4b67720-Linux/lib/libhdf5_hl.so.8.0.1
dorado-0.0.1+4b67720-Linux/lib/libhdf5.so.8
dorado-0.0.1+4b67720-Linux/lib/libhdf5_hl_cpp.so.8
dorado-0.0.1+4b67720-Linux/lib/libcublasLt-17d45838.so.11
dorado-0.0.1+4b67720-Linux/lib/libhdf5hl_fortran.so
dorado-0.0.1+4b67720-Linux/lib/libtorch.so
dorado-0.0.1+4b67720-Linux/lib/libhdf5_hl_cpp.so
dorado-0.0.1+4b67720-Linux/lib/libnvrtc-builtins-4730a239.so.11.3
dorado-0.0.1+4b67720-Linux/lib/libtorch_cuda_linalg.so
dorado-0.0.1+4b67720-Linux/lib/libcudnn_ops_train.so.8
dorado-0.0.1+4b67720-Linux/lib/libhdf5_fortran.so.8
dorad



I'm interested in seeing how the SUP mode performs, so I'm downloading dna_r9.4.1_e8_sup@v3.3 model for testing. 

Dorado currently offers 9 models for download:

*    dna_r10.4.1_e8.2_260bps_fast @ v3.5.2
*    dna_r10.4.1_e8.2_260bps_hac @ v3.5.2
*    dna_r10.4.1_e8.2_260bps_sup @ v3.5.2
*    dna_r10.4.1_e8.2_400bps_fast @ v3.5.2
*    dna_r10.4.1_e8.2_400bps_hac @ v3.5.2
*    dna_r10.4.1_e8.2_400bps_sup @ v3.5.2
*    dna_r9.4.1_e8_fast @ v3.4
*    dna_r9.4.1_e8_hac @ v3.3
*    dna_r9.4.1_e8_sup @ v3.3



In [34]:
%%shell
./dorado-0.0.1+4b67720-Linux/bin/dorado download --model dna_r9.4.1_e8_sup@v3.3

 - downloading dna_r9.4.1_e8_sup@v3.3 [200]




Let's see what our options are like with Dorado:

In [37]:
%%shell
./dorado-0.0.1+4b67720-Linux/bin/dorado basecaller --help 

Usage: dorado [options] model data 

Positional arguments:
model              	the basecaller model to run.
data               	the data directory.

Optional arguments:
-h --help          	shows help message and exits
-v --version       	prints version information and exits
-x --device        	device string in format "cuda:0,...,N", "cuda:all", "metal" etc.. [default: "cuda:all"]
-b --batchsize     	if 0 an optimal batchsize will be selected [default: 0]
-c --chunksize     	[default: 10000]
-o --overlap       	[default: 500]
-r --num_runners   	[default: 2]
--emit-fastq       	[default: false]
--remora-batchsize 	[default: 1000]
--remora-threads   	[default: 1]
--remora_models    	a comma separated list of remora models [default: ""]




And now, time to test out the actual basecalling. This particular Colab runtime has a Tesla T4 GPU, so I'm reducing the chunk size to a more reasonable 2000. 

I'm also specifically calling in the runtime's GPU device with cuda:0 - the device number was identified earlier in the process via tensorflow gpu_device_name call.

In [40]:
%%shell
pod5="gdrive/MyDrive/data/fast5/tmp/pod5"
dorado_001=./dorado-0.0.1+4b67720-Linux/bin/dorado
model="dna_r9.4.1_e8_sup@v3.3"
$dorado_001 basecaller $model $pod5 -c 2000 -x cuda:0 > demo.fastq

> Creating basecall pipeline
> Reads basecalled: 4000
> Samples/s: 1.121817e+06
> Finished




If above dorado command looks a little unexpected - it's just inputs being passed off as shell variables. Shell behavior on Colab really isn't as snappy as we're used to on a local hardware, and the usual tab based directory completion is missing. 

Declaring long directory names as variables and then calling them in a shell command makes things a little easier to type out and organize. 

Above command on a local linux machine would have looked like:

```
./dorado-0.0.1+4b67720-Linux/bin/dorado basecaller dna_r9.4.1_e8_sup@v3.3 gdrive/MyDrive/data/fast5/tmp/pod5 -c 2000 -x cuda:0 > demo.fastq
```

Let's use awk to check how many reads are in our basecalled file.

In [45]:
%%shell
awk 'END{print NR/4}' gdrive/MyDrive/data/fast5/dorado/demo.fastq

4000




4000 reads - matching the Dorado running prompt. 

Now let's check the file itself to make sure what we have is a real fastq file.

In [46]:
%%shell 
head gdrive/MyDrive/data/fast5/dorado/demo.fastq

@000fb14e-3e74-4741-89db-d60fc950c8af
GTTCAGTTGCCGTGTTGGGTGTTTAACCGTTTTTCGCATTTATCGTGAAACGCTGCGCGCGTTTTCGTGCCGCTTCAGGAGCATTCCCGAAAGCCGTGCAGTGGGGCGTCCCCTGCTGGAAGTGGTGCGCCGACATACCCTAGAAGCCCTGGCCGAGCGCGGCGGCGAACTGGAACTGGAAGCGGGTGGACGAGGCGCTGCGCGTGAAAGCCCGCCTGGAAGGTCAGGGCGGCACCCTGATCCTGGAAGACCTGAGCGAACACCGCCGCCGTGAAGCCGAACTGCGGGAAGCCACCGCCGTACTGTCACATGAATTCCGCACCCCCTGGGCAGCCATCCGGGGCGTGCTGGAAGCGCTGGAATATGACATGCCCCGCGAGTTGCAGCAGAACTTCGTGGCCCAGGGACTGCAGGAAGCTGAGCGCCTGGCCCGGCTGGTCGAAGATCTGGCGGTGGGGTTCCAGGCCTACGCGCACGCACACTGACGCTCTCGGAGGCCTTTGAACGCCGAGCGTTTGCTGGACACCGAATACGATCTGGGCCGCCCCACCAAGGGAGCCAAGCGCCAGGCCACCAAGCGGGGAGCTGCGCCGCGCCGCCGCGTTGCAACTCGTGTTCGGTGAAGATCTGGTGCGCGCCGATCCTGACAAGCTCCTGCCGTCCTGCTGAACCTGATCGAGAATGCGTTCAAATATGGTCCGGAAGGGGCACTGGTTGAAGTCGCCACCTCTGCACAGAACGCCTGGACCGAGGTGGCGGTGCTGGATCAGGGCGACCCCATCTCGGACACCGAGTACTCTTCCGGGCGCACACGCGGGGCCACGTGAGCGGTCAGGGCATTGGCATGGGCCTGTCGTCCGCAGCATCGTGCAGGGCTGGGGCGGTCAGGCCTGGGCGAACGGCGCGGCGACCGTAACGCCTTTTGCTTTACCGTGCCGGGCGTGGGCTGAGGGATCAGGAGT



And finally, let's count how many bases are in the newly generated fastq file.

In [47]:
%%shell 
cat gdrive/MyDrive/data/fast5/dorado/demo.fastq | paste - - - - | cut -f2 | tr -d '\n' | wc -c

148369415


