Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to acquire Remora models in the toml format that Dorado expects as input? #38

Closed
oneillkza opened this issue Oct 20, 2022 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@oneillkza
Copy link

What's the timeline for getting support for modified basecalling models in Dorado?

(Or is this possible already?)

@sklages
Copy link

sklages commented Oct 27, 2022

Well, listed under "Features":

  • Modified basecalling (Remora models).

@oneillkza
Copy link
Author

Hmm -- yep it does look like the -h suggests this might be possible, although it's not very informative as to what format it wants the remora models in...

dorado basecaller -h
Usage: dorado [options] model data 

Positional arguments:
model              	the basecaller model to run.
data               	the data directory.


Optional arguments:
-h --help          	shows help message and exits
-v --version       	prints version information and exits
-x --device        	device string in format "cuda:0,...,N", "cuda:all", "metal" etc.. [default: "cuda:all"]
-b --batchsize     	if 0 an optimal batchsize will be selected [default: 0]
-c --chunksize     	[default: 10000]
-o --overlap       	[default: 500]
-r --num_runners   	[default: 2]
--emit-fastq       	[default: false]
--remora-batchsize 	[default: 1000]
--remora-threads   	[default: 1]
--remora_models    	a comma separated list of remora models [default: ""]

When I try to pass it an .onnx from the Remora repository, it treats it like a directory that should contain a .toml file.

> Creating basecall pipeline
toml::parse: file open error -> ../remora/r9_4_1_sup_5mc_5hmc.onnx/config.toml

But there are no toml files in the Remora repository, or in rerio, and none for the basecall models distributed with Guppy. There's also no clear documentation on this format, although it seems to be alluded to in nanoporetech/bonito#278 which talks about "a bonito basecalling model [tar+toml]".

So the question is, how does one acquire (or create?) Remora models in the tar+toml format that Dorado accepts?

@oneillkza oneillkza changed the title Remora support? How to acquire Remora models in the toml format that Dorado expects as input? Oct 27, 2022
@iiSeymour iiSeymour self-assigned this Oct 29, 2022
@iiSeymour iiSeymour added the enhancement New feature or request label Oct 29, 2022
@iiSeymour
Copy link
Member

I will get dorado download supporting mods models over the next few days.

@oneillkza
Copy link
Author

Thanks! Looking forward to it!

@iiSeymour
Copy link
Member

@oneillkza v0.0.2 has a matching 5mC model for each simplex model.

$ dorado download --list 
[2022-11-10 16:25:06.843] [info] > simplex models
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_260bps_fast@v3.5.2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_260bps_hac@v3.5.2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_260bps_sup@v3.5.2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_400bps_fast@v3.5.2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_400bps_hac@v3.5.2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_400bps_sup@v3.5.2
[2022-11-10 16:25:06.846] [info]  - dna_r9.4.1_e8_fast@v3.4
[2022-11-10 16:25:06.846] [info]  - dna_r9.4.1_e8_hac@v3.3
[2022-11-10 16:25:06.846] [info]  - dna_r9.4.1_e8_sup@v3.3
[2022-11-10 16:25:06.846] [info] > modification models
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_260bps_fast@v3.5.2_5mCG@v2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_260bps_hac@v3.5.2_5mCG@v2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_260bps_sup@v3.5.2_5mCG@v2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_400bps_fast@v3.5.2_5mCG@v2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2
[2022-11-10 16:25:06.846] [info]  - dna_r10.4.1_e8.2_400bps_sup@v3.5.2_5mCG@v2
[2022-11-10 16:25:06.846] [info]  - dna_r9.4.1_e8_fast@v3.4_5mCG@v0
[2022-11-10 16:25:06.846] [info]  - dna_r9.4.1_e8_hac@v3.4_5mCG@v0
[2022-11-10 16:25:06.846] [info]  - dna_r9.4.1_e8_sup@v3.4_5mCG@v0

In this release you have to specify the model manually like so:

$ dorado basecaller ${models}/dna_r10.4.1_e8.2_400bps_hac@v3.5.2 ${data} \
    --remora-models ${models}/dna_r10.4.1_e8.2_400bps_hac@v3.5.2_5mCG@v2 > mods.sam

But I intend to simplify this with automatic model matching and a simpler cli i.e.

$ dorado basecaller ${models}/dna_r10.4.1_e8.2_400bps_hac@v3.5.2 ${data} --mods 5mCG > mods.sam

@oneillkza
Copy link
Author

Thanks @iiSeymour !

@jcolicchio-soundag
Copy link

Does this also work using the rerio remora all cytosine context model?

@iiSeymour
Copy link
Member

@jcolicchio-soundag not yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants