From f1a746f40e0de64c5ffab4138709ff4c1148a8e8 Mon Sep 17 00:00:00 2001 From: Joyjit Daw Date: Sun, 12 Nov 2023 21:42:39 -0500 Subject: [PATCH 1/5] add poly(a) documentation - add entry to README - make the tail estimation option public --- README.md | 5 +++++ dorado/cli/basecaller.cpp | 10 ++++------ 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a968830f..f02e6d2a 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nan * [Duplex basecalling](#duplex) (watch the following video for an introduction to [Duplex](https://youtu.be/8DVMG7FEBys)). * Simplex [barcode classification](#barcode-classification). * Support for aligned read output in SAM/BAM. +* Experimental support for [poly(A) tail estimation](#polyA-tail-estimation). * [POD5](https://github.com/nanoporetech/pod5-file-format) support for highest basecalling performance. * Based on libtorch, the C++ API for pytorch. * Multiple custom optimisations in CUDA and Metal for maximising inference performance. @@ -203,6 +204,10 @@ unclassified.bam #### Using a Sample Sheet `dorado` is able to use a sample sheet to restrict the barcode classifications to only those present, and to apply aliases to the detected classifications. This is enabled by passing the path to a sample sheet to the `--sample-sheet` argument when using the `basecaller` or `demux` commands. See [here](documentation/SampleSheets.md) for more information. +### poly(A) tail estimation + +Dorado has experimental support for estimating poly(A) tails for both DNA and RNA. This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. + ## Available basecalling models To download all available Dorado models, run: diff --git a/dorado/cli/basecaller.cpp b/dorado/cli/basecaller.cpp index 8f734e49..a9860360 100644 --- a/dorado/cli/basecaller.cpp +++ b/dorado/cli/basecaller.cpp @@ -368,17 +368,15 @@ int basecaller(int argc, char* argv[]) { parser.visible.add_argument("--sample-sheet") .help("Path to the sample sheet to use.") .default_value(std::string("")); - - cli::add_minimap2_arguments(parser, AlignerNode::dflt_options); - cli::add_internal_arguments(parser); - - // Add hidden arguments that only apply to simplex calling. - parser.hidden.add_argument("--estimate-poly-a") + parser.visible.add_argument("--estimate-poly-a") .help("Estimate poly-A/T tail lengths (beta feature). Primarily meant for cDNA and " "dRNA use cases.") .default_value(false) .implicit_value(true); + cli::add_minimap2_arguments(parser, AlignerNode::dflt_options); + cli::add_internal_arguments(parser); + // Create a copy of the parser to use if the resume feature is enabled. Needed // to parse the model used for the file being resumed from. Note that this copy // needs to be made __before__ the parser is used. From 35c74227ae62b4ee2cd5695d525fef419769e7f5 Mon Sep 17 00:00:00 2001 From: Joyjit Daw Date: Mon, 13 Nov 2023 08:25:17 -0500 Subject: [PATCH 2/5] address review comments --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f02e6d2a..87a620ba 100644 --- a/README.md +++ b/README.md @@ -204,9 +204,9 @@ unclassified.bam #### Using a Sample Sheet `dorado` is able to use a sample sheet to restrict the barcode classifications to only those present, and to apply aliases to the detected classifications. This is enabled by passing the path to a sample sheet to the `--sample-sheet` argument when using the `basecaller` or `demux` commands. See [here](documentation/SampleSheets.md) for more information. -### poly(A) tail estimation +### Poly(A) tail estimation -Dorado has experimental support for estimating poly(A) tails for both DNA and RNA. This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. +Dorado has experimental support for estimating poly(A) tail lengths for DNA and RNA. Note that Oxford Nanopore cDNA reads sequence in two different orientations and transcript poly(A) length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. ## Available basecalling models From 0d12d223547d030c89318f121eb7f69cbe806442 Mon Sep 17 00:00:00 2001 From: Joyjit Daw Date: Mon, 13 Nov 2023 08:26:38 -0500 Subject: [PATCH 3/5] update SAM specs --- README.md | 2 +- documentation/SAM.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 87a620ba..5c933bf0 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nan * [Duplex basecalling](#duplex) (watch the following video for an introduction to [Duplex](https://youtu.be/8DVMG7FEBys)). * Simplex [barcode classification](#barcode-classification). * Support for aligned read output in SAM/BAM. -* Experimental support for [poly(A) tail estimation](#polyA-tail-estimation). +* Experimental support for [poly(A) tail estimation](#polya-tail-estimation). * [POD5](https://github.com/nanoporetech/pod5-file-format) support for highest basecalling performance. * Based on libtorch, the C++ API for pytorch. * Multiple custom optimisations in CUDA and Metal for maximising inference performance. diff --git a/documentation/SAM.md b/documentation/SAM.md index 9b13f60d..3637616f 100644 --- a/documentation/SAM.md +++ b/documentation/SAM.md @@ -41,7 +41,7 @@ | dx:i: | bool to signify duplex read _(only in duplex mode)_ | | pi:Z: | parent read id for a split read | | sp:i: | start coordinate of split read in parent read signal | -| pt:i: | estimated poly(A) tail length in cDNA and dRNA reads | +| pt:i: | estimated poly(A/T) tail length in cDNA and dRNA reads | #### Modified Base Tags From 8bb045d27d8e0a26069122b922ac629701a582b1 Mon Sep 17 00:00:00 2001 From: Joyjit Daw Date: Mon, 13 Nov 2023 09:06:42 -0500 Subject: [PATCH 4/5] fix options parsing bug --- dorado/cli/basecaller.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dorado/cli/basecaller.cpp b/dorado/cli/basecaller.cpp index a9860360..2e7f35d1 100644 --- a/dorado/cli/basecaller.cpp +++ b/dorado/cli/basecaller.cpp @@ -460,7 +460,7 @@ int basecaller(int argc, char* argv[]) { parser.visible.get("--barcode-both-ends"), parser.visible.get("--no-trim"), parser.visible.get("--sample-sheet"), resume_parser, - parser.hidden.get("--estimate-poly-a")); + parser.visible.get("--estimate-poly-a")); } catch (const std::exception& e) { spdlog::error("{}", e.what()); return 1; From 1388ca3a1f09aa07068afa47bb54825a5d7db272 Mon Sep 17 00:00:00 2001 From: Joyjit Daw Date: Mon, 13 Nov 2023 10:41:58 -0500 Subject: [PATCH 5/5] change wording for poly(A) support --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 5c933bf0..7baa08c2 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nan * [Duplex basecalling](#duplex) (watch the following video for an introduction to [Duplex](https://youtu.be/8DVMG7FEBys)). * Simplex [barcode classification](#barcode-classification). * Support for aligned read output in SAM/BAM. -* Experimental support for [poly(A) tail estimation](#polya-tail-estimation). +* Initial support for [poly(A) tail estimation](#polya-tail-estimation). * [POD5](https://github.com/nanoporetech/pod5-file-format) support for highest basecalling performance. * Based on libtorch, the C++ API for pytorch. * Multiple custom optimisations in CUDA and Metal for maximising inference performance. @@ -206,7 +206,7 @@ unclassified.bam ### Poly(A) tail estimation -Dorado has experimental support for estimating poly(A) tail lengths for DNA and RNA. Note that Oxford Nanopore cDNA reads sequence in two different orientations and transcript poly(A) length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. +Dorado has initial support for estimating poly(A) tail lengths for DNA and RNA. Note that Oxford Nanopore cDNA reads sequence in two different orientations and transcript poly(A) length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. ## Available basecalling models