diff --git a/README.md b/README.md index a968830f5..7baa08c24 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nan * [Duplex basecalling](#duplex) (watch the following video for an introduction to [Duplex](https://youtu.be/8DVMG7FEBys)). * Simplex [barcode classification](#barcode-classification). * Support for aligned read output in SAM/BAM. +* Initial support for [poly(A) tail estimation](#polya-tail-estimation). * [POD5](https://github.com/nanoporetech/pod5-file-format) support for highest basecalling performance. * Based on libtorch, the C++ API for pytorch. * Multiple custom optimisations in CUDA and Metal for maximising inference performance. @@ -203,6 +204,10 @@ unclassified.bam #### Using a Sample Sheet `dorado` is able to use a sample sheet to restrict the barcode classifications to only those present, and to apply aliases to the detected classifications. This is enabled by passing the path to a sample sheet to the `--sample-sheet` argument when using the `basecaller` or `demux` commands. See [here](documentation/SampleSheets.md) for more information. +### Poly(A) tail estimation + +Dorado has initial support for estimating poly(A) tail lengths for DNA and RNA. Note that Oxford Nanopore cDNA reads sequence in two different orientations and transcript poly(A) length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. + ## Available basecalling models To download all available Dorado models, run: diff --git a/documentation/SAM.md b/documentation/SAM.md index 12dda0238..70a3311c8 100644 --- a/documentation/SAM.md +++ b/documentation/SAM.md @@ -41,7 +41,7 @@ | dx:i: | bool to signify duplex read _(only in duplex mode)_ | | pi:Z: | parent read id for a split read | | sp:i: | start coordinate of split read in parent read signal | -| pt:i: | estimated poly(A) tail length in cDNA and dRNA reads | +| pt:i: | estimated poly(A/T) tail length in cDNA and dRNA reads | #### Modified Base Tags diff --git a/dorado/cli/basecaller.cpp b/dorado/cli/basecaller.cpp index 8f734e49b..2e7f35d17 100644 --- a/dorado/cli/basecaller.cpp +++ b/dorado/cli/basecaller.cpp @@ -368,17 +368,15 @@ int basecaller(int argc, char* argv[]) { parser.visible.add_argument("--sample-sheet") .help("Path to the sample sheet to use.") .default_value(std::string("")); - - cli::add_minimap2_arguments(parser, AlignerNode::dflt_options); - cli::add_internal_arguments(parser); - - // Add hidden arguments that only apply to simplex calling. - parser.hidden.add_argument("--estimate-poly-a") + parser.visible.add_argument("--estimate-poly-a") .help("Estimate poly-A/T tail lengths (beta feature). Primarily meant for cDNA and " "dRNA use cases.") .default_value(false) .implicit_value(true); + cli::add_minimap2_arguments(parser, AlignerNode::dflt_options); + cli::add_internal_arguments(parser); + // Create a copy of the parser to use if the resume feature is enabled. Needed // to parse the model used for the file being resumed from. Note that this copy // needs to be made __before__ the parser is used. @@ -462,7 +460,7 @@ int basecaller(int argc, char* argv[]) { parser.visible.get("--barcode-both-ends"), parser.visible.get("--no-trim"), parser.visible.get("--sample-sheet"), resume_parser, - parser.hidden.get("--estimate-poly-a")); + parser.visible.get("--estimate-poly-a")); } catch (const std::exception& e) { spdlog::error("{}", e.what()); return 1;