diff --git a/CHANGELOG.md b/CHANGELOG.md index ef8a7869..237b271b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,29 @@ All notable changes to Dorado will be documented in this file. +# [0.5.0] (5 Dec 2023) + +This release of Dorado introduces new, more accurate, and faster v4.3 basecalling models. It also enables hemi-methylation basecalling of duplex reads. Dorado now supports DNA primer and adapter trimming, custom barcode arrangements and sequences, and can automatically select the correct model for your data. Furthermore, this release introduces speed and memory enhancements for basecalling on Apple silicon and various stability improvements. + +* 14159695955dd0d08322f26b545069fbfecb5003 - Add v4.3 basecalling models +* b7d4b380f17d4a15ed43d8d383cc770d121fca17 - Support for modified bases with duplex basecalling (hemi-methylation) +* 30e639cf66c1c24d0f61f1e7b91c6ce5db2cf7bf - Primer and adapter trimming +* fb85a70609eedfe895587275d06429515a1ce61e - Enable automatic model selection +* 16e5b6ad577f5485eb3a78c755313fc8314b2b1c - Support for custom barcode arrangements and sequences +* 46bbfddda06a7088f7031ef79eecf03b0f04660c - Add barcode column to summary file +* e9f060c1afff8d72fd51da4201d3062d8c8a2064 - Improve the precision of read splitting +* 4102ffc3454c609479665a337e1ad7c2f33b9d22 - Increase speed of v4.3 model execution +* 0a0711012ad906f94aa6e26c3a6b540e5ccbcc0e - Prevent progress bar from `--resume-from` logging excessive dots +* 20b5637dbbf944efcc3878c5271a8bd84d2b6eab - Ensure that aligner outputs SAM when not piped to a file +* 942a35a69832883904a1116b9b21d5c1641d0e2b - Add `MN` tag to ouput BAM to help downstream tools interpret modified base tags +* f0ac935035423d3b913940bf1b9b7fd50d832993 - Added modbase model name to BAM files in RG header section. +* a7fa37132b0f442ce87a154e7f2db21dfaa66933 - Improve performance of HAC and SUP on Apple silicon +* 152d5fdc782d14b1e9853d9242051d1f7064b63c - Improvements to auto batch sizing on Apple silicon +* b0767a6f31cd7f084491b2b3313d33d048bcc5a0 - Fix bug causing segfault with `summary` command on Windows +* 1c2c6a9e9bcf980702afec9b9f6a17cd27c3ae07 - Make AVX `reverse_complement` implementation preserve nucleotide case +* 4a4dd1cffe9db32e4c58e79ca6dc5dc79125f0c9 - Use updated Koi functions for small LSTM layers, final convolutional layer in LSTM models, and final linear layer + + # [0.4.3] (14 Nov 2023) This release of Dorado introduces a new RNA m6A modified base model and initial support for poly(A)/poly(T) tail length estimation. It also introduces duplex performance enhancements and bug fixes to improve the stability of Dorado. diff --git a/README.md b/README.md index bd1986c5..c85f0a37 100644 --- a/README.md +++ b/README.md @@ -19,10 +19,10 @@ If you encounter any problems building or running Dorado, please [report an issu ## Installation - - [dorado-0.4.3-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.4.3-linux-x64.tar.gz) - - [dorado-0.4.3-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.4.3-linux-arm64.tar.gz) - - [dorado-0.4.3-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.4.3-osx-arm64.zip) - - [dorado-0.4.3-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.4.3-win64.zip) + - [dorado-0.5.0-linux-x64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.5.0-linux-x64.tar.gz) + - [dorado-0.5.0-linux-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.5.0-linux-arm64.tar.gz) + - [dorado-0.5.0-osx-arm64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.5.0-osx-arm64.zip) + - [dorado-0.5.0-win64](https://cdn.oxfordnanoportal.com/software/analysis/dorado-0.5.0-win64.zip) ## Platforms @@ -77,13 +77,40 @@ To basecall a single file, simply replace the directory `pod5s/` with a path to If basecalling is interrupted, it is possible to resume basecalling from a BAM file. To do so, use the `--resume-from` flag to specify the path to the incomplete BAM file. For example: ``` -$ dorado basecaller hac pod5s --resume-from incomplete.bam > calls.bam +$ dorado basecaller hac pod5s/ --resume-from incomplete.bam > calls.bam ``` `calls.bam` will contain all of the reads from `incomplete.bam` plus the new basecalls *(`incomplete.bam` can be discarded after basecalling is complete)*. **Note: it is important to choose a different filename for the BAM file you are writing to when using `--resume-from`**. If you use the same filename, the interrupted BAM file will lose the existing basecalls and basecalling will restart from the beginning. +### Adapter and primer trimming + +#### In-line with basecalling + +By default, `dorado basecaller` will attempt to detect any adapter or primer sequences at the beginning and ending of reads, and remove them from the output sequence. + +This functionality can be altered by using either the `--trim` or `--no-trim` options with `dorado basecaller`. The `--no-trim` option will prevent the trimming of detected barcode sequences as well as the detection and trimming of adapter and primer sequences. + +The `--trim` option takes as its argument one of the following values: + +* `all` This is the the same as the default behavior. Any detected adapters or primers will be trimmed, and if barcoding is enabled then any detected barcodes will be trimmed. +* `primers` This will result in any detected adapters or primers being trimmed, but if barcoding is enabled the barcode sequences will not be trimmed. +* `adapters` This will result in any detected adapters being trimmed, but primers will not be trimmed, and if barcoding is enabled then barcodes will not be trimmed either. +* `none` This is the same as using the --no-trim option. Nothing will be trimmed. + +#### Trimming existing datasets + +Existing basecalled datasets can be scanned for adapter and/or primer sequences at either end, and trim any such found sequences. To do this, run: + +``` +$ dorado trim --output-dir +``` + +`` can either be an HTS format file (e.g. FASTQ, BAM, etc.) or a stream of an HTS format (e.g. the output of Dorado basecalling). + +The `--no-trim-primers` option can be used to prevent the trimming of primer sequences. In this case only adapter sequences will be trimmed. + ### Modified basecalling Beyond the traditional A, T, C, and G basecalling, Dorado can also detect modified bases such as 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC), and N6-methyladenosine (6mA). These modified bases play crucial roles in epigenetic regulation. @@ -96,6 +123,8 @@ $ dorado basecaller hac,5mCG_5hmCG pod5s/ > calls.bam Refer to the [DNA models](#dna-models) table's _Compatible Modifications_ column to see available modifications that can be called with the `--modified-bases` option. +Modified basecalling is also supported with [Duplex basecalling](#duplex), where it produces hemi-methylation calls. + ### Duplex To run Duplex basecalling, run the command: @@ -113,9 +142,12 @@ The `dx` tag in the BAM record for each read can be used to distinguish between Dorado will report the duplex rate as the number of nucleotides in the duplex basecalls multiplied by two and divided by the total number of nucleotides in the simplex basecalls. This value is a close approximation for the proportion of nucleotides which participated in a duplex basecall. -Dorado duplex previously required a separate tool to perform duplex pair detection and read splitting, but this is now integrated into Dorado. +Duplex basecalling can be performed with modified base detection, producing hemi-methylation calls for duplex reads: -Note that modified basecalling is not yet supported in duplex mode. +``` +$ dorado duplex hac,5mCG_5hmCG pod5s/ +``` +More information on how hemi-methylation calls are represented can be found in [page 7 of the SAM specification document (version aa7440d)](https://samtools.github.io/hts-specs/SAMtags.pdf) and [Modkit documentation](https://nanoporetech.github.io/modkit/intro_pileup_hemi.html). ### Alignment @@ -152,7 +184,7 @@ Dorado supports barcode classification for existing basecalls as well as produci #### In-line with basecalling -In this mode, reads are classified into their barcode groups during basecalling as part of the same command. To enable this, run +In this mode, reads are classified into their barcode groups during basecalling as part of the same command. To enable this, run: ``` $ dorado basecaller --kit-name ``` @@ -184,11 +216,12 @@ Existing basecalled datasets can be classified as well as demultiplexed into per $ dorado demux --kit-name --output-dir ``` -`` can either be an HTS format file (e.g. fastq, BAM, etc.) or a stream of an HTS format (e.g. the output of dorado basecalling). +`` can either be an HTS format file (e.g. FASTQ, BAM, etc.) or a stream of an HTS format (e.g. the output of dorado basecalling). This results in multiple BAM files being generated in the output folder, one per barcode (formatted as `KITNAME_BARCODEXX.bam`) and one for all unclassified reads. As with the in-line mode, `--no-trim` and `--barcode-both-ends` are also available as additional options. Here is an example output folder + ``` $ dorado demux --kit-name SQK-RPB004 --output-dir /tmp/demux reads.fastq @@ -201,9 +234,10 @@ unclassified.bam ``` #### Using a sample sheet + Dorado is able to use a sample sheet to restrict the barcode classifications to only those present, and to apply aliases to the detected classifications. This is enabled by passing the path to a sample sheet to the `--sample-sheet` argument when using the `basecaller` or `demux` commands. See [here](documentation/SampleSheets.md) for more information. -### Custom barcodes +#### Custom barcodes In addition to supporting the standard barcode kits from Oxford Nanopore, Dorado also supports specifying custom barcode kit arrangements and sequences. This is done by passing a barcode arrangement file via the `--barcode-arrangement` argument (either to `dorado demux` or `dorado basecaller`). Custom barcode sequences can optionally be specified via the `--barcode-sequences` option. See [here](documentation/CustomBarcodes.md) for more details. @@ -211,6 +245,8 @@ In addition to supporting the standard barcode kits from Oxford Nanopore, Dorado Dorado has initial support for estimating poly(A) tail lengths for cDNA and RNA. Note that Oxford Nanopore cDNA reads are sequenced in two different orientations and Dorado poly(A) tail length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag. +Note that if this option is used, then adapter and primer trimming will be automatically disabled. + ## Available basecalling models To download all available Dorado models, run: @@ -244,9 +280,9 @@ Below is a table of the available basecalling models and the modified basecallin | Basecalling Models | Compatible
Modifications | Modifications
Model
Version | Data
Sampling
Frequency | | :-------- | :------- | :--- | :--- | -| **dna_r10.4.1_e8.2_400bps_fast@v4.3.0** | 5mCG_5hmCG | v2 | 5 kHz | -| **dna_r10.4.1_e8.2_400bps_hac@v4.3.0** | 5mCG_5hmCG | v2 | 5 kHz | -| **dna_r10.4.1_e8.2_400bps_sup@v4.3.0** | 5mCG_5hmCG
5mC_5hmC
5mC
6mA
| v3.1
v1
v2
v3| 5 kHz | +| **dna_r10.4.1_e8.2_400bps_fast@v4.3.0** | | | 5 kHz | +| **dna_r10.4.1_e8.2_400bps_hac@v4.3.0** | 5mCG_5hmCG
5mC_5hmC
6mA
| v1
v1
v1 | 5 kHz | +| **dna_r10.4.1_e8.2_400bps_sup@v4.3.0** | 5mCG_5hmCG
5mC_5hmC
6mA
| v1
v1
v1 | 5 kHz | | dna_r10.4.1_e8.2_400bps_fast@v4.2.0 | 5mCG_5hmCG | v2 | 5 kHz | | dna_r10.4.1_e8.2_400bps_hac@v4.2.0 | 5mCG_5hmCG | v2 | 5 kHz | | dna_r10.4.1_e8.2_400bps_sup@v4.2.0 | 5mCG_5hmCG
5mC_5hmC
5mC
6mA
| v3.1
v1
v2
v3| 5 kHz | diff --git a/cmake/DoradoVersion.cmake b/cmake/DoradoVersion.cmake index 69da9eb6..009d868a 100644 --- a/cmake/DoradoVersion.cmake +++ b/cmake/DoradoVersion.cmake @@ -1,6 +1,6 @@ set(DORADO_VERSION_MAJOR 0) -set(DORADO_VERSION_MINOR 4) -set(DORADO_VERSION_REV 3) +set(DORADO_VERSION_MINOR 5) +set(DORADO_VERSION_REV 0) find_package(Git QUIET) if(GIT_FOUND AND EXISTS "${PROJECT_SOURCE_DIR}/.git")