Merge branch 'jdaw/improve-README' into 'master'

Improve README based on GitHub requests See merge request machine-learning/dorado!767
nanoporetech · Dec 18, 2023 · a3dfc94 · a3dfc94
2 parents 3bfb1f0 + 422a8b4
commit a3dfc94
Showing 1 changed file with 12 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -84,7 +84,7 @@ $ dorado basecaller hac pod5s/ --resume-from incomplete.bam > calls.bam
 
 **Note: it is important to choose a different filename for the BAM file you are writing to when using `--resume-from`**. If you use the same filename, the interrupted BAM file will lose the existing basecalls and basecalling will restart from the beginning.
 
-### Adapter and primer trimming
+### DNA Adapter and primer trimming
 
 #### In-line with basecalling
 
@@ -104,13 +104,17 @@ The `--trim` option takes as its argument one of the following values:
 Existing basecalled datasets can be scanned for adapter and/or primer sequences at either end, and trim any such found sequences. To do this, run:
 
 ```
-$ dorado trim --output-dir <output-folder-for-trimmed-bams> <reads>
+$ dorado trim <reads> > trimmed.bam
 ```
 
 `<reads>` can either be an HTS format file (e.g. FASTQ, BAM, etc.) or a stream of an HTS format (e.g. the output of Dorado basecalling).
 
 The `--no-trim-primers` option can be used to prevent the trimming of primer sequences. In this case only adapter sequences will be trimmed.
 
+### RNA Adapter trimming
+
+Adapters for RNA002 and RNA004 kits are automatically trimmed during basecalling. However, unlike in DNA, the RNA adapter cannot be trimmed post-basecalling.
+
 ### Modified basecalling
 
 Beyond the traditional A, T, C, and G basecalling, Dorado can also detect modified bases such as 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC), and N<sup>6</sup>-methyladenosine (6mA). These modified bases play crucial roles in epigenetic regulation.
@@ -145,7 +149,7 @@ Dorado will report the duplex rate as the number of nucleotides in the duplex ba
 Duplex basecalling can be performed with modified base detection, producing hemi-methylation calls for duplex reads:
 
 ```
-$ dorado duplex hac,5mCG_5hmCG pod5s/
+$ dorado duplex hac,5mCG_5hmCG pod5s/ > duplex.bam
 ```
 More information on how hemi-methylation calls are represented can be found in [page 7 of the SAM specification document (version aa7440d)](https://samtools.github.io/hts-specs/SAMtags.pdf) and [Modkit documentation](https://nanoporetech.github.io/modkit/intro_pileup_hemi.html).
 
@@ -156,14 +160,14 @@ Dorado supports aligning existing basecalls or producing aligned output directly
 To align existing basecalls, run:
 
 ```
-$ dorado aligner <index> <reads> 
+$ dorado aligner <index> <reads>  > aligned.bam
 ```
 where `index` is a reference to align to in (FASTQ/FASTA/.mmi) format and `reads` is a file in any HTS format.
 
 To basecall with alignment with duplex or simplex, run with the `--reference` option:
 
 ```
-$ dorado basecaller <model> <reads> --reference <index>
+$ dorado basecaller <model> <reads> --reference <index> > calls.bam
 ```
 
 Alignment uses [minimap2](https://github.com/lh3/minimap2) and by default uses the `map-ont` preset. This can be overridden with the `-k` and `-w` options to set kmer and window size respectively.
@@ -173,7 +177,7 @@ Alignment uses [minimap2](https://github.com/lh3/minimap2) and by default uses t
 The `dorado summary` command outputs a tab-separated file with read level sequencing information from the BAM file generated during basecalling. To create a summary, run:
 
 ```
-$ dorado summary <bam>
+$ dorado summary <bam> > summary.tsv
 ```
 
 Note that summary generation is only available for reads basecalled from POD5 files. Reads basecalled from .fast5 files are not compatible with the summary command.
@@ -186,7 +190,7 @@ Dorado supports barcode classification for existing basecalls as well as produci
 
 In this mode, reads are classified into their barcode groups during basecalling as part of the same command. To enable this, run:
 ```
-$ dorado basecaller <model> <reads> --kit-name <barcode-kit-name>
+$ dorado basecaller <model> <reads> --kit-name <barcode-kit-name> > calls.bam
 ```
 
 This will result in a single output stream with classified reads. The classification will be reflected in the read group name as well as in the `BC` tag of the output record.
@@ -311,7 +315,7 @@ Below is a table of the available basecalling models and the modified basecallin
 
 ### **RNA models:**
 
-**Note:** The BAM format does not support `U` bases. Therefore, when Dorado is performing RNA basecalling, the resulting output files will include `T` instead of `U`. This is consistent across output file types.
+**Note:** The BAM format does not support `U` bases. Therefore, when Dorado is performing RNA basecalling, the resulting output files will include `T` instead of `U`. This is consistent across output file types. The same applies to parsing inputs. Any input HTS file (e.g. FASTQ generated by `guppy`/`basecall_server`) with `U` bases is not handled by `dorado`.
 
 | Basecalling Models | Compatible<br />Modifications | Modifications<br />Model<br />Version | Data<br />Sampling<br />Frequency |
 | :-------- | :------- | :--- | :--- |