Skip to content

Commit

Permalink
Merge branch 'jdaw/improve-README' into 'master'
Browse files Browse the repository at this point in the history
Improve README based on GitHub requests

See merge request machine-learning/dorado!767
  • Loading branch information
tijyojwad committed Dec 18, 2023
2 parents 3bfb1f0 + 422a8b4 commit a3dfc94
Showing 1 changed file with 12 additions and 8 deletions.
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ $ dorado basecaller hac pod5s/ --resume-from incomplete.bam > calls.bam

**Note: it is important to choose a different filename for the BAM file you are writing to when using `--resume-from`**. If you use the same filename, the interrupted BAM file will lose the existing basecalls and basecalling will restart from the beginning.

### Adapter and primer trimming
### DNA Adapter and primer trimming

#### In-line with basecalling

Expand All @@ -104,13 +104,17 @@ The `--trim` option takes as its argument one of the following values:
Existing basecalled datasets can be scanned for adapter and/or primer sequences at either end, and trim any such found sequences. To do this, run:

```
$ dorado trim --output-dir <output-folder-for-trimmed-bams> <reads>
$ dorado trim <reads> > trimmed.bam
```

`<reads>` can either be an HTS format file (e.g. FASTQ, BAM, etc.) or a stream of an HTS format (e.g. the output of Dorado basecalling).

The `--no-trim-primers` option can be used to prevent the trimming of primer sequences. In this case only adapter sequences will be trimmed.

### RNA Adapter trimming

Adapters for RNA002 and RNA004 kits are automatically trimmed during basecalling. However, unlike in DNA, the RNA adapter cannot be trimmed post-basecalling.

### Modified basecalling

Beyond the traditional A, T, C, and G basecalling, Dorado can also detect modified bases such as 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC), and N<sup>6</sup>-methyladenosine (6mA). These modified bases play crucial roles in epigenetic regulation.
Expand Down Expand Up @@ -145,7 +149,7 @@ Dorado will report the duplex rate as the number of nucleotides in the duplex ba
Duplex basecalling can be performed with modified base detection, producing hemi-methylation calls for duplex reads:

```
$ dorado duplex hac,5mCG_5hmCG pod5s/
$ dorado duplex hac,5mCG_5hmCG pod5s/ > duplex.bam
```
More information on how hemi-methylation calls are represented can be found in [page 7 of the SAM specification document (version aa7440d)](https://samtools.github.io/hts-specs/SAMtags.pdf) and [Modkit documentation](https://nanoporetech.github.io/modkit/intro_pileup_hemi.html).

Expand All @@ -156,14 +160,14 @@ Dorado supports aligning existing basecalls or producing aligned output directly
To align existing basecalls, run:

```
$ dorado aligner <index> <reads>
$ dorado aligner <index> <reads> > aligned.bam
```
where `index` is a reference to align to in (FASTQ/FASTA/.mmi) format and `reads` is a file in any HTS format.

To basecall with alignment with duplex or simplex, run with the `--reference` option:

```
$ dorado basecaller <model> <reads> --reference <index>
$ dorado basecaller <model> <reads> --reference <index> > calls.bam
```

Alignment uses [minimap2](https://github.com/lh3/minimap2) and by default uses the `map-ont` preset. This can be overridden with the `-k` and `-w` options to set kmer and window size respectively.
Expand All @@ -173,7 +177,7 @@ Alignment uses [minimap2](https://github.com/lh3/minimap2) and by default uses t
The `dorado summary` command outputs a tab-separated file with read level sequencing information from the BAM file generated during basecalling. To create a summary, run:

```
$ dorado summary <bam>
$ dorado summary <bam> > summary.tsv
```

Note that summary generation is only available for reads basecalled from POD5 files. Reads basecalled from .fast5 files are not compatible with the summary command.
Expand All @@ -186,7 +190,7 @@ Dorado supports barcode classification for existing basecalls as well as produci

In this mode, reads are classified into their barcode groups during basecalling as part of the same command. To enable this, run:
```
$ dorado basecaller <model> <reads> --kit-name <barcode-kit-name>
$ dorado basecaller <model> <reads> --kit-name <barcode-kit-name> > calls.bam
```

This will result in a single output stream with classified reads. The classification will be reflected in the read group name as well as in the `BC` tag of the output record.
Expand Down Expand Up @@ -311,7 +315,7 @@ Below is a table of the available basecalling models and the modified basecallin

### **RNA models:**

**Note:** The BAM format does not support `U` bases. Therefore, when Dorado is performing RNA basecalling, the resulting output files will include `T` instead of `U`. This is consistent across output file types.
**Note:** The BAM format does not support `U` bases. Therefore, when Dorado is performing RNA basecalling, the resulting output files will include `T` instead of `U`. This is consistent across output file types. The same applies to parsing inputs. Any input HTS file (e.g. FASTQ generated by `guppy`/`basecall_server`) with `U` bases is not handled by `dorado`.

| Basecalling Models | Compatible<br />Modifications | Modifications<br />Model<br />Version | Data<br />Sampling<br />Frequency |
| :-------- | :------- | :--- | :--- |
Expand Down

0 comments on commit a3dfc94

Please sign in to comment.