Skip to content

Commit

Permalink
Resolved merge conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
hpendry-ont committed Mar 21, 2024
2 parents cc328db + dc22d7f commit cbdd51a
Show file tree
Hide file tree
Showing 52 changed files with 830 additions and 368 deletions.
46 changes: 27 additions & 19 deletions documentation/CustomBarcodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,14 @@ last_index = 96
## Scoring options
[scoring]
min_soft_barcode_threshold = 0.2
min_hard_barcode_threshold = 0.2
min_soft_flank_threshold = 0.3
min_hard_flank_threshold = 0.3
min_barcode_score_dist = 0.1
max_barcode_penalty = 11
barcode_end_proximity = 75
min_barcode_penalty_dist = 3
min_separation_only_dist = 6
flank_left_pad = 5
flank_right_pad = 10
front_barcode_window = 175
rear_barcode_window = 175
```

#### Arrangement Options
Expand All @@ -72,25 +75,30 @@ The pre-built barcode sequence in Dorado can be found in this [file](../dorado/u
Dorado maintains a default set of parameters for scoring each barcode to determine the best classification. These parameters have been tuned based on barcoding kits from Oxford Nanopore. However, the default parameters may not be optimal for new arrangements and kits.

The classification heuristic applied by Dorado is the following -
1. Dorado calculates a separate score for the barcode flanks and the barcode sequence itself. The score is roughly `1.0f - edit_distance / length(target_seq)`, where the `target_seq` is either the reference flank or reference barcode sequence.
2. For double ended barcodes, the __best__ window (either front or rear) is chosen based on the flank scores.
3. After choosing the best window for an arrangement, each barcode candidate within the arrangement is sorted by the barcode score.
1. Dorado uses the flanking sequences defined in `maskX_front/rear` to find a window in the read where the barcode is situated.
2. For double ended barcodes, the __best__ window (either from the front or rear of the read) is chosen based on the alignment of the flanking mask sequences.
3. After choosing the best window for an arrangement, each barcode candidate within the arrangement is aligned to the subsequence within the window. The alignment may optionally consider additional bases from the preceding/succeeding flank (as specifed in the `flank_left_pad` and `flank_right_pad` parameters). The edit distance of this alignment is assigned as a penalty to each barcode.

Once barcodes are sorted by barcode score, the top candidate is checked against the following rules -
1. If the flank score is above `min_soft_flank_threshold` and the barcode score is above `min_hard_barcode_threshold`, then the barcode is kept as a candidate.
2. If the barcode score is above `min_soft_barcode_threshold` and the flank score is above `min_hard_flank_threshold`, then the barcode is kept as a candidate.
3. If the arrangement is double ended, if both the top and bottom barcode sequence scores are above `min_hard_barcode_threshold`, then the barcode is kept as a candidate.
Once barcodes are sorted by barcode penalty, the top candidate is checked against the following rules -
1. Is the barcode penalty below `max_barcode_penalty` and the distance between top 2 barcode penalties greater than `min_barcode_penalty_dist`?.
2. Is the barcode penalty above `max_barcode_penalty` but the distance between top 2 barcodes penalties greater then `min_separation_only_dist`?
3. Is the flank score below the `min_flank_score`?

If a candidate still remains and there are at least 2 scores candidates in the sorted list, the difference between the best and second best candidate is computed. Only
if that score is greater than `min_barcode_score_dist`, the best candidate is considered a hit. Otherwise the read is considered as `unclassified`.
If a candidate meets (1) or (2) AND (3), and the location of the start/end of the barcode construct is within `barcode_end_proximity` bases of the ends of the read, then it is considered a hit.

| Scoring option | Description |
| -- | -- |
| min_soft_barcode_threshold | If barcode score meets this threshold and flank score meets its hard threshold, consider a hit. Soft score is higher than hard score. |
| min_hard_barcode_threshold | Minimum score threshold a barcode must meet. |
| min_soft_flank_threshold | If flank score meets this threshold and barcode score meets its hard threshold, consider a hit. Soft score is higher than hard score. |
| min_hard_barcode_threshold | Minimum score threshold a flank must meet. |
| min_barcode_score_dist | Minimum distance between barcode scores of best and second best hits. |
| max_barcode_penalty | The maximum edit distance allowed for a classified barcode. Considered in conjunction with the `min_barcode_penalty_dist` parameter. |
| min_barcode_penalty_dist | The minimum penalty difference between top-2 barcodes required for classification. Used in conjunction with `max_barcode_cost`. |
| min_separation_only_dist | The minimum penalty difference between the top-2 barcodes required for classification when the `max_barcode_cost` is not met. |
| barcode_end_proximity | Proximity of the end of the barcode construct to the ends of the read required for classification. |
| flank_left_pad | Number of bases to use from preceding flank during barcode alignment. |
| flank_right_pad | Number of bases to use from succeeding flank during barcode alignment. |
| front_barcode_window | Number of bases at the front of the read within which to look for barcodes. |
| rear_barcode_window | Number of bases at the rear of the read within which to look for barcodes. |
| min_flank_score | Minimum score for the flank alignment. Score here is 1.f - (edit distance) / flank_length |

For `flank_left_pad` and `flank_right_pad`, something in the range of 5-10 bases is typically good. Note that errors from this padding region are also part of the barcode alignment penalty. Therefore a bigger padding region may require a higher `max_barcode_cost` for classification.

### Custom Sequences File

Expand Down
2 changes: 1 addition & 1 deletion dorado/3rdparty/ont-minimap2
9 changes: 6 additions & 3 deletions dorado/alignment/Minimap2Index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,11 @@ bool Minimap2Index::initialise(Minimap2Options options) {
m_mapping_options = std::make_optional<mm_mapopt_t>();

mm_set_opt(0, &m_index_options.value(), &m_mapping_options.value());
// Setting options to map-ont default till relevant args are exposed.
mm_set_opt("map-ont", &m_index_options.value(), &m_mapping_options.value());
if (mm_set_opt(options.mm2_preset.c_str(), &m_index_options.value(),
&m_mapping_options.value()) != 0) {
spdlog::error("Cannot set mm2 options with preset: {}", options.mm2_preset);
return false;
}

set_index_options(options);
set_mapping_options(options);
Expand Down Expand Up @@ -180,4 +183,4 @@ const mm_mapopt_t& Minimap2Index::mapping_options() const {

const Minimap2Options& Minimap2Index::get_options() const { return m_options; }

} // namespace dorado::alignment
} // namespace dorado::alignment
16 changes: 9 additions & 7 deletions dorado/alignment/Minimap2Options.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once

#include <cstdint>
#include <string>
#include <tuple>

namespace dorado::alignment {
Expand All @@ -9,11 +10,12 @@ struct Minimap2IndexOptions {
short kmer_size;
short window_size;
uint64_t index_batch_size;
std::string mm2_preset;
};

inline bool operator<(const Minimap2IndexOptions& l, const Minimap2IndexOptions& r) {
return std::tie(l.kmer_size, l.window_size, l.index_batch_size) <
std::tie(r.kmer_size, r.window_size, r.index_batch_size);
return std::tie(l.kmer_size, l.window_size, l.index_batch_size, l.mm2_preset) <
std::tie(r.kmer_size, r.window_size, r.index_batch_size, r.mm2_preset);
}

inline bool operator>(const Minimap2IndexOptions& l, const Minimap2IndexOptions& r) {
Expand All @@ -29,8 +31,8 @@ inline bool operator>=(const Minimap2IndexOptions& l, const Minimap2IndexOptions
}

inline bool operator==(const Minimap2IndexOptions& l, const Minimap2IndexOptions& r) {
return std::tie(l.kmer_size, l.window_size, l.index_batch_size) ==
std::tie(r.kmer_size, r.window_size, r.index_batch_size);
return std::tie(l.kmer_size, l.window_size, l.index_batch_size, l.mm2_preset) ==
std::tie(r.kmer_size, r.window_size, r.index_batch_size, r.mm2_preset);
}

inline bool operator!=(const Minimap2IndexOptions& l, const Minimap2IndexOptions& r) {
Expand Down Expand Up @@ -87,7 +89,7 @@ inline bool operator==(const Minimap2Options& l, const Minimap2Options& r) {

inline bool operator!=(const Minimap2Options& l, const Minimap2Options& r) { return !(l == r); }

static constexpr Minimap2Options dflt_options{{15, 10, 16000000000ull},
{5, 500, 20000, false, false, true},
false};
static const Minimap2Options dflt_options{{15, 10, 16000000000ull, "lr:hq"},
{5, 500, 20000, false, false, true},
false};
} // namespace dorado::alignment
8 changes: 5 additions & 3 deletions dorado/cli/aligner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -269,9 +269,11 @@ int aligner(int argc, char* argv[]) {

// Report progress during output file finalisation.
tracker.set_description("Sorting output files");
hts_file.finalise([&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
});
hts_file.finalise(
[&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
},
writer_threads);

tracker.summarize();

Expand Down
8 changes: 5 additions & 3 deletions dorado/cli/basecaller.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -319,9 +319,11 @@ void setup(std::vector<std::string> args,

// Report progress during output file finalisation.
tracker.set_description("Sorting output files");
hts_file.finalise([&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
});
hts_file.finalise(
[&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
},
thread_allocations.writer_threads);

// Give the user a nice summary.
tracker.summarize();
Expand Down
7 changes: 7 additions & 0 deletions dorado/cli/cli_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,12 @@ void add_minimap2_arguments(ArgParser& parser, const Options& dflt) {
"specified as NUM,[NUM]")
.default_value(to_size(dflt.bandwidth) + "," + to_size(dflt.bandwidth_long));

// Setting options to lr:hq which is appropriate for high quality nanopore reads.
parser.visible.add_argument("--mm2-preset")
.help("minimap2 preset for indexing and mapping. Alias for the -x "
"option in minimap2.")
.default_value(dflt.mm2_preset);

parser.hidden.add_argument("--secondary-seq")
.help("minimap2 output seq/qual for secondary and supplementary alignments")
.default_value(false)
Expand Down Expand Up @@ -251,6 +257,7 @@ Options process_minimap2_arguments(const ArgParser& parser, const Options& dflt)
throw std::runtime_error("Wrong number of arguments for option '-r'.");
}
res.soft_clipping = parser.visible.get<bool>("Y");
res.mm2_preset = parser.visible.get<std::string>("mm2-preset");
res.secondary_seq = parser.hidden.get<bool>("secondary-seq");
res.print_aln_seq = parser.hidden.get<bool>("print-aln-seq");
return res;
Expand Down
12 changes: 9 additions & 3 deletions dorado/cli/demux.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ int demuxer(int argc, char* argv[]) {
auto emit_summary = parser.visible.get<bool>("emit-summary");
auto threads(parser.visible.get<int>("threads"));
auto max_reads(parser.visible.get<int>("max-reads"));
auto no_trim(parser.visible.get<bool>("--no-trim"));

alignment::AlignmentProcessingItems processing_items{reads, recursive_input, output_dir, true};
if (!processing_items.initialise()) {
Expand Down Expand Up @@ -198,6 +199,12 @@ int demuxer(int argc, char* argv[]) {
}

add_pg_hdr(header.get());
if (!no_trim) {
// Remove SQ lines from header since alignment information
// is invalidated after trimming.
utils::strip_sq_hdr(header.get());
sam_hdr_remove_tag_id(header.get(), "HD", NULL, NULL, "SO");
}

auto barcode_sample_sheet = parser.visible.get<std::string>("--sample-sheet");
std::unique_ptr<const utils::SampleSheet> sample_sheet;
Expand All @@ -219,9 +226,8 @@ int demuxer(int argc, char* argv[]) {
}
pipeline_desc.add_node<BarcodeClassifierNode>(
{demux_writer}, demux_threads, kit_names,
parser.visible.get<bool>("--barcode-both-ends"),
parser.visible.get<bool>("--no-trim"), std::move(allowed_barcodes),
std::move(custom_kit), std::move(custom_barcode_file));
parser.visible.get<bool>("--barcode-both-ends"), no_trim,
std::move(allowed_barcodes), std::move(custom_kit), std::move(custom_barcode_file));
}

// Create the Pipeline from our description.
Expand Down
11 changes: 7 additions & 4 deletions dorado/cli/duplex.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -359,7 +359,8 @@ int duplex(int argc, char* argv[]) {
SamHdrPtr hdr(sam_hdr_init());
cli::add_pg_hdr(hdr.get(), args);

utils::HtsFile hts_file("-", output_mode, 4);
constexpr int WRITER_THREADS = 4;
utils::HtsFile hts_file("-", output_mode, WRITER_THREADS);

PipelineDescriptor pipeline_desc;
auto hts_writer = PipelineDescriptor::InvalidNodeHandle;
Expand Down Expand Up @@ -554,9 +555,11 @@ int duplex(int argc, char* argv[]) {

// Report progress during output file finalisation.
tracker.set_description("Sorting output files");
hts_file.finalise([&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
});
hts_file.finalise(
[&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
},
WRITER_THREADS);

tracker.summarize();
if (!dump_stats_file.empty()) {
Expand Down
14 changes: 11 additions & 3 deletions dorado/cli/trim.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "read_pipeline/HtsReader.h"
#include "read_pipeline/HtsWriter.h"
#include "read_pipeline/ProgressTracker.h"
#include "utils/bam_utils.h"
#include "utils/basecaller_utils.h"
#include "utils/log_utils.h"
#include "utils/stats.h"
Expand Down Expand Up @@ -116,6 +117,11 @@ int trim(int argc, char* argv[]) {
HtsReader reader(reads[0], read_list);
auto header = SamHdrPtr(sam_hdr_dup(reader.header));
add_pg_hdr(header.get());
// Always remove alignment information from input header
// because at minimum the adapters are trimmed, which
// invalidates the alignment record.
utils::strip_sq_hdr(header.get());
sam_hdr_remove_tag_id(header.get(), "HD", NULL, NULL, "SO");

auto output_mode = OutputMode::BAM;

Expand Down Expand Up @@ -176,9 +182,11 @@ int trim(int argc, char* argv[]) {

// Report progress during output file finalisation.
tracker.set_description("Sorting output files");
hts_file.finalise([&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
});
hts_file.finalise(
[&](size_t progress) {
tracker.update_post_processing_progress(static_cast<float>(progress));
},
trim_writer_threads);
tracker.summarize();

spdlog::info("> finished adapter/primer trimming");
Expand Down
13 changes: 13 additions & 0 deletions dorado/data_loader/DataLoader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
#include "models/kits.h"
#include "models/models.h"
#include "read_pipeline/ReadPipeline.h"
#include "read_pipeline/messages.h"
#include "utils/PostCondition.h"
#include "utils/time_utils.h"
#include "utils/types.h"
Expand Down Expand Up @@ -229,6 +230,7 @@ SimplexReadPtr process_pod5_read(
// used has a rapid adapter and which one.
const auto condition_info = models::ConditionInfo(get_chemistry_key(run_info_data));
new_read->read_common.rapid_chemistry = condition_info.rapid_chemistry();
new_read->read_common.chemistry = condition_info.chemistry();

pod5_end_reason_t end_reason_value{POD5_END_REASON_UNKNOWN};
char end_reason_string_value[200];
Expand Down Expand Up @@ -853,6 +855,7 @@ void DataLoader::load_pod5_reads_from_file_by_read_ids(const std::string& path,

for (auto& v : futures) {
auto read = v.get();
check_read(read);
m_pipeline.push_message(std::move(read));
m_loaded_read_count++;
}
Expand Down Expand Up @@ -911,6 +914,7 @@ void DataLoader::load_pod5_reads_from_file(const std::string& path) {

for (auto& v : futures) {
auto read = v.get();
check_read(read);
m_pipeline.push_message(std::move(read));
m_loaded_read_count++;
}
Expand Down Expand Up @@ -1023,6 +1027,15 @@ void DataLoader::load_fast5_reads_from_file(const std::string& path) {
}
}

void DataLoader::check_read(const SimplexReadPtr& read) {
if (read->read_common.chemistry == models::Chemistry::UNKNOWN &&
m_log_unknown_chemistry.exchange(false)) {
spdlog::warn(
"Could not determine sequencing Chemistry from read data - "
"some features might be disabled");
}
}

DataLoader::DataLoader(Pipeline& pipeline,
const std::string& device,
size_t num_worker_threads,
Expand Down
7 changes: 7 additions & 0 deletions dorado/data_loader/DataLoader.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#pragma once
#include "models/models.h"
#include "read_pipeline/messages.h"
#include "utils/stats.h"
#include "utils/types.h"

Expand Down Expand Up @@ -75,6 +76,7 @@ class DataLoader {
void load_pod5_reads_from_file_by_read_ids(const std::string& path,
const std::vector<ReadID>& read_ids);
void load_read_channels(std::filesystem::path data_path, bool recursive_file_loading);

Pipeline& m_pipeline; // Where should the loaded reads go?
std::atomic<size_t> m_loaded_read_count{0};
std::string m_device;
Expand All @@ -87,6 +89,11 @@ class DataLoader {
std::unordered_map<int, std::vector<ReadSortInfo>> m_reads_by_channel;
std::unordered_map<std::string, size_t> m_read_id_to_index;
int m_max_channel{0};

// Issue warnings if read is potentially problematic
inline void check_read(const SimplexReadPtr& read);
// A flag to warn only once if the data chemsitry is known
std::atomic<bool> m_log_unknown_chemistry{true};
};

} // namespace dorado
Loading

0 comments on commit cbdd51a

Please sign in to comment.