Skip to content

Commit

Permalink
Merge branch 'jdaw/fix-rna-adapter-trimming' into 'master'
Browse files Browse the repository at this point in the history
[trim] Fix RNA adapter trim during polyA

Closes DOR-557

See merge request machine-learning/dorado!829
  • Loading branch information
tijyojwad committed Feb 5, 2024
1 parent f0f9883 commit 59a083c
Show file tree
Hide file tree
Showing 6 changed files with 35 additions and 11 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,9 +253,9 @@ In addition to supporting the standard barcode kits from Oxford Nanopore, Dorado

### Poly(A) tail estimation

Dorado has initial support for estimating poly(A) tail lengths for cDNA and RNA. Note that Oxford Nanopore cDNA reads are sequenced in two different orientations and Dorado poly(A) tail length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag.
Dorado has initial support for estimating poly(A) tail lengths for cDNA (PCS and PCB kits) and RNA. Note that Oxford Nanopore cDNA reads are sequenced in two different orientations and Dorado poly(A) tail length estimation handles both (A and T homopolymers). This feature can be enabled by passing `--estimate-poly-a` to the `basecaller` command. It is disabled by default. The estimated tail length is stored in the `pt:i` tag of the output record. Reads for which the tail length could not be estimated will not have the `pt:i` tag.

Note that if this option is used, then adapter and primer trimming will be automatically disabled.
Note that if this option is used, then adapter/primer/barcode trimming will be automatically **disabled** for DNA.

## Available basecalling models

Expand Down
6 changes: 4 additions & 2 deletions dorado/read_pipeline/PolyACalculator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,9 @@ SignalAnchorInfo determine_signal_anchor_and_strand_cdna(const dorado::SimplexRe
// RNA polyA appears at the beginning of the strand. Since the adapter
// for RNA has been trimmed off already, the polyA search can begin
// from the start of the signal.
SignalAnchorInfo determine_signal_anchor_and_strand_drna() { return SignalAnchorInfo{false, 0, 0}; }
SignalAnchorInfo determine_signal_anchor_and_strand_drna(const dorado::SimplexRead& read) {
return SignalAnchorInfo{false, read.read_common.rna_adapter_end_signal_pos, 0};
}

} // namespace

Expand All @@ -303,7 +305,7 @@ void PolyACalculator::worker_thread() {
// Determine the strand direction, approximate base space anchor for the tail, and whether
// the final length needs to be adjusted depending on the adapter sequence.
auto [fwd, signal_anchor, trailing_Ts] =
m_is_rna ? determine_signal_anchor_and_strand_drna()
m_is_rna ? determine_signal_anchor_and_strand_drna(*read)
: determine_signal_anchor_and_strand_cdna(*read);

if (signal_anchor >= 0) {
Expand Down
5 changes: 5 additions & 0 deletions dorado/read_pipeline/ReadPipeline.h
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,12 @@ class ReadCommon {

size_t get_raw_data_samples() const { return is_duplex ? raw_data.size(1) : raw_data.size(0); }

// Track length of estimated polyA tail in bases.
int rna_poly_tail_length{-1};
// Track position of end of RNA adapter in signal space. If the RNA adapter is
// trimmed, this will be 0. Otherwise it will be the position in the signal
// where the adapter ends.
int rna_adapter_end_signal_pos{0};

// subread_id is used to track 2 types of offsprings of a read
// (1) read splits
Expand Down
13 changes: 10 additions & 3 deletions dorado/read_pipeline/ScalerNode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,10 +120,17 @@ void ScalerNode::worker_thread() {
bool is_rna = (m_model_type == SampleType::RNA002 || m_model_type == SampleType::RNA004);
// Trim adapter for RNA first before scaling.
int trim_start = 0;
if (is_rna && m_trim_adapter) {
if (is_rna) {
trim_start = determine_rna_adapter_pos(*read, m_model_type);
read->read_common.raw_data =
read->read_common.raw_data.index({Slice(trim_start, at::indexing::None)});
if (m_trim_adapter) {
read->read_common.raw_data =
read->read_common.raw_data.index({Slice(trim_start, at::indexing::None)});
read->read_common.rna_adapter_end_signal_pos = 0;
} else {
// If RNA adapter isn't trimmed, track where the adapter signal is ending
// so it can be used during polyA estimation.
read->read_common.rna_adapter_end_signal_pos = trim_start;
}
}

assert(read->read_common.raw_data.dtype() == at::kShort);
Expand Down
Binary file added tests/data/poly_a/rna004_pod5/rna004_pt_50.pod5
Binary file not shown.
18 changes: 14 additions & 4 deletions tests/test_simple_basecaller_execution.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ dorado_bin=$(cd "$(dirname $1)"; pwd -P)/$(basename $1)
model_name=${2:-dna_r10.4.1_e8.2_400bps_hac@v4.1.0}
batch=${3:-384}
model_name_5k=${4:-dna_r10.4.1_e8.2_400bps_hac@v4.2.0}
model_name_5k_v43=${4:-dna_r10.4.1_e8.2_400bps_hac@v4.3.0}
model_name_5k_v43=${5:-dna_r10.4.1_e8.2_400bps_hac@v4.3.0}
model_name_rna004=${6:-rna004_130bps_hac@v3.0.1}
data_dir=$test_dir/data
output_dir_name=$(echo $RANDOM | head -c 10)
output_dir=${test_dir}/${output_dir_name}
Expand All @@ -30,6 +31,8 @@ $dorado_bin download --model ${model_name_5k} --directory ${output_dir}
model_5k=${output_dir}/${model_name_5k}
$dorado_bin download --model ${model_name_5k_v43} --directory ${output_dir}
model_5k_v43=${output_dir}/${model_name_5k_v43}
$dorado_bin download --model ${model_name_rna004} --directory ${output_dir}
model_rna004=${output_dir}/${model_name_rna004}

echo dorado basecaller test stage
$dorado_bin basecaller ${model} $data_dir/pod5 -b ${batch} --emit-fastq > $output_dir/ref.fq
Expand Down Expand Up @@ -234,13 +237,20 @@ if cmp --silent -- "$file1" "$file2"; then
fi

echo "dorado test poly(A) tail estimation"
$dorado_bin basecaller -b ${batch} ${model} $data_dir/poly_a/r10_cdna_pod5/ --estimate-poly-a > $output_dir/polya.bam
samtools quickcheck -u $output_dir/polya.bam
num_estimated_reads=$(samtools view $output_dir/polya.bam | grep pt:i: | wc -l | awk '{print $1}')
$dorado_bin basecaller -b ${batch} ${model} $data_dir/poly_a/r10_cdna_pod5/ --estimate-poly-a > $output_dir/cdna_polya.bam
samtools quickcheck -u $output_dir/cdna_polya.bam
num_estimated_reads=$(samtools view $output_dir/cdna_polya.bam | grep pt:i: | wc -l | awk '{print $1}')
if [[ $num_estimated_reads -ne "2" ]]; then
echo "2 poly(A) estimated reads expected. Found ${num_estimated_reads}"
exit 1
fi
$dorado_bin basecaller -b ${batch} ${model_rna004} $data_dir/poly_a/rna004_pod5/ --estimate-poly-a > $output_dir/rna_polya.bam
samtools quickcheck -u $output_dir/rna_polya.bam
num_estimated_reads=$(samtools view $output_dir/rna_polya.bam | grep pt:i: | wc -l | awk '{print $1}')
if [[ $num_estimated_reads -ne "1" ]]; then
echo "1 poly(A) estimated reads expected. Found ${num_estimated_reads}"
exit 1
fi

echo dorado basecaller barcoding read groups
test_barcoding_read_groups() (
Expand Down

0 comments on commit 59a083c

Please sign in to comment.