Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - Split on adapter #29

Open
jagos01 opened this issue Dec 19, 2022 · 10 comments
Open

Question - Split on adapter #29

jagos01 opened this issue Dec 19, 2022 · 10 comments

Comments

@jagos01
Copy link

jagos01 commented Dec 19, 2022

Hello,
I am duplex basecalling with dorado. Can split_on_adapter accept unmapped bam files for input/output?
Thanks

@onordesjo
Copy link

Hi,

Thanks for the question. It's not yet possible, but I would suspect that it would be useful. We intend to release a better version of template/complement splitting today hopefully that should be better than adapter splitting for duplex.

@jagos01
Copy link
Author

jagos01 commented Dec 19, 2022

Thanks for your quick reply. I will try it out when it is released.

@onordesjo
Copy link

Hi @jagos01, v0.2.20 is now out, and you can use this to recover reads which are non-split.

Feel free to try it out by

  1. simplex-calling (fast is ok):
$ dorado basecaller dna_r10.4.1_e8.2_400bps_fast@v4.0.0 pod5s/ --emit-moves > unmapped_reads_with_moves.sam
  1. run split_pairs like this:
duplex_tools split_pairs unmapped_reads_with_moves.sam pod5s/ pod5s_splitduplex/

This should give you new pod5s in the pod5s_splitduplex directory (with new read-ids), together with the pair_ids that correspond to the new read_ids.

Feel free to try it out and let me know how things are working.

@jagos01
Copy link
Author

jagos01 commented Dec 19, 2022

Hello @onordesjo, I followed the directions outlined in the readme for duplex calling with dorado. I generated the pair_id files for both step 2a and 2b. They contained 4667 and 7867 pairs respectively. When stereo basecalling those reads, dorado only basecalled 4114 and 1338 reads. Why is the number of stereo basecalled reads less than the number of read pairs?
Thanks

@onordesjo
Copy link

Hi @jagos01. Can I ask what type of data you have been looking at? Whole genome? Any amplification? There is some filtering happening in Dorado to ensure that bad pairs don't get through, so that is to be expected. I would expect less pairs generated in step 2b than 2b but greater retention of good pairs. 2a would also necessarily have to be generated without a subset (or alternatively a selection of channels).

Any of this information would help to explain what you are seeing.

@jagos01
Copy link
Author

jagos01 commented Dec 19, 2022

Hello @onordesjo. This is bacterial whole genome sequence data. No amplification was carried out. The data is split over two runs (had to restart the sequencer a couple hours into the run). I was also expecting less pairs from 2b. 2a was generated from the complete data set.

@ollenordesjo
Copy link
Contributor

ollenordesjo commented Dec 19, 2022 via email

@jagos01
Copy link
Author

jagos01 commented Dec 19, 2022

I inspected the pod5 reads for each run and the unmapped BAM file contains reads from both runs.

@ollenordesjo
Copy link
Contributor

ollenordesjo commented Dec 20, 2022 via email

@jagos01
Copy link
Author

jagos01 commented Dec 20, 2022

Thanks, I have emailed a link to the bam file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants