Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short basecalled R9.4.1 reads using both trained and default models in bonito, but not in guppy #376

Open
andreaswallberg opened this issue Dec 20, 2023 · 0 comments

Comments

@andreaswallberg
Copy link

Dear developers,

We are training our own models for non-model organisms using R9.4.1 data, focusing our efforts towards data associated with protein-coding genes. When we use bonito to basecall our data using either our own trained model or the default models, we observe that the output reads are often only 50% as long as compared to the reads produced with guppy using the same FAST5 resources. We also get many more reads, especially when using our own model. While we are improving the base-level quality of the reads, we are concerned by the short resulting reads.

We wonder if there is some sort of clipping in the algorithm, e.g. that basecalling only proceeds until a point where the local quality score has dropped by some level compared to the overall quality of the read, and then terminates the calling at that point.

We have also noticed that older versions of bonito had default settings seemingly more in tune with R9 data, while newer versions seem to be adapted to R10 data. Perhaps this indirectly mirrors internal changes to the algorithms too, such that the current implementation of bonito does not fit legacy R9.4.1 data as well as it used to.

For the legacy R9.4.1 data we are currently exploring, would you recommend using an older version of bonito?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant