-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
homopolymer compression? #19
Comments
Hi @nextgenusfs, Q1: I'm fairly sure you can simply change one line (274) here from
to
and isONclust should still work but without homopolymer compression. Q2: Best, |
Fantastic thanks -- I'll give it a try and see if I find an improvement. |
Brilliant -- I just needed to also edit at the qualcomp lines below -- ie the hp_compressed error rate function. Hopefully that looks correct to you? |
Ah, don't know how I missed that, I was looking for the quality values too.. Sorry about that. Yes, your fix looks good from what I can see. |
Thanks @ksahlin for the great tool. I've been trying to use for a variety of clustering tasks using some amplicon data. I still have some uncertainty on how it is performing as I don't necessarily know exactly I should be expecting in my amplicon pool, but I do know that I see some "lumping" of closely related sequences that I would like to see come out in separate clusters.
So we are running all of our amplicons with the newer R10 flow cells and use high accuracy basecalling model -- the goal is to reduce errors in homopolymers. I notice this is different than at least your test case (using fast basecalling model). So as I looked at the code I see that the first step in clustering is actually to compress homopolymers: https://github.com/ksahlin/isONclust/blob/master/modules/cluster.py#L209. So I'm wondering what would happen if homopolymers were not compressed? Is compressing homopolymers required here just for kmer selection or is it being used because of higher error rates in these sequences?
Related point is that the error profiles are quite different between the ONT guppy basecall models, so I'm also then wondering about the role of the empirical distances that seem to be hard coded into isONclust -- would the basecall model (ie single read accuracy) alter these distances? I guess in other words should I expect differing performance using the FAST vs HAC basecall models? And if I try to remove the homopolymer compression step as a test, would that cause a problem?
The text was updated successfully, but these errors were encountered: