Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

homopolymer compression? #19

Open
nextgenusfs opened this issue Dec 7, 2021 · 4 comments
Open

homopolymer compression? #19

nextgenusfs opened this issue Dec 7, 2021 · 4 comments

Comments

@nextgenusfs
Copy link

Thanks @ksahlin for the great tool. I've been trying to use for a variety of clustering tasks using some amplicon data. I still have some uncertainty on how it is performing as I don't necessarily know exactly I should be expecting in my amplicon pool, but I do know that I see some "lumping" of closely related sequences that I would like to see come out in separate clusters.

So we are running all of our amplicons with the newer R10 flow cells and use high accuracy basecalling model -- the goal is to reduce errors in homopolymers. I notice this is different than at least your test case (using fast basecalling model). So as I looked at the code I see that the first step in clustering is actually to compress homopolymers: https://github.com/ksahlin/isONclust/blob/master/modules/cluster.py#L209. So I'm wondering what would happen if homopolymers were not compressed? Is compressing homopolymers required here just for kmer selection or is it being used because of higher error rates in these sequences?

Related point is that the error profiles are quite different between the ONT guppy basecall models, so I'm also then wondering about the role of the empirical distances that seem to be hard coded into isONclust -- would the basecall model (ie single read accuracy) alter these distances? I guess in other words should I expect differing performance using the FAST vs HAC basecall models? And if I try to remove the homopolymer compression step as a test, would that cause a problem?

@ksahlin
Copy link
Owner

ksahlin commented Dec 7, 2021

Hi @nextgenusfs,

Q1:
Homopolymer compression mainly helps for transcriptome data as it removes polyA tails.

I'm fairly sure you can simply change one line (274) here from

seq_hpol_comp = ''.join(ch for ch, _ in itertools.groupby(seq))

to

seq_hpol_comp = seq

and isONclust should still work but without homopolymer compression.

Q2:
The error probabilities are precomputed and hardcoded. They are agnostic to the model the basecaller uses. As long as the base caller gives roughly consistent Phred quality scores with the actual error rate, it should be fine regardless of the basecaller used.

Best,
Kristoffer

@nextgenusfs
Copy link
Author

Fantastic thanks -- I'll give it a try and see if I find an improvement.

@nextgenusfs
Copy link
Author

Brilliant -- I just needed to also edit at the qualcomp lines below -- ie the hp_compressed error rate function. Hopefully that looks correct to you?
https://github.com/nextgenusfs/isONclust/blob/master/modules/cluster.py#L291-L314

@ksahlin
Copy link
Owner

ksahlin commented Dec 7, 2021

Ah, don't know how I missed that, I was looking for the quality values too.. Sorry about that.

Yes, your fix looks good from what I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants