-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose CLI parameter for maximum coverage on a single k-mer #263
Comments
I am looking at implementing the ability to deal with targeted/amplicon sequencing in drprg. However, the major hurdle is pandora's inability to deal with quite varied coverage. There are lots of hard-coded values in pandora that create this restriction. I will discuss a few here.
The easiest solution here is to make a couple of new CLI params to set the multiplier for global coverage and the maximum allowed kmer coverage. Although this doesn't quite solve the 16-bit problem. It would also be great if we could estimate the genome size rather than the user needing to specify it... What are thoughts on the correct way to allow for handling this type of data in pandora? |
Couple of things
|
I think pandora as is now only works well on a specific read coverage setting. I'd downsample your targeted/amplicon sequencing to 300x for each region and run pandora. This is the practical solution as it just adds one step in your pipeline (downsampling) and then running pandora as is.
I think it would be better than the current approach. For the current approach, we take a fixed genome size as input from the CLI. Estimating the breadth I think is not hard, I am thinking on:
This is better because now our estimated breadth is the size of a personalised reference for each sample. This deals with issues of the genome size being inaccurate (which can be for species with small core genome) and the pangenome being possibly incomplete.
Me too, I think >65k depth are edge cases, and we can either limit to 65k or ask the user to downsample to 65k. Current bugs with coverageRe-reading pandora code due to this issue I found some bugs:
When estimating error rate, we compute coverage in a different way (we get the number of reads mapped to each PRG and divide by the number of PRGs): There might be more bugs, besides those outlined by Michael already. Possible solution outline
There might be other issues as implementation goes on... Planned timelineThe main decision and issue we will face is choosing how to compute the coverage of a PRG (item 3). Item 4 is not an issue, we will have CLI parameters for each filter, so we can choose which ones we apply. I have to finish the index refactoring and lazy loading in pandora before starting this, so early February is possible, especially because this would benefit roundhound. I think we could implement items 3 and 4, and test both on paper and drprg data (for item 4, we can activate each filter and see the effect of it). Mostly need to check with you @iqbal-lab and @mbhall88 if this looks like a plan we should follow or if we should postpone or abandon it. I can't estimate how long this would take, as I think I'd need to have one week into the implementation for a good estimation. |
I'm happy to cap coverage at 65k rather than require the user to subsample. One other thing that wasn't mentioned by either of you is point 1 in my comment (the global covg multiplier). I guess this is fixable with a CLI option, but wasn't sure if any of the above proposed changes would alter the need for this setting? In terms of your first bug about coverage @leoisl, I had that exact realisation last night. In terms of your proposed coverage calculation methods, I like 3.4 (post-filtering obviously)? I think for option 4 maybe leave it as is for now and we can revisit it afterwards? Best not to change too many things at once? I like the solution outline. One question about the planned timeline, would you go for this or #316 first? Being selfish, #316 would be the higher priority for me, but maybe this issue would address that too? |
My vote would be to straight up refactor these three conditions: into new conditions all controlled by CLI parameters. Maybe have
I think these are different issues, so it makes sense to think about prioritisation. I don't have any preference, priority for pandora now is to provide support for downstream tools, i.e. drprg and roundhound. I agree with whatever prioritisation you and @iqbal-lab defines |
Yeah that sounds wise.
Okay, cool! Let me know if you need any more samples to test on. I have 9 in total with the same problem as #316. |
Sounds good. We'll probably combine with simulated data with perfect reads starting at every base, and see which do/not map |
Originally posted by @rmcolq in #262 (comment)
The text was updated successfully, but these errors were encountered: