Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose CLI parameter for maximum coverage on a single k-mer #263

Open
mbhall88 opened this issue Feb 12, 2021 · 8 comments
Open

Expose CLI parameter for maximum coverage on a single k-mer #263

mbhall88 opened this issue Feb 12, 2021 · 8 comments

Comments

@mbhall88
Copy link
Member

mbhall88 commented Feb 12, 2021

https://github.com/rmcolq/pandora/blob/815da22867f2bdaa3e3bf40f382be830c1506b0a/src/estimate_parameters.cpp#L217-L231

In particular https://github.com/rmcolq/pandora/blob/815da22867f2bdaa3e3bf40f382be830c1506b0a/src/estimate_parameters.cpp#L227

"Magic" numbers always make me nervous. I appreciate that for any bacterial datasets, a k-mer coverage of 1000 or more is craziness. But if we ever want to look at viruses or bacteria/other sequenced at very high depth, this could cause a lot of problems.
Question 3: should we expose this value as a CLI param or set it a little more "dynamically"?

Question 3: Yes, a parameter instead of a magic number would definitely be better.

Originally posted by @rmcolq in #262 (comment)

@mbhall88
Copy link
Member Author

mbhall88 commented Jan 5, 2023

I am looking at implementing the ability to deal with targeted/amplicon sequencing in drprg. However, the major hurdle is pandora's inability to deal with quite varied coverage. There are lots of hard-coded values in pandora that create this restriction. I will discuss a few here.

  1. https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/localPRG.cpp#L1731-L1736 in this section of the code, we say that if the mean coverage on a PRG is greater than 10 times the global coverage, we skip/remove it. When I tested out drprg/pandora with a targeted sequencing sample I have this was the first major issue I ran into. Nearly all of my loci were considered absent. The global coverage was 223 but the mean coverage for a lot of my loci was way over 2230 (223x10) - some are as high as 43,000. As a test, I changed the magic number 10 to 1000 and reran. It didn't throw away any genes when I did this (apart from those that weren't in the amplicons). But I'm not sure if just exposing this multiplier is the "correct" solution? I guess changing the genome size is also important in calculating the global coverage...
  2. https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/pangenome/pangraph.cpp#L328-L334 this is just a logging message, but it uses a magic number which we should make configurable in the CLI options
  3. There are some underlying assumptions that coverage can fit into a 16-bit unsigned integer in some places (but 32-bit in others) https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/kmergraphwithcoverage.cpp#L58-L61 this would limit coverage to 65,536. In my test sample, I had kmers with coverage over 100,000
  4. https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/estimate_parameters.cpp#L228-L230 (this is what the original comment in this issue referenced) we don't count kmers with more than 1000 coverage when estimating parameters

The easiest solution here is to make a couple of new CLI params to set the multiplier for global coverage and the maximum allowed kmer coverage. Although this doesn't quite solve the 16-bit problem. It would also be great if we could estimate the genome size rather than the user needing to specify it...

What are thoughts on the correct way to allow for handling this type of data in pandora?

@iqbal-lab
Copy link
Collaborator

Couple of things

  1. Would it be very bad to estimate depth and breadth ( ie mean number of reads at every position in a locus, and % of locus with at least 10x depth) per locus/prg and use that ? Would solve this issue for targeted seq and for plasmids.
  2. If you have >65k depth, I see no harm in subsampling to 65k

@leoisl
Copy link
Collaborator

leoisl commented Jan 6, 2023

I think pandora as is now only works well on a specific read coverage setting. I'd downsample your targeted/amplicon sequencing to 300x for each region and run pandora. This is the practical solution as it just adds one step in your pipeline (downsampling) and then running pandora as is.

  1. Would it be very bad to estimate depth and breadth ( ie mean number of reads at every position in a locus, and % of locus with at least 10x depth) per locus/prg and use that ? Would solve this issue for targeted seq and for plasmids.

I think it would be better than the current approach. For the current approach, we take a fixed genome size as input from the CLI. Estimating the breadth I think is not hard, I am thinking on:

  1. Map reads to genes;
  2. Remove FP genes;
  3. Infer ML seq for each gene;
  4. Sum the ML seqs to estimate breadth;

This is better because now our estimated breadth is the size of a personalised reference for each sample. This deals with issues of the genome size being inaccurate (which can be for species with small core genome) and the pangenome being possibly incomplete.

2. If you have >65k depth, I see no harm in subsampling to 65k

Me too, I think >65k depth are edge cases, and we can either limit to 65k or ask the user to downsample to 65k.

Current bugs with coverage

Re-reading pandora code due to this issue I found some bugs:

  1. [Possible bug] Coverage is based on the cumulative size of reads processed / genome size: https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/utils.cpp#L606 https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/utils.cpp#L665
    It would maybe be more accurate to compute coverage based on the regions of the reads that we map (or maybe even better, the regions of the reads that we actually consider post-filtering).

When estimating error rate, we compute coverage in a different way (we get the number of reads mapped to each PRG and divide by the number of PRGs):
https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/estimate_parameters.cpp#L235
This works well I guess for nanopore reads that cover the whole loci, probably not so well for illumina reads...

There might be more bugs, besides those outlined by Michael already.

Possible solution outline

  1. I'd advocate to remove parameters --genome-size and --max-covg;
  2. When mapping, we map all reads. If user wants to downsample, then it is needed to be done upstream with the several tools available to do so;
  3. We estimate the coverage of each loci, in which we have several options:
    3.1. Coverage is number of reads mapped;
    3.2. Coverage is the cumulative size of the reads over the length of the ML sequence;
    3.3. Coverage is the cumulative size of the mapped regions over the length of the ML sequence;
    3.4. Coverage is the average kmer count on the ML path;
    3.5. Other options?
    I think we should be careful when calculating the coverage, is the most important stat in pandora genotyping;
  4. We get a coverage distribution (likely normal?) among the loci in the sample, and remove the FP genes, where we can apply different filters:
    4.1. Read coverage filter (which is done currently): we can remove genes with very low or very high coverage;
    4.2. Length coverage filter: we could remove genes such that <70% (CLI parameter, configurable) of the ML path/sequence has reads. During pandora discover, these reads are not removed, as they can encode denovo sequences, but in map/compare they are.
    4.3. Other filters?
  5. We now have a personalised PanRG for the sample, with the ML sequences/paths for each PRG. We can then compute the global coverage;

There might be other issues as implementation goes on...

Planned timeline

The main decision and issue we will face is choosing how to compute the coverage of a PRG (item 3). Item 4 is not an issue, we will have CLI parameters for each filter, so we can choose which ones we apply.

I have to finish the index refactoring and lazy loading in pandora before starting this, so early February is possible, especially because this would benefit roundhound. I think we could implement items 3 and 4, and test both on paper and drprg data (for item 4, we can activate each filter and see the effect of it).

Mostly need to check with you @iqbal-lab and @mbhall88 if this looks like a plan we should follow or if we should postpone or abandon it. I can't estimate how long this would take, as I think I'd need to have one week into the implementation for a good estimation.

@mbhall88
Copy link
Member Author

mbhall88 commented Jan 8, 2023

I'm happy to cap coverage at 65k rather than require the user to subsample.

One other thing that wasn't mentioned by either of you is point 1 in my comment (the global covg multiplier). I guess this is fixable with a CLI option, but wasn't sure if any of the above proposed changes would alter the need for this setting?

In terms of your first bug about coverage @leoisl, I had that exact realisation last night. In terms of your proposed coverage calculation methods, I like 3.4 (post-filtering obviously)?

I think for option 4 maybe leave it as is for now and we can revisit it afterwards? Best not to change too many things at once?

I like the solution outline.

One question about the planned timeline, would you go for this or #316 first? Being selfish, #316 would be the higher priority for me, but maybe this issue would address that too?

@leoisl
Copy link
Collaborator

leoisl commented Jan 9, 2023

One other thing that wasn't mentioned by either of you is point 1 in my comment (the global covg multiplier). I guess this is fixable with a CLI option, but wasn't sure if any of the above proposed changes would alter the need for this setting?

My vote would be to straight up refactor these three conditions:
https://github.com/rmcolq/pandora/blob/48f63a5939c9378229699b62dc81a1cd8cf906c2/src/localPRG.cpp#L1725-L1743

into new conditions all controlled by CLI parameters. Maybe have --minimum-gene-coverage and --minimum-fraction-gene-coverage parameters. Maybe also a maximum to remove highly covered genes?

One question about the planned timeline, would you go for this or #316 first? Being selfish, #316 would be the higher priority for me, but maybe this issue would address that too?

I think these are different issues, so it makes sense to think about prioritisation. I don't have any preference, priority for pandora now is to provide support for downstream tools, i.e. drprg and roundhound. I agree with whatever prioritisation you and @iqbal-lab defines

@leoisl
Copy link
Collaborator

leoisl commented Jan 9, 2023

@mbhall88 Zam and I had our weekly meeting, and #316 is now second in priority, just after implementing lazy loading feature

@mbhall88
Copy link
Member Author

mbhall88 commented Jan 9, 2023

Maybe have --minimum-gene-coverage and --minimum-fraction-gene-coverage parameters. Maybe also a maximum to remove highly covered genes?

Yeah that sounds wise.

Zam and I had our weekly meeting, and #316 is now second in priority, just after implementing lazy loading feature

Okay, cool! Let me know if you need any more samples to test on. I have 9 in total with the same problem as #316.

@iqbal-lab
Copy link
Collaborator

Sounds good. We'll probably combine with simulated data with perfect reads starting at every base, and see which do/not map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants