-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine the optimum w and k sizes #6
Comments
The way this analysis was setup is I took 183 isolates from the head to head dataset which have a minimum of 3x coverage for both Illumina and Nanopore. I then run w
k
I then aggregate the number of false positive and negative resistance calls across all drugs and plot them. In the above plot, the dashed lines with the triangle points are FP counts and the solid lines with circle points are FN counts. The different colours indicate the window size, with kmer size on the x-axis. Seems like something weird has happened in both technologies at k-17 w=14 as the FPs randomly sky-rocket. I will look into what has caused this because based on the trends, this looks like it might have been the best combination for both technologies. As it stands k=17 and w=11 looks to be the best. |
The jump in FNs are effectively all in Rifampicin and specifically in the mutation Example VCF entires (I have removed excess sample
sample
sample
For @lachlancoin's understanding, what you want to focus on here for the coverage is the I don't really get why this particular window size (14) at that particular kmer size (17) would cause this. But as this allele is inside the RRDR there could be something funky happening in the graph surrounding this position. I'm going to add in w=13 and w=15 to get a better idea of exactly where this breaks down. Any other ideas? |
Waiting on resolution of iqbal-lab-org/pandora#293 as a few samples failed for that reason |
How is the graph built? Are you starting with a list of variants/vcf and implicitly allowing loads of recombination, or starting with an MSA of known haplotypes and using make-prg? And does drprg do de novo via de bruin or racon, or neither? |
I started with the cryptic VCF, subsampling 50 isolates from each lineage. Then apply the haplotype for each isolate to H37Rv. Then MSA that, then make_prg. I then run de novo (dBG) on that graph, update the graph with make prg, and then I hit the error when genotyping with pandora map. I only get it on 4 w-k combination, and, interestingly, all are when w=k. I don't hit this for any other w-k combination. In total, with the 181 isolates across the two sequencing modalities there are ~7-8k jobs in the pipeline and only 4 hit this error. I'll skip them for now, but it seems important to understand what is causing this. |
After the work to-date on #11 these results have changed quite noticeably - for FPs mainly. After ranking the F1 scores for all of these combinations (and eyeballing) it seems the original default of w=14 and k=15 gives the best results. I'll revert back to that. |
After all the recent work I thought I would revisit this. Looking at the raw values I thought I'd try out w=11 and k=15 (currently the default is w=14 k=15) and I get the following diff for Illumina (no change for nanopore)
So I think I'll stick with w=11 and k=15 |
Do a sweep through
pandora
window and kmer sizes and see which result in the best drprg performance for Illumina and NanoporeThe text was updated successfully, but these errors were encountered: