-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Population calling sample limit #41
Comments
Hi! There is no strict limit, but you will likely experience runtime performance issues joint calling that many samples, unless your samples are haploid. I've not done extensive testing of this use-case, but I would recommend adopting a strategy of joint calling small batches of your samples (maybe 5-10 samples at a time), and then joint calling all the samples using only the variants found in the batches. This approach is discussed here. Depending on your preference for runtime vs accuracy, you may want to adjust the Let me know if you run into trouble. Dan |
Oh, I totally missed the wiki pages! I've been looking at the manual. As it turns out, my samples are "pseudo" haploids. By pseudo I mean they are diploids that primarily reproduce through selfing and their genomes are largely homozygous. That being said, I typically use variant callers in haploid mode when available. I'll check out the batch iteration calling approach you linked. Before you responded, I tried a small test set using this command:
A quick note on the
The debug log is 2Gb and I am wondering if there are any specific aspects of it that would be most useful to send your way. The tail end of the log was similar to the error message:
Here is the output when I grab the region where the run failed, if that's useful at all:
|
Thanks for the debug report. It looks like the problem occurs at |
I tried the region you specified and did not run into any issues:
I ended up extending the region to capture the issue:
And got the same error:
Here is the tail end of the log file:
The bam files can be found here. The samples I am using for this test run are XZ2020.bam, XZ1672.bam, QG2857.bam, NIC526.bam, JU3280.bam, JU3226.bam, and ECA744.bam. I figured it might be one BAM messing things up so I iterated the above command (with the expanded region) while dropping BAM files until the run completed (e.g. run with 7 samples -> error, drop 1 bam, run with 6 samples -> error, drop 1 bam, run with 5 sample...). The run finished without an error when I removed QG2857.bam. But the run also finished without an error when I ran QG2857.bam alone. Then I started adding samples back. Eventually, I figured out that the minimal number of samples required to generate the error is 3 and the samples are: QG2857, JU3226, and JU3280. Here is the tail end of the log from that minimal sample set run:
Please let me know if you need any other information. |
Many thanks for the information. I've downloaded the BAMs but can't seem to find the reference genome |
hm, embedding the link isn't working, maybe because it is an ftp? anyway the link to all the genomes, including WS256 is here: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/sequence/genomic/ |
Thanks. I've found the problem and have pushed a fix to the develop branch. I'll add to next release. |
This should be resolved in v0.5.3-beta. Please re-open if not. |
I re-ran the samples that caused the issue mentioned in this thread with 0.5.3, and did not run into an issue. Now I'll start looking into a the large-cohort approach you suggested above. thanks for your help! |
Hello,
I just came across your biorxiv preprint, it was very easy to install via conda, and I love the extensive documentation!
I am wondering if there is a limit to the population size when calling variants in population mode. I currently have 330 samples (100Mb genome, coverage per sample ranges from 20X-1000X, though I usually downsample high coverage samples to 100X), but this number will increase over time.
Thanks
The text was updated successfully, but these errors were encountered: