Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Population calling sample limit #41

Closed
Thatguy027 opened this issue Dec 7, 2018 · 9 comments
Closed

Population calling sample limit #41

Thatguy027 opened this issue Dec 7, 2018 · 9 comments
Assignees

Comments

@Thatguy027
Copy link

Hello,

I just came across your biorxiv preprint, it was very easy to install via conda, and I love the extensive documentation!

I am wondering if there is a limit to the population size when calling variants in population mode. I currently have 330 samples (100Mb genome, coverage per sample ranges from 20X-1000X, though I usually downsample high coverage samples to 100X), but this number will increase over time.

Thanks

@dancooke dancooke self-assigned this Dec 7, 2018
@dancooke
Copy link
Member

dancooke commented Dec 7, 2018

Hi!

There is no strict limit, but you will likely experience runtime performance issues joint calling that many samples, unless your samples are haploid.

I've not done extensive testing of this use-case, but I would recommend adopting a strategy of joint calling small batches of your samples (maybe 5-10 samples at a time), and then joint calling all the samples using only the variants found in the batches. This approach is discussed here.

Depending on your preference for runtime vs accuracy, you may want to adjust the max-joint-genotype option (default 1,000,000). Discussed here.

Let me know if you run into trouble.

Dan

@Thatguy027
Copy link
Author

Oh, I totally missed the wiki pages! I've been looking at the manual.

As it turns out, my samples are "pseudo" haploids. By pseudo I mean they are diploids that primarily reproduce through selfing and their genomes are largely homozygous. That being said, I typically use variant callers in haploid mode when available.

I'll check out the batch iteration calling approach you linked.

Before you responded, I tried a small test set using this command:

octopus -R WS256.fa --reads-file samples.txt -P 1 --filter-expression "AF < 0.5 | DP < 10 | QUAL < 30 | MQ < 40 | QD < 2 | AD < 0.5" -o test_joint.vcf --threads 40 --very-fast --debug

A quick note on the --filter-expression flag, I didn't notice anywhere in the manual that quotes "seem" to be required for the filter expression. I kept getting this error when attempting without quotes: -bash: 10: No such file or directory . Needless to say, I eventually realized quotes were necessary.

samples.txt contains the location of seven relatively low coverage bam files. I added the --debug flag after encountering an error. Here is the error:

[2018-12-07 14:48:11] <INFO>     III:1752466            32.0%           8m 43s           18m 29s
[2018-12-07 14:50:16] <INFO>               -             100%          10m 48s                 -
[2018-12-07 14:50:16] <INFO> Removed 16 temporary files
[2018-12-07 14:50:17] <EROR> A program error has occurred:
[2018-12-07 14:50:17] <EROR> 
[2018-12-07 14:50:17] <EROR>     Encountered an unknown error during calling. This means there is a
[2018-12-07 14:50:17] <EROR>     bug and your results are untrustworthy.
[2018-12-07 14:50:17] <EROR> 
[2018-12-07 14:50:17] <EROR> To help resolve this error run in debug mode and send the log file to
[2018-12-07 14:50:17] <EROR> https://github.com/luntergroup/octopus/issues.
[2018-12-07 14:50:17] <INFO> ------------------------------------------------------------------------

The debug log is 2Gb and I am wondering if there are any specific aspects of it that would be most useful to send your way. The tail end of the log was similar to the error message:

[2018-12-07 14:48:10] <DEBG> Measuring block III:1868270-1868278 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868316-1868376 containing 3 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868425-1868427 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868533-1868565 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868590-1868597 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868647-1868648 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868682-1868687 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868722-1868740 containing 3 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868755-1868764 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869023-1869048 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868810-1868844 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1868864-1868898 containing 3 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869074-1869101 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869129-1869130 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869246-1869247 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869176-1869233 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869363-1869384 containing 3 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869440-1869441 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869405-1869411 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869468-1869505 containing 3 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869619-1869620 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869563-1869588 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869670-1869684 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869712-1869719 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869739-1869740 containing 1 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869743-1869748 containing 5 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869749-1869782 containing 3 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869817-1869864 containing 4 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869923-1869929 containing 2 calls
[2018-12-07 14:48:10] <DEBG> Measuring block III:1869961-1869962 containing 1 calls
[2018-12-07 14:48:11] <INFO>     III:1752466            32.0%           8m 43s           18m 29s
[2018-12-07 14:50:16] <INFO>               -             100%          10m 48s                 -
[2018-12-07 14:50:16] <DEBG> Encountered an error whilst filtering, attempting to cleanup
[2018-12-07 14:50:16] <INFO> Removed 16 temporary files
[2018-12-07 14:50:17] <EROR> A program error has occurred:
[2018-12-07 14:50:17] <EROR> 
[2018-12-07 14:50:17] <EROR>     Encountered an unknown error during calling. This means there is a
[2018-12-07 14:50:17] <EROR>     bug and your results are untrustworthy.
[2018-12-07 14:50:17] <EROR> 
[2018-12-07 14:50:17] <EROR> To help resolve this error run in debug mode and send the log file to
[2018-12-07 14:50:17] <EROR> https://github.com/luntergroup/octopus/issues.
[2018-12-07 14:50:17] <INFO> ------------------------------------------------------------------------

Here is the output when I grab the region where the run failed, if that's useful at all:

 > cat octopus_debug.log | grep 1752466
[2018-12-07 14:15:29] <DEBG> Haplotype region is I:1752102-1752466
    III:1752466-1752466  -> A
    III:1752466-1752466  -> A
[2018-12-07 14:21:56] <DEBG> Active region is III:1752466-1752466
[2018-12-07 14:21:56] <DEBG> There are 155 active reads in III:1752466-1752466
    III:1752466-1752466  -> A
    < {III:1752466-1752466 A} > 1
[2018-12-07 14:21:56] <DEBG> Calling variants in region III:1752466-1752466
    [< {III:1752466-1752466 A} >(1)] 1.10957e-32
    [< {III:1752466-1752466 A} >(1)] 1
    [< {III:1752466-1752466 A} >(1)] 1
    [< {III:1752466-1752466 A} >(1)] 6.85601e-29
    [< {III:1752466-1752466 A} >(1)] 7.26886e-44
    [< {III:1752466-1752466 A} >(1)] 1.02189e-38
    [< {III:1752466-1752466 A} >(1)] 9.0466e-24
    III:1752466-1752466  -> A 1 1 0 0 0 0 0 
[2018-12-07 14:32:55] <DEBG> Haplotype region is V:17524308-17524667
[2018-12-07 14:32:55] <DEBG> Haplotype region is V:17524668-17525024
[2018-12-07 14:32:55] <DEBG> Haplotype region is V:17524669-17525027
[2018-12-07 14:48:10] <DEBG> Measuring block III:1752465-1752466 containing 1 calls
[2018-12-07 14:48:11] <INFO>     III:1752466            32.0%           8m 43s           18m 29s

@dancooke
Copy link
Member

dancooke commented Dec 8, 2018

Thanks for the debug report. It looks like the problem occurs at III:1869961-1869962 rather than III:1752466. If you're able to send me the BAM files, I can take a look myself, otherwise could you try just calling around that region (just add -T III:1869900-1870000 to your command) and see if you get the same error? If you also add the command --keep-unfiltered-calls then we might be able to see which call is failing.

@Thatguy027
Copy link
Author

I tried the region you specified and did not run into any issues:

octopus -R WS256.fa --reads-file samples.txt -P 1 --filter-expression "AF < 0.5 | DP < 10 | QUAL < 30 | MQ < 40 | QD < 2 | AD < 0.5" -o test_joint.vcf --threads 40 --very-fast --debug -T III:1869900-1870000 --keep-unfiltered-calls

I ended up extending the region to capture the issue:

octopus -R WS256.fa --reads-file samples.txt -P 1 --filter-expression "AF < 0.5 | DP < 10 | QUAL < 30 | MQ < 40 | QD < 2 | AD < 0.5" -o test_joint.vcf --threads 40 --very-fast --debug -T III:1709900-1970000 --keep-unfiltered-calls

And got the same error:

[2018-12-09 12:14:09] <INFO> Starting Call Set Refinement (CSR) filtering
[2018-12-09 12:14:09] <INFO> ------------------------------------------------------------------------
[2018-12-09 12:14:09] <INFO>      current      |                   |     time      |     estimated   
[2018-12-09 12:14:09] <INFO>      position     |     completed     |     taken     |     ttc         
[2018-12-09 12:14:09] <INFO> ------------------------------------------------------------------------
[2018-12-09 12:16:27] <INFO> Removed 3 temporary files
[2018-12-09 12:16:27] <EROR> A program error has occurred:
[2018-12-09 12:16:27] <EROR> 
[2018-12-09 12:16:27] <EROR>     Encountered an unknown error during calling. This means there is a
[2018-12-09 12:16:27] <EROR>     bug and your results are untrustworthy.
[2018-12-09 12:16:27] <EROR> 
[2018-12-09 12:16:27] <EROR> To help resolve this error run in debug mode and send the log file to
[2018-12-09 12:16:27] <EROR> https://github.com/luntergroup/octopus/issues.
[2018-12-09 12:16:27] <INFO> ------------------------------------------------------------------------

Here is the tail end of the log file:

[2018-12-09 12:14:09] <INFO> ------------------------------------------------------------------------
[2018-12-09 12:14:09] <INFO>      current      |                   |     time      |     estimated   
[2018-12-09 12:14:09] <INFO>      position     |     completed     |     taken     |     ttc         
[2018-12-09 12:14:09] <INFO> ------------------------------------------------------------------------
[2018-12-09 12:14:20] <DEBG> Fetched 1994155 unfiltered reads from III:1709833-1970066
[2018-12-09 12:14:20] <DEBG> In sample XZ2020
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 50 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> In sample QG2857
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 16333 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> In sample NIC526
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 132 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> In sample XZ1672
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 86 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> In sample JU3280
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 16277 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> In sample JU3226
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 74 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> In sample ECA744
[2018-12-09 12:14:20] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:14:20] <DEBG> 17262 failed the HasWellFormedCigar filter
[2018-12-09 12:14:20] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:14:20] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:14:20] <DEBG> There are 1943941 reads in III:1709833-1970066 after filtering
[2018-12-09 12:16:27] <DEBG> Encountered an error whilst filtering, attempting to cleanup
[2018-12-09 12:16:27] <INFO> Removed 3 temporary files
[2018-12-09 12:16:27] <EROR> A program error has occurred:
[2018-12-09 12:16:27] <EROR> 
[2018-12-09 12:16:27] <EROR>     Encountered an unknown error during calling. This means there is a
[2018-12-09 12:16:27] <EROR>     bug and your results are untrustworthy.
[2018-12-09 12:16:27] <EROR> 
[2018-12-09 12:16:27] <EROR> To help resolve this error run in debug mode and send the log file to
[2018-12-09 12:16:27] <EROR> https://github.com/luntergroup/octopus/issues.
[2018-12-09 12:16:27] <INFO> ------------------------------------------------------------------------

The bam files can be found here. The samples I am using for this test run are XZ2020.bam, XZ1672.bam, QG2857.bam, NIC526.bam, JU3280.bam, JU3226.bam, and ECA744.bam.

I figured it might be one BAM messing things up so I iterated the above command (with the expanded region) while dropping BAM files until the run completed (e.g. run with 7 samples -> error, drop 1 bam, run with 6 samples -> error, drop 1 bam, run with 5 sample...). The run finished without an error when I removed QG2857.bam. But the run also finished without an error when I ran QG2857.bam alone.

Then I started adding samples back. Eventually, I figured out that the minimal number of samples required to generate the error is 3 and the samples are: QG2857, JU3226, and JU3280.

Here is the tail end of the log from that minimal sample set run:

[2018-12-09 12:54:30] <INFO> ------------------------------------------------------------------------
[2018-12-09 12:54:30] <INFO>      current      |                   |     time      |     estimated   
[2018-12-09 12:54:30] <INFO>      position     |     completed     |     taken     |     ttc         
[2018-12-09 12:54:30] <INFO> ------------------------------------------------------------------------
[2018-12-09 12:54:36] <DEBG> Fetched 1191448 unfiltered reads from III:1709833-1970066
[2018-12-09 12:54:37] <DEBG> In sample QG2857
[2018-12-09 12:54:37] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:54:37] <DEBG> 16333 failed the HasWellFormedCigar filter
[2018-12-09 12:54:37] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:54:37] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:54:37] <DEBG> In sample JU3280
[2018-12-09 12:54:37] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:54:37] <DEBG> 16277 failed the HasWellFormedCigar filter
[2018-12-09 12:54:37] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:54:37] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:54:37] <DEBG> In sample JU3226
[2018-12-09 12:54:37] <DEBG> 0 failed the IsNotMarkedQcFail filter
[2018-12-09 12:54:37] <DEBG> 74 failed the HasWellFormedCigar filter
[2018-12-09 12:54:37] <DEBG> 0 failed the IsMapped filter
[2018-12-09 12:54:37] <DEBG> 0 failed the HasValidBaseQualities filter
[2018-12-09 12:54:37] <DEBG> There are 1158764 reads in III:1709833-1970066 after filtering
[2018-12-09 12:56:00] <DEBG> Encountered an error whilst filtering, attempting to cleanup
[2018-12-09 12:56:00] <INFO> Removed 3 temporary files
[2018-12-09 12:56:00] <EROR> A program error has occurred:
[2018-12-09 12:56:00] <EROR> 
[2018-12-09 12:56:00] <EROR>     Encountered an unknown error during calling. This means there is a
[2018-12-09 12:56:00] <EROR>     bug and your results are untrustworthy.
[2018-12-09 12:56:00] <EROR> 
[2018-12-09 12:56:00] <EROR> To help resolve this error run in debug mode and send the log file to
[2018-12-09 12:56:00] <EROR> https://github.com/luntergroup/octopus/issues.
[2018-12-09 12:56:00] <INFO> ------------------------------------------------------------------------

Please let me know if you need any other information.

@dancooke
Copy link
Member

Many thanks for the information. I've downloaded the BAMs but can't seem to find the reference genome WS256.fa. Could you point me to where I can download this?

@Thatguy027
Copy link
Author

Thatguy027 commented Dec 11, 2018

hm, embedding the link isn't working, maybe because it is an ftp? anyway the link to all the genomes, including WS256 is here:

ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/sequence/genomic/

@dancooke
Copy link
Member

Thanks. I've found the problem and have pushed a fix to the develop branch. I'll add to next release.

@dancooke dancooke added the bug label Dec 14, 2018
@dancooke
Copy link
Member

dancooke commented Jan 2, 2019

This should be resolved in v0.5.3-beta. Please re-open if not.

@dancooke dancooke closed this as completed Jan 2, 2019
@Thatguy027
Copy link
Author

I re-ran the samples that caused the issue mentioned in this thread with 0.5.3, and did not run into an issue. Now I'll start looking into a the large-cohort approach you suggested above. thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants