Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Octopus hangs after calling variants #11

Closed
jbedo opened this issue Nov 8, 2016 · 25 comments
Closed

Octopus hangs after calling variants #11

jbedo opened this issue Nov 8, 2016 · 25 comments
Assignees
Labels

Comments

@jbedo
Copy link
Contributor

jbedo commented Nov 8, 2016

I ran octopus for calling somatic variants with a normal/tumour pair and the variant calling appears to have finished, outputting the following line to stdout:

[2016-11-08 17:11:24] <INFO>                           -             100%           7.85h               -

Unfortunately the process has not exited and the final output file (as specified with -o flag) is still empty. It has been 12 hours since that last output line was printed so I doubt it will ever terminate. I attached to the process with GDB and obtained the following trace:

#0  0x00007fddcbd64f2c in hts_itr_query () from /nix/store/88syq7cidvhmgd54drajabja2qhjvcdv-htslib-1.3.2/lib/libhts.so.1
#1  0x00007fddcbd7340a in _reader_seek () from /nix/store/88syq7cidvhmgd54drajabja2qhjvcdv-htslib-1.3.2/lib/libhts.so.1
#2  0x00007fddcbd77853 in _reader_next_line () from /nix/store/88syq7cidvhmgd54drajabja2qhjvcdv-htslib-1.3.2/lib/libhts.so.1
#3  0x0000000000526ef8 in octopus::HtslibBcfFacade::count_records(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
#4  0x00000000005487f5 in octopus::VcfReader::count_records(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
#5  0x000000000055f39b in octopus::get_contig_count_map(std::vector<octopus::VcfReader, std::allocator<octopus::VcfReader> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) ()
#6  0x0000000000560bc9 in octopus::merge(std::vector<octopus::VcfReader, std::allocator<octopus::VcfReader> > const&, octopus::VcfWriter&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) ()
#7  0x000000000074226f in octopus::(anonymous namespace)::run_octopus_multi_threaded(octopus::GenomeCallingComponents&) ()
#8  0x000000000073badb in octopus::run_octopus(octopus::GenomeCallingComponents&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#9  0x00000000004736b4 in main ()`

Peaking in the temp directory seems to indicate the BCF files haven't been touched since that last message was printed to stdout. I will leave the process live for the moment in case you want more debugging output from GDB. Meanwhile, can I merge the temporary BCF files to obtain my variants?

@dancooke
Copy link
Member

dancooke commented Nov 8, 2016

Hi Justin!

This looks like a bug in octopus' VCF merge function. Could you please provide your command line prompt that you used, and if possible the temporary directory containing all the BCF files? It should be possible for you to manually merge the BCF files assuming they are not corrupted.

Please be aware that the somatic caller is currently some way off the maturity of the germline caller. I've just uploaded a new release which contains numerous fixes and improvements which drastically improves somatic calling performance from the last release, but there is still some way to go.

@jbedo
Copy link
Contributor Author

jbedo commented Nov 8, 2016

Octopus was invoked with:
octopus -R /data/hg38/hg38.fa -I tmp/9be570942bd7cfae6ed5c8c3b44b73bccb8e1c08.Samtools.AddRGLine.bam tmp/0828b0a2897927c2b831c1361f5fa7ec0ea31476.Samtools.AddRGLine.bam -N normal --fast --output tmp/3f9d7c093adadc2aea0e997d8bf405f0e355b993.Octopus.CallSomatic.vcf --threads 8
Unfortunately I can't provide the BCF files, I realise this makes debugging difficult. I wonder if this is actually a HtsLib bug though, as it appears to be stuck in a HtsLib function.

Thanks for the update to to somatic calling! I'll try it out. In the meantime, I've had a look at the temp BCF files and found a lot of genotype calls referencing an alt allele in the germline (e.g., 2|0) that is not present in the ALT column of the VCF file (only 1 alternate present). This seems to only occur with variants tagged with SOMATIC; does 2 mean something different in this context or is this another bug? I'll open up another issue if it is a bug.

@dancooke
Copy link
Member

dancooke commented Nov 8, 2016

Agh, the odd VCF format you're seeing was due to a bug which has been fixed in the new release. So it looks like htslib is getting into a muddle with octopus' (bad) VCF. Would you be able to try with the new release to see if the problem is resolved?

@jbedo
Copy link
Contributor Author

jbedo commented Nov 8, 2016

Sure, I'll get back to you in 10 hours ;).

@dancooke
Copy link
Member

dancooke commented Nov 8, 2016

Cheers! Apologies about the slow runtime, I'll be focusing on improving that over the next few weeks.

@jbedo
Copy link
Contributor Author

jbedo commented Nov 9, 2016

Looks like it still hangs in htslib. Inspecting the BCF files the strange genotyping is now fixed, but the SOMATIC mutations are still a little odd: all of them are tri-allelic. I think it's unlikely the only somatic mutations have copy number 3.

How does htslib handle multi-allelic sites? Maybe that's still causing it to hang. I've linked against htslib 1.3.1, maybe I should try again with htslib 1.3.2. Or should I be using the master branch of htslib?

@dancooke
Copy link
Member

dancooke commented Nov 9, 2016

The triploid somatic genotypes are not intended to give copy number information per se, rather they are used to distinguish novel somatic haplotypes from the germline background (they are always ploidy + 1). The third column in the phased genotype is always the somatic allele. Octopus does infer allele frequencies (a somatic allele frequency credible interval is given in the SCR sample field), but it doesn't try to infer copy numbers from these, maybe it's something we'll look at in the future.

I've not had this issue with htslib, I am using 1.3.2 so it's possible that a bug in 1.3.1 is causing this. Does the problem persist if you call over a smaller target region (e.g. with just one somatic)?

@jbedo
Copy link
Contributor Author

jbedo commented Nov 10, 2016

Thanks for the clarification, that makes a lot more sense.

I tried calling over a single chromosome and it completed successfully.

@jbedo
Copy link
Contributor Author

jbedo commented Nov 10, 2016

Ok after some experiments I can get it to hang if I run octopus on the unknown contigs in hg38, so everything except chr1-22, chrX, and chrY. There doesn't seem to be anything unusal about the VCF output that I dug out of the octopus temporary files for these chromosomes, other than the _ in the chromosome name (e.g., chrUn_KI270438v1).

@jbedo
Copy link
Contributor Author

jbedo commented Nov 10, 2016

Also confirmed bug is still present with htslib 1.3.2.

@dancooke
Copy link
Member

dancooke commented Nov 9, 2017

I'm not 100% sure, but I think I resolved this at some point over the last year. Will leave open for now.

@jbedo
Copy link
Contributor Author

jbedo commented Nov 30, 2017

Looks like some variant of this bug is still around. Octopus seems to create an "unfiltered" file and then hang. The output file I specify remains empty. I'll try and check if it's caused by the contigs as it previously was a year ago.

@dancooke
Copy link
Member

dancooke commented Dec 1, 2017

This is odd. Octopus shouldn't have any issues with the underscores in the contig names. I was trying to reproduce this bug today and did come across a new issue (#17), but I don't think it is related to this problem. I haven't managed to reproduce this issue yet, have you checked if you get the same behaviour running on the same contigs with a single sample?

I also had a quick look at the htslib method your GDB stack trace indicates where the process gets stuck (starting line 579 here). Although I cannot claim to fully understand this code, that while loop looks highly suspect. Even if octopus is calling this method with bad input, I don't think an infinite loop is a reasonable outcome.

@jbedo
Copy link
Contributor Author

jbedo commented Dec 1, 2017

Removing contigs didn't help this time, octopus still freezes after producing the "unfiltered" file. I'm currently using a cluster to execute this so debugging with gdb is difficult. I'll run a single sample and see if it terminates.

As for the while loop, I don't think it's going into any infinite loops as the CPU usage goes to 0%, suggesting some sort of deadlock.

@jbedo
Copy link
Contributor Author

jbedo commented Dec 5, 2017

Looks like I have the same problem with a single sample: the unfiltered file is produced then octopus hangs and the output file remains empty.

@dancooke
Copy link
Member

dancooke commented Dec 5, 2017

Very strange. The temporary file merging occurs in the main thread, once all other worker threads have finished, so I'm struggling to see how this can be a deadlock issue. Could you please confirm which OS, compiler, standard library, Boost, and htslib versions you're using?

@jbedo
Copy link
Contributor Author

jbedo commented Dec 5, 2017

I'm on linux, here are the dependencies being used:

glibc-2.26-75
zlib-1.2.11
bzip2-1.0.6.0.1
xz-5.2.3
openssl-1.0.2m
nghttp2-1.24.0-lib
libssh2-1.8.0
curl-7.56.1
htslib-1.6
bash-4.4-p12
gcc-6.4.0-lib
icu4c-59.1
boost-1.65.1

I compiled octopus with clang 3.9.1.

@jbedo
Copy link
Contributor Author

jbedo commented Dec 10, 2017

I reproduced the problem on my laptop, here's a stack trace:

#0 0x00007fd79ced90d9 in pthread_cond_destroy@@GLIBC_2.3.2 () from /nix/store/dps6gpjd9vmjylqgjhdbw6kyxfbfssn7-glibc-2.26-75/lib/libpthread.so.0 #1 0x0000000000818b03 in octopus::(anonymous namespace)::run_octopus_multi_threaded(octopus::GenomeCallingComponents&) () #2 0x000000000081155f in octopus::run_calling(octopus::GenomeCallingComponents&) () #3 0x000000000081cba6 in octopus::run_octopus(octopus::GenomeCallingComponents&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () #4 0x0000000000477d74 in main ()

HTH.

@dancooke
Copy link
Member

That is useful, thanks! Oddly, this stack trace points somewhere quite different to your previous one. From this Stack Overflow question, it looks some thread is waiting on one of the std::condition_variables that terminates before calling notify, then when the condition variable destructor is called (which just calls pthread_cond_destroy), we get undefined behaviour (hanging in this case). I'll try to track down where this could be happening!

@dancooke
Copy link
Member

dancooke commented Dec 12, 2017

Could you please try checking out the branch fix/hang-after-calling and running your example with the --debug option? This will produce a debug log file which could be quite large (~10GB). If the changes I've made haven't fixed the issue, it would be helpful if you could post the last few hundred lines of the debug log after it gets stuck. Cheers.

@dancooke
Copy link
Member

Just some other thoughts. I wonder if there is some issue with using Clang with libstdc++ which is really intended for GCC? If the new branch doesn't fix the issue, would it be possible for you to try building with GCC instead of Clang?

@jbedo
Copy link
Contributor Author

jbedo commented Dec 13, 2017

It doesn't compile for me with GCC for some reason, but I don't think the libstdc++ from gcc will cause any issues (especially a pthread issue). In any case, the branch partially worked! My first octopus run ran to completion, but the second sample crashed. I'll restart and see if it's a repeatable thing.

@jbedo
Copy link
Contributor Author

jbedo commented Dec 13, 2017

OK it doesn't hang anymore but now it just crashes in the call set refinement stage. No unfiltered files are being produced; does that mean octopus is getting further than before or not? Doesn't seem to be anything useful in the debug log:

[2017-12-13 07:40:45] 0 failed the IsNotOctopusDuplicate filter
[2017-12-13 07:40:45] 0 failed the HasValidBaseQualities filter
[2017-12-13 07:40:45] 0 failed the HasWellFormedCigar filter
[2017-12-13 07:40:45] 0 failed the IsMapped filter
[2017-12-13 07:40:45] 2 failed the IsProperTemplate filter
[2017-12-13 07:40:45] 0 failed the IsNotMarkedQcFail filter
[2017-12-13 07:40:45] 0 failed the IsNotMarkedDuplicate filter
[2017-12-13 07:40:45] There are 201 reads in chr1:5687474-5695533 after filtering
[2017-12-13 07:40:45] - 100% 31s -
[2017-12-13 07:40:45] Encountered an error whilst filtering, attempting to cleanup
[2017-12-13 07:40:45] Removed 912 temporary files
[2017-12-13 07:40:46] A program error has occurred:
[2017-12-13 07:40:46]
[2017-12-13 07:40:46] Encountered an unknown error during calling. This means there is a
[2017-12-13 07:40:46] bug and your results are untrustworthy.
[2017-12-13 07:40:46]
[2017-12-13 07:40:46] To help resolve this error run in debug mode and send the log file to
[2017-12-13 07:40:46] https://github.com/luntergroup/octopus/issues.
[2017-12-13 07:40:46] ------------------------------------------------------------------------

@dancooke
Copy link
Member

Ok great. The crash in the variant filtering is certainly an independent bug as this only starts once the calling has finished, and filtering isn't threaded yet. Unfortunately the 'cleanup' after the exception in the filtering removes the unfiltered calls - I'll probably change that behaviour. Would you mind opening a separate issue for the filtering crash? Probably if you just call the region chr1:5687474-5695533 you'll hit the same bug - in which case it would be very helpful if you could provide a BAM subset for this region with the issue. I don't envision it will be too difficult to resolve this one! In the meantime, you can just avoid filtering completely by specifying -f off. Many thanks for your help in resolving this issue!

@dancooke
Copy link
Member

Closing this issue as I believe it is fixed in v0.3.2-alpha (c27686f).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants