New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node failures cause jiva replica on non-impacted nodes to crash #1612

ksatchit opened this Issue Jun 13, 2018 · 2 comments


None yet
3 participants
Copy link

ksatchit commented Jun 13, 2018



What happened:

  • Node failures on a kubernetes cluster running OpenEBS volume is seen to cause restart (crash) of jiva replicas on other, i.e., non-impacted nodes. Provided below is the panic received :
 fatal error: unexpected signal during runtime execution
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x7ffe001b6ab0 pc=0x7ffe001b6ab0]

 runtime stack:
 runtime.throw(0xbba4c1, 0x2a)
        /usr/local/go/src/runtime/panic.go:605 +0x95
        /usr/local/go/src/runtime/signal_unix.go:351 +0x2b8
 runtime.nanotime(0x11c2b50, 0xdf8475800, 0x1, 0x8, 0x141bea0d10, 0xbe5c866cb14, 0x45d964b800, 0x0, 0x0, 0xbdb42b79887, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:179 +0x26
        /usr/local/go/src/runtime/proc.go:3894 +0x1d2
        /usr/local/go/src/runtime/proc.go:1182 +0x11e
        /usr/local/go/src/runtime/proc.go:1152 +0x64
  • Typically (most cases using the recommended 3-replica storage classes), this leads to quorum violation as, replica reschedule caused as part of the node failures are accompanied by the crash described above - leading to volume presented RO to the application pod

What you expected to happen:

  • Node failures impacting a single replica should not cause RO mounts on the application node, i.e., there should be no IO impact

How to reproduce it (as minimally and precisely as possible):

  • This issue is seen to occur in a vagrant-based kubernetes cluster (v1.8.8), built using kubeadm. The kubeminons/hosts are ubuntu xenial vagrantboxes with 2 vCPUs each & ~4G RAM

  • A percona application deployment with liveness probe comprising DB writes with low delay-interval (~1-10s) was helpful in reproducing this issue.

  • The node failure was effected using vagrant halt -f command

Anything else we need to know?:

  • This issue is seen on 0.6.0-RC1/2. It is not seen in 0.5.4 (where there are known cases of I/O errors, i.e., device RO)

  • A replica pod log w/ backtrace from one of the reproduction attempts is provided in the comments


This comment has been minimized.

Copy link

ksatchit commented Jun 13, 2018


This comment has been minimized.

Copy link

kmova commented Dec 7, 2018

Have not observed this issue in 0.7 and 0.8 testing, after the jiva has been refactored around usage of locks and also using a fixed ubuntu image instead of getting the latest ubuntu image. Will re-open this issue if we hit this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment