New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node failures cause jiva replica on non-impacted nodes to crash #1612

Closed
ksatchit opened this Issue Jun 13, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@ksatchit
Copy link
Member

ksatchit commented Jun 13, 2018

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

What happened:

  • Node failures on a kubernetes cluster running OpenEBS volume is seen to cause restart (crash) of jiva replicas on other, i.e., non-impacted nodes. Provided below is the panic received :
 fatal error: unexpected signal during runtime execution
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x7ffe001b6ab0 pc=0x7ffe001b6ab0]

 runtime stack:
 runtime.throw(0xbba4c1, 0x2a)
        /usr/local/go/src/runtime/panic.go:605 +0x95
 runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:351 +0x2b8
 runtime.nanotime(0x11c2b50, 0xdf8475800, 0x1, 0x8, 0x141bea0d10, 0xbe5c866cb14, 0x45d964b800, 0x0, 0x0, 0xbdb42b79887, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:179 +0x26
 runtime.sysmon()
        /usr/local/go/src/runtime/proc.go:3894 +0x1d2
 runtime.mstart1()
        /usr/local/go/src/runtime/proc.go:1182 +0x11e
 runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1152 +0x64
  • Typically (most cases using the recommended 3-replica storage classes), this leads to quorum violation as, replica reschedule caused as part of the node failures are accompanied by the crash described above - leading to volume presented RO to the application pod

What you expected to happen:

  • Node failures impacting a single replica should not cause RO mounts on the application node, i.e., there should be no IO impact

How to reproduce it (as minimally and precisely as possible):

  • This issue is seen to occur in a vagrant-based kubernetes cluster (v1.8.8), built using kubeadm. The kubeminons/hosts are ubuntu xenial vagrantboxes with 2 vCPUs each & ~4G RAM

  • A percona application deployment with liveness probe comprising DB writes with low delay-interval (~1-10s) was helpful in reproducing this issue.

  • The node failure was effected using vagrant halt -f command

Anything else we need to know?:

  • This issue is seen on 0.6.0-RC1/2. It is not seen in 0.5.4 (where there are known cases of I/O errors, i.e., device RO)

  • A replica pod log w/ backtrace from one of the reproduction attempts is provided in the comments

@ksatchit

This comment has been minimized.

Copy link
Member

ksatchit commented Jun 13, 2018

@kmova

This comment has been minimized.

Copy link
Member

kmova commented Dec 7, 2018

Have not observed this issue in 0.7 and 0.8 testing, after the jiva has been refactored around usage of locks and also using a fixed ubuntu image instead of getting the latest ubuntu image. Will re-open this issue if we hit this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment