Broker should recover from panics and not get into a crash loop #614
Labels
3.12 | release-1.4
Kubernetes 1.12 | Openshift 3.12 | Broker release-1.4
feature
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
proposal
As outlined in the following Bug issue #612 it is possible for a Job to panic which causes the broker to crash. The broker pod is restarted however if the Job that caused the crash is in a pending state, the broker attempts to recover and restart the Job. This creates the potential for the broker to go into a crash loop where the Job continues to panic and the broker continues to crash.
We should probably have some mechanism to guard against this situation.
One way to do this would be to have a recover function which is a builtin function in go to recover a panicking go routine:
More info https://blog.golang.org/defer-panic-and-recover
Although this would stop the broker from crashing, exiting quickly when the broker enters an unexpected state (ie a panic) can be a nice way to recover from a crash as it forces a restart of the process from scratch. A drawback here is the broker would be unavailable until it the pod was restarted (unless there were more than 1 replica).
As we attempt to recover Jobs that are in a "in progress" state when the broker starts, we may also want a mechanism to discard / delete a job if it appears during recovery more than a configured number of times. This could potentially be achieved with a field on the JobState that is set and persisted before a job recovery is attempted.
Interested to know others thoughts.
The text was updated successfully, but these errors were encountered: