Broker should recover from panics and not get into a crash loop #614

maleck13 · 2018-01-05T15:07:25Z

Feature:

As outlined in the following Bug issue #612 it is possible for a Job to panic which causes the broker to crash. The broker pod is restarted however if the Job that caused the crash is in a pending state, the broker attempts to recover and restart the Job. This creates the potential for the broker to go into a crash loop where the Job continues to panic and the broker continues to crash.

We should probably have some mechanism to guard against this situation.

One way to do this would be to have a recover function which is a builtin function in go to recover a panicking go routine:
More info https://blog.golang.org/defer-panic-and-recover

Although this would stop the broker from crashing, exiting quickly when the broker enters an unexpected state (ie a panic) can be a nice way to recover from a crash as it forces a restart of the process from scratch. A drawback here is the broker would be unavailable until it the pod was restarted (unless there were more than 1 replica).

As we attempt to recover Jobs that are in a "in progress" state when the broker starts, we may also want a mechanism to discard / delete a job if it appears during recovery more than a configured number of times. This could potentially be achieved with a field on the JobState that is set and persisted before a job recovery is attempted.

Interested to know others thoughts.

jmrodri · 2018-03-06T15:16:51Z

requires proposal

openshift-bot · 2020-08-24T03:27:52Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

jmrodri · 2020-09-20T01:00:31Z

/close

openshift-ci-robot · 2020-09-20T01:00:47Z

@jmrodri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rthallisey added feature 3.10 | release-1.2 Kubernetes 1.10 | Openshift 3.10 | Broker release-1.2 labels Jan 9, 2018

eriknelson mentioned this issue Jan 19, 2018

Remove ancient comment with app startup #664

Merged

jmrodri self-assigned this Mar 6, 2018

jmrodri added the proposal label Mar 6, 2018

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2020

openshift-ci-robot closed this as completed Sep 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broker should recover from panics and not get into a crash loop #614

Broker should recover from panics and not get into a crash loop #614

maleck13 commented Jan 5, 2018

Feature:

jmrodri commented Mar 6, 2018

openshift-bot commented Aug 24, 2020

jmrodri commented Sep 20, 2020

openshift-ci-robot commented Sep 20, 2020

Broker should recover from panics and not get into a crash loop #614

Broker should recover from panics and not get into a crash loop #614

Comments

maleck13 commented Jan 5, 2018

Feature:

jmrodri commented Mar 6, 2018

openshift-bot commented Aug 24, 2020

jmrodri commented Sep 20, 2020

openshift-ci-robot commented Sep 20, 2020