Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker should recover from panics and not get into a crash loop #614

Closed
maleck13 opened this issue Jan 5, 2018 · 4 comments
Closed

Broker should recover from panics and not get into a crash loop #614

maleck13 opened this issue Jan 5, 2018 · 4 comments
Assignees
Labels
3.12 | release-1.4 Kubernetes 1.12 | Openshift 3.12 | Broker release-1.4 feature lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. proposal

Comments

@maleck13
Copy link
Contributor

maleck13 commented Jan 5, 2018

Feature:

As outlined in the following Bug issue #612 it is possible for a Job to panic which causes the broker to crash. The broker pod is restarted however if the Job that caused the crash is in a pending state, the broker attempts to recover and restart the Job. This creates the potential for the broker to go into a crash loop where the Job continues to panic and the broker continues to crash.

We should probably have some mechanism to guard against this situation.

One way to do this would be to have a recover function which is a builtin function in go to recover a panicking go routine:
More info https://blog.golang.org/defer-panic-and-recover

Although this would stop the broker from crashing, exiting quickly when the broker enters an unexpected state (ie a panic) can be a nice way to recover from a crash as it forces a restart of the process from scratch. A drawback here is the broker would be unavailable until it the pod was restarted (unless there were more than 1 replica).

As we attempt to recover Jobs that are in a "in progress" state when the broker starts, we may also want a mechanism to discard / delete a job if it appears during recovery more than a configured number of times. This could potentially be achieved with a field on the JobState that is set and persisted before a job recovery is attempted.

Interested to know others thoughts.

@rthallisey rthallisey added feature 3.10 | release-1.2 Kubernetes 1.10 | Openshift 3.10 | Broker release-1.2 labels Jan 9, 2018
@jmrodri jmrodri self-assigned this Mar 6, 2018
@jmrodri
Copy link
Contributor

jmrodri commented Mar 6, 2018

requires proposal

@jmrodri jmrodri added 3.11 | release-1.3 Kubernetes 1.11 | Openshift 3.11 | Broker release-1.3 and removed 3.10 | release-1.2 Kubernetes 1.10 | Openshift 3.10 | Broker release-1.2 labels Jun 5, 2018
@jmrodri jmrodri added 3.12 | release-1.4 Kubernetes 1.12 | Openshift 3.12 | Broker release-1.4 and removed 3.11 | release-1.3 Kubernetes 1.11 | Openshift 3.11 | Broker release-1.3 labels Jul 24, 2018
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2020
@jmrodri
Copy link
Contributor

jmrodri commented Sep 20, 2020

/close

@openshift-ci-robot
Copy link

@jmrodri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 | release-1.4 Kubernetes 1.12 | Openshift 3.12 | Broker release-1.4 feature lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. proposal
Projects
None yet
Development

No branches or pull requests

5 participants