Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC recovering reconcile panics as errors #5992

Closed
wants to merge 1 commit into from

Conversation

davidvossel
Copy link
Member

We've encountered two recent virt-controller crashes (#5849 and #5694) that were caused by resources entering into unexpected states. The result is that a crash in one of our reconcile loops crashes all of virt-controller, which can cause the entire virt-controller process to enter into a crash loop.

Using the built in golang recover() function, we can catch and handle panics that occur during our reconcile loops and simply treat them as errors which then get rate limited.

This PR exists to illustrate how this can be done for the sake of discussion.

NONE

Signed-off-by: David Vossel <davidvossel@gmail.com>
@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jul 7, 2021
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from davidvossel after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@davidvossel davidvossel marked this pull request as draft July 7, 2021 20:07
@kubevirt-bot kubevirt-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 7, 2021
@davidvossel
Copy link
Member Author

I'm not convinced treating reconcile panics as errors is a good idea. For now i think we should still crash, however we may choose to introduce a small delay in the crash handler to allow other controllers to perform some work before exitting. This would potentially mitigate a situation where a crash occurs which would get resolved if only another controller had a chance to perform some work

@davidvossel davidvossel closed this Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants