Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NooBaa Component Rescheduling #672

Merged
merged 2 commits into from Aug 2, 2021
Merged

Conversation

baum
Copy link
Contributor

@baum baum commented Jun 24, 2021

Improve NooBaa pods recovery in case of a node failure. Find noobaa pods
running on the failing node and force delete them to speed up
rescheduling in healthy node.

Signed-off-by: Alexander Indenbaum aindenba@redhat.com

@baum baum requested review from dannyzaken and guymguym June 24, 2021 13:24
@baum baum force-pushed the node_watcher branch 13 times, most recently from 8a1ef23 to bef9402 Compare June 27, 2021 20:34
Copy link
Member

@guymguym guymguym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @baum
This PR looks great. Awesome doc!
Since this feature is a new controller, I want to first discuss its goal and scope.

Instead of node-watcher, I think this should be named and purposed as High Availability (HA) controller. Think about it this way - a reconciler's purpose is well defined by its target effect, not its source triggers, because that's the nature of reconcile - to always apply the desired state, based on any sources needed for the job.

So I would redefine this controller as the HA-controller, which is what we are trying to impose, and describe the problems that are currently in the scope - for now just node-shutdown/disconnect (or other not ready states?). I would also add section that walks through the flow of events on node shutdown - with and without this controller running.

About the code examples in "the solution" section in the doc - it makes the doc colorful and welcoming indeed, but I think that the code should appear once in the source tree rather than copied to the doc which is bound to become obsolete. I would compress those technical details to a a few sentences or bullets that go through the steps in the reconcile flow.

@baum baum force-pushed the node_watcher branch 6 times, most recently from 4426df0 to 0cf3ec7 Compare June 28, 2021 07:22
@baum
Copy link
Contributor Author

baum commented Jun 28, 2021

@guymguym sounds good, updated per your concerns

@baum baum requested a review from guymguym June 28, 2021 08:29
@baum baum force-pushed the node_watcher branch 2 times, most recently from 83be120 to 37980cb Compare June 28, 2021 10:49
pkg/hac/hac.go Outdated Show resolved Hide resolved
pkg/hac/hac.go Outdated Show resolved Hide resolved
@baum
Copy link
Contributor Author

baum commented Jul 4, 2021

Added High Availability (HA) Controller integration test, see here

GitHub Action CI integration test using KIND cluster.
Simulate a node failure by killing a cluster node.
Verify that NooBaa pods running on the failing node are deleted.

.github/workflows/run_hac_test.yml Outdated Show resolved Hide resolved
doc/high-availability-controller.md Outdated Show resolved Hide resolved
doc/high-availability-controller.md Outdated Show resolved Hide resolved
doc/high-availability-controller.md Outdated Show resolved Hide resolved
doc/high-availability-controller.md Outdated Show resolved Hide resolved
@baum baum force-pushed the node_watcher branch 7 times, most recently from 6ddfabc to f9ffb7c Compare July 7, 2021 07:00
@baum baum force-pushed the node_watcher branch 6 times, most recently from 164841f to 1dd7bab Compare July 22, 2021 14:34
Copy link
Contributor

@dannyzaken dannyzaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@baum baum force-pushed the node_watcher branch 2 times, most recently from 70beade to 1d81979 Compare July 27, 2021 07:45
@baum baum force-pushed the node_watcher branch 2 times, most recently from 4604dda to 41078a5 Compare August 2, 2021 14:41
Alexander Indenbaum and others added 2 commits August 2, 2021 17:47
Improve NooBaa pods recovery in case of a node failure. Find noobaa pods
running on the failing node and force to delete them to speed up
rescheduling in healthy node.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Co-authored-by: liranmauda <liran.mauda@gmail.com>
Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
GitHub Action CI integration test using KIND cluster.
Simulate a node failure by killing a cluster node.
Verify that NooBaa pods running on the failing node are deleted.

Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants