New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NooBaa Component Rescheduling #672
Conversation
8a1ef23
to
bef9402
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @baum
This PR looks great. Awesome doc!
Since this feature is a new controller, I want to first discuss its goal and scope.
Instead of node-watcher, I think this should be named and purposed as High Availability (HA) controller. Think about it this way - a reconciler's purpose is well defined by its target effect, not its source triggers, because that's the nature of reconcile - to always apply the desired state, based on any sources needed for the job.
So I would redefine this controller as the HA-controller, which is what we are trying to impose, and describe the problems that are currently in the scope - for now just node-shutdown/disconnect (or other not ready states?). I would also add section that walks through the flow of events on node shutdown - with and without this controller running.
About the code examples in "the solution" section in the doc - it makes the doc colorful and welcoming indeed, but I think that the code should appear once in the source tree rather than copied to the doc which is bound to become obsolete. I would compress those technical details to a a few sentences or bullets that go through the steps in the reconcile flow.
4426df0
to
0cf3ec7
Compare
|
@guymguym sounds good, updated per your concerns |
83be120
to
37980cb
Compare
6ddfabc
to
f9ffb7c
Compare
164841f
to
1dd7bab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
70beade
to
1d81979
Compare
4604dda
to
41078a5
Compare
Improve NooBaa pods recovery in case of a node failure. Find noobaa pods running on the failing node and force to delete them to speed up rescheduling in healthy node. Signed-off-by: Alexander Indenbaum <aindenba@redhat.com> Co-authored-by: liranmauda <liran.mauda@gmail.com> Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
GitHub Action CI integration test using KIND cluster. Simulate a node failure by killing a cluster node. Verify that NooBaa pods running on the failing node are deleted. Signed-off-by: Alexander Indenbaum <aindenba@redhat.com>
Improve NooBaa pods recovery in case of a node failure. Find noobaa pods
running on the failing node and force delete them to speed up
rescheduling in healthy node.
Signed-off-by: Alexander Indenbaum aindenba@redhat.com