New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762
Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762
Conversation
@sinnykumari you added that |
To elaborate, the original code removes the stamp file just before rebooting: 40d8225#diff-06b532a2a5a5d3ad1d5341b3e00945efR481 |
/bugzilla refresh |
@cgwalters: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters: This pull request references Bugzilla bug 1841255, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Does this patch change that? I think we need to be |
It does indirectly, because we will reboot if changes were required. |
I mean in the failure path. If Thinking more on this, I think it makes sense to have |
FWIW 2e-metal-ipi passed on this which is a positive sign. |
Yep, true. But, that's not the scenario in the current bug. I'm uncertain as to whether to try to roll up more changes here.
I thought about that too but I want people to be able to log in over |
…belet See https://bugzilla.redhat.com/show_bug.cgi?id=1840222 Something in the baremetal IPI stack is forcibly powering off nodes during the firstboot. This causes all sorts of problems, but we should be more robust to handling this. The problem with `BindsTo=ignition-firstboot-complete.service` is twofold: First, if the service fails, we don't run, and will silently continue on to e.g. `kubelet.service`. That's bad - we should not land user workloads until a node is up to date and secure. Second, the binding is wrong because at some point we may move that service into the initramfs in CoreOS, and that would cause this to break. The "stamp file" approach is a generally good method of achieving idempotence, and we already have one, so let's use it. We also add a `RequiredBy={kubelet,crio}.service` to ensure they don't run unless we succeed.
e6dfbdc
to
75dbab9
Compare
@cgwalters: This pull request references Bugzilla bug 1841255, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Based on IRC chat I rolled in here a
so that even if the service fails (as opposed to not starting because of a dependency problem, which was the above BZ) then we still won't start crio or kubelet. |
Makes sense to me! /approve |
/lgtm e2e-metal-ipi worked again (it's just finishing up now) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, jlebon, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
OK, this clearly doesn't break the successful scenarios. I wanted to test how we handle failure, so I manually hacked up |
In a real failure case oc debug node works that early? |
No, it wouldn't - this one came from cluster-bot and I'm still trying to figure out where the private key for the ssh that's configured is. Though I may just quickly hack in my key into one of the nodes. |
OK I did some manual testing of failure scenarios, all looks right to me! |
/test e2e-aws |
1 similar comment
/test e2e-aws |
/test e2e-gcp-upgrade |
Upgrade failures are known flakes, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1812142 |
@cgwalters: /override requires a failed status context to operate on.
Only the following contexts were expected:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override ci/prow/e2e-gcp-upgrade |
@cgwalters: Overrode contexts on behalf of cgwalters: ci/prow/e2e-gcp-upgrade In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1762. Bugzilla bug 1841255 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
#904 (comment) was the reason i.e. without having BindsTo, it caused double reboot. I will check if that is not the case now. |
Tested with this PR, don't see double reboot while applying kargs as day1 which is good! |
@cgwalters @jlebon this PR is likely the cause of https://bugzilla.redhat.com/show_bug.cgi?id=1842906 - Jerry made an amazing job at debugging it and seems like the removal of BindsTo is causing the firstboot provision to run again and then triggers a backwards upgrade to the very initial state |
The current code has
So that should ensure that Something must be going wrong that breaks that. |
There's one last comment in https://bugzilla.redhat.com/show_bug.cgi?id=1842906 which clarifies what's happening better than I did in my previous comment |
See https://bugzilla.redhat.com/show_bug.cgi?id=1840222
Something in the baremetal IPI stack is forcibly powering off nodes
during the firstboot. This causes all sorts of problems, but
we should be more robust to handling this.
The problem with
BindsTo=ignition-firstboot-complete.service
is twofold:
First, if the service fails, we don't run, and will silently
continue on to e.g.
kubelet.service
. That's bad - we shouldnot land user workloads until a node is up to date and secure.
Second, the binding is wrong because at some point we may
move that service into the initramfs in CoreOS, and that would
cause this to break.
The "stamp file" approach is a generally good method of achieving
idempotence, and we already have one, so let's use it.
We also add a
RequiredBy={kubelet,crio}.service
to ensurethey don't run unless we succeed.