New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1829642: templates: Add a special machine-config-daemon-firstboot-v42.service #1706
Bug 1829642: templates: Add a special machine-config-daemon-firstboot-v42.service #1706
Conversation
This is aiming to fix: https://bugzilla.redhat.com/show_bug.cgi?id=1829642 AKA openshift#1215 (comment) Basically we have our systemd units dynamically differentiate between "4.2" and "4.3 or above" by looking at the aleph version.
A few other things we should fix here but let's do them separately, just noting for myself:
|
Also...random thought, why did we add the MCD to the host versus pulling it from a container image and just running that via |
So it looks like the issue was a target change that we didn't hold backwards compatibility for? |
We were pulling it as a container but decided to bake it in to the OS in #801. Essentially at that time |
Well...at the time I did a lot of testing with 4.1 and I think we did 4.2 too - the thing is the code was only partially enabled partly because I think I hadn't fully tested enabling it and I was worried about issues like this. This change enabled it fully. |
No worries and no blame! Things move fast 😄 Just making sure I understand what may have happened. |
[Service] | ||
# Need oneshot to delay kubelet | ||
Type=oneshot | ||
ExecStart=/usr/libexec/machine-config-daemon pivot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be dumb, but will this run in 4.1 as well on this conditionpathexists? 4.1 has pivot service in the OS directly so that's gonna be run but this will as well or am I making this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. So...I think what's going on here right now is you'll notice that
machine-config-daemon-host.service
does:
# If pivot exists, defer to it. Note similar code in update.go
ConditionPathExists=!/usr/lib/systemd/system/pivot.service
I think the idea there was that this firstboot.service
will try to do the "other stuff" like kernel arguments and the defer the actual "os update' to pivot.service
.
But I'm honestly not sure that really makes sense - we might as well take over the whole thing?
This needs investigation - but we're also doing the same thing that the current code is doing so we can't (shouldn't) be making anything worse, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep agreed I think it makes sense and it’ll be skipped indeed 🤔
I don't know know the answer to that but this is what's done in OKD on FCOS which doesn't ship with MCD. How feasible would it be to change this in OCP to pull from image there? |
I should've read on a bit 😄 |
I can't remember either right now but sure that could be something we could try and move to (as OKD is doing as Christian said 👍 ) |
I think this works! I deployed this PR on an affected cluster, made sure the base AMI was from 4.2, then scaled up the machineset and:
all looks good! It joined the cluster. |
/retitle Bug 1829642: templates: Add a special machine-config-daemon-firstboot-v42.service |
@cgwalters: An error was encountered searching for bug 1829642 on the Bugzilla server at https://bugzilla.redhat.com:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
4.4 backport #1707 |
/bugzilla refresh |
@kikisdeliveryservice: This pull request references Bugzilla bug 1829642, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Has anyone had a chance to independently verify this yet? |
@cgwalters: This pull request references Bugzilla bug 1829642, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
My previous cluster broke before it got to it so am trying again |
hmm these are weird test errors in e2e-aws, will take a look but also /retest |
Some tips for testing this out/reproducing. The most "customer-like" reproduction scenario is:
If it fails, the machine will be stuck in But, I am 95% sure it will also reproduce to just:
I'm testing that we successfully fail with the latter right now. |
I tagged the release image from this PR to And just for reference I find this "tag release image from PR" workflow to be very useful, so here's a hand command to do it. First find the CI namespace from the artifacts, then: |
I understand this will fix the m-c-d-firstboot-complete issue. But I am not sure if it solves issues like the crio https://bugzilla.redhat.com/show_bug.cgi?id=1829642#c8 . It is a bit surprising that when a 4.2 cluster gets upgraded to 4.4 and we scale-up, we are using 4.2 based bootimage to boot node but it uses ignition config served by 4.4. Shouldn't bootimage and ignition config be from same time (i.e. ignition config which was served initially with 4.2 bootimage)? correct me, if I misunderstood something. |
Related discussion seem to have happened here onward #798 (comment) and related pr is #904 . I vaguely remember that at that time we decided to go with supporting karg day1 for 4.3+ cluster but we didn't foresee that we would have upgrade issue like the one this PR is about. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cgwalters: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1706. Bugzilla bug 1829642 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Also, add machine-config-daemon-firstboot-v42.service so that new nodes through machine-api comes up as expected on a 4.2 to 4.3 upgraded cluster Backport of PRs: - openshift#1366 - openshift#1706
This came up in discussions around here: openshift#1706 (comment) There's no reason to start crio either, and it's confusing to do so because in that bug it made us think there was a crio problem when there wasn't. Add `crio-wipe.service` here too because it's part of `crio.service`.
I haven't verified this yet but I think this is the core fix for https://bugzilla.redhat.com/show_bug.cgi?id=1842906 Basically the MCS injects both `/etc/pivot/image-pullspec` which 4.1 bootimages need, *and* we get `/etc/ignition-machine-config-encapsulated.json` which starts `machine-config-daemon-firstboot.service`. We should only have one running. I think we have a setup like this: 4.1: Uses `pivot.service` which lives on the host already 4.2: Uses `machine-config-daemon-firstboot-v42.service` from openshift#1706 ≥4.3: Uses `machine-config-daemon-firstboot.service` So let's stop starting this by default.
This is aiming to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=1829642
AKA
#1215 (comment)
Basically we have our systemd units dynamically differentiate between
"4.2" and "4.3 or above" by looking at the aleph version.
https://bugzilla.redhat.com/show_bug.cgi?id=1829642
Background
In the beginning, we had this concept of
pivot.service
whose whole job was to pull a newmachine-os-content
and reboot, runningBefore=kubelet.service
.More: OSUpgrades.
That was simple, and worked well! Life was good. Everyone was relaxing on the beach.
Then, we needed to handle kernel arguments (specifically adding
nosmt
).But a core problem here is that we have no mechanism to update bootimages by default - so every change we make that needs to happen on this "firstboot" becomes an upgrade hazard.
Further, we had a strong split between the
pivot
code (day 0) and what happened in the MCO. So we started baking in the MCD to the host; #868 is a notable commit here.But still today, the MCO needs to handle having even very old bootimages (from 4.1) scale up and join the cluster.
#891 is a notable commit here; basically the MCS serves Ignition which contains a pair of systemd units:
machine-config-daemon-firstboot.service: Replaces
pivot.service
- this handles everything the MCO knows how to handle day 1/2, such as kernel arguments too.machine-config-daemon-host.service: Run by rpm-ostree to perform upgrades. This is not enabled by default, but is started by the MCD (both via the
-firstboot.service
as well as "day 2" when the MCD runs as a daemonset).Now wait...trying to re-understand the difference between them, I found #1215 (landed for 4.4).
Before that commit...we basically didn't run the new code AFAICS because the MCS still today writes
/etc/pivot/image-pullspec
(for compatibility with 4.1 bootimages). See this commit which added the code.