Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762

cgwalters · 2020-05-28T17:28:07Z

See https://bugzilla.redhat.com/show_bug.cgi?id=1840222
Something in the baremetal IPI stack is forcibly powering off nodes
during the firstboot. This causes all sorts of problems, but
we should be more robust to handling this.

The problem with BindsTo=ignition-firstboot-complete.service
is twofold:

First, if the service fails, we don't run, and will silently
continue on to e.g. kubelet.service. That's bad - we should
not land user workloads until a node is up to date and secure.

Second, the binding is wrong because at some point we may
move that service into the initramfs in CoreOS, and that would
cause this to break.

The "stamp file" approach is a generally good method of achieving
idempotence, and we already have one, so let's use it.

We also add a RequiredBy={kubelet,crio}.service to ensure
they don't run unless we succeed.

cgwalters · 2020-05-28T17:29:13Z

@sinnykumari you added that BindsTo= in a465088 - do you remember why?

cgwalters · 2020-05-28T17:32:05Z

To elaborate, the original code removes the stamp file just before rebooting: 40d8225#diff-06b532a2a5a5d3ad1d5341b3e00945efR481
Current code moves it to a backup file, see 6e8b587 - but either way what we want is just that ConditionPathExists=.

cgwalters · 2020-05-28T18:08:53Z

/bugzilla refresh

openshift-ci-robot · 2020-05-28T18:08:54Z

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-28T18:09:06Z

@cgwalters: This pull request references Bugzilla bug 1841255, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlebon · 2020-05-28T18:35:03Z

First, if the service fails, we don't run, and will silently
continue on to e.g. kubelet.service. That's bad - we should
not land user workloads until a node is up to date and secure.

Does this patch change that? I think we need to be RequiredBy=kubelet.service or something, no? Or the atomic option is OnFailure=emergency.target. That would've made it really obvious that something went wrong early on.

cgwalters · 2020-05-28T19:07:38Z

Does this patch change that? I think we need to be RequiredBy=kubelet.service or something, no? Or the atomic option is OnFailure=emergency.target. That would've made it really obvious that something went wrong early on.

It does indirectly, because we will reboot if changes were required.

jlebon · 2020-05-28T19:16:47Z

It does indirectly, because we will reboot if changes were required.

I mean in the failure path. If machine-config-daemon-firstboot.service fails, there's nothing preventing the system from continuing to boot, right?

Thinking more on this, I think it makes sense to have OnFailure=emergency.target? That service is like an extension of Ignition itself, and I think it makes sense to have the same semantics of "crash and burn if we fail to provision as per the spec".

stbenjam · 2020-05-28T19:22:33Z

FWIW 2e-metal-ipi passed on this which is a positive sign.

cgwalters · 2020-05-28T19:25:32Z

I mean in the failure path. If machine-config-daemon-firstboot.service fails, there's nothing preventing the system from continuing to boot, right?

Yep, true. But, that's not the scenario in the current bug. I'm uncertain as to whether to try to roll up more changes here.

Thinking more on this, I think it makes sense to have OnFailure=emergency.target?

I thought about that too but I want people to be able to log in over ssh and debug.

…belet See https://bugzilla.redhat.com/show_bug.cgi?id=1840222 Something in the baremetal IPI stack is forcibly powering off nodes during the firstboot. This causes all sorts of problems, but we should be more robust to handling this. The problem with `BindsTo=ignition-firstboot-complete.service` is twofold: First, if the service fails, we don't run, and will silently continue on to e.g. `kubelet.service`. That's bad - we should not land user workloads until a node is up to date and secure. Second, the binding is wrong because at some point we may move that service into the initramfs in CoreOS, and that would cause this to break. The "stamp file" approach is a generally good method of achieving idempotence, and we already have one, so let's use it. We also add a `RequiredBy={kubelet,crio}.service` to ensure they don't run unless we succeed.

openshift-ci-robot · 2020-05-28T19:44:01Z

@cgwalters: This pull request references Bugzilla bug 1841255, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2020-05-28T19:45:30Z

Based on IRC chat I rolled in here a

  [Install]
  ...
  RequiredBy=crio.service kubelet.service

so that even if the service fails (as opposed to not starting because of a dependency problem, which was the above BZ) then we still won't start crio or kubelet.

jlebon · 2020-05-28T19:46:46Z

Makes sense to me!

/approve

stbenjam · 2020-05-28T21:09:52Z

/lgtm

e2e-metal-ipi worked again (it's just finishing up now)

openshift-ci-robot · 2020-05-28T21:10:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2020-05-28T21:20:40Z

OK, this clearly doesn't break the successful scenarios. I wanted to test how we handle failure, so I manually hacked up /etc/ignition-machine-config-encapsulated.json to have syntax errors and did systemctl restart machine-config-daemon-firstboot but...because that failed it took down kubelet and hence my oc debug node/...

stbenjam · 2020-05-28T21:22:34Z

OK, this clearly doesn't break the successful scenarios. I wanted to test how we handle failure, so I manually hacked up /etc/ignition-machine-config-encapsulated.json to have syntax errors and did systemctl restart machine-config-daemon-firstboot but...because that failed it took down kubelet and hence my oc debug node/...

In a real failure case oc debug node works that early?

cgwalters · 2020-05-28T21:32:43Z

In a real failure case oc debug node works that early?

No, it wouldn't - this one came from cluster-bot and I'm still trying to figure out where the private key for the ssh that's configured is. Though I may just quickly hack in my key into one of the nodes.

cgwalters · 2020-05-28T21:41:57Z

OK I did some manual testing of failure scenarios, all looks right to me!

stbenjam · 2020-05-28T21:50:59Z

/test e2e-aws

stbenjam · 2020-05-28T22:15:51Z

/test e2e-aws

stbenjam · 2020-05-28T22:54:51Z

/test e2e-gcp-upgrade

cgwalters · 2020-05-29T00:14:47Z

Upgrade failures are known flakes, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1812142
I'm just going to pull out the
/override e2e-gcp-upgrade
hammer 🔨 .

openshift-ci-robot · 2020-05-29T00:15:05Z

@cgwalters: /override requires a failed status context to operate on.
The following unknown contexts were given:

e2e-gcp-upgrade

Only the following contexts were expected:

ci/prow/e2e-aws
ci/prow/e2e-aws-scaleup-rhel7
ci/prow/e2e-gcp-op
ci/prow/e2e-gcp-upgrade
ci/prow/e2e-metal-ipi
ci/prow/images
ci/prow/unit
ci/prow/verify
tide

In response to this:

Upgrade failures are known flakes, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1812142
I'm just going to pull out the
/override e2e-gcp-upgrade
hammer 🔨 .

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2020-05-29T00:15:14Z

/override ci/prow/e2e-gcp-upgrade

openshift-ci-robot · 2020-05-29T00:15:31Z

@cgwalters: Overrode contexts on behalf of cgwalters: ci/prow/e2e-gcp-upgrade

In response to this:

/override ci/prow/e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-29T00:17:49Z

@cgwalters: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1762. Bugzilla bug 1841255 has been moved to the MODIFIED state.

In response to this:

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sinnykumari · 2020-05-29T09:46:02Z

@sinnykumari you added that BindsTo= in a465088 - do you remember why?

#904 (comment) was the reason i.e. without having BindsTo, it caused double reboot. I will check if that is not the case now.

sinnykumari · 2020-05-29T11:46:48Z

Tested with this PR, don't see double reboot while applying kargs as day1 which is good!
Not sure what changed in between but yes now we don't run m-c-d-host service on first boot which used to race earlier with m-c-d-firstboot service.

runcom · 2020-06-09T16:32:11Z

@cgwalters @jlebon this PR is likely the cause of https://bugzilla.redhat.com/show_bug.cgi?id=1842906 - Jerry made an amazing job at debugging it and seems like the removal of BindsTo is causing the firstboot provision to run again and then triggers a backwards upgrade to the very initial state
Was it required to get rid of BindsTo here?

cgwalters · 2020-06-09T16:41:47Z

Jerry made an amazing job at debugging it and seems like the removal of BindsTo is causing the firstboot provision to run again and then triggers a backwards upgrade to the very initial state

The current code has

	// Removing this file signals completion of the initial MC processing.
	if err := os.Rename(constants.MachineConfigEncapsulatedPath, constants.MachineConfigEncapsulatedBakPath); err != nil {
		return errors.Wrap(err, "failed to rename encapsulated MachineConfig after processing on firstboot")
	}

	dn.skipReboot = false
	return dn.reboot(fmt.Sprintf("Completing firstboot provisioning to %s", mc.GetName()))

So that should ensure that ConditionPathExists=/etc/ignition-machine-config-encapsulated.json no longer triggers.

Something must be going wrong that breaks that.

runcom · 2020-06-09T16:43:10Z

There's one last comment in https://bugzilla.redhat.com/show_bug.cgi?id=1842906 which clarifies what's happening better than I did in my previous comment

cgwalters changed the title ~~machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstb…~~ machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service May 28, 2020

openshift-ci-robot requested review from ashcrow and kikisdeliveryservice May 28, 2020 17:28

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2020

cgwalters changed the title ~~machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service~~ BZ 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service May 28, 2020

cgwalters changed the title ~~BZ 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service~~ Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service May 28, 2020

openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 28, 2020

ashcrow requested review from jlebon and removed request for ashcrow May 28, 2020 18:43

cgwalters force-pushed the and-with-one-firstboot-bind-them branch from e6dfbdc to 75dbab9 Compare May 28, 2020 19:42

cgwalters changed the title ~~Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service~~ Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet May 28, 2020

cgwalters mentioned this pull request May 28, 2020

machine-config-daemon-host: Also Before=crio.service #1761

Closed

openshift-ci-robot assigned stbenjam May 28, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2020

openshift-merge-robot merged commit ef127d4 into openshift:master May 29, 2020

cgwalters mentioned this pull request May 29, 2020

Use MCD binary from container in /run/bin #1766

Merged

cgwalters mentioned this pull request Jun 30, 2020

document how to edit/set kernel arguments coreos/fedora-coreos-docs#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762

cgwalters commented May 28, 2020 •

edited

cgwalters commented May 28, 2020

cgwalters commented May 28, 2020

cgwalters commented May 28, 2020

openshift-ci-robot commented May 28, 2020

openshift-ci-robot commented May 28, 2020

jlebon commented May 28, 2020

cgwalters commented May 28, 2020

jlebon commented May 28, 2020

stbenjam commented May 28, 2020

cgwalters commented May 28, 2020

openshift-ci-robot commented May 28, 2020

cgwalters commented May 28, 2020

jlebon commented May 28, 2020

stbenjam commented May 28, 2020

openshift-ci-robot commented May 28, 2020

cgwalters commented May 28, 2020

stbenjam commented May 28, 2020

cgwalters commented May 28, 2020

cgwalters commented May 28, 2020

stbenjam commented May 28, 2020

stbenjam commented May 28, 2020

stbenjam commented May 28, 2020

cgwalters commented May 29, 2020

openshift-ci-robot commented May 29, 2020

cgwalters commented May 29, 2020

openshift-ci-robot commented May 29, 2020

openshift-ci-robot commented May 29, 2020

sinnykumari commented May 29, 2020

sinnykumari commented May 29, 2020

runcom commented Jun 9, 2020 •

edited

cgwalters commented Jun 9, 2020

runcom commented Jun 9, 2020

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762

Conversation

cgwalters commented May 28, 2020 • edited

cgwalters commented May 28, 2020

cgwalters commented May 28, 2020

cgwalters commented May 28, 2020

openshift-ci-robot commented May 28, 2020

openshift-ci-robot commented May 28, 2020

jlebon commented May 28, 2020

cgwalters commented May 28, 2020

jlebon commented May 28, 2020

stbenjam commented May 28, 2020

cgwalters commented May 28, 2020

openshift-ci-robot commented May 28, 2020

cgwalters commented May 28, 2020

jlebon commented May 28, 2020

stbenjam commented May 28, 2020

openshift-ci-robot commented May 28, 2020

cgwalters commented May 28, 2020

stbenjam commented May 28, 2020

cgwalters commented May 28, 2020

cgwalters commented May 28, 2020

stbenjam commented May 28, 2020

stbenjam commented May 28, 2020

stbenjam commented May 28, 2020

cgwalters commented May 29, 2020

openshift-ci-robot commented May 29, 2020

cgwalters commented May 29, 2020

openshift-ci-robot commented May 29, 2020

openshift-ci-robot commented May 29, 2020

sinnykumari commented May 29, 2020

sinnykumari commented May 29, 2020

runcom commented Jun 9, 2020 • edited

cgwalters commented Jun 9, 2020

runcom commented Jun 9, 2020

cgwalters commented May 28, 2020 •

edited

runcom commented Jun 9, 2020 •

edited