Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet #1762

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented May 28, 2020

See https://bugzilla.redhat.com/show_bug.cgi?id=1840222
Something in the baremetal IPI stack is forcibly powering off nodes
during the firstboot. This causes all sorts of problems, but
we should be more robust to handling this.

The problem with BindsTo=ignition-firstboot-complete.service
is twofold:

First, if the service fails, we don't run, and will silently
continue on to e.g. kubelet.service. That's bad - we should
not land user workloads until a node is up to date and secure.

Second, the binding is wrong because at some point we may
move that service into the initramfs in CoreOS, and that would
cause this to break.

The "stamp file" approach is a generally good method of achieving
idempotence, and we already have one, so let's use it.

We also add a RequiredBy={kubelet,crio}.service to ensure
they don't run unless we succeed.

@cgwalters cgwalters changed the title machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstb… machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service May 28, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2020
@cgwalters
Copy link
Member Author

@sinnykumari you added that BindsTo= in a465088 - do you remember why?

@cgwalters
Copy link
Member Author

To elaborate, the original code removes the stamp file just before rebooting: 40d8225#diff-06b532a2a5a5d3ad1d5341b3e00945efR481
Current code moves it to a backup file, see 6e8b587 - but either way what we want is just that ConditionPathExists=.

@cgwalters cgwalters changed the title machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service BZ 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service May 28, 2020
@cgwalters
Copy link
Member Author

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters cgwalters changed the title BZ 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service May 28, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 28, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1841255, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlebon
Copy link
Member

jlebon commented May 28, 2020

First, if the service fails, we don't run, and will silently
continue on to e.g. kubelet.service. That's bad - we should
not land user workloads until a node is up to date and secure.

Does this patch change that? I think we need to be RequiredBy=kubelet.service or something, no? Or the atomic option is OnFailure=emergency.target. That would've made it really obvious that something went wrong early on.

@ashcrow ashcrow requested review from jlebon and removed request for ashcrow May 28, 2020 18:43
@cgwalters
Copy link
Member Author

Does this patch change that? I think we need to be RequiredBy=kubelet.service or something, no? Or the atomic option is OnFailure=emergency.target. That would've made it really obvious that something went wrong early on.

It does indirectly, because we will reboot if changes were required.

@jlebon
Copy link
Member

jlebon commented May 28, 2020

It does indirectly, because we will reboot if changes were required.

I mean in the failure path. If machine-config-daemon-firstboot.service fails, there's nothing preventing the system from continuing to boot, right?

Thinking more on this, I think it makes sense to have OnFailure=emergency.target? That service is like an extension of Ignition itself, and I think it makes sense to have the same semantics of "crash and burn if we fail to provision as per the spec".

@stbenjam
Copy link
Member

FWIW 2e-metal-ipi passed on this which is a positive sign.

@cgwalters
Copy link
Member Author

I mean in the failure path. If machine-config-daemon-firstboot.service fails, there's nothing preventing the system from continuing to boot, right?

Yep, true. But, that's not the scenario in the current bug. I'm uncertain as to whether to try to roll up more changes here.

Thinking more on this, I think it makes sense to have OnFailure=emergency.target?

I thought about that too but I want people to be able to log in over ssh and debug.

…belet

See https://bugzilla.redhat.com/show_bug.cgi?id=1840222
Something in the baremetal IPI stack is forcibly powering off nodes
during the firstboot.  This causes all sorts of problems, but
we should be more robust to handling this.

The problem with `BindsTo=ignition-firstboot-complete.service`
is twofold:

First, if the service fails, we don't run, and will silently
continue on to e.g. `kubelet.service`.  That's bad - we should
not land user workloads until a node is up to date and secure.

Second, the binding is wrong because at some point we may
move that service into the initramfs in CoreOS, and that would
cause this to break.

The "stamp file" approach is a generally good method of achieving
idempotence, and we already have one, so let's use it.

We also add a `RequiredBy={kubelet,crio}.service` to ensure
they don't run unless we succeed.
@cgwalters cgwalters force-pushed the and-with-one-firstboot-bind-them branch from e6dfbdc to 75dbab9 Compare May 28, 2020 19:42
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1841255, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters cgwalters changed the title Bug 1841255: machine-config-daemon-firstboot.service: Drop BindsTo=ignition-firstboot-complete.service Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet May 28, 2020
@cgwalters
Copy link
Member Author

Based on IRC chat I rolled in here a

  [Install]
  ...
  RequiredBy=crio.service kubelet.service

so that even if the service fails (as opposed to not starting because of a dependency problem, which was the above BZ) then we still won't start crio or kubelet.

@jlebon
Copy link
Member

jlebon commented May 28, 2020

Makes sense to me!

/approve

@stbenjam
Copy link
Member

/lgtm

e2e-metal-ipi worked again (it's just finishing up now)

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member Author

OK, this clearly doesn't break the successful scenarios. I wanted to test how we handle failure, so I manually hacked up /etc/ignition-machine-config-encapsulated.json to have syntax errors and did systemctl restart machine-config-daemon-firstboot but...because that failed it took down kubelet and hence my oc debug node/...

@stbenjam
Copy link
Member

OK, this clearly doesn't break the successful scenarios. I wanted to test how we handle failure, so I manually hacked up /etc/ignition-machine-config-encapsulated.json to have syntax errors and did systemctl restart machine-config-daemon-firstboot but...because that failed it took down kubelet and hence my oc debug node/...

In a real failure case oc debug node works that early?

@cgwalters
Copy link
Member Author

In a real failure case oc debug node works that early?

No, it wouldn't - this one came from cluster-bot and I'm still trying to figure out where the private key for the ssh that's configured is. Though I may just quickly hack in my key into one of the nodes.

@cgwalters
Copy link
Member Author

OK I did some manual testing of failure scenarios, all looks right to me!

@stbenjam
Copy link
Member

/test e2e-aws

1 similar comment
@stbenjam
Copy link
Member

/test e2e-aws

@stbenjam
Copy link
Member

/test e2e-gcp-upgrade

@cgwalters
Copy link
Member Author

Upgrade failures are known flakes, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1812142
I'm just going to pull out the
/override e2e-gcp-upgrade
hammer 🔨 .

@openshift-ci-robot
Copy link
Contributor

@cgwalters: /override requires a failed status context to operate on.
The following unknown contexts were given:

  • e2e-gcp-upgrade

Only the following contexts were expected:

  • ci/prow/e2e-aws
  • ci/prow/e2e-aws-scaleup-rhel7
  • ci/prow/e2e-gcp-op
  • ci/prow/e2e-gcp-upgrade
  • ci/prow/e2e-metal-ipi
  • ci/prow/images
  • ci/prow/unit
  • ci/prow/verify
  • tide

In response to this:

Upgrade failures are known flakes, e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1812142
I'm just going to pull out the
/override e2e-gcp-upgrade
hammer 🔨 .

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/override ci/prow/e2e-gcp-upgrade

@openshift-ci-robot
Copy link
Contributor

@cgwalters: Overrode contexts on behalf of cgwalters: ci/prow/e2e-gcp-upgrade

In response to this:

/override ci/prow/e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit ef127d4 into openshift:master May 29, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1762. Bugzilla bug 1841255 has been moved to the MODIFIED state.

In response to this:

Bug 1841255: machine-config-daemon-firstboot.service: Make idempotent and block kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor

@sinnykumari you added that BindsTo= in a465088 - do you remember why?

#904 (comment) was the reason i.e. without having BindsTo, it caused double reboot. I will check if that is not the case now.

@sinnykumari
Copy link
Contributor

Tested with this PR, don't see double reboot while applying kargs as day1 which is good!
Not sure what changed in between but yes now we don't run m-c-d-host service on first boot which used to race earlier with m-c-d-firstboot service.

@runcom
Copy link
Member

runcom commented Jun 9, 2020

@cgwalters @jlebon this PR is likely the cause of https://bugzilla.redhat.com/show_bug.cgi?id=1842906 - Jerry made an amazing job at debugging it and seems like the removal of BindsTo is causing the firstboot provision to run again and then triggers a backwards upgrade to the very initial state
Was it required to get rid of BindsTo here?

@cgwalters
Copy link
Member Author

Jerry made an amazing job at debugging it and seems like the removal of BindsTo is causing the firstboot provision to run again and then triggers a backwards upgrade to the very initial state

The current code has

	// Removing this file signals completion of the initial MC processing.
	if err := os.Rename(constants.MachineConfigEncapsulatedPath, constants.MachineConfigEncapsulatedBakPath); err != nil {
		return errors.Wrap(err, "failed to rename encapsulated MachineConfig after processing on firstboot")
	}

	dn.skipReboot = false
	return dn.reboot(fmt.Sprintf("Completing firstboot provisioning to %s", mc.GetName()))

So that should ensure that ConditionPathExists=/etc/ignition-machine-config-encapsulated.json no longer triggers.

Something must be going wrong that breaks that.

@runcom
Copy link
Member

runcom commented Jun 9, 2020

There's one last comment in https://bugzilla.redhat.com/show_bug.cgi?id=1842906 which clarifies what's happening better than I did in my previous comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants