OCPBUGS-17788: Improved error handling for missing MC #4096

cdoern · 2024-01-03T21:07:38Z

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

openshift-ci-robot · 2024-01-03T21:07:45Z

@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cdoern · 2024-01-04T14:21:36Z

/jira refresh

openshift-ci-robot · 2024-01-04T14:21:43Z

@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cdoern · 2024-01-04T14:21:44Z

/retest-required

sergiordlr · 2024-01-08T13:00:39Z

To trigger the error we created a machine config and while the node was rebooting we run this command to consistently remove the needed rendered MC

watch -n 1 oc delete mc rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b

Eventually the following error is displayed in the daemon logs

I0108 11:29:26.953477    1561 update.go:1608] Deleting stale data
I0108 11:29:26.953572    1561 update.go:2361] Removing SIGTERM protection
E0108 11:29:26.953616    1561 writer.go:226] Marking Degraded due to: A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: could not apply update: error processing state and configs. Error: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found 
E0108 11:30:21.528565    1561 writer.go:226] Marking Degraded due to: prepping update: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found
I0108 11:31:21.547458    1561 daemon.go:711] Transitioned from degraded/unreconcilable reason A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: could not apply update: error processing state and configs. Error: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found  -> prepping update: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found

Once we stopped deleting the needed MC, the cluster was able to apply the configuration and stopped being degraded.

Since we are only improving the log messages in this PR, I think that this small test is enough to add the qe-approved label.

/label qe-approved

openshift-ci-robot · 2024-01-08T13:00:45Z

@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cdoern · 2024-01-08T14:32:15Z

/retest-required

cdoern · 2024-01-09T18:51:28Z

/override ci/prow/e2e-gcp-op-single-node

openshift-ci · 2024-01-09T18:51:44Z

@cdoern: Overrode contexts on behalf of cdoern: ci/prow/e2e-gcp-op-single-node

In response to this:

/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2024-01-09T22:50:30Z

pkg/daemon/daemon.go

+	if err != nil && apierrors.IsNotFound(err) {
+		// if this is first run, the MC we went to get DNE this means there was a content mismatch
+		// bwtn what user provided and what was rendered. State that here.
+		return fmt.Errorf("User provided configuration caused a mismatch in the rendered config. Please compare /etc/mcs-machine-config-content.json and the existing rendered config. %w", err)


So, this error here actually triggers on the curentConfig, meaning that we don't get to this line if we have this issue.

I think you would need to add it to almost the beginning of the function, in getStateAndConfigs return if you want this. Maybe a better phrasing would be:

Could not find config %v in-cluster, this likely indicates the MachineConfigs in-cluster has changed during the install process. If you are seeing this when installing the cluster, please compare the in-cluster rendered machineconfigs to ...

or is that a bit too wordy? You are theoretically able to hit this later on if you delete a config while the node is rebooting, for example, so I don't want to confuse the user

yuqi-zhang · 2024-01-09T22:55:13Z

pkg/daemon/daemon.go

@@ -792,6 +792,10 @@ func (dn *Daemon) syncNode(key string) error {
 		}

 		if err := dn.triggerUpdate(ufc.currentConfig, ufc.desiredConfig, ufc.currentImage, ufc.desiredImage); err != nil {
+			if apierrors.IsNotFound(err) {
+				// if MC was not found, let user know where they can find more info on this.
+				return fmt.Errorf("A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: %w ", err)


Hmm, this (and the next one) feels less like error messages and more like "what to do in case of this error".

The phrasing sounds like we want the user to delete the annotation entirely (which I think will break the node as well?) but telling them to switch to another rendered config might just get them into a worse spot if they are uncertain what they are doing.

It's probably best to either somehow use an alert->runbook method to self help or maybe just clarify what the error is here. It's better that they can reach out to support with the issue (a MachineConfig seems to have been deleted)

Also I feel like maybe we'd hit this more in the prepUpdateFromCluster above or other locations and not in this function? I feel like the timing would be very tight

cdoern · 2024-01-11T19:28:12Z

@yuqi-zhang i'll convert it to an alert

cdoern · 2024-01-17T14:39:28Z

/retest-required

cdoern · 2024-01-17T21:06:09Z

/test e2e-gcp-op-layering unit

cdoern · 2024-01-23T21:08:19Z

@yuqi-zhang can you PTAL at this? converted it to an alert

yuqi-zhang

Overall makes sense, some minor questions inline

install/0000_90_machine-config-operator_01_prometheus-rules.yaml

pkg/daemon/daemon.go

cdoern · 2024-01-25T14:48:07Z

/test verify

cheesesashimi · 2024-01-25T15:00:01Z

Overall, this looks good. I have some reservations about only returning a specific value from dn.triggerUpdate() (and everything else in that call chain) when an error has occurred. Instead, it would be more idiomatic to put the name of the missing MachineConfig into a custom error type and extract it when needed. Here's a way to do this:

// First, we create a custom error type to hold the missing MachineConfig name.
type ErrMissingMachineConfig struct {
	missingMC string
}

// Optional constructor.
func newErrMissingMachineConfig(missingMC string) error {
	return &ErrMissingMachineConfig{
		missingMC: missingMC,
	}
}

// This implements the error interface within Go.
func (e *ErrMissingMachineConfig) Error() string {
	return fmt.Sprintf("missing MachineConfig %s", e.missingMC)
}

// This is an optional accessor to get the missing MachineConfig.
func (e *ErrMissingMachineConfig) MissingMachineConfig() string {
	return e.missingMC
}

Next, within triggerUpdate() (and everything in that call path), we do something like this instead:

func (dn *Daemon) triggerUpdate() error {
	// ...
	desiredConfig, err = dn.mcLister.Get(dcAnnotation)
	if err != nil {
		// errors.Join() is a new built-in, see: https://pkg.go.dev/errors#example-Join
		// It has a similar purpose to the AggregateError type found here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/errors/errors.go#L41-L46
		return errors.Join(newErrMissingMachineConfig(dcAnnotation), err)
	}
	// ...
}

Finally, to determine whether we have an ErrMissingMachineConfig, we can examine the error chain like this:

func caller() {
	err := dn.triggerUpdate()
	if err != nil {
		var missingMCErr *ErrMissingMachineConfig
		// Here, we can check if we have an ErrMissingMachineConfig error, extract
		// the missing MachineConfig from it, and report it. All without needing to
		// add an additional parameter onto everything :).
		if errors.As(err, &missingMCErr) {
			mcdMissingMC.WithLabelValues(missingMCErr.MissingMachineConfig()).Inc()
		}
	}

	// ...
}

cheesesashimi

Overall, this looks good and we're almost there! I have a few additional suggestions in the meantime.

pkg/daemon/daemon.go

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations. add this in the form of a metric and an alert if the metric gets triggered. Signed-off-by: Charlie Doern <cdoern@redhat.com>

cheesesashimi · 2024-01-26T14:31:06Z

/retest-required

cheesesashimi · 2024-01-26T16:38:07Z

/lgtm
/approve
/override ci/prow/e2e-gcp-op-single-node

openshift-ci · 2024-01-26T16:40:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cdoern, cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cdoern,cheesesashimi,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-01-26T16:40:48Z

@cheesesashimi: Overrode contexts on behalf of cheesesashimi: ci/prow/e2e-gcp-op-single-node

In response to this:

/lgtm
/approve
/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2024-01-26T19:29:47Z

@cdoern: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`9e0bc36`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`9e0bc36`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-01-26T19:33:57Z

@cdoern: Jira Issue OCPBUGS-17788: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4096

Jira Issue OCPBUGS-17788 has been moved to the MODIFIED state.

In response to this:

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2024-01-31T12:49:18Z

Fix included in accepted release 4.16.0-0.nightly-2024-01-31-073538

openshift-ci bot requested review from cgwalters and dkhater-redhat January 3, 2024 21:09

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 3, 2024

cdoern force-pushed the proxy-err branch from 980436f to 8d3ccf9 Compare January 3, 2024 21:13

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 4, 2024

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 4, 2024

openshift-ci bot requested a review from sergiordlr January 4, 2024 14:22

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jan 8, 2024

yuqi-zhang reviewed Jan 9, 2024

View reviewed changes

cdoern force-pushed the proxy-err branch 2 times, most recently from a6bff82 to 249d337 Compare January 15, 2024 15:22

yuqi-zhang approved these changes Jan 23, 2024

View reviewed changes

install/0000_90_machine-config-operator_01_prometheus-rules.yaml Outdated Show resolved Hide resolved

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

cdoern force-pushed the proxy-err branch from 249d337 to f06fb5a Compare January 25, 2024 14:03

cdoern force-pushed the proxy-err branch 2 times, most recently from 7faf3f5 to 33c459d Compare January 25, 2024 16:12

cheesesashimi suggested changes Jan 25, 2024

View reviewed changes

pkg/daemon/daemon.go Show resolved Hide resolved

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

pkg/daemon/daemon.go Show resolved Hide resolved

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

cdoern force-pushed the proxy-err branch from 33c459d to 9e0bc36 Compare January 25, 2024 19:31

openshift-ci bot assigned cheesesashimi Jan 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 26, 2024

openshift-merge-bot bot merged commit a460e63 into openshift:master Jan 26, 2024
14 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-17788: Improved error handling for missing MC #4096

OCPBUGS-17788: Improved error handling for missing MC #4096

cdoern commented Jan 3, 2024

openshift-ci-robot commented Jan 3, 2024

cdoern commented Jan 4, 2024

openshift-ci-robot commented Jan 4, 2024

cdoern commented Jan 4, 2024

sergiordlr commented Jan 8, 2024

openshift-ci-robot commented Jan 8, 2024

cdoern commented Jan 8, 2024

cdoern commented Jan 9, 2024

openshift-ci bot commented Jan 9, 2024

yuqi-zhang Jan 9, 2024

yuqi-zhang Jan 9, 2024

cdoern commented Jan 11, 2024

cdoern commented Jan 17, 2024

cdoern commented Jan 17, 2024

cdoern commented Jan 23, 2024

yuqi-zhang left a comment

cdoern commented Jan 25, 2024

cheesesashimi commented Jan 25, 2024 •

edited

cheesesashimi left a comment

cheesesashimi commented Jan 26, 2024

cheesesashimi commented Jan 26, 2024

openshift-ci bot commented Jan 26, 2024

openshift-ci bot commented Jan 26, 2024

openshift-ci bot commented Jan 26, 2024

openshift-ci-robot commented Jan 26, 2024

openshift-merge-robot commented Jan 31, 2024

OCPBUGS-17788: Improved error handling for missing MC #4096

OCPBUGS-17788: Improved error handling for missing MC #4096

Conversation

cdoern commented Jan 3, 2024

openshift-ci-robot commented Jan 3, 2024

cdoern commented Jan 4, 2024

openshift-ci-robot commented Jan 4, 2024

cdoern commented Jan 4, 2024

sergiordlr commented Jan 8, 2024

openshift-ci-robot commented Jan 8, 2024

cdoern commented Jan 8, 2024

cdoern commented Jan 9, 2024

openshift-ci bot commented Jan 9, 2024

yuqi-zhang Jan 9, 2024

Choose a reason for hiding this comment

yuqi-zhang Jan 9, 2024

Choose a reason for hiding this comment

cdoern commented Jan 11, 2024

cdoern commented Jan 17, 2024

cdoern commented Jan 17, 2024

cdoern commented Jan 23, 2024

yuqi-zhang left a comment

Choose a reason for hiding this comment

cdoern commented Jan 25, 2024

cheesesashimi commented Jan 25, 2024 • edited

cheesesashimi left a comment

Choose a reason for hiding this comment

cheesesashimi commented Jan 26, 2024

cheesesashimi commented Jan 26, 2024

openshift-ci bot commented Jan 26, 2024

openshift-ci bot commented Jan 26, 2024

openshift-ci bot commented Jan 26, 2024

openshift-ci-robot commented Jan 26, 2024

openshift-merge-robot commented Jan 31, 2024

cheesesashimi commented Jan 25, 2024 •

edited