Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-17788: Improved error handling for missing MC #4096

Merged
merged 1 commit into from Jan 26, 2024

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Jan 3, 2024

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 3, 2024
@openshift-ci-robot
Copy link
Contributor

@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 3, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Jan 4, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 4, 2024
@openshift-ci-robot
Copy link
Contributor

@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 4, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Jan 4, 2024

/retest-required

@openshift-ci openshift-ci bot requested a review from sergiordlr January 4, 2024 14:22
@sergiordlr
Copy link

To trigger the error we created a machine config and while the node was rebooting we run this command to consistently remove the needed rendered MC

watch -n 1 oc delete mc rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b

Eventually the following error is displayed in the daemon logs

I0108 11:29:26.953477    1561 update.go:1608] Deleting stale data
I0108 11:29:26.953572    1561 update.go:2361] Removing SIGTERM protection
E0108 11:29:26.953616    1561 writer.go:226] Marking Degraded due to: A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: could not apply update: error processing state and configs. Error: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found 
E0108 11:30:21.528565    1561 writer.go:226] Marking Degraded due to: prepping update: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found
I0108 11:31:21.547458    1561 daemon.go:711] Transitioned from degraded/unreconcilable reason A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: could not apply update: error processing state and configs. Error: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found  -> prepping update: machineconfig.machineconfiguration.openshift.io "rendered-worker-647fbfa9f3a2caebb9d4882c86942d9b" not found

Once we stopped deleting the needed MC, the cluster was able to apply the configuration and stopped being degraded.

Since we are only improving the log messages in this PR, I think that this small test is enough to add the qe-approved label.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jan 8, 2024
@openshift-ci-robot
Copy link
Contributor

@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cdoern
Copy link
Contributor Author

cdoern commented Jan 8, 2024

/retest-required

@cdoern
Copy link
Contributor Author

cdoern commented Jan 9, 2024

/override ci/prow/e2e-gcp-op-single-node

Copy link
Contributor

openshift-ci bot commented Jan 9, 2024

@cdoern: Overrode contexts on behalf of cdoern: ci/prow/e2e-gcp-op-single-node

In response to this:

/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

if err != nil && apierrors.IsNotFound(err) {
// if this is first run, the MC we went to get DNE this means there was a content mismatch
// bwtn what user provided and what was rendered. State that here.
return fmt.Errorf("User provided configuration caused a mismatch in the rendered config. Please compare /etc/mcs-machine-config-content.json and the existing rendered config. %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this error here actually triggers on the curentConfig, meaning that we don't get to this line if we have this issue.

I think you would need to add it to almost the beginning of the function, in getStateAndConfigs return if you want this. Maybe a better phrasing would be:

Could not find config %v in-cluster, this likely indicates the MachineConfigs in-cluster has changed during the install process. If you are seeing this when installing the cluster, please compare the in-cluster rendered machineconfigs to ...

or is that a bit too wordy? You are theoretically able to hit this later on if you delete a config while the node is rebooting, for example, so I don't want to confuse the user

@@ -792,6 +792,10 @@ func (dn *Daemon) syncNode(key string) error {
}

if err := dn.triggerUpdate(ufc.currentConfig, ufc.desiredConfig, ufc.currentImage, ufc.desiredImage); err != nil {
if apierrors.IsNotFound(err) {
// if MC was not found, let user know where they can find more info on this.
return fmt.Errorf("A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: %w ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this (and the next one) feels less like error messages and more like "what to do in case of this error".

The phrasing sounds like we want the user to delete the annotation entirely (which I think will break the node as well?) but telling them to switch to another rendered config might just get them into a worse spot if they are uncertain what they are doing.

It's probably best to either somehow use an alert->runbook method to self help or maybe just clarify what the error is here. It's better that they can reach out to support with the issue (a MachineConfig seems to have been deleted)

Also I feel like maybe we'd hit this more in the prepUpdateFromCluster above or other locations and not in this function? I feel like the timing would be very tight

@cdoern
Copy link
Contributor Author

cdoern commented Jan 11, 2024

@yuqi-zhang i'll convert it to an alert

@cdoern cdoern force-pushed the proxy-err branch 2 times, most recently from a6bff82 to 249d337 Compare January 15, 2024 15:22
@cdoern
Copy link
Contributor Author

cdoern commented Jan 17, 2024

/retest-required

@cdoern
Copy link
Contributor Author

cdoern commented Jan 17, 2024

/test e2e-gcp-op-layering unit

@cdoern
Copy link
Contributor Author

cdoern commented Jan 23, 2024

@yuqi-zhang can you PTAL at this? converted it to an alert

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense, some minor questions inline

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
@cdoern
Copy link
Contributor Author

cdoern commented Jan 25, 2024

/test verify

@cheesesashimi
Copy link
Member

cheesesashimi commented Jan 25, 2024

Overall, this looks good. I have some reservations about only returning a specific value from dn.triggerUpdate() (and everything else in that call chain) when an error has occurred. Instead, it would be more idiomatic to put the name of the missing MachineConfig into a custom error type and extract it when needed. Here's a way to do this:

// First, we create a custom error type to hold the missing MachineConfig name.
type ErrMissingMachineConfig struct {
	missingMC string
}

// Optional constructor.
func newErrMissingMachineConfig(missingMC string) error {
	return &ErrMissingMachineConfig{
		missingMC: missingMC,
	}
}

// This implements the error interface within Go.
func (e *ErrMissingMachineConfig) Error() string {
	return fmt.Sprintf("missing MachineConfig %s", e.missingMC)
}

// This is an optional accessor to get the missing MachineConfig.
func (e *ErrMissingMachineConfig) MissingMachineConfig() string {
	return e.missingMC
}

Next, within triggerUpdate() (and everything in that call path), we do something like this instead:

func (dn *Daemon) triggerUpdate() error {
	// ...
	desiredConfig, err = dn.mcLister.Get(dcAnnotation)
	if err != nil {
		// errors.Join() is a new built-in, see: https://pkg.go.dev/errors#example-Join
		// It has a similar purpose to the AggregateError type found here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/errors/errors.go#L41-L46
		return errors.Join(newErrMissingMachineConfig(dcAnnotation), err)
	}
	// ...
}

Finally, to determine whether we have an ErrMissingMachineConfig, we can examine the error chain like this:

func caller() {
	err := dn.triggerUpdate()
	if err != nil {
		var missingMCErr *ErrMissingMachineConfig
		// Here, we can check if we have an ErrMissingMachineConfig error, extract
		// the missing MachineConfig from it, and report it. All without needing to
		// add an additional parameter onto everything :).
		if errors.As(err, &missingMCErr) {
			mcdMissingMC.WithLabelValues(missingMCErr.MissingMachineConfig()).Inc()
		}
	}

	// ...
}

@cdoern cdoern force-pushed the proxy-err branch 2 times, most recently from 7faf3f5 to 33c459d Compare January 25, 2024 16:12
Copy link
Member

@cheesesashimi cheesesashimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good and we're almost there! I have a few additional suggestions in the meantime.

pkg/daemon/daemon.go Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
pkg/daemon/daemon.go Show resolved Hide resolved
pkg/daemon/daemon.go Outdated Show resolved Hide resolved
the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

add this in the form of a metric and an alert if the metric gets triggered.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cheesesashimi
Copy link
Member

/retest-required

@cheesesashimi
Copy link
Member

/lgtm
/approve
/override ci/prow/e2e-gcp-op-single-node

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 26, 2024
Copy link
Contributor

openshift-ci bot commented Jan 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cdoern, cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cdoern,cheesesashimi,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jan 26, 2024

@cheesesashimi: Overrode contexts on behalf of cheesesashimi: ci/prow/e2e-gcp-op-single-node

In response to this:

/lgtm
/approve
/override ci/prow/e2e-gcp-op-single-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Jan 26, 2024

@cdoern: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 9e0bc36 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure-ovn-upgrade-out-of-change 9e0bc36 link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit a460e63 into openshift:master Jan 26, 2024
14 of 16 checks passed
@openshift-ci-robot
Copy link
Contributor

@cdoern: Jira Issue OCPBUGS-17788: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-17788 has been moved to the MODIFIED state.

In response to this:

the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-01-31-073538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants