New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-17788: Improved error handling for missing MC #4096
Conversation
@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest-required |
To trigger the error we created a machine config and while the node was rebooting we run this command to consistently remove the needed rendered MC
Eventually the following error is displayed in the daemon logs
Once we stopped deleting the needed MC, the cluster was able to apply the configuration and stopped being degraded. Since we are only improving the log messages in this PR, I think that this small test is enough to add the qe-approved label. /label qe-approved |
@cdoern: This pull request references Jira Issue OCPBUGS-17788, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest-required |
/override ci/prow/e2e-gcp-op-single-node |
@cdoern: Overrode contexts on behalf of cdoern: ci/prow/e2e-gcp-op-single-node In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/daemon/daemon.go
Outdated
if err != nil && apierrors.IsNotFound(err) { | ||
// if this is first run, the MC we went to get DNE this means there was a content mismatch | ||
// bwtn what user provided and what was rendered. State that here. | ||
return fmt.Errorf("User provided configuration caused a mismatch in the rendered config. Please compare /etc/mcs-machine-config-content.json and the existing rendered config. %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this error here actually triggers on the curentConfig, meaning that we don't get to this line if we have this issue.
I think you would need to add it to almost the beginning of the function, in getStateAndConfigs
return if you want this. Maybe a better phrasing would be:
Could not find config %v in-cluster, this likely indicates the MachineConfigs in-cluster has changed during the install process. If you are seeing this when installing the cluster, please compare the in-cluster rendered machineconfigs to ...
or is that a bit too wordy? You are theoretically able to hit this later on if you delete a config while the node is rebooting, for example, so I don't want to confuse the user
pkg/daemon/daemon.go
Outdated
@@ -792,6 +792,10 @@ func (dn *Daemon) syncNode(key string) error { | |||
} | |||
|
|||
if err := dn.triggerUpdate(ufc.currentConfig, ufc.desiredConfig, ufc.currentImage, ufc.desiredImage); err != nil { | |||
if apierrors.IsNotFound(err) { | |||
// if MC was not found, let user know where they can find more info on this. | |||
return fmt.Errorf("A rendered MC was removed. Remove all currentConfig or desiredConfig annotations from the nodes using this MC. Also delete /etc/machine-config-daemon/currentconfig: %w ", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this (and the next one) feels less like error messages and more like "what to do in case of this error".
The phrasing sounds like we want the user to delete the annotation entirely (which I think will break the node as well?) but telling them to switch to another rendered config might just get them into a worse spot if they are uncertain what they are doing.
It's probably best to either somehow use an alert->runbook method to self help or maybe just clarify what the error is here. It's better that they can reach out to support with the issue (a MachineConfig seems to have been deleted)
Also I feel like maybe we'd hit this more in the prepUpdateFromCluster
above or other locations and not in this function? I feel like the timing would be very tight
@yuqi-zhang i'll convert it to an alert |
a6bff82
to
249d337
Compare
/retest-required |
/test e2e-gcp-op-layering unit |
@yuqi-zhang can you PTAL at this? converted it to an alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall makes sense, some minor questions inline
install/0000_90_machine-config-operator_01_prometheus-rules.yaml
Outdated
Show resolved
Hide resolved
/test verify |
Overall, this looks good. I have some reservations about only returning a specific value from // First, we create a custom error type to hold the missing MachineConfig name.
type ErrMissingMachineConfig struct {
missingMC string
}
// Optional constructor.
func newErrMissingMachineConfig(missingMC string) error {
return &ErrMissingMachineConfig{
missingMC: missingMC,
}
}
// This implements the error interface within Go.
func (e *ErrMissingMachineConfig) Error() string {
return fmt.Sprintf("missing MachineConfig %s", e.missingMC)
}
// This is an optional accessor to get the missing MachineConfig.
func (e *ErrMissingMachineConfig) MissingMachineConfig() string {
return e.missingMC
} Next, within func (dn *Daemon) triggerUpdate() error {
// ...
desiredConfig, err = dn.mcLister.Get(dcAnnotation)
if err != nil {
// errors.Join() is a new built-in, see: https://pkg.go.dev/errors#example-Join
// It has a similar purpose to the AggregateError type found here: https://github.com/kubernetes/apimachinery/blob/master/pkg/util/errors/errors.go#L41-L46
return errors.Join(newErrMissingMachineConfig(dcAnnotation), err)
}
// ...
} Finally, to determine whether we have an func caller() {
err := dn.triggerUpdate()
if err != nil {
var missingMCErr *ErrMissingMachineConfig
// Here, we can check if we have an ErrMissingMachineConfig error, extract
// the missing MachineConfig from it, and report it. All without needing to
// add an additional parameter onto everything :).
if errors.As(err, &missingMCErr) {
mcdMissingMC.WithLabelValues(missingMCErr.MissingMachineConfig()).Inc()
}
}
// ...
} |
7faf3f5
to
33c459d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good and we're almost there! I have a few additional suggestions in the meantime.
the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations. add this in the form of a metric and an alert if the metric gets triggered. Signed-off-by: Charlie Doern <cdoern@redhat.com>
/retest-required |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cdoern, cheesesashimi, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cheesesashimi: Overrode contexts on behalf of cheesesashimi: ci/prow/e2e-gcp-op-single-node In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cdoern: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
a460e63
into
openshift:master
@cdoern: Jira Issue OCPBUGS-17788: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-17788 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fix included in accepted release 4.16.0-0.nightly-2024-01-31-073538 |
the error for when a MachineConfig is missing is very vague. Usually the error has little to do with the cause. Specify steps users can take to alleviate the situations.