New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-16871: MCO - currentConfig missing on the filesystem #3963
Conversation
@inesqyx: This pull request references Jira Issue OCPBUGS-16871, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@inesqyx: This pull request references Jira Issue OCPBUGS-16871, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
4dcc55e
to
35128c8
Compare
/retest-required |
/test okd-scos-e2e-aws-ovn |
1 similar comment
/test okd-scos-e2e-aws-ovn |
pkg/daemon/daemon.go
Outdated
if currentOnDisk == nil { | ||
currentOnDisk, err = dn.substituteMissingODC() | ||
if err != nil { | ||
return fmt.Errorf("could not get the state: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will duplicate with the error message inside of the function. I think you can just return err directly here.
pkg/daemon/daemon.go
Outdated
@@ -2054,6 +2084,13 @@ func (dn *Daemon) prepUpdateFromCluster() (*updateFromCluster, error) { | |||
return nil, err | |||
} | |||
|
|||
if odc == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why add this here? I think if the on-disk config is empty, this function should be fine, and we shouldn't make assumptions about the current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look due to the panic Sergio has reported. I think it is necessary and reasonable to assume odc = current config + image reading from node annotation if the odc is missing from the disk.
It is necessary because below, there are several calls that dereference the odc to get the odc.currentImage and compares it to the desiredImage to determine whether an update is required. As a result, in the case that odc = nil, dereference the nil pointer would produce a panic. As a result, to fully take care of the odc = nil case, we also need to find a workaround for calls to an empty odc.
It is reasonable because by looking at the func that calls prepUpdateFromCluster, which is syncNode and runOnceFromMachineConfig, both are at places that are way ahead the next update or way after the previous update. As a result, they are at the "steady" state where the missing odc will only result from proactive removal after the previous update (e.g. human manipulation, etc). Assuming odc = current config + image from node annotation will not block any update or skip any necessary update.
I also think that this part of code is a bit messily written as for the image comparison, it is comparing odc.currentImage vs. desiredImage from node annotation, but for the config comparison, it is comparing current and desired config both from node annotation. By doing so, I think it is already assuming odc = current config from node annotation.
I feel like it can be condensed into:
currentImage, err := getNodeAnnotationExt(dn.node, constants.CurrentImageAnnotationKey, true)
if err != nil {
klog.Infof("%s is not set. any errors? %s", constants.CurrentImageAnnotationKey, err)
return nil, err
}
if desiredImage == currentImage && desiredConfigName == currentConfigName{
if state == constants.MachineConfigDaemonStateDone {
// No actual update to the config
klog.V(2).Info("No updating is required")
return nil, nil
}
// This seems like it shouldn't happen...let's just warn for now.
klog.Warningf("current+desiredConfig is %s, current+desiredImage is %s but state is %s", currentConfigName, currentImage, state)
}
@@ -1905,7 +1935,7 @@ func (dn *Daemon) updateConfigAndState(state *stateAndConfigs) (bool, error) { | |||
if err == nil { | |||
state.currentConfig = odc.currentConfig | |||
state.currentImage = odc.currentImage | |||
} else { | |||
} else if err != nil && !os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should actually hard fail here, since this is during the update flow, and that file should be there. If it isn't, there's something wrong during the update itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this nill allowing needs to be there because the function updateConfigAndState is called at two places: checkStateOnFirstRun & performPostConfigChangeAction (it makes sense to fail hard in performPostConfigChangeAction). But when called in checkStateOnFristRun which allows odc = nil, hard fail will degrade the node later. I recall I add this in b/c I think if the state.current = state.desired, it also means that the update is complete. Whether the currentconfig is on disk or not does not matter that much.
I1012 15:30:28.834489 37245 daemon.go:1872] Validating against current config rendered-worker-862813a853ebf95c28c17db49406ad29
I1012 15:30:28.834666 37245 daemon.go:1785] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
I1012 15:30:29.170758 37245 rpm-ostree.go:308] Running captured: rpm-ostree kargs
I1012 15:30:29.202444 37245 update.go:1977] Validated on-disk state
I1012 15:30:29.203819 37245 daemon.go:1939] Error reading config from disk
E1012 15:30:29.203849 37245 writer.go:226] Marking Degraded due to: error reading config from disk: open /etc/machine-config-daemon/currentconfig: no such file or directory
I1012 15:30:31.218914 37245 daemon.go:670] Transitioned from state: Done -> Degraded
I1012 15:30:31.218944 37245 daemon.go:673] Transitioned from degraded/unreconcilable reason -> error reading config from disk: open /etc/machine-config-daemon/currentconfig: no such file or directory
W1012 15:30:31.219031 37245 daemon.go:1674] Failed to persist NIC names: open /etc/systemd/network: no such file or directory
I1012 15:30:31.222251 37245 daemon.go:1345] Previous boot ostree-finalize-staged.service appears successful
I1012 15:30:31.222263 37245 daemon.go:1462] Current+desired config: rendered-worker-862813a853ebf95c28c17db49406ad29
I1012 15:30:31.222266 37245 daemon.go:1478] state: Degraded
I1012 15:30:31.222282 37245 update.go:1962] Running: rpm-ostree cleanup -r
Deployments unchanged.
pkg/daemon/daemon.go
Outdated
@@ -1190,6 +1197,21 @@ func (dn *Daemon) onConfigDrift(err error) { | |||
} | |||
} | |||
|
|||
// substituteMissingODC fetch the current config through node annotations to respond to getCurrentConfigDisk | |||
// calls where the ODC is missing due to manual deletion and other reasons. | |||
func (dn *Daemon) substituteMissingODC() (*onDiskConfig, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of substituteMissingODC
you can just call this getCurrentConfigFromNode
or similar, to make it more descriptive of the functionality
pkg/daemon/daemon.go
Outdated
if odc == nil { | ||
odc, err = dn.substituteMissingODC() | ||
if err != nil { | ||
dn.exitCh <- fmt.Errorf("could not get current config from disk: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think this will duplicate error message here
35128c8
to
8d57504
Compare
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits on formatting, otherwise lgtm
pkg/daemon/update.go
Outdated
return fmt.Errorf("could not apply update: setting node's state to Done failed. Error: %w", err) | ||
} | ||
|
||
if missingODC { | ||
return fmt.Errorf("error updating state.currentconfig from on-disck currentconfig") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo in on-disk
pkg/daemon/daemon.go
Outdated
@@ -1887,27 +1917,34 @@ func (dn *Daemon) isInDesiredConfig(state *stateAndConfigs) bool { | |||
} | |||
|
|||
// updateConfigAndState updates node to desired state, labels nodes as done and uncordon | |||
func (dn *Daemon) updateConfigAndState(state *stateAndConfigs) (bool, error) { | |||
func (dn *Daemon) updateConfigAndState(state *stateAndConfigs) (bool, bool, error) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra whiteline
pkg/daemon/daemon.go
Outdated
currentConfig: state.currentConfig, | ||
} | ||
return tempConfig, nil | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra whiteline
While pre-merge testing, we got a panic following these steps
This is the MCD logs for the node:
|
8d57504
to
b4f8825
Compare
/lgtm /hold For another round of QE verification |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: inesqyx, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
@inesqyx: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Verified using IPI on GCP Steps:
We can add the qe-approved label. /label qe-approved |
@inesqyx: This pull request references Jira Issue OCPBUGS-16871, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest-required |
/unhold |
/retest-required |
Hmm, tide status seems stuck? /hold cancel see if that helps kick it |
@inesqyx: Jira Issue OCPBUGS-16871: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-16871 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix included in accepted release 4.15.0-0.nightly-2023-10-26-064434 |
- What I did
Set up a backup plan for on-disk current config reading, so that the call will reads the current config from the node annotation when the currentConfig is missing on the filesystem
- How to verify it
Before:
RESULT: DEGRADED
After:
RESULT: update went through; the mcp did not get degraded; currentConfig got written back to the disk at some point