Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-16871: MCO - currentConfig missing on the filesystem #3963

Merged
merged 1 commit into from Oct 25, 2023

Conversation

inesqyx
Copy link
Contributor

@inesqyx inesqyx commented Oct 10, 2023

- What I did
Set up a backup plan for on-disk current config reading, so that the call will reads the current config from the node annotation when the currentConfig is missing on the filesystem

- How to verify it
Before:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: DEGRADED

After:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: update went through; the mcp did not get degraded; currentConfig got written back to the disk at some point

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 10, 2023
@openshift-ci-robot
Copy link
Contributor

@inesqyx: This pull request references Jira Issue OCPBUGS-16871, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did
Set up a backup plan for on-disk current config reading, so that the call will reads the current config from the node annotation when the currentConfig is missing on the filesystem

- How to verify it
Before:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: DEGRADED

After:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: update went through; the mcp did not get degraded; currentConfig got written back to the disk at some point

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 10, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 10, 2023
@openshift-ci-robot
Copy link
Contributor

@inesqyx: This pull request references Jira Issue OCPBUGS-16871, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 10, 2023

/retest-required

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 10, 2023

/test okd-scos-e2e-aws-ovn

1 similar comment
@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 11, 2023

/test okd-scos-e2e-aws-ovn

if currentOnDisk == nil {
currentOnDisk, err = dn.substituteMissingODC()
if err != nil {
return fmt.Errorf("could not get the state: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will duplicate with the error message inside of the function. I think you can just return err directly here.

@@ -2054,6 +2084,13 @@ func (dn *Daemon) prepUpdateFromCluster() (*updateFromCluster, error) {
return nil, err
}

if odc == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why add this here? I think if the on-disk config is empty, this function should be fine, and we shouldn't make assumptions about the current state.

Copy link
Contributor Author

@inesqyx inesqyx Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look due to the panic Sergio has reported. I think it is necessary and reasonable to assume odc = current config + image reading from node annotation if the odc is missing from the disk.

It is necessary because below, there are several calls that dereference the odc to get the odc.currentImage and compares it to the desiredImage to determine whether an update is required. As a result, in the case that odc = nil, dereference the nil pointer would produce a panic. As a result, to fully take care of the odc = nil case, we also need to find a workaround for calls to an empty odc.

It is reasonable because by looking at the func that calls prepUpdateFromCluster, which is syncNode and runOnceFromMachineConfig, both are at places that are way ahead the next update or way after the previous update. As a result, they are at the "steady" state where the missing odc will only result from proactive removal after the previous update (e.g. human manipulation, etc). Assuming odc = current config + image from node annotation will not block any update or skip any necessary update.

I also think that this part of code is a bit messily written as for the image comparison, it is comparing odc.currentImage vs. desiredImage from node annotation, but for the config comparison, it is comparing current and desired config both from node annotation. By doing so, I think it is already assuming odc = current config from node annotation.

I feel like it can be condensed into:

currentImage, err := getNodeAnnotationExt(dn.node, constants.CurrentImageAnnotationKey, true)
	if err != nil {
		klog.Infof("%s is not set. any errors? %s", constants.CurrentImageAnnotationKey, err)
		return nil, err
	}
if desiredImage == currentImage && desiredConfigName == currentConfigName{
	if state == constants.MachineConfigDaemonStateDone {
		// No actual update to the config
		klog.V(2).Info("No updating is required")
		return nil, nil
	}
	// This seems like it shouldn't happen...let's just warn for now.
	klog.Warningf("current+desiredConfig is %s, current+desiredImage is %s but state is %s", currentConfigName, currentImage, state)
}
	

@@ -1905,7 +1935,7 @@ func (dn *Daemon) updateConfigAndState(state *stateAndConfigs) (bool, error) {
if err == nil {
state.currentConfig = odc.currentConfig
state.currentImage = odc.currentImage
} else {
} else if err != nil && !os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually hard fail here, since this is during the update flow, and that file should be there. If it isn't, there's something wrong during the update itself.

Copy link
Contributor Author

@inesqyx inesqyx Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this nill allowing needs to be there because the function updateConfigAndState is called at two places: checkStateOnFirstRun & performPostConfigChangeAction (it makes sense to fail hard in performPostConfigChangeAction). But when called in checkStateOnFristRun which allows odc = nil, hard fail will degrade the node later. I recall I add this in b/c I think if the state.current = state.desired, it also means that the update is complete. Whether the currentconfig is on disk or not does not matter that much.

I1012 15:30:28.834489   37245 daemon.go:1872] Validating against current config rendered-worker-862813a853ebf95c28c17db49406ad29
I1012 15:30:28.834666   37245 daemon.go:1785] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
I1012 15:30:29.170758   37245 rpm-ostree.go:308] Running captured: rpm-ostree kargs
I1012 15:30:29.202444   37245 update.go:1977] Validated on-disk state
I1012 15:30:29.203819   37245 daemon.go:1939] Error reading config from disk
E1012 15:30:29.203849   37245 writer.go:226] Marking Degraded due to: error reading config from disk: open /etc/machine-config-daemon/currentconfig: no such file or directory
I1012 15:30:31.218914   37245 daemon.go:670] Transitioned from state: Done -> Degraded
I1012 15:30:31.218944   37245 daemon.go:673] Transitioned from degraded/unreconcilable reason  -> error reading config from disk: open /etc/machine-config-daemon/currentconfig: no such file or directory
W1012 15:30:31.219031   37245 daemon.go:1674] Failed to persist NIC names: open /etc/systemd/network: no such file or directory
I1012 15:30:31.222251   37245 daemon.go:1345] Previous boot ostree-finalize-staged.service appears successful
I1012 15:30:31.222263   37245 daemon.go:1462] Current+desired config: rendered-worker-862813a853ebf95c28c17db49406ad29
I1012 15:30:31.222266   37245 daemon.go:1478] state: Degraded
I1012 15:30:31.222282   37245 update.go:1962] Running: rpm-ostree cleanup -r
Deployments unchanged.

@@ -1190,6 +1197,21 @@ func (dn *Daemon) onConfigDrift(err error) {
}
}

// substituteMissingODC fetch the current config through node annotations to respond to getCurrentConfigDisk
// calls where the ODC is missing due to manual deletion and other reasons.
func (dn *Daemon) substituteMissingODC() (*onDiskConfig, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of substituteMissingODC you can just call this getCurrentConfigFromNode or similar, to make it more descriptive of the functionality

if odc == nil {
odc, err = dn.substituteMissingODC()
if err != nil {
dn.exitCh <- fmt.Errorf("could not get current config from disk: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think this will duplicate error message here

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 13, 2023

/retest-required

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits on formatting, otherwise lgtm

return fmt.Errorf("could not apply update: setting node's state to Done failed. Error: %w", err)
}

if missingODC {
return fmt.Errorf("error updating state.currentconfig from on-disck currentconfig")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in on-disk

@@ -1887,27 +1917,34 @@ func (dn *Daemon) isInDesiredConfig(state *stateAndConfigs) bool {
}

// updateConfigAndState updates node to desired state, labels nodes as done and uncordon
func (dn *Daemon) updateConfigAndState(state *stateAndConfigs) (bool, error) {
func (dn *Daemon) updateConfigAndState(state *stateAndConfigs) (bool, bool, error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra whiteline

currentConfig: state.currentConfig,
}
return tempConfig, nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra whiteline

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2023
@sergiordlr
Copy link

sergiordlr commented Oct 17, 2023

While pre-merge testing, we got a panic following these steps

  1. Remove the /etc/machine-config-daemon/currentconfig file in a node
  2. Create the force file touch /run/machine-config-daemon-force in the same node

This is the MCD logs for the node:

I1017 09:17:24.049626   25183 daemon.go:1462] Current+desired config: rendered-worker-d3f5b6bf6e7c67a0cbed03d189b8130d
I1017 09:17:24.049631   25183 daemon.go:1478] state: Done
I1017 09:17:24.058252   25183 config_drift_monitor.go:246] Config Drift Monitor started
E1017 09:17:24.058406   25183 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 138 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x211d780?, 0x3a0f060})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00078c2a0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x211d780, 0x3a0f060})
	/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).prepUpdateFromCluster(0xc0000e2000)
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:2109 +0x2f9
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).syncNode(0xc0000e2000, {0xc000775000, 0x38})
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:717 +0xa65
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0000e2000)
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:543 +0xda
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).worker(...)
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:532
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004af2c0?, {0x2829100, 0xc0007daf00}, 0x1, 0xc0004ae300)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004af2c0?, 0x3b9aca00, 0x0, 0x80?, 0xc000460710?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0004607d0?, 0x91b246?, 0xc00023e700?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).Run
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:1175 +0x4aa
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1e230b9]

goroutine 138 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00078c2a0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x211d780, 0x3a0f060})
	/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).prepUpdateFromCluster(0xc0000e2000)
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:2109 +0x2f9
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).syncNode(0xc0000e2000, {0xc000775000, 0x38})
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:717 +0xa65
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).processNextWorkItem(0xc0000e2000)
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:543 +0xda
github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).worker(...)
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:532
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0004af2c0?, {0x2829100, 0xc0007daf00}, 0x1, 0xc0004ae300)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0004af2c0?, 0x3b9aca00, 0x0, 0x80?, 0xc000460710?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0004607d0?, 0x91b246?, 0xc00023e700?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).Run
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:1175 +0x4aa

@yuqi-zhang
Copy link
Contributor

/lgtm

/hold

For another round of QE verification

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 19, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: inesqyx, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 20, 2023

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2023

@inesqyx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn b4f8825 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-op-layering b4f8825 link false /test e2e-gcp-op-layering

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sergiordlr
Copy link

Verified using IPI on GCP

Steps:

  1. Upgrade from 4.14 -> 4.15+fix. PASS
  2. Remove /etc/machine-config-daemon/currentconfig and use force file. PASS (the configuration is not restored though, but it looks like intended)
  3. Remove /etc/machine-config-daemon/currentconfig and create a MC with reboot+drain. PASS
  4. Remove /etc/machine-config-daemon/currentconfig and create a MC with osImage. PASS.
  5. Remove /etc/machine-config-daemon/currentconfig and create a MC without reboot nor drain (password). PASS.
  6. Remove /etc/machine-config-daemon/currentconfig and use on cluster build functionality. PASS.
  7. Remove /etc/machine-config-daemon/currentconfig in master pool and use a password MC (no reboot no drain). PASS.

We can add the qe-approved label.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 25, 2023
@openshift-ci-robot
Copy link
Contributor

@inesqyx: This pull request references Jira Issue OCPBUGS-16871, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

- What I did
Set up a backup plan for on-disk current config reading, so that the call will reads the current config from the node annotation when the currentConfig is missing on the filesystem

- How to verify it
Before:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: DEGRADED

After:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: update went through; the mcp did not get degraded; currentConfig got written back to the disk at some point

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 25, 2023

/retest-required

@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 25, 2023

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 25, 2023
@inesqyx
Copy link
Contributor Author

inesqyx commented Oct 25, 2023

/retest-required

@yuqi-zhang
Copy link
Contributor

Hmm, tide status seems stuck?

/hold cancel

see if that helps kick it

@openshift-ci openshift-ci bot merged commit 76e4c18 into openshift:master Oct 25, 2023
12 of 14 checks passed
@openshift-ci-robot
Copy link
Contributor

@inesqyx: Jira Issue OCPBUGS-16871: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-16871 has been moved to the MODIFIED state.

In response to this:

- What I did
Set up a backup plan for on-disk current config reading, so that the call will reads the current config from the node annotation when the currentConfig is missing on the filesystem

- How to verify it
Before:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: DEGRADED

After:

  1. remove the currentConfig (/etc/machine-config-daemon/currentConfig) from the node
  2. apply an update
  3. check the status of the MCO
    RESULT: update went through; the mcp did not get degraded; currentConfig got written back to the disk at some point

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-10-26-064434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants