Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2012969: emit kube events for skipped and upgraded OSes #3153

Conversation

cheesesashimi
Copy link
Member

To provide better information to what the MCD is doing, the MCD should emit an event whenever a MachineConfig specifies that a new OS will be installed as well as whenever it does not.

- What I did

I added two events, OSUpgradeApplied and OSUpgradeSkipped which are emitted whenever the MCD determines an OS upgrade is required or not.

- How to verify it

  1. Spin up a test cluster.
  2. Watch the logs for a given node, e.g.: $ oc logs -n openshift-machine-config-operator -c machine-config-daemon -f pod/machine-config-daemon-8hqvd.
  3. Create and apply a new MachineConfig.
  4. Watch for the following events in the log (may need to start the MCD with the -v4) flag:
I0518 19:51:24.703522    2623 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-130-248.ec2.internal", UID:"cdac4d45-1c5f-4e31-b106-b9d76583d624", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NoOSUpgradeAvailable' New MachineConfig (rendered-infra-d4288fbd661c4a61fbc31a4614a9271f) has same OS (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e42eb8c6aa523511d928b3dde3a5e0251e6baaba0b60ed00b431e17698556777) as old MachineConfig (rendered-infra-f02bc763e0574470ff44e34981d34e12); skipping OS upgrade

For the upgrade cases:

  1. Create a new MachineConfig with a new .spec.osImageURL value.
  2. Watch the logs for a given node, e.g.: $ oc logs -n openshift-machine-config-operator -c machine-config-daemon -f pod/machine-config-daemon-8hqvd.
  3. Apply the new MachineConfig.
  4. Watch for the upgrade event to be emitted in the MCD log.

- Description for the changelog
Emit a Kube event when an OS upgrade is applied or skipped

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 18, 2022

@cheesesashimi: This pull request references Bugzilla bug 2012969, which is invalid:

  • expected the bug to target the "4.11.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2012969: emit kube events for skipped and upgraded OSes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from cgwalters and jkyros May 18, 2022 19:57
@kikisdeliveryservice
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 18, 2022
@kikisdeliveryservice
Copy link
Contributor

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 18, 2022

@kikisdeliveryservice: This pull request references Bugzilla bug 2012969, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (rioliu@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file changed out from under you due to changes in the daemon in b6cfd42

/lgtm cancel

@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 19, 2022
@@ -379,6 +379,9 @@ func (dn *CoreOSDaemon) applyOSChanges(mcDiff machineConfigDiff, oldConfig, newC
MCDPivotErr.WithLabelValues(nodeName, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime()
return err
}
dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeNormal, "OSUpgradeApplied", "OS upgrade applied; new MachineConfig (%s) has new OS (%s)", newConfig.Name, newConfig.Spec.OSImageURL)
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no more recorder so we need something like:

Suggested change
dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeNormal, "OSUpgradeApplied", "OS upgrade applied; new MachineConfig (%s) has new OS (%s)", newConfig.Name, newConfig.Spec.OSImageURL)
dn.nodeWriter.Eventf(corev1.EventTypeNormal, "OSUpgradeApplied", "OS upgrade applied; new MachineConfig (%s) has new OS (%s)", newConfig.Name, newConfig.Spec.OSImageURL)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that's right! Good catch! I'll make the changes.

@@ -379,6 +379,9 @@ func (dn *CoreOSDaemon) applyOSChanges(mcDiff machineConfigDiff, oldConfig, newC
MCDPivotErr.WithLabelValues(nodeName, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime()
return err
}
dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeNormal, "OSUpgradeApplied", "OS upgrade applied; new MachineConfig (%s) has new OS (%s)", newConfig.Name, newConfig.Spec.OSImageURL)
} else {
dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeNormal, "OSUpgradeSkipped", "OS upgrade skipped; new MachineConfig (%s) has same OS (%s) as old MachineConfig (%s)", newConfig.Name, newConfig.Spec.OSImageURL, oldConfig.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Suggested change
dn.recorder.Eventf(getNodeRef(dn.node), corev1.EventTypeNormal, "OSUpgradeSkipped", "OS upgrade skipped; new MachineConfig (%s) has same OS (%s) as old MachineConfig (%s)", newConfig.Name, newConfig.Spec.OSImageURL, oldConfig.Name)
dn.nodeWriter.Eventf(corev1.EventTypeNormal, "OSUpgradeSkipped", "OS upgrade skipped; new MachineConfig (%s) has same OS (%s) as old MachineConfig (%s)", newConfig.Name, newConfig.Spec.OSImageURL, oldConfig.Name)

Copy link
Contributor

@sinnykumari sinnykumari May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, instead of "same OS ", it should be "same OS image"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That sounds better.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/os-upgrade-events branch 2 times, most recently from 5a4d04b to 758385c Compare May 19, 2022 14:08
@cheesesashimi
Copy link
Member Author

/retest-required

@kikisdeliveryservice
Copy link
Contributor

weird everything is failing

/retest-required

@cheesesashimi
Copy link
Member Author

Found the source of the failures:

May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: Changes queued for next boot. Run "systemctl reboot" to start a reboot
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: I0519 16:17:25.268339    2149 update.go:2002] Removing SIGTERM protection
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: panic: runtime error: invalid memory address or nil pointer dereference
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1796d0a]
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: goroutine 1 [running]:
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: github.com/openshift/machine-config-operator/pkg/daemon.(*CoreOSDaemon).applyOSChanges(0xc0007534f0, {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0xc0004ba680, ...)
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/pkg/daemon/update.go:382 +0x68a
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).update(0xc0001ca600, 0xc00030e374, 0xc000102680)
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/pkg/daemon/update.go:573 +0xd1b
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: github.com/openshift/machine-config-operator/pkg/daemon.(*Daemon).RunFirstbootCompleteMachineconfig(0xc0001ca600)
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/pkg/daemon/daemon.go:852 +0x172
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: main.runFirstBootCompleteMachineConfig(0xc0005dfce0, {0xc0005dfd08, 0x502d80, 0xc000092050})
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/firstboot_complete_machineconfig.go:39 +0x11d
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: main.executeFirstbootCompleteMachineConfig(0x2e0c4e0, {0x2e583a8, 0x0, 0x0})
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/firstboot_complete_machineconfig.go:44 +0xdd
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: github.com/spf13/cobra.(*Command).execute(0x2e0c4e0, {0x2e583a8, 0x0, 0x0})
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: github.com/spf13/cobra.(*Command).ExecuteC(0x2e0c760)
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: github.com/spf13/cobra.(*Command).Execute(...)
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/vendor/github.com/spf13/cobra/command.go:902
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: k8s.io/component-base/cli.Run(0x2e0c760)
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/component-base/cli/run.go:105 +0x389
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]: main.main()
May 19 16:17:25 ip-10-0-131-92 machine-config-daemon[2149]:         /go/src/github.com/openshift/machine-config-operator/cmd/machine-config-daemon/main.go:28 +0x25
May 19 16:17:25 ip-10-0-131-92 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 19 16:17:25 ip-10-0-131-92 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
May 19 16:17:25 ip-10-0-131-92 systemd[1]: Failed to start Machine Config Daemon Firstboot.
May 19 16:17:25 ip-10-0-131-92 systemd[1]: machine-config-daemon-firstboot.service: Consumed 13.342s CPU time

During firstboot, dn.nodeWriter is nil, so we need an if dn.nodeWriter != nil { ... } before we attempt to emit an event.

For posterity, how I found this was:

  1. Upon failure, openshift-installer creates a log bundle.
  2. The install step collects this log bundle in the event of a failure and stores it in the GCP cloud bucket (for example).
  3. Retrieve this log bundle and untar it.
  4. Look in the journals path for the control plane nodes for the machine-config-daemon-firstboot.log file (e.g., log-bundle-20220519164705/control-plane/10.0.131.92/journals/machine-config-daemon-firstboot.log).

To provide better information to what the MCD is doing, the MCD should
emit an event whenever a MachineConfig specifies that a new OS will be
installed as well as whenever it does not.
Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test e2e-agnostic-upgrade

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2022
@kikisdeliveryservice
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, kikisdeliveryservice, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

@cheesesashimi: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 031234c into openshift:master May 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

@cheesesashimi: All pull requests linked via external trackers have merged:

Bugzilla bug 2012969 has been moved to the MODIFIED state.

In response to this:

Bug 2012969: emit kube events for skipped and upgraded OSes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor

@cheesesashimi do we need to backport it 4.10 since bug was reported during 4.10 timeframe?

@cheesesashimi cheesesashimi deleted the zzlotnik/os-upgrade-events branch March 21, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants