Add status conditions and profile applied to Profile(s) #188

jmencak · 2020-12-11T12:16:08Z

Changes:

report Tuned profile currently applied for each of the containerized
Tuned daemon managed by NTO
report two Profile status conditions Applied and Degraded
in every Profile indicating whether the Tuned profile was applied and
whether there were issues during the profile application
cleanup of the ClusterOperator settings code; ClusterOperator now also
reports Reason == ProfileDegraded for the Available condition if any of
the Tuned Profiles failed to be applied cleanly for any of the containerized
Tuned daemons managed by NTO
e2e test added to check the status reporting functionality
e2e basic/available test enhanced to check for not Degraded condition
using podman build --no-cache now. This works around issues such as:
Podman build wrongly uses stale cache layer although build-arg changed and, thus, produces incorrect image containers/buildah#2837

jmencak · 2020-12-11T12:19:07Z

/cc @marcel-apf
Marcel, please review from your perspective whether the newly created statuses fit your needs. Many thanks!

jmencak · 2020-12-11T12:30:13Z

@MarSik FYI

marcel-apf · 2020-12-14T08:23:05Z

pkg/apis/tuned/v1/tuned_types.go

+	// This is only to be consumed by humans.
+	// +optional
+	Message string `json:"message,omitempty"`
+}


@jmencak looks good, I think we better merge it and consume it with PAO, and then we can go from there. We will have a better visibility and we can come up with improvement ideas, if needed.

Thanks, @marcel-apf . I'll aim to merge this immediately once 4.8 opens. @sjug/@dagrayvid , could you please provide a code review? Thank you!

pkg/tuned/tuned.go

dagrayvid

Except for the previous comment, the code changes look good to me.

jmencak · 2020-12-15T06:34:17Z

/retest

jmencak · 2020-12-15T08:52:11Z

/retest

jmencak · 2020-12-15T10:26:59Z

/retest

jmencak · 2020-12-15T15:21:29Z

/retest

openshift-merge-robot · 2020-12-15T17:42:51Z

@jmencak: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-upgrade	`1520e93`	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yanirq · 2020-12-22T11:27:01Z

LGTM

yanirq · 2020-12-22T11:27:21Z

/retest

openshift-bot · 2021-01-26T16:15:38Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T17:20:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T17:46:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T18:25:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

jmencak · 2021-01-26T18:43:38Z

/hold
Bot, go easy with the retests, waiting for #195 to merge first. Then I'll rebase.

jmencak · 2021-01-27T12:55:13Z

/hold cancel

yanirq · 2021-01-27T14:04:58Z

pkg/tuned/status.go

+		Type: tunedv1.TunedDegraded,
+	}
+
+	if (status & scApplied) != 0 {


where are these coming from ? (status, scApplied, scError, scWarn)

https://github.com/openshift/cluster-node-tuning-operator/pull/196/files

As pointed out by Sebastian above, I "backported" part of this PR to fix rhbz1919970.

yanirq · 2021-01-28T11:34:55Z

Degrading the cluster operator (NTO operator) if we get even 1 degraded profile status seems a bit harsh.
This can cause upgrade issues (@jmencak) but not only. We could have a cluster with multiple nodes having different labels, most of them can run fine with their tuned settings but if we hit one profile in a degraded state this halt NTO completely.
Another thing is, introducing yet anther degraded state might cause end users to start suffering from it and needing to fix it since it can pose as a blocker issue for them.

Looking from Performance addons operator perspective - I think we should consume the tuned object status anyway if we want a 1:1 report for what might be wrong with performance profiles applied.
@MarSik @fromanirh @cynepco3hahue agree/disagree ?

jmencak · 2021-01-28T11:40:41Z

Degrading the cluster operator (NTO operator) if we get even 1 degraded profile status seems a bit harsh.
This can cause upgrade issues (@jmencak) but not only. We could have a cluster with multiple nodes having different labels, most of them can run fine with their tuned settings but if we hit one profile in a degraded state this halt NTO completely.
Another thing is, introducing yet anther degraded state might cause end users to start suffering from it and needing to fix it since it can pose as a blocker issue for them.

I completely agree. If you believe that reporting the "error" status per Profile is enough, I'm more than happy not to touch the OperatorStatus.

yanirq · 2021-01-28T20:14:19Z

Degrading the cluster operator (NTO operator) if we get even 1 degraded profile status seems a bit harsh.
This can cause upgrade issues (@jmencak) but not only. We could have a cluster with multiple nodes having different labels, most of them can run fine with their tuned settings but if we hit one profile in a degraded state this halt NTO completely.
Another thing is, introducing yet anther degraded state might cause end users to start suffering from it and needing to fix it since it can pose as a blocker issue for them.

I completely agree. If you believe that reporting the "error" status per Profile is enough, I'm more than happy not to touch the OperatorStatus.

@jmencak ack, I think we can remove degration of the cluster operator then. We can keep status reporting maybe under the operator itself without degrading it.

Changes: - report Tuned profile currently applied for each of the containerized Tuned daemon managed by NTO - report two Profile status conditions "Applied" and "Degraded" in every Profile indicating whether the Tuned profile was applied and whether there were issues during the profile application - cleanup of the ClusterOperator settings code; ClusterOperator now also reports Reason == ProfileDegraded for the Available condition if any of the Tuned Profiles failed to be applied cleanly for any of the containerized Tuned daemons managed by NTO - e2e test added to check the status reporting functionality - e2e basic/available test enhanced to check for not Degraded condition - using "podman build --no-cache" now. This works around issues such as: containers/buildah#2837

jmencak · 2021-01-29T12:01:17Z

@jmencak ack, I think we can remove degration of the cluster operator then. We can keep status reporting maybe under the operator itself without degrading it.

Done.

jmencak · 2021-01-29T13:44:35Z

/test e2e-aws

jmencak · 2021-01-29T14:53:20Z

/retest

yanirq · 2021-01-30T11:37:04Z

/retest

yanirq · 2021-01-31T18:59:41Z

/lgtm

openshift-ci-robot · 2021-01-31T18:59:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jmencak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yanirq · 2021-02-01T10:46:27Z

@jmencak only thing missing here is the requested bug label

jmencak · 2021-02-01T11:31:26Z

@jmencak only thing missing here is the requested bug label

The bug label would be needed if we wanted this in 4.7. For 4.8, the bug label is not necessary. As soon as 4.8 opens, this will merge. And it was exactly the plan to have this ready very early on for 4.8.

jmencak · 2021-02-08T14:33:16Z

Let's merge this:
/test all

openshift-bot · 2021-02-08T17:11:26Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-02-08T19:02:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from dagrayvid and sjug December 11, 2020 12:16

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2020

openshift-ci-robot requested a review from marcel-apf December 11, 2020 12:19

marcel-apf reviewed Dec 14, 2020

View reviewed changes

dagrayvid reviewed Dec 14, 2020

View reviewed changes

pkg/tuned/tuned.go Outdated Show resolved Hide resolved

dagrayvid reviewed Dec 14, 2020

View reviewed changes

openshift-ci-robot assigned sjug Jan 20, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 20, 2021

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2021

jmencak force-pushed the 4.8-per-node-tuned-status branch from 1520e93 to e3359d9 Compare January 27, 2021 08:18

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2021

jmencak force-pushed the 4.8-per-node-tuned-status branch from e3359d9 to ca481f1 Compare January 27, 2021 08:24

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2021

yanirq reviewed Jan 27, 2021

View reviewed changes

jmencak force-pushed the 4.8-per-node-tuned-status branch from ca481f1 to 9a778dd Compare January 29, 2021 12:00

openshift-ci-robot assigned yanirq Jan 31, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2021

openshift-merge-robot merged commit 84897ad into openshift:master Feb 8, 2021

jmencak deleted the 4.8-per-node-tuned-status branch February 9, 2021 08:23

Add status conditions and profile applied to Profile(s) #188

Add status conditions and profile applied to Profile(s) #188

Conversation

jmencak commented Dec 11, 2020 • edited

jmencak commented Dec 11, 2020

jmencak commented Dec 11, 2020

marcel-apf Dec 14, 2020

Choose a reason for hiding this comment

jmencak Dec 14, 2020

Choose a reason for hiding this comment

dagrayvid left a comment

Choose a reason for hiding this comment

jmencak commented Dec 15, 2020

jmencak commented Dec 15, 2020

jmencak commented Dec 15, 2020

jmencak commented Dec 15, 2020

openshift-merge-robot commented Dec 15, 2020

yanirq commented Dec 22, 2020

yanirq commented Dec 22, 2020

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

jmencak commented Jan 26, 2021

jmencak commented Jan 27, 2021

yanirq Jan 27, 2021

Choose a reason for hiding this comment

sjug Jan 27, 2021

Choose a reason for hiding this comment

jmencak Jan 27, 2021

Choose a reason for hiding this comment

yanirq commented Jan 28, 2021

jmencak commented Jan 28, 2021 • edited

yanirq commented Jan 28, 2021

jmencak commented Jan 29, 2021

jmencak commented Jan 29, 2021

jmencak commented Jan 29, 2021

yanirq commented Jan 30, 2021

yanirq commented Jan 31, 2021

openshift-ci-robot commented Jan 31, 2021

yanirq commented Feb 1, 2021

jmencak commented Feb 1, 2021

jmencak commented Feb 8, 2021

openshift-bot commented Feb 8, 2021

openshift-bot commented Feb 8, 2021

jmencak commented Dec 11, 2020 •

edited

jmencak commented Jan 28, 2021 •

edited