Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add status conditions and profile applied to Profile(s) #188

Merged

Conversation

jmencak
Copy link
Contributor

@jmencak jmencak commented Dec 11, 2020

Changes:

  • report Tuned profile currently applied for each of the containerized
    Tuned daemon managed by NTO
  • report two Profile status conditions Applied and Degraded
    in every Profile indicating whether the Tuned profile was applied and
    whether there were issues during the profile application
  • cleanup of the ClusterOperator settings code; ClusterOperator now also
    reports Reason == ProfileDegraded for the Available condition if any of
    the Tuned Profiles failed to be applied cleanly for any of the containerized
    Tuned daemons managed by NTO
  • e2e test added to check the status reporting functionality
  • e2e basic/available test enhanced to check for not Degraded condition
  • using podman build --no-cache now. This works around issues such as:
    Podman build wrongly uses stale cache layer although build-arg changed and, thus, produces incorrect image containers/buildah#2837

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2020
@jmencak
Copy link
Contributor Author

jmencak commented Dec 11, 2020

/cc @marcel-apf
Marcel, please review from your perspective whether the newly created statuses fit your needs. Many thanks!

@jmencak
Copy link
Contributor Author

jmencak commented Dec 11, 2020

@MarSik FYI

// This is only to be consumed by humans.
// +optional
Message string `json:"message,omitempty"`
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmencak looks good, I think we better merge it and consume it with PAO, and then we can go from there. We will have a better visibility and we can come up with improvement ideas, if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @marcel-apf . I'll aim to merge this immediately once 4.8 opens. @sjug/@dagrayvid , could you please provide a code review? Thank you!

pkg/tuned/tuned.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dagrayvid dagrayvid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the previous comment, the code changes look good to me.

@jmencak
Copy link
Contributor Author

jmencak commented Dec 15, 2020

/retest

3 similar comments
@jmencak
Copy link
Contributor Author

jmencak commented Dec 15, 2020

/retest

@jmencak
Copy link
Contributor Author

jmencak commented Dec 15, 2020

/retest

@jmencak
Copy link
Contributor Author

jmencak commented Dec 15, 2020

/retest

@openshift-merge-robot
Copy link
Contributor

@jmencak: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-upgrade 1520e93 link /test e2e-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yanirq
Copy link
Contributor

yanirq commented Dec 22, 2020

LGTM

@yanirq
Copy link
Contributor

yanirq commented Dec 22, 2020

/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 20, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@jmencak
Copy link
Contributor Author

jmencak commented Jan 26, 2021

/hold
Bot, go easy with the retests, waiting for #195 to merge first. Then I'll rebase.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2021
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2021
@jmencak
Copy link
Contributor Author

jmencak commented Jan 27, 2021

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2021
Type: tunedv1.TunedDegraded,
}

if (status & scApplied) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are these coming from ? (status, scApplied, scError, scWarn)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed out by Sebastian above, I "backported" part of this PR to fix rhbz1919970.

@yanirq
Copy link
Contributor

yanirq commented Jan 28, 2021

Degrading the cluster operator (NTO operator) if we get even 1 degraded profile status seems a bit harsh.
This can cause upgrade issues (@jmencak) but not only. We could have a cluster with multiple nodes having different labels, most of them can run fine with their tuned settings but if we hit one profile in a degraded state this halt NTO completely.
Another thing is, introducing yet anther degraded state might cause end users to start suffering from it and needing to fix it since it can pose as a blocker issue for them.

Looking from Performance addons operator perspective - I think we should consume the tuned object status anyway if we want a 1:1 report for what might be wrong with performance profiles applied.
@MarSik @fromanirh @cynepco3hahue agree/disagree ?

@jmencak
Copy link
Contributor Author

jmencak commented Jan 28, 2021

Degrading the cluster operator (NTO operator) if we get even 1 degraded profile status seems a bit harsh.
This can cause upgrade issues (@jmencak) but not only. We could have a cluster with multiple nodes having different labels, most of them can run fine with their tuned settings but if we hit one profile in a degraded state this halt NTO completely.
Another thing is, introducing yet anther degraded state might cause end users to start suffering from it and needing to fix it since it can pose as a blocker issue for them.

I completely agree. If you believe that reporting the "error" status per Profile is enough, I'm more than happy not to touch the OperatorStatus.

@yanirq
Copy link
Contributor

yanirq commented Jan 28, 2021

Degrading the cluster operator (NTO operator) if we get even 1 degraded profile status seems a bit harsh.
This can cause upgrade issues (@jmencak) but not only. We could have a cluster with multiple nodes having different labels, most of them can run fine with their tuned settings but if we hit one profile in a degraded state this halt NTO completely.
Another thing is, introducing yet anther degraded state might cause end users to start suffering from it and needing to fix it since it can pose as a blocker issue for them.

I completely agree. If you believe that reporting the "error" status per Profile is enough, I'm more than happy not to touch the OperatorStatus.

@jmencak ack, I think we can remove degration of the cluster operator then. We can keep status reporting maybe under the operator itself without degrading it.

Changes:
  - report Tuned profile currently applied for each of the containerized
    Tuned daemon managed by NTO
  - report two Profile status conditions "Applied" and "Degraded"
    in every Profile indicating whether the Tuned profile was applied and
    whether there were issues during the profile application
  - cleanup of the ClusterOperator settings code; ClusterOperator now also
    reports Reason == ProfileDegraded for the Available condition if any of
    the Tuned Profiles failed to be applied cleanly for any of the
    containerized Tuned daemons managed by NTO
  - e2e test added to check the status reporting functionality
  - e2e basic/available test enhanced to check for not Degraded condition
  - using "podman build --no-cache" now.  This works around issues such as:
    containers/buildah#2837
@jmencak
Copy link
Contributor Author

jmencak commented Jan 29, 2021

@jmencak ack, I think we can remove degration of the cluster operator then. We can keep status reporting maybe under the operator itself without degrading it.

Done.

@jmencak
Copy link
Contributor Author

jmencak commented Jan 29, 2021

/test e2e-aws

@jmencak
Copy link
Contributor Author

jmencak commented Jan 29, 2021

/retest

1 similar comment
@yanirq
Copy link
Contributor

yanirq commented Jan 30, 2021

/retest

@yanirq
Copy link
Contributor

yanirq commented Jan 31, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, yanirq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yanirq
Copy link
Contributor

yanirq commented Feb 1, 2021

@jmencak only thing missing here is the requested bug label

@jmencak
Copy link
Contributor Author

jmencak commented Feb 1, 2021

@jmencak only thing missing here is the requested bug label

The bug label would be needed if we wanted this in 4.7. For 4.8, the bug label is not necessary. As soon as 4.8 opens, this will merge. And it was exactly the plan to have this ready very early on for 4.8.

@jmencak
Copy link
Contributor Author

jmencak commented Feb 8, 2021

Let's merge this:
/test all

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 84897ad into openshift:master Feb 8, 2021
@jmencak jmencak deleted the 4.8-per-node-tuned-status branch February 9, 2021 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants