Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPNODE-1515 : Support Evented PLEG feature in Openshift #1458

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sairameshv
Copy link
Member

@sairameshv sairameshv commented May 10, 2023

  1. CRI-O sends the container events to the Kubelet so that the pod cache can be updated based on the received events.
    More about the Evented PLEG is here - KEP Reference
  2. This feature can be enabled in OCP by adding a new field in the node config custom resource
    that can be monitored by the MCO and update both the Kubelet and CRI-O configurations
    Enhancement PR: OCPNODE-1525: Support Evented PLEG in Openshift enhancements#1368

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2023

Hello @sairameshv! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 10, 2023
@sairameshv sairameshv changed the title Support Evented PLEG feature in Openshift OCPNODE-1515 : Support Evented PLEG feature in Openshift May 10, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2023
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 24, 2023
@sairameshv sairameshv marked this pull request as ready for review May 24, 2023 16:11
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2023
@openshift-ci openshift-ci bot requested review from bparees and sjenning May 24, 2023 16:12
@sairameshv
Copy link
Member Author

/test verify

config/v1/types_node.go Outdated Show resolved Hide resolved
@sairameshv sairameshv marked this pull request as draft May 25, 2023 15:55
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2023
@openshift-ci openshift-ci bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 26, 2023
@sairameshv sairameshv marked this pull request as ready for review May 26, 2023 11:08
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2023
@openshift-ci openshift-ci bot requested a review from JoelSpeed May 26, 2023 11:09
@sairameshv sairameshv force-pushed the evented_pleg branch 2 times, most recently from 7661898 to 6c1e4f0 Compare May 29, 2023 12:07
@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 29, 2023
@@ -42,6 +42,11 @@ type NodeSpec struct {
// the status and corresponding reaction of the cluster
// +optional
WorkerLatencyProfile WorkerLatencyProfileType `json:"workerLatencyProfile,omitempty"`

// EventedPleg enables event based PLEG between the kubelet and the CRI-O
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to expand this comment to provide more detail.

Questions I would be asking would be:

  • What are the valid values?
  • What happens if i didn't specify any value? Is this behaviour likely to change over time?

Nit, the comment should start with the json tag version of the name not the Go field version
So I'd be expecting something more along the lines of:

Suggested change
// EventedPleg enables event based PLEG between the kubelet and the CRI-O
// eventedPLEG enables event based PLEG between the kubelet and the CRI-O.
// Valid values are `Enabled`, `Disabled` and omitted.
// When omitted, this means no opinion and the platform is left to choose a reasonable default
// which is subject to change over time.
// The current default is `Disabled`.

Also, is it worthing rather than saying CRI-O, spelling that out in human readable prose?

I assume a user will know what PLEG is if they are turning this on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested along with a reference to the KEP !!

@@ -70,6 +75,17 @@ const (
DefaultUpdateDefaultReaction WorkerLatencyProfileType = "Default"
)

// +kubebuilder:validation:Enum=Enabled;Disabled;""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using omitempty which I think you are doing because this is a workload API rather than a configuration API, you don't need to have "" in place, when the field is omitted the validation for the enum will not execute

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Removed the "" option and I think nodes.config API is a configuration API as this resource is unique cluster-wide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a singleton resource or is there one per Node in the cluster? If it's a singleton then yep, it's a configuration API, in which case I would suggest including "" in the enum and dropping omitempty. This improves the discoverability of the API for the user to configure it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a singleton resource named "cluster" that has some of the node related configs like cgroupMode, workerlatencyprofiles etc.
I agree that removing omitempty improves the discoverability of the API for the user. At the same time, I want to maintain consistency with the other API fields already present which are again optional
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, I want to maintain consistency with the other API fields already present which are again optional

We tend to say we don't repeat the sins of the past here. We have conventions that evolve over time so new fields should follow current conventions even when they make the field look inconsistent with older fields.

IMO this should be no omitempty, but allow empty string on the enum please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with your point, Updated !!

@JoelSpeed
Copy link
Contributor

How does this interact with #1471? What's the goal in 4.14?

Looks like 1471 enabled the feature by default in tech preview clusters, so is the intention here that a user will have the choice? It will be on by default but can be disabled by using the fields added here?

@sairameshv
Copy link
Member Author

How does this interact with #1471? What's the goal in 4.14?

Looks like 1471 enabled the feature by default in tech preview clusters, so is the intention here that a user will have the choice? It will be on by default but can be disabled by using the fields added here?

#1471 Just provides a way to enable this feature via featuregate. I don't think it enables the EventedPLEG without getting this PR & the related MCO PR merged.
Default behavior of the EventedPLEG is disabled in the cluster.

@JoelSpeed
Copy link
Contributor

I don't think it enables the EventedPLEG without getting this PR & the related MCO PR merged.

It does enable it for TechPreviewNoUpgrade clusters. MCO gets the feature gate passed through to kubelet and so it gets enabled by default on any TechPreviewNoUpgrade cluster, guessing that wasn't the intention?

Note, this has been noticed because of debugging issues in openshift/machine-config-operator#3688

// eventedPleg enables the event based PLEG between the kubelet and CRI-O
// Reference: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3386-kubelet-evented-pleg/README.md
// Valid values are `Enabled`, `Disabled` and ""
// By default, the evented pleg feature is not enabled in the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically have a particular prose we use for this, can you update please

Suggested change
// By default, the evented pleg feature is not enabled in the cluster
// When omitted, this means no opinion and the platform is left to choose a reasonable default, which is subject to change over time.
// The current default value is Disabled.

@@ -42,6 +42,14 @@ type NodeSpec struct {
// the status and corresponding reaction of the cluster
// +optional
WorkerLatencyProfile WorkerLatencyProfileType `json:"workerLatencyProfile,omitempty"`

// eventedPleg enables the event based PLEG between the kubelet and CRI-O
// Reference: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3386-kubelet-evented-pleg/README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a user friendly explanation we can include? A link to a KEP is quite a lot of context.

Is the intention when a user enables this feature that CRIO and Kubelet are both configured? Is there explicit configuration required for both?

EventedPLEG is an upstream feature gate, what happens when that is enabled by default?

@sairameshv sairameshv marked this pull request as draft January 24, 2024 11:12
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2024
Copy link
Contributor

openshift-ci bot commented Jan 24, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sairameshv
Once this PR has been reviewed and has the lgtm label, please assign mfojtik for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1. Incase of Evented PLEG, CRI-O sends the container events to the Kubelet so that the pod cache can be updated based on the received events.
KEP Reference: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3386-kubelet-evented-pleg/README.md
2. This feature can be enabled in OCP by adding a new field in the node config custom resource
that can be monitored by the MCO and update both the required Kubelet and CRI-O configurations
Enhancement PR: openshift/enhancements#1368

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
Copy link
Contributor

openshift-ci bot commented Mar 19, 2024

@sairameshv: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-upgrade-minor 8b0fb6b link true /test e2e-upgrade-minor

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants