Update openshift/api to disable EventedPLEG featuregate in techpreview #317

JoelSpeed · 2023-06-05T10:51:29Z

This currently also pulls in the changes from #314

See openshift/api#1483 for context on why this is being disabled temporarily

sairameshv · 2023-06-05T10:53:04Z

/lgtm

openshift-ci · 2023-06-05T10:53:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, sairameshv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sinnykumari · 2023-06-05T11:58:03Z

Needs valid BZ

deads2k · 2023-06-05T13:15:41Z

freeing up featuregates.

openshift-ci · 2023-06-05T13:20:06Z

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed · 2023-06-05T13:24:31Z

For context, if you observe these earlier test failures on tech preview, these have EventedPLEG=true in the kubelet config, which enables the featuregate on the kubelet level.

Some symptoms that we are seeing are pods stuck in leader election and pods with bind already in use issues.

Diving into this with the node team we found that crio had created duplicate containers on the host level, so, for example, we had two machine-config-controller containers which were conflicting and one had leader, the other did not, one had bound, one had not.

By putting in a hack, I prevented the EventedPLEG feature from being enabled in Kubelet and this lead to green presubmits for the tech preview jobs, showing that enabling this feature was causing issues.

Now that it has been taken out, we can try re-enabling individually once openshift/machine-config-operator#3688 is merged, as we will be able to promote the feature via a single PR

deads2k · 2023-06-05T13:26:16Z

For context, if you observe these earlier test failures on tech preview, these have EventedPLEG=true in the kubelet config, which enables the featuregate on the kubelet level.

Some symptoms that we are seeing are pods stuck in leader election and pods with bind already in use issues.

@mrunalp before the eventedPLEG goes in again, we should have e2e tests that fail on this condition so we get a clear fix signal on payload test.

sairameshv · 2023-06-05T15:29:20Z

I just want to highlight here that Evented PLEG was never enabled in kubelet config because the wiring to do that is still in progress.
Only the Tech preview feature gate support to enable the evented pleg was added which wouldn't interact with MCO code in anyway to enable the feature in the Kubelet Config.
Also, the logs from Joel's cluster confirmed that it was standard generic pleg getting used on the node and not the evented pleg. cc: @harche

JoelSpeed · 2023-06-05T15:43:28Z

I just want to highlight here that Evented PLEG was never enabled in kubelet config because the wiring to do that is still openshift/api#1458.
Only the Tech preview feature gate openshift/api#1471 to enable the evented pleg was added which wouldn't interact with MCO code in anyway to enable the feature in the Kubelet Config.

MCO, in master today, takes feature gates from the openshift/api repository and passes them directly through to the kubelet. So as soon as something exists in openshift/api, and is revendored into MCO, it gets passed through to Kubelet, no additional wiring needed.

In the future it will come from cluster-config-operator rather than being vendored, but will still be passed through.

Both myself and @sergiordlr grabbed kubelet configs from our clusters and could see EventedPLEG=true inside the kubelet config. (example in slack)

I think we need to have a chat about how the feature gates stuff works as gating an upstream feature gate behind a configuration in the API is not necessarily straightforward and needs some work in MCO to support it, if that's something we want to support.

harche · 2023-06-05T16:13:27Z

@JoelSpeed Just want to point out that the diff in the slack link is for the Openshift feature gates, and not for the kubelet configuration.

Kubelet feature gates can be found at - https://github.com/kubernetes/kubernetes/blob/bba9833c398230f68cd30ff23e22f303e898e6a9/pkg/features/kube_features.go#L858

As you can see some of the enabled features in that slack message, such as, AzureWorkloadIdentity, GatewayAPI are not kubelet related but rather part of Openshift feature gates.

In case any of the Openshift feature gates require any changes on the node (such as additional arguments for kubelet, crio settings or kernel arguments), usually we use MCO to handle that wiring. A good example would be WorkerLatencyProfiles, Cgroup v2 support etc.

The point @sairameshv was trying to convey that, just having a feature gate declaration in Openshift API does not automatically set any values on the node related components (e.g Kubelet, crio, kernel etc). We need to add some wiring in the background (usually in MCO) to make it happen.

Looking at the logs on your cluster it was very evident that evented pleg wasn't even used on the nodes, and it is kind of expected because there is no way for MCO to know that should be enabled in the kubelet as well as crio because @sairameshv had not put in the required code to actually orchestrate that.

JoelSpeed · 2023-06-05T16:46:24Z

Just want to point out that the diff in the slack link is for the Openshift feature gates, and not for the kubelet configuration.

I"m sorry but that's not true. The excerpt was taken from the kubelet.conf file on a cluster, you can see a couple of messages on where Sergio posted the full kubelet configs for both a regular master and from the PR.

As you can see some of the enabled features in that slack message, such as, AzureWorkloadIdentity, GatewayAPI are not kubelet related but rather part of Openshift feature gates.

Yes you're right, but the MCO doesn't know that. The MCO has a features controller that generates a kubelet config using the feature gates enabled in the cluster. It blindly passes these to Kubelet and you can see in kubelet logs on my PR that kubelet is warning that there are unknown feature gates. Ideally the MCO filters out the openshift only feature gates, but that means keeping a list of ones for it to filter out which is tricky to not get bit-rot on.

For example. from one of the failed test runs, I've pulled out 98-master-generated-kubelet from [here], decoding the base64 file, the featuregates are:

"featureGates": {
    "AdmissionWebhookMatchConditions": true,
    "AzureWorkloadIdentity": true,
    "BuildCSIVolumes": true,
    "CSIDriverSharedResource": true,
    "DynamicResourceAllocation": true,
    "EventedPLEG": true, <------------- EventedPLEG being passed to Kubelet in kubelet.conf file
    "ExternalCloudProviderAzure": true,
    "ExternalCloudProviderGCP": true,
    "GatewayAPI": true,
    "InsightsConfigAPI": true,
    "MachineAPIProviderOpenStack": true,
    "MatchLabelKeysInPodTopologySpread": true,
    "MaxUnavailableStatefulSet": true,
    "NodeSwap": true,
    "OpenShiftPodSecurityAdmission": true,
    "PDBUnhealthyPodEvictionPolicy": true,
    "PrivateHostedZoneAWS": true,
    "RetroactiveDefaultStorageClass": true,
    "SigstoreImageVerification": true
  },

The point @sairameshv was trying to convey that, just having a feature gate declaration in Openshift API does not automatically set any values on the node related components (e.g Kubelet, crio, kernel etc). We need to add some wiring in the background (usually in MCO) to make it happen.

That's also not true. As mentioned above, the MCO blindly sends any and all feature gates configured in the FeatureGate status through to the kubelet, so by adding EventedPLEG into the config operator, it added the value to the status, and MCO started passing it directly to kubelet, without any work done in MCO, which is why we saw it enabled in our testing.

Looking at the logs on your cluster

I don't know what logs you're looking at, have you got a link? In my testing last week I was observing Kubelet logs discarding unknown feature gates and enabling EventedPLEG

harche · 2023-06-06T15:47:05Z

@JoelSpeed thanks for that clarification.

When I logged into your system the evented pleg was not in use, it was generic pleg that was updating the cache. This was expected because, even if you enable evented pleg in the kubelet but not in crio, the kubelet will automatically fall back to the generic pleg.

Anyway, we would look into it more closely from node side. Thanks for your help.

Update openshift/api to disable EventedPLEG featuregate in techpreview

4c1321c

openshift-ci bot requested review from deads2k and soltysh June 5, 2023 10:51

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2023

openshift-ci bot assigned sairameshv Jun 5, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2023

JoelSpeed mentioned this pull request Jun 5, 2023

OCPBUGS-13547: [OCPCLOUD-2034] Update Library-go and API for new featuregate changes openshift/machine-config-operator#3688

Merged

deads2k added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 5, 2023

openshift-merge-robot merged commit 297b472 into openshift:master Jun 5, 2023
8 checks passed

JoelSpeed deleted the bump-api branch June 5, 2023 13:25

JoelSpeed mentioned this pull request Jun 6, 2023

Bump to latest openshift/api #314

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update openshift/api to disable EventedPLEG featuregate in techpreview #317

Update openshift/api to disable EventedPLEG featuregate in techpreview #317

JoelSpeed commented Jun 5, 2023

sairameshv commented Jun 5, 2023

openshift-ci bot commented Jun 5, 2023

sinnykumari commented Jun 5, 2023

deads2k commented Jun 5, 2023

openshift-ci bot commented Jun 5, 2023

JoelSpeed commented Jun 5, 2023 •

edited

deads2k commented Jun 5, 2023

sairameshv commented Jun 5, 2023

JoelSpeed commented Jun 5, 2023

harche commented Jun 5, 2023

JoelSpeed commented Jun 5, 2023

harche commented Jun 6, 2023

Update openshift/api to disable EventedPLEG featuregate in techpreview #317

Update openshift/api to disable EventedPLEG featuregate in techpreview #317

Conversation

JoelSpeed commented Jun 5, 2023

sairameshv commented Jun 5, 2023

openshift-ci bot commented Jun 5, 2023

sinnykumari commented Jun 5, 2023

deads2k commented Jun 5, 2023

openshift-ci bot commented Jun 5, 2023

JoelSpeed commented Jun 5, 2023 • edited

deads2k commented Jun 5, 2023

sairameshv commented Jun 5, 2023

JoelSpeed commented Jun 5, 2023

harche commented Jun 5, 2023

JoelSpeed commented Jun 5, 2023

harche commented Jun 6, 2023

JoelSpeed commented Jun 5, 2023 •

edited