Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update openshift/api to disable EventedPLEG featuregate in techpreview #317

Merged
merged 1 commit into from Jun 5, 2023

Conversation

JoelSpeed
Copy link
Contributor

This currently also pulls in the changes from #314

See openshift/api#1483 for context on why this is being disabled temporarily

@openshift-ci openshift-ci bot requested review from deads2k and soltysh June 5, 2023 10:51
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2023
@sairameshv
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 5, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, sairameshv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sinnykumari
Copy link

Needs valid BZ

@deads2k deads2k added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 5, 2023
@deads2k
Copy link
Contributor

deads2k commented Jun 5, 2023

freeing up featuregates.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 5, 2023

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 297b472 into openshift:master Jun 5, 2023
8 checks passed
@JoelSpeed
Copy link
Contributor Author

JoelSpeed commented Jun 5, 2023

For context, if you observe these earlier test failures on tech preview, these have EventedPLEG=true in the kubelet config, which enables the featuregate on the kubelet level.

Some symptoms that we are seeing are pods stuck in leader election and pods with bind already in use issues.

Diving into this with the node team we found that crio had created duplicate containers on the host level, so, for example, we had two machine-config-controller containers which were conflicting and one had leader, the other did not, one had bound, one had not.

By putting in a hack, I prevented the EventedPLEG feature from being enabled in Kubelet and this lead to green presubmits for the tech preview jobs, showing that enabling this feature was causing issues.

Now that it has been taken out, we can try re-enabling individually once openshift/machine-config-operator#3688 is merged, as we will be able to promote the feature via a single PR

@JoelSpeed JoelSpeed deleted the bump-api branch June 5, 2023 13:25
@deads2k
Copy link
Contributor

deads2k commented Jun 5, 2023

For context, if you observe these earlier test failures on tech preview, these have EventedPLEG=true in the kubelet config, which enables the featuregate on the kubelet level.

Some symptoms that we are seeing are pods stuck in leader election and pods with bind already in use issues.

@mrunalp before the eventedPLEG goes in again, we should have e2e tests that fail on this condition so we get a clear fix signal on payload test.

@sairameshv
Copy link
Member

I just want to highlight here that Evented PLEG was never enabled in kubelet config because the wiring to do that is still in progress.
Only the Tech preview feature gate support to enable the evented pleg was added which wouldn't interact with MCO code in anyway to enable the feature in the Kubelet Config.
Also, the logs from Joel's cluster confirmed that it was standard generic pleg getting used on the node and not the evented pleg. cc: @harche

@JoelSpeed
Copy link
Contributor Author

I just want to highlight here that Evented PLEG was never enabled in kubelet config because the wiring to do that is still openshift/api#1458.
Only the Tech preview feature gate openshift/api#1471 to enable the evented pleg was added which wouldn't interact with MCO code in anyway to enable the feature in the Kubelet Config.

MCO, in master today, takes feature gates from the openshift/api repository and passes them directly through to the kubelet. So as soon as something exists in openshift/api, and is revendored into MCO, it gets passed through to Kubelet, no additional wiring needed.

In the future it will come from cluster-config-operator rather than being vendored, but will still be passed through.

Both myself and @sergiordlr grabbed kubelet configs from our clusters and could see EventedPLEG=true inside the kubelet config. (example in slack)

I think we need to have a chat about how the feature gates stuff works as gating an upstream feature gate behind a configuration in the API is not necessarily straightforward and needs some work in MCO to support it, if that's something we want to support.

@harche
Copy link

harche commented Jun 5, 2023

@JoelSpeed Just want to point out that the diff in the slack link is for the Openshift feature gates, and not for the kubelet configuration.

Kubelet feature gates can be found at - https://github.com/kubernetes/kubernetes/blob/bba9833c398230f68cd30ff23e22f303e898e6a9/pkg/features/kube_features.go#L858

As you can see some of the enabled features in that slack message, such as, AzureWorkloadIdentity, GatewayAPI are not kubelet related but rather part of Openshift feature gates.

In case any of the Openshift feature gates require any changes on the node (such as additional arguments for kubelet, crio settings or kernel arguments), usually we use MCO to handle that wiring. A good example would be WorkerLatencyProfiles, Cgroup v2 support etc.

The point @sairameshv was trying to convey that, just having a feature gate declaration in Openshift API does not automatically set any values on the node related components (e.g Kubelet, crio, kernel etc). We need to add some wiring in the background (usually in MCO) to make it happen.

Looking at the logs on your cluster it was very evident that evented pleg wasn't even used on the nodes, and it is kind of expected because there is no way for MCO to know that should be enabled in the kubelet as well as crio because @sairameshv had not put in the required code to actually orchestrate that.

@JoelSpeed
Copy link
Contributor Author

Just want to point out that the diff in the slack link is for the Openshift feature gates, and not for the kubelet configuration.

I"m sorry but that's not true. The excerpt was taken from the kubelet.conf file on a cluster, you can see a couple of messages on where Sergio posted the full kubelet configs for both a regular master and from the PR.

As you can see some of the enabled features in that slack message, such as, AzureWorkloadIdentity, GatewayAPI are not kubelet related but rather part of Openshift feature gates.

Yes you're right, but the MCO doesn't know that. The MCO has a features controller that generates a kubelet config using the feature gates enabled in the cluster. It blindly passes these to Kubelet and you can see in kubelet logs on my PR that kubelet is warning that there are unknown feature gates. Ideally the MCO filters out the openshift only feature gates, but that means keeping a list of ones for it to filter out which is tricky to not get bit-rot on.

For example. from one of the failed test runs, I've pulled out 98-master-generated-kubelet from [here], decoding the base64 file, the featuregates are:

"featureGates": {
    "AdmissionWebhookMatchConditions": true,
    "AzureWorkloadIdentity": true,
    "BuildCSIVolumes": true,
    "CSIDriverSharedResource": true,
    "DynamicResourceAllocation": true,
    "EventedPLEG": true, <------------- EventedPLEG being passed to Kubelet in kubelet.conf file
    "ExternalCloudProviderAzure": true,
    "ExternalCloudProviderGCP": true,
    "GatewayAPI": true,
    "InsightsConfigAPI": true,
    "MachineAPIProviderOpenStack": true,
    "MatchLabelKeysInPodTopologySpread": true,
    "MaxUnavailableStatefulSet": true,
    "NodeSwap": true,
    "OpenShiftPodSecurityAdmission": true,
    "PDBUnhealthyPodEvictionPolicy": true,
    "PrivateHostedZoneAWS": true,
    "RetroactiveDefaultStorageClass": true,
    "SigstoreImageVerification": true
  },

The point @sairameshv was trying to convey that, just having a feature gate declaration in Openshift API does not automatically set any values on the node related components (e.g Kubelet, crio, kernel etc). We need to add some wiring in the background (usually in MCO) to make it happen.

That's also not true. As mentioned above, the MCO blindly sends any and all feature gates configured in the FeatureGate status through to the kubelet, so by adding EventedPLEG into the config operator, it added the value to the status, and MCO started passing it directly to kubelet, without any work done in MCO, which is why we saw it enabled in our testing.

Looking at the logs on your cluster

I don't know what logs you're looking at, have you got a link? In my testing last week I was observing Kubelet logs discarding unknown feature gates and enabling EventedPLEG

@harche
Copy link

harche commented Jun 6, 2023

@JoelSpeed thanks for that clarification.

When I logged into your system the evented pleg was not in use, it was generic pleg that was updating the cache. This was expected because, even if you enable evented pleg in the kubelet but not in crio, the kubelet will automatically fall back to the generic pleg.

Anyway, we would look into it more closely from node side. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants