Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: prepare DRA resources before CNI setup #114364

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Dec 8, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

It calls DRA PrepareResources API before CNI is initialized to enable DRA usage for network devices.

Which issue(s) this PR fixes:

Fixes #113785

Special notes for your reviewer:

This is a modified version of the pohly#43 that keeps most of the logic in the container manager and adds only one call to the Kubelet just to be able to call container manager PrepareResources API.

Does this PR introduce a user-facing change?

Dynamic Resource Allocation framework can be used for network devices

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 8, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.26 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.26.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Thu Dec 8 11:56:23 UTC 2022.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 8, 2022
@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 8, 2022

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 8, 2022
@bart0sh bart0sh force-pushed the PR102-prepare-DRA-resources-before-CNI-setup branch from 995927f to ea40383 Compare December 8, 2022 12:39
@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 8, 2022

/retest

1 similar comment
@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 9, 2022

/retest

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 9, 2022

/assign @klueska

@k8s-ci-robot
Copy link
Contributor

@bart0sh: GitHub didn't allow me to request PR reviews from the following users: moshe010.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @moshe010

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 9, 2022

/cc @moshe010

@k8s-ci-robot
Copy link
Contributor

@bart0sh: GitHub didn't allow me to request PR reviews from the following users: moshe010.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @moshe010

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@klueska
Copy link
Contributor

klueska commented Dec 9, 2022

Before committing to this approach, I would like to explore the option of adding code to tolerate temporary failures in the Pod Admission loop. That way the dra manager could move into this loop and sit alongside all other related managers.

The devicemanager would also benefit from such a contribution, as all transient failures with communicating with a device plugin currently manifest as PodAdmission errors.

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 12, 2022

/retest

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 15, 2022

/retest

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 15, 2022

/priority important-longterm
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Dec 15, 2022
@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 30, 2022

/retest

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this change looks good. Just a few comments.

That said, as mentioned previously, we should still spend some cycles looking into doing this in the pod admission loop (like all other resource managers), rather than in the container creation step. The blocking factor at the moment being that the pod admission loop doesn't support transient failures when calling out to the kubelet plugins, but the container creation step does.

Comment on lines 1036 to 1048
if cm.draManager != nil {
return cm.draManager.PrepareResources(pod, container)
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we didn't need this check against nil previously, or was it an oversight?

Copy link
Contributor Author

@bart0sh bart0sh Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cm.draManager can't be nil when DynamicResourceAllocation feature is enabled ,so this check is the same as if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.DynamicResourceAllocation) {.
I decided to check for nil to be consistent with the code in this file, e.g. with https://github.com/kubernetes/kubernetes/blob/release-1.26/pkg/kubelet/cm/container_manager_linux.go#L714

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked if DynamicResourceAllocation feature is enabled instead. This is better as we'll hopefully need to remove this check when DRA is GA. It will be easier to find it.

Comment on lines 1044 to 1056
if cm.draManager != nil {
return cm.draManager.UnprepareResources(pod)
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here

pkg/kubelet/container/helpers.go Outdated Show resolved Hide resolved
pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated Show resolved Hide resolved
pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated Show resolved Hide resolved
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 25, 2023

In general, this change looks good. Just a few comments.

That said, as mentioned previously, we should still spend some cycles looking into doing this in the pod admission loop (like all other resource managers), rather than in the container creation step. The blocking factor at the moment being that the pod admission loop doesn't support transient failures when calling out to the kubelet plugins, but the container creation step does.

Agree, this needs to be done and I'm going to investigate it.

@bart0sh bart0sh force-pushed the PR102-prepare-DRA-resources-before-CNI-setup branch from 58b54e7 to c25222d Compare January 25, 2023 16:01
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2023
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 25, 2023

/retest

1 similar comment
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 26, 2023

/retest

@bart0sh bart0sh force-pushed the PR102-prepare-DRA-resources-before-CNI-setup branch from c25222d to ce06591 Compare January 26, 2023 21:39
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 27, 2023

@klueska fixed CI failures & rebased, PTAL

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
@bart0sh bart0sh force-pushed the PR102-prepare-DRA-resources-before-CNI-setup branch from ce06591 to b01a9d0 Compare February 1, 2023 17:40
@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 2, 2023

/retest

@bart0sh bart0sh force-pushed the PR102-prepare-DRA-resources-before-CNI-setup branch from b01a9d0 to 55059e6 Compare February 5, 2023 19:34
pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated Show resolved Hide resolved
}

// UnprepareDynamicResources calls container Manager UnprepareDynamicResources API
// This method implements RuntimeHelper interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This method implements RuntimeHelper interface
// This method implements the RuntimeHelper interface

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/types.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/dra/manager.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/container_manager_linux.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/container_manager_linux.go Outdated Show resolved Hide resolved
@bart0sh bart0sh force-pushed the PR102-prepare-DRA-resources-before-CNI-setup branch from 55059e6 to 4f88332 Compare February 6, 2023 18:40
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 6, 2023

@bart0sh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 ea4038327ebb0a0f6b2e4e221c9a4630a3349e40 link false /test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 7, 2023

/retest

@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 7, 2023

@klueska thank you for the review! I've updated the PR according to your suggestions. PTAL.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turn-around on the reviews. Looks great now.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5cd8e0f886c19847029ece8324230e80e7b426ff

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2023
@k8s-ci-robot k8s-ci-robot merged commit 5437d49 into kubernetes:master Feb 7, 2023
SIG Node PR Triage automation moved this from Needs Reviewer to Done Feb 7, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

dynamic resource allocation: ensure PrepareResources called before CNI invoked
3 participants