-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust GPU test to work with latest nvidia daemonset on AWS/ec2 #123776
Adjust GPU test to work with latest nvidia daemonset on AWS/ec2 #123776
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-e2e-ec2-device-plugin-gpu |
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
a19501d
to
ddda25f
Compare
/test pull-kubernetes-e2e-ec2-device-plugin-gpu |
ddda25f
to
3085765
Compare
/kind bug |
/kind failing-test |
/assign @pacoxu |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
/hold we should release hold if |
/test pull-kubernetes-e2e-gce-device-plugin-gpu after kubernetes/test-infra#32188 was merged |
/test pull-kubernetes-e2e-ec2-device-plugin-gpu |
let's make sure both the |
pull-kubernetes-e2e-ec2-device-plugin-gpu is ✅ |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
/test pull-kubernetes-e2e-ec2-device-plugin-gpu |
This may fixed the ec2 test. My run with #123788 failed in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/123788/pull-kubernetes-e2e-ec2-device-plugin-gpu/1765609958725914624. |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
/skip ( skipping the |
/hold cancel |
/lgtm |
LGTM label has been added. Git tree hash: f3e1519b753aeb5e2679e4a2b57161fff663708c
|
@dims: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Getting GPU test to work on AWS/EC2 as there are some flakiness on the existing CI job.
areGPUsAvailableOnAllSchedulableNodes
: ignore control plane node, it needed not have a GPU (we are not going to schedule on that node)makeCudaAdditionDevicePluginTestPod
: usually builds a pod with 2 containers (uses 1 GPU each). As in EC2/AWS, Nodes with 2 Nvidia GPU(s) are costly, let's have a environment variableTEST_MAX_GPU_COUNT
to use just 1 container (with 1 GPU) as the test workload"NVIDIA_DRIVER_INSTALLER_DAEMONSET
and point it to the latest daemonset URL needs some accomodation on the namespace logic to make it work.See first clean run here:
https://testgrid.k8s.io/presubmits-ec2#pull-kubernetes-e2e-ec2-device-plugin-gpu&width=20
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: