New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU e2es are failing #47216
Comments
We had one successful run which included that commit: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-gpu/402 |
I built a cluster from HEAD and ran the e2es locally and they are passing. |
That's odd. I'd like to understand why the test flaked? @mindprince can you look at the failed test logs to see if there are any clues? |
When no nvidia device was attached, the -ne check had a syntax error: sh: -ne: argument expected This resulted in 'Success' being echoed and the test passing incorrectly. This was found while debugging issue kubernetes#47216
So, the regression is caused by #46744. Tests pass when run with:
They fail when run with:
This the difference between the two commits:
Looks like the nvidia GPU devices are no longer being attached to the docker containers after the changes introduced by #46744. But why are node e2e's still passing? |
#46744 seems to be an important feature. Can you try to identify the bug it introduced instead of attempting to revert it? @dchen1107 This issue needs to be fixed for v1.7. |
Seems moving resource assignment after updateCreateConfig will do the fix. Resources got new value there but was then replaced: I'll try prepare a PR. |
Automatic merge from submit-queue (batch tested with PRs 47000, 47188, 47094, 47323, 47124) Fix hostconfig device map logic in dockershim. **What this PR does / why we need it**: Fixes for device injection logic in dockershim , please help verify e2e run. Should do updateCreateConfig before Resources assignment. Related change: https://github.com/kubernetes/kubernetes/pull/46744/files#diff-c7dd39479fd733354254e70845075db5L137 **Which issue this PR fixes** #47216 **Special notes for your reviewer**: **Release note**: ```release-note ```
This can be closed now, the tests are green: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-gpu/462 |
…e-gpu Automatic merge from submit-queue Fix bad check in node e2e tests for GPUs. When no nvidia device was attached, the -ne check had a syntax error: sh: -ne: argument expected This resulted in `Success` being echoed and the test passing incorrectly. This was found while debugging issue #47216 /release-note-none /sig node /area node-e2e /kind bug
Yay!!!. |
GPU e2es are failing since yesterday (6/8) - https://k8s-testgrid.appspot.com/google-gce#gce-gpu
The only GPU related PR that merged yesterday is #46087
cc @dchen1107 @mindprince
The text was updated successfully, but these errors were encountered: