Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU e2es are failing #47216

Closed
vishh opened this issue Jun 9, 2017 · 8 comments
Closed

GPU e2es are failing #47216

vishh opened this issue Jun 9, 2017 · 8 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@vishh
Copy link
Contributor

vishh commented Jun 9, 2017

GPU e2es are failing since yesterday (6/8) - https://k8s-testgrid.appspot.com/google-gce#gce-gpu

The only GPU related PR that merged yesterday is #46087

cc @dchen1107 @mindprince

@vishh vishh added kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 9, 2017
@vishh vishh added this to the v1.7 milestone Jun 9, 2017
@vishh vishh self-assigned this Jun 9, 2017
@rohitagarwal003
Copy link
Member

We had one successful run which included that commit: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-gpu/402

@vishh
Copy link
Contributor Author

vishh commented Jun 9, 2017

I built a cluster from HEAD and ran the e2es locally and they are passing.

@vishh
Copy link
Contributor Author

vishh commented Jun 9, 2017

That's odd. I'd like to understand why the test flaked? @mindprince can you look at the failed test logs to see if there are any clues?

rohitagarwal003 added a commit to rohitagarwal003/kubernetes that referenced this issue Jun 12, 2017
When no nvidia device was attached, the -ne check had a syntax error:

    sh: -ne: argument expected

This resulted in 'Success' being echoed and the test passing incorrectly.
This was found while debugging issue kubernetes#47216
@rohitagarwal003
Copy link
Member

So, the regression is caused by #46744.

Tests pass when run with:

$ git checkout d16d64f62078eb419a0bf9c2d5b843176a63d1b9

They fail when run with:

$ git checkout 69a9759d90339e8d9623d71bf4032c3819879872

This the difference between the two commits:

$ git log d16d64f62078eb419a0bf9c2d5b843176a63d1b9...69a9759d90339e8d9623d71bf4032c3819879872 --oneline
69a9759d90 (HEAD) Merge pull request #46744 from karataliu/wincri4
5936e81b2e Add determinePodIPBySandboxID.
6d07fc2f44 Add updateCreateConfig.
9c2309b7cb Add os dependent getSecurityOpts helper method.
33c34f0ae4 Upgrade go-winio package to v0.4.2, supporting go v1.8 .

Looks like the nvidia GPU devices are no longer being attached to the docker containers after the changes introduced by #46744.

But why are node e2e's still passing?
There is a bug in them which causes them to pass incorrectly. Sent PR #47321 to fix that.

@vishh
Copy link
Contributor Author

vishh commented Jun 12, 2017

Looks like the nvidia GPU devices are no longer being attached to the docker containers after the changes introduced by #46744.

#46744 seems to be an important feature. Can you try to identify the bug it introduced instead of attempting to revert it?
FYI: The relevant device injection logic is here.
In case you did not know, you can use a local cluster for testing and development of GPU support @mindprince

@dchen1107 This issue needs to be fixed for v1.7.

@karataliu
Copy link
Contributor

Seems moving resource assignment after updateCreateConfig will do the fix.
https://github.com/kubernetes/kubernetes/pull/46744/files#diff-c7dd39479fd733354254e70845075db5R150

Resources got new value there but was then replaced:
https://github.com/kubernetes/kubernetes/pull/46744/files#diff-7a64798a1c76fb0f489279744c106b36R55

I'll try prepare a PR.

k8s-github-robot pushed a commit that referenced this issue Jun 13, 2017
Automatic merge from submit-queue (batch tested with PRs 47000, 47188, 47094, 47323, 47124)

Fix hostconfig device map logic in dockershim.

**What this PR does / why we need it**:
Fixes for device injection logic in dockershim , please help verify e2e run.

Should do updateCreateConfig before Resources assignment.

Related change:
https://github.com/kubernetes/kubernetes/pull/46744/files#diff-c7dd39479fd733354254e70845075db5L137


**Which issue this PR fixes**
#47216

**Special notes for your reviewer**:

**Release note**:
```release-note
```
@rohitagarwal003
Copy link
Member

k8s-github-robot pushed a commit that referenced this issue Jun 13, 2017
…e-gpu

Automatic merge from submit-queue

Fix bad check in node e2e tests for GPUs.

When no nvidia device was attached, the -ne check had a syntax error:

    sh: -ne: argument expected

This resulted in `Success` being echoed and the test passing incorrectly.
This was found while debugging issue #47216

/release-note-none
/sig node
/area node-e2e
/kind bug
@vishh
Copy link
Contributor Author

vishh commented Jun 13, 2017

Yay!!!.

@vishh vishh closed this as completed Jun 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants