New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295
Comments
/triage accepted |
The failing test seems to be expecting GPU on machine and the image seems to be broken. I tried the image manually:
Logs:
|
/assign @xmcqueen as discussed at SIG Node CI meeting, can you please take a look? I see that pod is now running for couple minutes and then stops. I wonder if this is something environment related - maybe pod was evicted or something. I doesn't seem to be able to repro the failure of the pod locally |
@SergeyKanzhelev: GitHub didn't allow me to assign the following users: xmcqueen. Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is no longer a flake. It has failed every run since the start of June. Locally this test runs perfectly with a legit correct local run of the test suite code, doing the full tensorflow run of 300 epochs. The failure in the testgrid job seems to be fast (173 sec to build and fail, locally its 160 sec before it starts the tensorflow pod). I wonder if there's a blocked port or something from the data initializing python program. It needs access to: Here's the command it runs:
Is it feasible to add debugging code, like a pre-check for that http port? It would be good to at least force it to fail with a meaningful error message. |
Note that HEAD is now at image version 1.2, and it is still failing per testgrid. The successful local runs of the test mentioned above used image version 1.2. Locally image version 1.1 does produce the cuda/GPU error shown above by @SergeyKanzhelev :
|
More info is needed to find the RC for this test failure. I've put in a PR to get the test framework to preserve the container logs when the pod fails. |
The pod logs show that it is timing out on an http connection, and it looks like it's the one where it tries to download the test data set for the tensorflow run. You can see it here. Search down to
|
This test depending on that external data doesn't look quite right :/ |
I'm discussing this test with sig-testing. I'll post an update later. |
Talked to @eddiezane from sig-testing:
So I see three choices:
@eddiezane what do you think: Size of data about 10MB: the tools and libs for the tensorflow run: the training and eval files for the testrun:
|
Here's a relevant issue in a related k8s repo about using/not using docker registries: Registries used in k/k CI should be on Kubernetes Community infra |
This issue looks good for the subject of putting up a new docker image containing the deps. This seems feasible and effective. I'd prefer it to deleting the test. I'll discuss it a bit more with a few others, but this seems like the way to go forward with this. TBS, I plan to take the above listed test requirements, and build a docker image containing all that is needed, and reset the test to pull the new docker image. |
I've got some Dockerfile changes to fix this. I've tested it locally with no network and it does the full 300 epochs.
PR coming soon. |
Currently the image builds for the first arch (amd64) but fails when it enters the cross compiling phase (arm64) and specifically when it tries to do apt installs from within the qemu-*-static emulator:
The other images in node-perf fail to build in the same way. The jessie-dnsutils image build fails to build in the same way. I have tried pulling in newer base images up to buster, but it did not fix the problem. -Start focusing on the version of qemu. - qemu is version locked in the Makefile and has not changed in two years. There was an upgrade recently and the images were building at that time. Try to find clues over there. |
I tried pinning many relevant versions of Here's a clue: docker/buildx#495 This ugly hack gets it past the observed error:
|
Note the previous PR was missing an important change, but I do not see any feasible way that could explain the current problem with FTR I need to bump the VERSION. I did not bump the version in I thought maybe the CI would bump it automatically. |
Fix node-perf test tf-wide-deep: bumped image version, and removed arm64 arch testing #109295
There's an image version in test/utils/manifest.go. Its still pulling the old version and failing. Bump it from 1.2 to 1.3. So one more commit is needed. |
The new image is present and runs perfectly. Tested it on local linux and in google cloud:
It ran to completion. Note the new image URL. Its NOT the same as the old one:
|
podspec references the image here Here's the list of registries. tf-wide-deep's url is built here The above
|
FTR: My "testing" above did not catch the current problem because I did not run the full e2e_test binary. I only tested the image via a AND I should have read this way back at the beginning. |
Local run of This has to change too. |
Local test using the published 1.3 image run the standard way was successful with the above:
ran the test successfully. So
|
After the image promotion is merged, there's STILL one more commit coming:
|
|
its fixed. the test is green again. |
Which jobs are failing?
sig-node-containerd#node-kubelet-containerd-performance-test
Which tests are failing?
E2eNode Suite.[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload
Since when has it been failing?
From the testgrid, it has been 1/2 flaky since the start of history (03/20/2022) and has been failing constantly (0/2) since 04/01/2022
Testgrid link
https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-performance-test
Reason for failure (if possible)
Anything else we need to know?
No response
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: