Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295

Closed
ruiwen-zhao opened this issue Apr 4, 2022 · 41 comments · Fixed by #112869, #113012 or #113282
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ruiwen-zhao
Copy link
Contributor

Which jobs are failing?

sig-node-containerd#node-kubelet-containerd-performance-test

Which tests are failing?

E2eNode Suite.[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload

Since when has it been failing?

From the testgrid, it has been 1/2 flaky since the start of history (03/20/2022) and has been failing constantly (0/2) since 04/01/2022

Testgrid link

https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-performance-test

Reason for failure (if possible)

E2eNode Suite: [sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload expand_less | 37s
-- | --
{ Failure test/e2e_node/node_perf_test.go:172 wait for pod "tensorflow-wide-deep-pod" to succeed Expected success, but got an error:     <*errors.errorString \| 0xc000722840>: {         s: "pod \"tensorflow-wide-deep-pod\" failed with reason: \"\", message: \"\"",     }     pod "tensorflow-wide-deep-pod" failed with reason: "", message: "" test/e2e/framework/pods.go:240}

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

@ruiwen-zhao ruiwen-zhao added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Apr 4, 2022
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 4, 2022
@ehashman ehashman added this to Triage in SIG Node CI/Test Board Apr 4, 2022
@SergeyKanzhelev
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Apr 6, 2022
@SergeyKanzhelev
Copy link
Member

The failing test seems to be expecting GPU on machine and the image seems to be broken.

I tried the image manually:

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow-wide-deep-ctn
spec:
  containers:
  - name: tensorflow-wide-deep-ctn
    image: k8s.gcr.io/e2e-test-images/node-perf/tf-wide-deep:1.1
    command: ["/bin/sh"]
    args: ["-c", "python ./data_download.py && time -p python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561"]

Logs:

kubectl logs tensorflow-wide-deep-ctn
2022-04-19 21:24:17.714218: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-19 21:24:17.714292: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "./data_download.py", line 71, in <module>
    tf.app.run(argv=[sys.argv[0]] + unparsed)
AttributeError: module 'tensorflow' has no attribute 'app'

@SergeyKanzhelev
Copy link
Member

/assign @xmcqueen

as discussed at SIG Node CI meeting, can you please take a look? I see that pod is now running for couple minutes and then stops. I wonder if this is something environment related - maybe pod was evicted or something. I doesn't seem to be able to repro the failure of the pod locally

@k8s-ci-robot
Copy link
Contributor

@SergeyKanzhelev: GitHub didn't allow me to assign the following users: xmcqueen.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @xmcqueen

as discussed at SIG Node CI meeting, can you please take a look? I see that pod is now running for couple minutes and then stops. I wonder if this is something environment related - maybe pod was evicted or something. I doesn't seem to be able to repro the failure of the pod locally

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xmcqueen
Copy link
Contributor

xmcqueen commented Jul 1, 2022

This is no longer a flake. It has failed every run since the start of June.

Locally this test runs perfectly with a legit correct local run of the test suite code, doing the full tensorflow run of 300 epochs.

The failure in the testgrid job seems to be fast (173 sec to build and fail, locally its 160 sec before it starts the tensorflow pod). I wonder if there's a blocked port or something from the data initializing python program. It needs access to: https://archive.ics.uci.edu/.

Here's the command it runs:

python ./data_download.py && time -p python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561

Is it feasible to add debugging code, like a pre-check for that http port? It would be good to at least force it to fail with a meaningful error message.

@xmcqueen
Copy link
Contributor

xmcqueen commented Jul 1, 2022

Note that HEAD is now at image version 1.2, and it is still failing per testgrid. The successful local runs of the test mentioned above used image version 1.2.

Locally image version 1.1 does produce the cuda/GPU error shown above by @SergeyKanzhelev :

root@tensorflow-wide-deep-pod:/models-1.9.0/official/wide_deep# python ./data_download.py && time -p python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561
2022-07-02 00:17:50.063604: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-02 00:17:50.063660: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "./data_download.py", line 71, in <module>
    tf.app.run(argv=[sys.argv[0]] + unparsed)
AttributeError: module 'tensorflow' has no attribute 'app'
root@tensorflow-wide-deep-pod:/models-1.9.0/official/wide_deep#

@xmcqueen
Copy link
Contributor

xmcqueen commented Jul 8, 2022

More info is needed to find the RC for this test failure. I've put in a PR to get the test framework to preserve the container logs when the pod fails.

@xmcqueen
Copy link
Contributor

The pod logs show that it is timing out on an http connection, and it looks like it's the one where it tries to download the test data set for the tensorflow run. You can see it here. Search down to pod error detected. The data_download script in the image attempts to pull down a dataset from: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data. That url is live and healthy, so they might be blocking the gcloud.

> curl -s -w "%{http_code}\n" -I https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
HTTP/1.1 200 OK
Date: Tue, 12 Jul 2022 15:32:31 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips SVN/1.7.14 Phusion_Passenger/4.0.53 mod_perl/2.0.11 Perl/v5.16.3
Last-Modified: Sat, 10 Aug 1996 18:14:48 GMT
ETag: "3ca4a1-2fbb419259600"
Accept-Ranges: bytes
Content-Length: 3974305
Content-Type: application/x-httpd-php

200
>
W0712 03:50:49.666] 
W0712 03:50:49.666] Traceback (most recent call last):
W0712 03:50:49.666]   File "./data_download.py", line 71, in <module>
W0712 03:50:49.666]     tf.app.run(argv=[sys.argv[0]] + unparsed)
...
W0712 03:50:49.668]     raise URLError(err)
W0712 03:50:49.668] urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
W0712 03:50:49.668] 
W0712 03:50:49.668] Jul 12 03:43:04.398: FAIL: pod error: error while waiting for pod node-performance-testing-4710/tensorflow-wide-deep-pod to be Succeeded or Failed: pod "tensorflow-wide-deep-pod" failed with reason: "", message: ""
W0712 03:50:49.669] 

@aojea
Copy link
Member

aojea commented Jul 12, 2022

This test depending on that external data doesn't look quite right :/

@xmcqueen
Copy link
Contributor

I'm discussing this test with sig-testing. I'll post an update later.

@xmcqueen
Copy link
Contributor

xmcqueen commented Jul 13, 2022

Talked to @eddiezane from sig-testing:

So I see three choices:

  • upgrade this thing to a more recent edition
  • delete it
  • grab everything that's needed to run this and keep it in-house <- clear winner here

@eddiezane what do you think:
Maybe the folks over at sig-testing have a place to store the artifacts in-house? They are not massive (about 10MB see below)

Size of data about 10MB:

the tools and libs for the tensorflow run:
wget https://github.com/tensorflow/models/archive/v1.9.0.tar.gz
924K Jul 12 17:42 v1.9.0.tar.gz

the training and eval files for the testrun:
python ./data_download.py

3.8M Jul 12 17:38 adult.data
5.1K Jul 12 17:38 adult.names
1.9M Jul 12 17:38 adult.test

@xmcqueen
Copy link
Contributor

Here's a relevant issue in a related k8s repo about using/not using docker registries:

Registries used in k/k CI should be on Kubernetes Community infra

kubernetes/k8s.io#1458

@xmcqueen
Copy link
Contributor

This issue looks good for the subject of putting up a new docker image containing the deps. This seems feasible and effective. I'd prefer it to deleting the test. I'll discuss it a bit more with a few others, but this seems like the way to go forward with this. TBS, I plan to take the above listed test requirements, and build a docker image containing all that is needed, and reset the test to pull the new docker image.

#97027

@xmcqueen
Copy link
Contributor

I've got some Dockerfile changes to fix this. I've tested it locally with no network and it does the full 300 epochs.

docker run -it --entrypoint "" 68e5677a97f1 python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561 --data_dir ./census_data

PR coming soon.

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 7, 2022

Currently the image builds for the first arch (amd64) but fails when it enters the cross compiling phase (arm64) and specifically when it tries to do apt installs from within the qemu-*-static emulator:

error: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c apt-get update && apt-get install -y wget time" did not complete successfully: exit code: 100

The other images in node-perf fail to build in the same way.

The jessie-dnsutils image build fails to build in the same way.

I have tried pulling in newer base images up to buster, but it did not fix the problem.

-Start focusing on the version of qemu. -

qemu is version locked in the Makefile and has not changed in two years.

There was an upgrade recently and the images were building at that time. Try to find clues over there.

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 7, 2022

I tried pinning many relevant versions of tonistiigi/binfmt from here but none helped.

Here's a clue: docker/buildx#495

This ugly hack gets it past the observed error:

+RUN ln -s /usr/bin/dpkg-split /usr/local/bin/dpkg-split
+RUN ln -s /usr/bin/dpkg-deb /usr/local/bin/dpkg-deb
+RUN ln -s /bin/rm /usr/local/bin/rm
+RUN ln -s /bin/tar /usr/local/bin/tar

@xmcqueen
Copy link
Contributor

Note the previous PR was missing an important change, but I do not see any feasible way that could explain the current problem with /dev/.buildkit_qemu_emulator /bin/sh -c apt-get update && apt-get install -y wget time failing, so I have not felt any urgency to put up a PR for it.

FTR I need to bump the VERSION. I did not bump the version in ./node-perf/tf-wide-deep/VERSION

I thought maybe the CI would bump it automatically.

xmcqueen added a commit to xmcqueen/kubernetes that referenced this issue Oct 12, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Issues - In progress to Issues - To do in SIG Node CI/Test Board Oct 14, 2022
SIG Node CI/Test Board automation moved this from Issues - To do to Done Oct 19, 2022
k8s-ci-robot added a commit that referenced this issue Oct 19, 2022
Fix node-perf test tf-wide-deep: bumped image version, and removed arm64 arch testing #109295
@xmcqueen
Copy link
Contributor

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 19, 2022

There's an image version in test/utils/manifest.go.

Its still pulling the old version and failing.

Bump it from 1.2 to 1.3.

So one more commit is needed.

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 19, 2022

The new image is present and runs perfectly. Tested it on local linux and in google cloud:

[bmcqueen@bmcqueen-ld2 test]$ docker run -it gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3 -- sh
Unable to find image 'gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3' locally
1.3: Pulling from k8s-staging-e2e-test-images/node-perf/tf-wide-deep
778066204fb7: Already exists
0036158cfada: Already exists
0e3a508508d3: Already exists
5c9866230de7: Already exists
4d2393ffbbeb: Already exists
f11c5852ef84: Pull complete
4d8b48c93c84: Pull complete
af1ffc5e891a: Extracting [=========================>                         ]  491.5kB/946.8kB
a4dfa6232e2c: Verifying Checksum
4f4fb700ef54: Download complete
.
.
.
clearly running
.
.
.

It ran to completion.

Note the new image URL. Its NOT the same as the old one:

docker run -it gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3 -- sh

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 20, 2022

podspec references the image here

Here's the list of registries.

tf-wide-deep's url is built here

The above docker run has staging in it. image promotion. Its already built and pushed to staging. It must be promoted when this new version number goes out.

> docker pull gcr.io/e2e-test-images/node-perf/tf-wide-deep:1.3
Error response from daemon: Head "https://gcr.io/v2/e2e-test-images/node-perf/tf-wide-deep/manifests/1.3": unknown: Project 'project:e2e-test-images' not found or deleted.
> docker pull gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3
1.3: Pulling from k8s-staging-e2e-test-images/node-perf/tf-wide-deep
Digest: sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1
Status: Image is up to date for gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3
gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3
>

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 20, 2022

FTR: My "testing" above did not catch the current problem because I did not run the full e2e_test binary. I only tested the image via a docker run ....

AND I should have read this way back at the beginning.

@xmcqueen
Copy link
Contributor

Local run of e2e_node.test:

This has to change too.

@xmcqueen
Copy link
Contributor

Local test using the published 1.3 image run the standard way was successful with the above:

make test-e2e-node FOCUS='Node Performance Testing \[Serial\] \[Slow\] Run node performance testing with pre-defined workloads TensorFlow workload' SKIP="QQQQQ"

ran the test successfully.

So

@xmcqueen
Copy link
Contributor

After the image promotion is merged, there's STILL one more commit coming:

@xmcqueen
Copy link
Contributor

xmcqueen commented Oct 23, 2022

  • its promoted and is found at the expected url
  • local containerd cleared out
  • local test was rerun pulling from the new prod registry
  • local test using that new registry passes
> ctr -n k8s.io i ls | grep -i tf
REF                                                                                                                                            TYPE                                                      DIGEST                                                                  SIZE      PLATFORMS                                                                    LABELS
registry.k8s.io/e2e-test-images/node-perf/tf-wide-deep:1.3                                                                                     application/vnd.docker.distribution.manifest.list.v2+json sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1 203.6 MiB linux/amd64                                                                  io.cri-containerd.image=managed
registry.k8s.io/e2e-test-images/node-perf/tf-wide-deep@sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1                 application/vnd.docker.distribution.manifest.list.v2+json sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1 203.6 MiB linux/amd64                                                                  io.cri-containerd.image=managed
>

@xmcqueen
Copy link
Contributor

xmcqueen commented Nov 2, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
6 participants