[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295

ruiwen-zhao · 2022-04-04T21:18:42Z

Which jobs are failing?

sig-node-containerd#node-kubelet-containerd-performance-test

Which tests are failing?

E2eNode Suite.[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload

Since when has it been failing?

From the testgrid, it has been 1/2 flaky since the start of history (03/20/2022) and has been failing constantly (0/2) since 04/01/2022

Testgrid link

https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-performance-test

Reason for failure (if possible)

E2eNode Suite: [sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload expand_less | 37s
-- | --
{ Failure test/e2e_node/node_perf_test.go:172 wait for pod "tensorflow-wide-deep-pod" to succeed Expected success, but got an error:     <*errors.errorString \| 0xc000722840>: {         s: "pod \"tensorflow-wide-deep-pod\" failed with reason: \"\", message: \"\"",     }     pod "tensorflow-wide-deep-pod" failed with reason: "", message: "" test/e2e/framework/pods.go:240}

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

The text was updated successfully, but these errors were encountered:

SergeyKanzhelev · 2022-04-06T17:10:35Z

/triage accepted
/priority important-soon

SergeyKanzhelev · 2022-04-19T21:38:43Z

The failing test seems to be expecting GPU on machine and the image seems to be broken.

I tried the image manually:

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow-wide-deep-ctn
spec:
  containers:
  - name: tensorflow-wide-deep-ctn
    image: k8s.gcr.io/e2e-test-images/node-perf/tf-wide-deep:1.1
    command: ["/bin/sh"]
    args: ["-c", "python ./data_download.py && time -p python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561"]

Logs:

kubectl logs tensorflow-wide-deep-ctn
2022-04-19 21:24:17.714218: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-19 21:24:17.714292: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "./data_download.py", line 71, in <module>
    tf.app.run(argv=[sys.argv[0]] + unparsed)
AttributeError: module 'tensorflow' has no attribute 'app'

SergeyKanzhelev · 2022-06-15T22:59:29Z

/assign @xmcqueen

as discussed at SIG Node CI meeting, can you please take a look? I see that pod is now running for couple minutes and then stops. I wonder if this is something environment related - maybe pod was evicted or something. I doesn't seem to be able to repro the failure of the pod locally

k8s-ci-robot · 2022-06-15T22:59:31Z

@SergeyKanzhelev: GitHub didn't allow me to assign the following users: xmcqueen.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @xmcqueen

as discussed at SIG Node CI meeting, can you please take a look? I see that pod is now running for couple minutes and then stops. I wonder if this is something environment related - maybe pod was evicted or something. I doesn't seem to be able to repro the failure of the pod locally

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xmcqueen · 2022-07-01T23:54:07Z

This is no longer a flake. It has failed every run since the start of June.

Locally this test runs perfectly with a legit correct local run of the test suite code, doing the full tensorflow run of 300 epochs.

The failure in the testgrid job seems to be fast (173 sec to build and fail, locally its 160 sec before it starts the tensorflow pod). I wonder if there's a blocked port or something from the data initializing python program. It needs access to: https://archive.ics.uci.edu/.

Here's the command it runs:

python ./data_download.py && time -p python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561

Is it feasible to add debugging code, like a pre-check for that http port? It would be good to at least force it to fail with a meaningful error message.

xmcqueen · 2022-07-01T23:59:20Z

Note that HEAD is now at image version 1.2, and it is still failing per testgrid. The successful local runs of the test mentioned above used image version 1.2.

Locally image version 1.1 does produce the cuda/GPU error shown above by @SergeyKanzhelev :

root@tensorflow-wide-deep-pod:/models-1.9.0/official/wide_deep# python ./data_download.py && time -p python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561
2022-07-02 00:17:50.063604: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-02 00:17:50.063660: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "./data_download.py", line 71, in <module>
    tf.app.run(argv=[sys.argv[0]] + unparsed)
AttributeError: module 'tensorflow' has no attribute 'app'
root@tensorflow-wide-deep-pod:/models-1.9.0/official/wide_deep#

xmcqueen · 2022-07-08T00:35:08Z

More info is needed to find the RC for this test failure. I've put in a PR to get the test framework to preserve the container logs when the pod fails.

…ailures kubernetes#109295

xmcqueen · 2022-07-12T15:38:25Z

The pod logs show that it is timing out on an http connection, and it looks like it's the one where it tries to download the test data set for the tensorflow run. You can see it here. Search down to pod error detected. The data_download script in the image attempts to pull down a dataset from: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data. That url is live and healthy, so they might be blocking the gcloud.

> curl -s -w "%{http_code}\n" -I https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
HTTP/1.1 200 OK
Date: Tue, 12 Jul 2022 15:32:31 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips SVN/1.7.14 Phusion_Passenger/4.0.53 mod_perl/2.0.11 Perl/v5.16.3
Last-Modified: Sat, 10 Aug 1996 18:14:48 GMT
ETag: "3ca4a1-2fbb419259600"
Accept-Ranges: bytes
Content-Length: 3974305
Content-Type: application/x-httpd-php

200
>

W0712 03:50:49.666] 
W0712 03:50:49.666] Traceback (most recent call last):
W0712 03:50:49.666]   File "./data_download.py", line 71, in <module>
W0712 03:50:49.666]     tf.app.run(argv=[sys.argv[0]] + unparsed)
...
W0712 03:50:49.668]     raise URLError(err)
W0712 03:50:49.668] urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
W0712 03:50:49.668] 
W0712 03:50:49.668] Jul 12 03:43:04.398: FAIL: pod error: error while waiting for pod node-performance-testing-4710/tensorflow-wide-deep-pod to be Succeeded or Failed: pod "tensorflow-wide-deep-pod" failed with reason: "", message: ""
W0712 03:50:49.669]

aojea · 2022-07-12T17:15:55Z

This test depending on that external data doesn't look quite right :/

xmcqueen · 2022-07-12T17:50:54Z

I'm discussing this test with sig-testing. I'll post an update later.

xmcqueen · 2022-07-13T01:23:24Z

Talked to @eddiezane from sig-testing:

he's contacting the remote to see if they are blocking us, because it looks like they are
The tests's Dockerfile is pulling the script from a place that is overdue for deletion

So I see three choices:

upgrade this thing to a more recent edition
delete it
grab everything that's needed to run this and keep it in-house <- clear winner here

@eddiezane what do you think:
Maybe the folks over at sig-testing have a place to store the artifacts in-house? They are not massive (about 10MB see below)

Size of data about 10MB:

the tools and libs for the tensorflow run:
wget https://github.com/tensorflow/models/archive/v1.9.0.tar.gz
924K Jul 12 17:42 v1.9.0.tar.gz

the training and eval files for the testrun:
python ./data_download.py

3.8M Jul 12 17:38 adult.data
5.1K Jul 12 17:38 adult.names
1.9M Jul 12 17:38 adult.test

xmcqueen · 2022-07-20T14:57:24Z

Here's a relevant issue in a related k8s repo about using/not using docker registries:

Registries used in k/k CI should be on Kubernetes Community infra

kubernetes/k8s.io#1458

xmcqueen · 2022-07-21T16:30:50Z

This issue looks good for the subject of putting up a new docker image containing the deps. This seems feasible and effective. I'd prefer it to deleting the test. I'll discuss it a bit more with a few others, but this seems like the way to go forward with this. TBS, I plan to take the above listed test requirements, and build a docker image containing all that is needed, and reset the test to pull the new docker image.

#97027

xmcqueen · 2022-07-22T22:01:12Z

I've got some Dockerfile changes to fix this. I've tested it locally with no network and it does the full 300 epochs.

docker run -it --entrypoint "" 68e5677a97f1 python ./wide_deep.py --model_type=wide_deep --train_epochs=300 --epochs_between_evals=300 --batch_size=32561 --data_dir ./census_data

PR coming soon.

xmcqueen · 2022-10-07T17:31:01Z

Currently the image builds for the first arch (amd64) but fails when it enters the cross compiling phase (arm64) and specifically when it tries to do apt installs from within the qemu-*-static emulator:

error: failed to solve: process "/dev/.buildkit_qemu_emulator /bin/sh -c apt-get update && apt-get install -y wget time" did not complete successfully: exit code: 100

The other images in node-perf fail to build in the same way.

The jessie-dnsutils image build fails to build in the same way.

I have tried pulling in newer base images up to buster, but it did not fix the problem.

-Start focusing on the version of qemu. -

qemu is version locked in the Makefile and has not changed in two years.

There was an upgrade recently and the images were building at that time. Try to find clues over there.

xmcqueen · 2022-10-07T18:55:23Z

I tried pinning many relevant versions of tonistiigi/binfmt from here but none helped.

Here's a clue: docker/buildx#495

This ugly hack gets it past the observed error:

+RUN ln -s /usr/bin/dpkg-split /usr/local/bin/dpkg-split
+RUN ln -s /usr/bin/dpkg-deb /usr/local/bin/dpkg-deb
+RUN ln -s /bin/rm /usr/local/bin/rm
+RUN ln -s /bin/tar /usr/local/bin/tar

xmcqueen · 2022-10-12T02:04:32Z

Note the previous PR was missing an important change, but I do not see any feasible way that could explain the current problem with /dev/.buildkit_qemu_emulator /bin/sh -c apt-get update && apt-get install -y wget time failing, so I have not felt any urgency to put up a PR for it.

FTR I need to bump the VERSION. I did not bump the version in ./node-perf/tf-wide-deep/VERSION

I thought maybe the CI would bump it automatically.

… v7.1.0-2 kubernetes#109295

…ubernetes#109295

Fix node-perf test tf-wide-deep: bumped image version, and removed arm64 arch testing #109295

xmcqueen · 2022-10-19T20:02:29Z

https://prow.k8s.io/?job=ci-kubernetes-node-kubelet-containerd-performance-test

xmcqueen · 2022-10-19T20:47:23Z

There's an image version in test/utils/manifest.go.

Its still pulling the old version and failing.

Bump it from 1.2 to 1.3.

So one more commit is needed.

xmcqueen · 2022-10-19T22:32:41Z

The new image is present and runs perfectly. Tested it on local linux and in google cloud:

[bmcqueen@bmcqueen-ld2 test]$ docker run -it gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3 -- sh
Unable to find image 'gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3' locally
1.3: Pulling from k8s-staging-e2e-test-images/node-perf/tf-wide-deep
778066204fb7: Already exists
0036158cfada: Already exists
0e3a508508d3: Already exists
5c9866230de7: Already exists
4d2393ffbbeb: Already exists
f11c5852ef84: Pull complete
4d8b48c93c84: Pull complete
af1ffc5e891a: Extracting [=========================>                         ]  491.5kB/946.8kB
a4dfa6232e2c: Verifying Checksum
4f4fb700ef54: Download complete
.
.
.
clearly running
.
.
.

It ran to completion.

Note the new image URL. Its NOT the same as the old one:

docker run -it gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3 -- sh

xmcqueen · 2022-10-20T15:49:04Z

podspec references the image here

Here's the list of registries.

tf-wide-deep's url is built here

The above docker run has staging in it. image promotion. Its already built and pushed to staging. It must be promoted when this new version number goes out.

> docker pull gcr.io/e2e-test-images/node-perf/tf-wide-deep:1.3
Error response from daemon: Head "https://gcr.io/v2/e2e-test-images/node-perf/tf-wide-deep/manifests/1.3": unknown: Project 'project:e2e-test-images' not found or deleted.
> docker pull gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3
1.3: Pulling from k8s-staging-e2e-test-images/node-perf/tf-wide-deep
Digest: sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1
Status: Image is up to date for gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3
gcr.io/k8s-staging-e2e-test-images/node-perf/tf-wide-deep:1.3
>

xmcqueen · 2022-10-20T16:14:22Z

FTR: My "testing" above did not catch the current problem because I did not run the full e2e_test binary. I only tested the image via a docker run ....

AND I should have read this way back at the beginning.

xmcqueen · 2022-10-22T00:56:54Z

Local run of e2e_node.test:

This has to change too.

xmcqueen · 2022-10-22T02:58:46Z

Local test using the published 1.3 image run the standard way was successful with the above:

make test-e2e-node FOCUS='Node Performance Testing \[Serial\] \[Slow\] Run node performance testing with pre-defined workloads TensorFlow workload' SKIP="QQQQQ"

ran the test successfully.

So

promote the image per this
bump the image version in the e2e test manifest here
change the overriding pod spec command

xmcqueen · 2022-10-23T00:13:38Z

After the image promotion is merged, there's STILL one more commit coming:

bump the image version in the e2e test manifest here
change the overriding pod spec command

xmcqueen · 2022-10-23T15:34:26Z

its promoted and is found at the expected url
local containerd cleared out
local test was rerun pulling from the new prod registry
local test using that new registry passes

> ctr -n k8s.io i ls | grep -i tf
REF                                                                                                                                            TYPE                                                      DIGEST                                                                  SIZE      PLATFORMS                                                                    LABELS
registry.k8s.io/e2e-test-images/node-perf/tf-wide-deep:1.3                                                                                     application/vnd.docker.distribution.manifest.list.v2+json sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1 203.6 MiB linux/amd64                                                                  io.cri-containerd.image=managed
registry.k8s.io/e2e-test-images/node-perf/tf-wide-deep@sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1                 application/vnd.docker.distribution.manifest.list.v2+json sha256:91ab3b5ee22441c99370944e2e2cb32670db62db433611b4e3780bdee6a8e5a1 203.6 MiB linux/amd64                                                                  io.cri-containerd.image=managed
>

xmcqueen · 2022-11-02T17:19:44Z

its fixed. the test is green again.

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-containerd-performance-test

… v7.1.0-2 kubernetes#109295

…ubernetes#109295

ruiwen-zhao added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Apr 4, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 4, 2022

ehashman added this to Triage in SIG Node CI/Test Board Apr 4, 2022

SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Apr 6, 2022

SergeyKanzhelev mentioned this issue Apr 18, 2022

Perma failing Jobs in test-grid #109521

Closed

28 tasks

SergeyKanzhelev mentioned this issue Apr 20, 2022

fix the image for node performance tests - model expected tensorflow… #109551

Merged

endocrimes mentioned this issue May 9, 2022

node-perf: use tf-wide-deep:1.2 #109921

Merged

xmcqueen mentioned this issue Jul 8, 2022

Capture the Container Logs for a Flaky Test #111015

Merged

xmcqueen added a commit to xmcqueen/kubernetes that referenced this issue Jul 11, 2022

capture the container logs on pod error to assist in debugging test f…

37d246b

…ailures kubernetes#109295

xmcqueen added a commit to xmcqueen/kubernetes that referenced this issue Oct 12, 2022

bumped image version and upgraded to buster and bumped QEMUVERSION to…

9c65abd

… v7.1.0-2 kubernetes#109295

xmcqueen mentioned this issue Oct 12, 2022

Fix node-perf test tf-wide-deep: bumped image version, and removed arm64 arch testing #109295 #113012

Merged

SergeyKanzhelev moved this from Issues - In progress to Issues - To do in SIG Node CI/Test Board Oct 14, 2022

xmcqueen added a commit to xmcqueen/kubernetes that referenced this issue Oct 14, 2022

restored QEMUVERSION and slim-stretch and removed arch linux/arm64v8 k…

61f04e6

…ubernetes#109295

k8s-ci-robot closed this as completed in #113012 Oct 19, 2022

SIG Node CI/Test Board automation moved this from Issues - To do to Done Oct 19, 2022

k8s-ci-robot added a commit that referenced this issue Oct 19, 2022

Merge pull request #113012 from xmcqueen/master

1e4e179

Fix node-perf test tf-wide-deep: bumped image version, and removed arm64 arch testing #109295

xmcqueen mentioned this issue Oct 23, 2022

test images: promote tf-wide-deep 1.3 kubernetes/k8s.io#4391

Merged

xmcqueen mentioned this issue Oct 23, 2022

Image Version Bump in Manifest for Node Perf Test tf-wide-deep #113282

Merged

jaehnri pushed a commit to jaehnri/kubernetes that referenced this issue Jan 3, 2023

bumped image version and upgraded to buster and bumped QEMUVERSION to…

32dcf75

… v7.1.0-2 kubernetes#109295

jaehnri pushed a commit to jaehnri/kubernetes that referenced this issue Jan 3, 2023

restored QEMUVERSION and slim-stretch and removed arch linux/arm64v8 k…

e0e2d3c

…ubernetes#109295

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295

[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295

ruiwen-zhao commented Apr 4, 2022

SergeyKanzhelev commented Apr 6, 2022

SergeyKanzhelev commented Apr 19, 2022

SergeyKanzhelev commented Jun 15, 2022

k8s-ci-robot commented Jun 15, 2022

xmcqueen commented Jul 1, 2022 •

edited

xmcqueen commented Jul 1, 2022 •

edited

xmcqueen commented Jul 8, 2022

xmcqueen commented Jul 12, 2022

aojea commented Jul 12, 2022

xmcqueen commented Jul 12, 2022

xmcqueen commented Jul 13, 2022 •

edited

xmcqueen commented Jul 20, 2022

xmcqueen commented Jul 21, 2022

xmcqueen commented Jul 22, 2022

xmcqueen commented Oct 7, 2022 •

edited

xmcqueen commented Oct 7, 2022

xmcqueen commented Oct 12, 2022

xmcqueen commented Oct 19, 2022

xmcqueen commented Oct 19, 2022 •

edited

xmcqueen commented Oct 19, 2022 •

edited

xmcqueen commented Oct 20, 2022 •

edited

xmcqueen commented Oct 20, 2022 •

edited

xmcqueen commented Oct 22, 2022

xmcqueen commented Oct 22, 2022

xmcqueen commented Oct 23, 2022

xmcqueen commented Oct 23, 2022 •

edited

xmcqueen commented Nov 2, 2022

[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295

[sig-node] Node Performance Testing [Serial] [Slow] Run node performance testing with pre-defined workloads TensorFlow workload #109295

Comments

ruiwen-zhao commented Apr 4, 2022

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

SergeyKanzhelev commented Apr 6, 2022

SergeyKanzhelev commented Apr 19, 2022

SergeyKanzhelev commented Jun 15, 2022

k8s-ci-robot commented Jun 15, 2022

xmcqueen commented Jul 1, 2022 • edited

xmcqueen commented Jul 1, 2022 • edited

xmcqueen commented Jul 8, 2022

xmcqueen commented Jul 12, 2022

aojea commented Jul 12, 2022

xmcqueen commented Jul 12, 2022

xmcqueen commented Jul 13, 2022 • edited

xmcqueen commented Jul 20, 2022

xmcqueen commented Jul 21, 2022

xmcqueen commented Jul 22, 2022

xmcqueen commented Oct 7, 2022 • edited

xmcqueen commented Oct 7, 2022

xmcqueen commented Oct 12, 2022

xmcqueen commented Oct 19, 2022

xmcqueen commented Oct 19, 2022 • edited

xmcqueen commented Oct 19, 2022 • edited

xmcqueen commented Oct 20, 2022 • edited

xmcqueen commented Oct 20, 2022 • edited

xmcqueen commented Oct 22, 2022

xmcqueen commented Oct 22, 2022

xmcqueen commented Oct 23, 2022

xmcqueen commented Oct 23, 2022 • edited

xmcqueen commented Nov 2, 2022

xmcqueen commented Jul 1, 2022 •

edited

xmcqueen commented Jul 1, 2022 •

edited

xmcqueen commented Jul 13, 2022 •

edited

xmcqueen commented Oct 7, 2022 •

edited

xmcqueen commented Oct 19, 2022 •

edited

xmcqueen commented Oct 19, 2022 •

edited

xmcqueen commented Oct 20, 2022 •

edited

xmcqueen commented Oct 20, 2022 •

edited

xmcqueen commented Oct 23, 2022 •

edited