-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446
Comments
@dcbw There are no sig labels on this issue. Please add a sig label by: |
Both of the linked runs failed after a timeout:
This is not a cluster lifecycle issue. |
Like what @roberthbailey mentioned above, the test is simply timeout, and longest distributed test set use 1960.667636731s below. All tests are passed but timeout. Should we bump the timeout value for this test suite? Also there is #47479 which doubled ns deletion timeout, but not increase the overall timeout for this test job. This might also introduce more timeout issues. On another hand, I spent some time look at which test suite in this job use more of time. The top three are the following:
The time spent on other test suites each is less than 300s, and most of them use less than 100s. My question is if some of StatefulSet should be marked as SLOW test and skipped in this test job? @kow3ns @janetkuo |
even before #47479, we were bumping up against the 55m timeout |
A quick updates on this:
I am removing sig/apps since the same set of tests are ok on gce-etcd3 |
It's required to be confident that shared informers are not having their objects mutated by listeners, causing data inconsistency issues. I would have expected it on in the post-merge job as well, actually |
Here are a list of jobs with this enabled: https://github.com/kubernetes/test-infra/search?utf8=%E2%9C%93&q=ENABLE_CACHE_MUTATION_DETECTOR&type= One can see that only pull-* jobs with this enabled. This discrepancy might explain why we see pull-federation-* uses more times than ci-federation-* ? cc/ @csbell On another hand, my comment at #47446 (comment)
The above test configuration might still be the contributor to the API performance downgrade you reported at #47135 (comment) since in pull test, there are more tests running concurrently, which means more API requests to the API server. We might hit some threshold or limit. cc/ @kubernetes/sig-api-machinery-misc @krzyzacy How safe we clone ci test with the same configuration as pull one without cache-mutation-detector being enabled? This should be temporary (~ 1 day) for the AB comparison. The only concern I am having is if we would hit the quota limit for our test infra. |
I can make one with cache-mutation and one without and we can compare the performance. |
the pull job has GINKGO_TOLERATE_FLAKES=y, the ci job does not. if any flake is encountered, that can lengthen the test run time by the time of the re-run test also, time to reach the "Running Suite: Kubernetes e2e suite" line from start of build-log.txt:
|
Yesterday I encountered some test flakes inside e2e-gce-etcd3 on this PR #46792. All the failures mention timeouts. I think they relate to this test flake, #47446. The test did pass last night, after the timeout was increased. Here are the the failed test runs' output: |
@liggitt We started etcd3-pr-validate job which has the identical setting as the pull one except cache-mutation-detector at: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3-pr-validate/ by comparing the time spent for above tests vs. ci-kubernetes-e2e-gce-etcd3: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/ (please ignore those newly flakiness): I saw the similar runtime to finish those tests roughly. @krzyzacy is going to enable cache-mutation-detector to see if there are performance hit. |
After enabling cache-mutation-detector at 15:00pm today, the latency to finish etcd3-pr-validate job is increased obviously. From https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3-pr-validate/
I believe the mystery is resolved, and am closing this issue. |
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46823/pull-kubernetes-e2e-gce-etcd3/35722/
The text was updated successfully, but these errors were encountered: