Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] timeouts in ci-kubernetes-e2e-gce-scale-performance #78734

Closed
alejandrox1 opened this issue Jun 5, 2019 · 17 comments

Comments

@alejandrox1
Copy link
Contributor

commented Jun 5, 2019

Which jobs are failing:
ci-kubernetes-e2e-gce-scale-performance

Which test(s) are failing:

  • testing/density/config.yaml
  • testing/load/config.yaml
  • ClusterLoaderV2

Since when has it been failing:
Due to previous issues with prow we cannot at this time determine (we are working on resolving this issue in testgrid) the exact moment these tests started failing but they were passing on 5/30 and failed on 6/2.

Testgrid link:
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure:
There failures seem to be related to pods not reaching the desired state within the timeout period.

W0604 03:17:19.141] E0604 03:17:19.140882 12669 wait_for_controlled_pods.go:497] WaitForControlledPodsRunning: test-jfyird-5/saturation-rc-0 timed out

There were also issues with reaching prow:

I0604 03:41:37.997] error dialing prow@35.227.71.123:22: 'dial tcp 35.227.71.123:22: connect: connection timed out', retrying

and

W0604 03:43:44.069] E0604 03:43:44.069271 12669 profile.go:101] failed to gather profile for simple.profileConfig{componentName:"kube-scheduler", provider:"gce", host:"35.227.71.123", kind:"heap"}: failed to execute curl command on master through SSH: error getting SSH client to prow@35.227.71.123:22: 'dial tcp 35.227.71.123:22: connect: connection timed out'

It was mentioned in a previous issue, #76670, that #77127 may be a contributor to failures in this job.

/cc @kubernetes/sig-scalability-test-failures
/kind failing-test
/priority important-soon
/milestone v1.15

/cc @jimangel @smourapina @rarchk @alenkacz
/cc @wojtek-t

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone Jun 5, 2019

@alejandrox1 alejandrox1 added this to New (no response yet) in 1.16 CI Signal Jun 5, 2019

@oxddr

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

/assign @oxddr

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

@oxddr: GitHub didn't allow me to assign the following users: oxddr.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @oxddr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@oxddr

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

It seems, that we have an actual problem with test, not only issues with Prow (kubernetes/test-infra#12940). We have two runs, where apiserver blew up. @krzysied is looking into this issue and hopefully by the end of the day (CET time) will be able to provide more insights. Right now I'd claim we have a regression.

/cc @krzysied @mborsz
/assign @krzysied

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

I've run gce 5k test manually. It failed in similar way as the prow job. We have regression.

I'm not sure what is the root cause of the regression yet.

Good thing is that regression affected density test, which allows to "quickly" verify if k8s build is good or bad. I will try to run bisection to find the root cause.

@alejandrox1 alejandrox1 moved this from New (no response yet) to Under investigation (prioritized) in 1.16 CI Signal Jun 10, 2019

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

@oxddr @krzysied any updates on this? 👀

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

@alejandrox1 We are working on this. We haven't found the root cause yet.

I'm running bisection to find the problematic commit. Based on the current results, it looks like it is something from this range 59f0f2d...7929c15

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

For more context on this issue, from kubernetes/test-infra#12940 :

We suspect this might cause by log-exporter not working correctly (e.g. due to lack of permissions to GCS). I am hoping that the current run will give us more insight.

The problem was first noted in here kubernetes/test-infra#11594 (comment) .

@oxddr

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

@alejandrox1 the problem you mentioned has been mitigated using larger disk in the build cluster. At the moment we have a regression in Kubernetes itself - for a yet unknown reason apiserver is dying in the middle of density tests.

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

At the moment we have a regression in Kubernetes itself

+1 I'm running manual tests that aren't using test-infra infrastructure, kubetest version remains unchanged between the runs. Some of the test passed, some failed. This regression should be unrelated to test-infra.

About the regression... Current scope of bisection is 163ef4d...5e18554 so it's only few commits left.
Most of them are comments/doc/tests fixes.
My random guess would be: b094dd9, but it still needs to be verified.

I will post next update tomorrow.

@claurence

This comment has been minimized.

Copy link

commented Jun 11, 2019

@oxddr @krzysied are either of y'all able to join our burndown call tomorrow to get some clrity on this issue and if we need a fix for 1.15?

Burndown is at 9am PT

@oxddr

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

That's 6pm CEST - I can make it then. Hopefully we'll be able to provide more insight tomorrow.

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

It looks like the klog update is root cause of the issue. We are going to revert the change to fix the tests - #78931.

@krzysied

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

I've just finished testing HEAD w/o klog update. Test passed without a problem.
It seems there is no other regressions (regarding density test).

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

hoisting discussion from slack for posterity:

  1. --log-file was not working properly, causing things to get logged multiple times to the same file (unknown if this issue pre-existed in glog, or if klog regressed in the <= 1.14 timeframe, but there were no --log-file related changes in klog between v1.14.0 and master as of the revert back to klog v0.3.1)

  2. this was noticed by scale tests when we started trying to use --log-file in https://github.com/kubernetes/kubernetes/pull/76396/files#diff-6e3b476b9225d1213dc6ad13e453fc16R2083, so that PR was reverted in #77904 to stop using --log-file

  3. an attempt was made to fix the multi-logging issue in klog (kubernetes/klog#65), included in 0.3.2, and bumped in kube in #78465

  4. that klog change by itself caused perf regressions (even before reintroducing the PR to start using --log-file again, which is still open at https://github.com/kubernetes/kubernetes/pull/78466/files#diff-6e3b476b9225d1213dc6ad13e453fc16R1863)

so that's where we are today... with the same klog handling of --log-file we had in kube v1.14, and without use of --log-file in our manifests (because it's buggy)

@mm4tt

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

Yesterday's run, with reverted #78465, passed.
I believe we can close this one.

MX7fmihkUHP

@alejandrox1

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2019

The specific perf regression due to klog v0.3.2 was revert.
However this job is not completely green due to #79096

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2019

@alejandrox1: Closing this issue.

In response to this:

The specific perf regression due to klog v0.3.2 was revert.
However this job is not completely green due to #79096

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1.16 CI Signal automation moved this from Under investigation (prioritized) to Observing (observe test failure/flake before marking as resolved) Jun 17, 2019

@alejandrox1 alejandrox1 moved this from Observing (observe test failure/flake before marking as resolved) to Resolved (2+ weeks) in 1.16 CI Signal Jul 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.