Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] testing/load/config.yaml #82538

Closed
soggiest opened this issue Sep 10, 2019 · 7 comments

Comments

@soggiest
Copy link
Contributor

commented Sep 10, 2019

Which jobs are failing:
ci-kubernetes-e2e-gce-scale-performance

Which test(s) are failing:
testing/load/config.yaml
ClusterLoaderV2

Since when has it been failing:
09/09/2019

Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-master-scale-performance

Reason for failure:

:0
[measurement call APIResponsiveness - APIResponsiveness error: top latency metric: there should be no high-latency requests, but: [got: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:perc50: 92.425ms, perc90: 991.234ms, perc99: 1.073737s Count:631}; expected perc99 <= 1s]]
:0
error during /go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --experimental-gcp-snapshot-prometheus-disk=true --experimental-prometheus-disk-snapshot-name=ci-kubernetes-e2e-gce-scale-performance-1171106376953368576 --nodes=5000 --prometheus-scrape-etcd --provider=gce --report-dir=/workspace/_artifacts --testconfig=testing/density/config.yaml --testconfig=testing/load/config.yaml --testoverrides=./testing/density/5000_nodes/override.yaml --testoverrides=./testing/load/experimental/overrides/enable_configmaps.yaml --testoverrides=./testing/load/experimental/overrides/enable_secrets.yaml --testoverrides=./testing/load/experimental/overrides/enable_statefulsets.yaml: exit status 1

Anything else we need to know:

/milestone v1.16
/priority critical-urgent
/kind failing-test
/cc @kubernetes/sig-scalability-bugs
/cc @alejandrox1 @jimangel @alenkacz @Verolop

@alejandrox1

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

/assign @wojtek-t @oxddr
The run that failed is https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1171106376953368576
As per our last conversation, we opened this issue in reference to a new failure in the scale-performance job.
Do you consider this to be the same failure as the one in #82182 ?

@mm4tt

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

/assign

This is not the same failure as #82182.
As far as we understand it, at this point it doesn't seem to be a release blocker, rather a transient etcd issue.
We're still debugging this, will post a more detailed update on this tomorrow.

@jkaniuk

This comment has been minimized.

Copy link

commented Sep 10, 2019

Seems related to #82182.

Problem was caused by slow etcd, that resulted in apiserver unresponsiveness.

etcd had 11s window of being slow:

2019-09-10 03:39:36.560538 I | [...]
2019-09-10 03:39:40.262470 W | etcdserver: read-only range request "key:"/registry/health" " with result "error:context deadline exceeded" took too long (2.000059544s) to execute
...
2019-09-10 03:39:49.034906 W | etcdserver: read-only range request "key:"/registry/services/endpoints/kube-system/kube-controller-manager" " with result "error:context canceled" took too long (9.999530758s) to execute
2019-09-10 03:39:49.270347 W | wal: sync duration of 11.340251509s, expected less than 1s

and in the next 1-2s back to normal

during that time etcd-events.log does not show anything unusual, apart from 100ms response times (probably due to the load on apiserver):

2019-09-10 03:40:16.163860 W | etcdserver: read-only range request "key:"/registry/events/test-1hq69b-19/small-deployment-259-6775db8cb8-5mr62.15c2f6a6564842b9" " with result "range_response_count:1 size:593" took too long (101.135029ms) to execute

etcd Version: 3.3.15

@jkaniuk

This comment has been minimized.

Copy link

commented Sep 11, 2019

/assign

@jkaniuk

This comment has been minimized.

Copy link

commented Sep 11, 2019

There was one Persistent Disk attached to the master instance that reported being unhealthy during the window of 03:40:20-03:42:10 GMT.

One of the fail runs that could be avoided with better SLOs, as we should not fail on such a small window of unavailability.

We can close this as unrelated to Kubernetes.

@liggitt

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

There was one Persistent Disk attached to the master instance that reported being unhealthy during the window of 03:40:20-03:42:10 GMT.

One of the fail runs that could be avoided with better SLOs, as we should not fail on such a small window of unavailability.

We can close this as unrelated to Kubernetes.

thanks for the investigation

/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2019

@liggitt: Closing this issue.

In response to this:

There was one Persistent Disk attached to the master instance that reported being unhealthy during the window of 03:40:20-03:42:10 GMT.

One of the fail runs that could be avoided with better SLOs, as we should not fail on such a small window of unavailability.

We can close this as unrelated to Kubernetes.

thanks for the investigation

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1.16 CI Signal automation moved this from New (no response yet) to Observing (observe test failure/flake before marking as resolved) Sep 12, 2019

@alejandrox1 alejandrox1 moved this from Observing (observe test failure/flake before marking as resolved) to Resolved (week Sep 9) in 1.16 CI Signal Sep 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.