Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single prober not discovered via Prometheus cause scalability tests failures #87143

Open
droslean opened this issue Jan 13, 2020 · 8 comments
Open

Comments

@droslean
Copy link
Member

@droslean droslean commented Jan 13, 2020

Which jobs are failing:
gce-master-scale-performance (ci-kubernetes-e2e-gce-scale-performance)

Since when has it been failing:
12th Jan 09:02 PST

Testgrid link:
https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure:

:0
[measurement call DnsLookupLatency - DnsLookupLatency error: timed out waiting for the condition
measurement call InClusterNetworkLatency - InClusterNetworkLatency error: timed out waiting for the condition
measurement call DnsLookupLatency - DnsLookupLatency error: measurement DnsLookupLatency has not been started
measurement call InClusterNetworkLatency - InClusterNetworkLatency error: measurement InClusterNetworkLatency has not been started]
:0

Anything else we need to know:
/sig scalability
/cc @kubernetes/ci-signal
/priority critical-urgent
/milestone v1.18

@oxddr

This comment has been minimized.

Copy link
Contributor

@oxddr oxddr commented Jan 13, 2020

/assign

@oxddr

This comment has been minimized.

Copy link
Contributor

@oxddr oxddr commented Jan 13, 2020

What happened: two probers pods were started correctly (according to kubelet logs), but were not visible via Prometheus (we use it to detect whether probes have started correctly).

Two things:

  1. Shall we wait for all pods to start? At 5k we could allow some small number or percentage of probes' pod to not start.
  2. If there are any errors from a measurement at start phase they should fail the test fast - we shouldn't wait up until the end of the test.
@oxddr

This comment has been minimized.

Copy link
Contributor

@oxddr oxddr commented Jan 13, 2020

/priority important-soon
/remove-priority critical-urgent

This doesn't seem critical. The problem we have is with testing infrastructure.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Jan 13, 2020

@oxddr: Those labels are not set on the issue: priority/critical-urgent

In response to this:

/priority important-soon
/remove-priority critical-urgent

This doesn't seem critical. The problem we have is with testing infrastructure.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@droslean droslean changed the title [Failing Test] gce-master-scale-performance (ci-kubernetes-e2e-gce-scale-performance) [Flaky Test] gce-master-scale-performance (ci-kubernetes-e2e-gce-scale-performance) Jan 14, 2020
@droslean

This comment has been minimized.

Copy link
Member Author

@droslean droslean commented Jan 14, 2020

/remove-kind failing-test
/kind flake

@oxddr

This comment has been minimized.

Copy link
Contributor

@oxddr oxddr commented Jan 14, 2020

@oxddr

This comment has been minimized.

Copy link
Contributor

@oxddr oxddr commented Jan 14, 2020

/retitle Single prober not discovered via Prometheus cause scalability tests failures

@k8s-ci-robot k8s-ci-robot changed the title [Flaky Test] gce-master-scale-performance (ci-kubernetes-e2e-gce-scale-performance) Single prober not discovered via Prometheus cause scalability tests failures Jan 14, 2020
@droslean

This comment has been minimized.

Copy link
Member Author

@droslean droslean commented Jan 14, 2020

Thanks @oxddr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.