Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaking Test: e2e-gci-gce-scalability #69473

Closed
jberkus opened this Issue Oct 5, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@jberkus
Copy link

jberkus commented Oct 5, 2018

TestGrid: https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce-100

Sample Failure: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/18372

This test has been flaky since 9/24. That's a bit of a problem for 1.13 release, because this is the fastest-running of the e2e scalability tests, and we count on checking it first for scalability fixes.

The problem appears to be that the cluster of Google VMs sometimes doesn't deploy:

W1005 13:47:12.185] ERROR: (gcloud.compute.instance-groups.managed.wait-until-stable) Timeout while waiting for group to become stable.

Can someone from SIG-scalability take a look at this? Thanks!

/sig scalability
/kind flake
/priority important-soon

/assign @shyamjvs

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Oct 6, 2018

... in fact, for the last 15 hours, it's been failing exactly every other run.

When I look at the end of the last successful run, and the start of the next run, there's less than 3 minutes gap. Could it just be that the test runs are too close together, and gcloud hasn't dropped all the instances yet?

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 10, 2018

This one seems to be some configuration issue - I will take a look.

/assign @wojtek-t

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 10, 2018

So I took a look into that internally, and sometimes clusters don't start due to lack of quota.

So it's configuration issue.

the question is what has changed recently - my hypothesis is that we are choosing projects for those suites somewhat randomly and not all of them have enough quota.

@krzyzacy - if you remember any recent changes here from the top of your head

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 10, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 10, 2018

kubernetes/test-infra#9567 broke our tests.

There are different problems with this PR:

  1. not all scalability-related jobs require the same type of projects
  2. some of the project in that category don't have enough quota
@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 10, 2018

The problematic thing is quota for:
"Compute Engine API CPUs (all regions)"
set to 64

in projects:
k8s-e2e-gce-scalability-1-2
k8s-e2e-gci-gce-scale-1-5
k8s-jenkins-gci-scalability-2
k8s-jenkins-scalability-2

Because of some unknown reason, this quota is not even present in the other 4 projects.

So given that 4 out of 8 projects have broken quota, that explains roughly 50% failure rate.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Oct 10, 2018

I have just bumped the quotas in all of those and it should now be fixed.

@wojtek-t wojtek-t closed this Oct 10, 2018

@krzyzacy

This comment has been minimized.

Copy link
Member

krzyzacy commented Oct 10, 2018

cc @amwat

the quota is controlled by internal values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.