Router quota exceeded error causing GCE tests to fail #14611

cblecker · 2019-10-04T03:16:34Z

What happened:
W1004 01:31:15.149] Creating router [e2e-51036-95a39-nat-router]...
W1004 01:31:18.991] ....................failed.
W1004 01:31:19.173] ERROR: (gcloud.compute.routers.create) Quota 'ROUTERS' exceeded. Limit: 10.0 globally.

Please provide links to example occurrences, if any:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/51036/pull-kubernetes-e2e-gce-100-performance/1179926617770692608/
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/51036/pull-kubernetes-kubemark-e2e-gce-big/1179926617825218560/

Anything else we need to know?:
Potential boskos cleaning issue

BenTheElder · 2019-10-04T03:17:06Z

from boskos janitor logs:

jsonPayload: {
  error: "exit status 1"   
  level: "error"   
  msg: "failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com]
ERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global 

To search the help text of gcloud commands, run:
  gcloud help -- SEARCH_TERMS
Error try to delete resources disks: CalledProcessError()
[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]
[=== Activating service_account /etc/service-account/service-account.json ===]
[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]
"   
 }

BenTheElder · 2019-10-04T03:18:15Z

cc @krzyzacy

BenTheElder · 2019-10-04T03:27:04Z

it looks like the image was last updated a month ago 0fd634d

gcloud compute disks delete --help
NAME
    gcloud compute disks delete - delete Google Compute Engine persistent disks

SYNOPSIS
    gcloud compute disks delete DISK_NAME [DISK_NAME ...] [--zone=ZONE]
        [GCLOUD_WIDE_FLAG ...]

DESCRIPTION
    gcloud compute disks delete deletes one or more Google Compute Engine
    persistent disks. Disks can be deleted only if they are not being used by
    any virtual machine instances.

POSITIONAL ARGUMENTS
     DISK_NAME [DISK_NAME ...]
        Names of the disks to delete.

FLAGS
     --zone=ZONE
        Zone of the disks to delete. If not specified and the compute/zone
        property isn't set, you may be prompted to select a zone.

        To avoid prompting when this flag is omitted, you can set the
        compute/zone property:

            $ gcloud config set compute/zone ZONE

# gcloud compute disks delete --help | tail
    --flags-file, --flatten, --format, --help, --log-http, --project, --quiet,
    --trace-token, --user-output-enabled, --verbosity. Run $ gcloud help for
    details.

NOTES
    These variants are also available:

        $ gcloud alpha compute disks delete
        $ gcloud beta compute disks delete

# gcloud compute disks delete --global
ERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global 

To search the help text of gcloud commands, run:
  gcloud help -- SEARCH_TERMS

BenTheElder · 2019-10-04T03:41:52Z

I can't tell what actually broke when yet here. AFAICT we're running an image from august since then and haven't been having issues, also, the previous image has the same missing --global flag ...

cblecker · 2019-10-04T03:46:41Z

https://storage.googleapis.com/k8s-gubernator/triage/index.html?ci=0&pr=1&text=hack%2Fe2e-internal%2Fe2e-up.sh#ea6679417165f10786e6

BenTheElder · 2019-10-04T03:52:47Z

$ kubectl get po -n=test-pods -l=app=boskos-janitor-nongke
NAME                                     READY   STATUS    RESTARTS   AGE
boskos-janitor-nongke-7c78646b5d-8rwjm   1/1     Running   0          6d6h
boskos-janitor-nongke-7c78646b5d-wm2fn   1/1     Running   0          6d9h
boskos-janitor-nongke-7c78646b5d-xj5sl   1/1     Running   0          6d8h
boskos-janitor-nongke-7c78646b5d-xl2bw   1/1     Running   0          6d3h

BenTheElder · 2019-10-04T03:55:38Z

I execed to the pods and unsurprisingly they do seem to be running the janitor script from when the image was updated, so I don't think there were any terribly recent changes actually deployed.

BenTheElder · 2019-10-04T03:57:32Z

@krzyzacy feel free to punt this back, but I don't feel that I have the context on what happened here.
@dims can you fill us in on the router issue with cluster-api-provider-gcp?

krzyzacy · 2019-10-04T04:26:19Z

we are obviously not cleaning up routers in https://github.com/kubernetes/test-infra/blob/master/boskos/janitor/gcp_janitor.py#L35-L67

also seems gcloud deprecated some flags (that --global one), but should be unrelated.

dims · 2019-10-04T13:19:31Z

Thanks @krzyzacy Sen!

@BenTheElder the new CAPG job uses boskos to acquire a project to create the actual cluster (uses kind to boostrap and then gcp to run the actual cluster) seems to have ended up with some problems. I do try to clean that up here, but some runs may have run into trouble and ended up leaking.
https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/master/hack/ci/e2e-conformance.sh#L101-L105

dims · 2019-10-04T13:21:32Z

@BenTheElder @krzyzacy Here's a fix for one more thing that could leak:
#14617

BenTheElder · 2019-10-04T15:44:36Z

waiting for #14617 to merge and then we need to update the deployment

BenTheElder · 2019-10-04T16:16:34Z

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/post-test-infra-deploy-prow/1180153743044251652

bad bazel version :/

BenTheElder · 2019-10-04T16:37:36Z

if anyone can see https://github.com/kubernetes/test-infra/compare/master...BenTheElder:github-compare-is-broken-ugh?expand=1 or https://github.com/kubernetes/test-infra/compare/master...BenTheElder:upgrade-gcloud-bazel?expand=1 I can't file the PR because github is erroring

BenTheElder · 2019-10-04T16:39:49Z

... after a few minutes of server errors: #14622, #14623

BenTheElder · 2019-10-04T18:00:36Z

ok so we have the gcloud bump in, running a new https://prow.k8s.io/?job=ci-test-infra-autobump-prow and then will let prow bump / deploy

BenTheElder · 2019-10-04T18:11:23Z

#14598

BenTheElder · 2019-10-08T04:42:03Z

see kubernetes/kubernetes#83493 for the real root cause 🤦‍♂

TLDR these scale presubmits are using a fixed GCP project, I've bumped the quota 3x from 10 -> 30, but I have no idea if that's sufficient.

So far I've observed through manual polling a max of 16/30.

BenTheElder · 2019-10-09T22:03:16Z

AFAICT this is fixed.

cblecker added kind/bug Categorizes issue or PR as related to a bug. area/boskos Issues or PRs related to code in /boskos kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Oct 4, 2019

BenTheElder added the kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall. label Oct 4, 2019

BenTheElder self-assigned this Oct 4, 2019

BenTheElder assigned dims and krzyzacy and unassigned BenTheElder Oct 4, 2019

krzyzacy mentioned this issue Oct 4, 2019

Let's clean up routers as well #14614

Merged

BenTheElder mentioned this issue Oct 4, 2019

bump boskos/janitor to v20191004-585224c4f #14620

Merged

alejandrox1 mentioned this issue Oct 4, 2019

check that N job pods succeeded instead of exactly N pods existing a… kubernetes/kubernetes#83456

Merged

This was referenced Oct 6, 2019

pull-kubernetes-e2e-gce-100-performance fails kubernetes/kubernetes#83493

Closed

pr:pull-kubernetes-e2e-gce-100-performance flaked 43 times in the past week kubernetes/kubernetes#83529

Closed

Jefftree mentioned this issue Oct 7, 2019

Move privilege e2e test to common kubernetes/kubernetes#83211

Merged

quinton-hoole mentioned this issue Oct 7, 2019

job controller support modify sync flag kubernetes/kubernetes#79264

Closed

BenTheElder assigned BenTheElder and unassigned dims and krzyzacy Oct 8, 2019

BenTheElder closed this as completed Oct 9, 2019

This was referenced Jan 17, 2020

Add more boskos projects tektoncd/plumbing#29

Closed

Boskos seems to be wedged tektoncd/plumbing#186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Router quota exceeded error causing GCE tests to fail #14611

Router quota exceeded error causing GCE tests to fail #14611

cblecker commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

cblecker commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

krzyzacy commented Oct 4, 2019

dims commented Oct 4, 2019

dims commented Oct 4, 2019 •

edited

Loading

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019 •

edited

Loading

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 8, 2019

BenTheElder commented Oct 9, 2019

Router quota exceeded error causing GCE tests to fail #14611

Router quota exceeded error causing GCE tests to fail #14611

Comments

cblecker commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

cblecker commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

krzyzacy commented Oct 4, 2019

dims commented Oct 4, 2019

dims commented Oct 4, 2019 • edited Loading

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019 • edited Loading

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 4, 2019

BenTheElder commented Oct 8, 2019

BenTheElder commented Oct 9, 2019

dims commented Oct 4, 2019 •

edited

Loading

BenTheElder commented Oct 4, 2019 •

edited

Loading