Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate k8s-infra-prow-build to a nodepool with more IOPS #1187

Closed
spiffxp opened this issue Aug 28, 2020 · 38 comments
Closed

Migrate k8s-infra-prow-build to a nodepool with more IOPS #1187

spiffxp opened this issue Aug 28, 2020 · 38 comments
Assignees
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Aug 28, 2020

This is followup to #1168 and #1173, made possible by quota changes done via #1132

My goal is to make these graphs go down
Screen Shot 2020-08-28 at 10 24 01 AM

Our jobs are hitting I/O limits (both IOPS and throughput). That was made extra clear last weekend when we switched to larger nodes, thus causing more jobs to share the same amount of I/O.

We're seeing more jobs scheduled into the cluster now that v1.20 PRs are being merged. While our worst case node performance is about the same, we are seeing more throttling across the cluster in aggregate.

Kubernetes doesn't give us a way to provision I/O, so we're left optimizing per-node performance. Based on https://cloud.google.com/compute/docs/disks/performance I think we can get just under 2x the IOPS for a ~14% increase in cluster cost.

From there, going to next tier would require a 90% increase in cost, for only 66% more performance.

The most ideal thing would be local SSD but:

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

/area prow
/sig testing
/wg k8s-infra

@k8s-ci-robot k8s-ci-robot added area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/k8s-infra labels Aug 28, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

Opened #1186 to start migrating to first option (14% more cost for ~100% more IOPS)

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

New nodepool is up, old nodepool cordoned

spiffxp@cloudshell:~ (k8s-infra-prow-build)$ for n in $(k get nodes -l cloud.google.com/gke-nodepool=pool3-20200824192452986800000001 -o=name); do k cordon $n; done
node/gke-prow-build-pool3-2020082419245298-65156f0e-0cwp cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-4xr9 cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-7bnv cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-7x2j cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-9q3g cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-fmpb cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-v5l5 cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-z4lq cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-5sqt cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-6kr5 cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-7wr8 cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-b5ns cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-h4v9 cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-nb31 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-0sn6 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-8qq0 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-b496 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-jt7h cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-lk1n cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-lnkj cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-mrt2 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-qwl2 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-s9x6 cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-tkwx cordoned
node/gke-prow-build-pool3-2020082419245298-f02ec66b-x2g2 cordoned

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

And I forgot to disable autoscaling for pool3 until just now

spiffxp@cloudshell:~ (k8s-infra-prow-build)$ for n in $(k get nodes -l cloud.google.com/gke-nodepool=pool3-20200824192452986800000001 -o=name); do k cordon $n; done | grep -v already
node/gke-prow-build-pool3-2020082419245298-65156f0e-4bsj cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-bzg3 cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-l8n1 cordoned
node/gke-prow-build-pool3-2020082419245298-65156f0e-n6mz cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-1bgk cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-968j cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-fpqb cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-nv8r cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-qgx2 cordoned
node/gke-prow-build-pool3-2020082419245298-d2051b67-s4kj cordoned

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

Deleted boskos

$ date; k delete pod -n test-pods -l app=boskos; date
Fri 28 Aug 2020 08:01:08 PM UTC
pod "boskos-564f5594dd-sk2jv" deleted
Fri 28 Aug 2020 08:01:17 PM UTC

$ date; k delete pod -n test-pods -l app=boskos-janitor; date
Fri 28 Aug 2020 08:01:38 PM UTC
pod "boskos-janitor-58c6d75dc9-684lm" deleted
pod "boskos-janitor-58c6d75dc9-s9b6s" deleted
Fri 28 Aug 2020 08:06:47 PM UTC

$ date; k delete pod -n test-pods -l app=boskos-reaper; date
Fri 28 Aug 2020 08:16:51 PM UTC
pod "boskos-reaper-56b467f9d8-c6rfw" deleted
Fri 28 Aug 2020 08:16:57 PM UTC

Waiting on the following to finish up

$ date; k get pods -n test-pods --field-selector=status.phase=Running -o=json | jq -r '.items | map(select(.spec.nodeName | match("pool3")))[] | "\(.status.startTime) \(.metadata.labels["prow.k8s.io/job"]) \(.metadata.name) \(.spec.nodeName)"' | sort | tee old-nodepool-jobs
Fri 28 Aug 2020 08:43:02 PM UTC
2020-08-28T16:05:56Z ci-kubernetes-e2e-gce-cos-k8sbeta-serial 53dbcab4-e948-11ea-aa1a-c6580c04344b gke-prow-build-pool3-2020082419245298-f02ec66b-0sn6
2020-08-28T16:12:56Z ci-kubernetes-e2e-gci-gce-serial 4e00d2eb-e949-11ea-aa1a-c6580c04344b gke-prow-build-pool3-2020082419245298-f02ec66b-tkwx
2020-08-28T17:34:56Z ci-kubernetes-e2e-gce-cos-k8sstable1-serial c2d6b387-e954-11ea-aa1a-c6580c04344b gke-prow-build-pool3-2020082419245298-d2051b67-b5ns
2020-08-28T18:26:56Z ci-kubernetes-gce-conformance-latest-1-16 067d54f8-e95c-11ea-aa1a-c6580c04344b gke-prow-build-pool3-2020082419245298-f02ec66b-0sn6
2020-08-28T18:47:48Z ci-kubernetes-gce-conformance-latest-kubetest2 8a2d1fdd-e95e-11ea-aa1a-c6580c04344b gke-prow-build-pool3-2020082419245298-65156f0e-bzg3

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

Trimmed empty nodes
Screen Shot 2020-08-28 at 1 46 55 PM

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

Removed old nodepool with #1188

@spiffxp
Copy link
Member Author

spiffxp commented Aug 28, 2020

Holding this open to see what impact, if any, this has on the graphs shown in the description

https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&timeDomain=1w

@spiffxp
Copy link
Member Author

spiffxp commented Aug 31, 2020

No real change in the graphs, other than a reflection of PR traffic. This certainly didn't make things worse and isn't urgently more expensive, so I'm not inclined to rollback at the moment.

Screen Shot 2020-08-31 at 12 53 36 PM

Supposedly PER_GB means increase the persistent disk size (ref: https://cloud.google.com/compute/docs/disks/review-disk-metrics#throttling_metrics)

One option would be to increase disk size to the next "tier" and see what happens. But I think I'd like to do a little more reading and focused testing to understand what's going on, and what options we have.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Jan 20, 2021

/remove-lifecycle rotten
https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/local-ssd

It's possible (under pre-GA terms) to create node pools with Local SSD as of 1.18. We'd need to upgrade the cluster to that version first

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 20, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jan 21, 2021

I'm really interested in seeing this happen, but I can't guarantee I'll have bandwidth this cycle, so leaving out of milestone. Gated on migrating cluster to 1.18

@spiffxp
Copy link
Member Author

spiffxp commented Jan 22, 2021

/priority backlog
/milestone v1.21
changed my mind about milestoning, I'll assign low priority

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 22, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 22, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2021

Provisioning a nodepool with ephemeral-ssd-storage=true as a taint: #2835

I'll try cutting over some canary presubmits to see what the behavior is

@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2021

kubernetes/test-infra#23783 will cutover:

  • pull-kubernetes-e2e-gce-canary
  • pull-kubernetes-integration-go-canary
  • pull-kubernetes-e2e-kind-canary
  • pull-kubernetes-unit-go-canary
  • pull-kubernetes-verify-go-canary

Which are all manually triggered

@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2021

After an initial round of canary jobs against a single PR, I have kicked off the canary jobs against a handful of arbitrary kubernetes/kubernetes PR's to trigger autoscaling and evaluate the node disk usage under some level of concurrency / load

https://console.cloud.google.com/monitoring/dashboards/builder/f0163540-a8b7-4618-8308-66652d3d4794?project=k8s-infra-prow-build&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=1h is the dashboard I'm using to watch the pot boil

Old pool is on left (pool4), new pool is on right (pool5). By default, Google Cloud Monitoring doesn't appear to let me manually set the Y axis scales. I added an arbitrary threshold to each graph to give them the same scale. You can see we're experiencing way less throttling with the new pool.

Screen Shot 2021-09-27 at 4 32 59 PM

@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2021

Need to cost out and estimate quota before rolling this out more generally. The numbers look good enough that I'm interested in doing so. However...

I legitimately can't tell that there's any immediately obvious speedup from doing this? I'll let the other jobs finish and take a look at PR history for a quick check tomorrow. The only other thing I can think that this might allow is lowering of some CPU/memory resource limits to pack jobs more densely if they're in fact not going to be as noisy to each other. That will probably require more attention than I have time for right now.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2021

https://cloud.google.com/compute/disks-image-pricing#localssdpricing

Local SSDs are $30/mo, so x2 = 60/mo
Zonal SSDs are .17/GB/mo, so 500GB = 85/mo
Zonal PDs are .04/GB/mo, so 100GB = 4/mo

https://cloud.google.com/compute/vm-instance-pricing

n1-highmem-8 are ~ 241/mo

pool4 instances are n1-highmem-8 + 500GB pd-ssd = 241 + 85 = 326
pool5 instances are n1-highmem-8 + 100GB pd-standard + 2x local SSD = 241 + 60 + 4 = 305

That's about 7% savings... we could bring that to 16% if we used only 1 local SSD. Looking at our total spend just for k8s-infra-prow-build over the last year, it was ~258K. 7% savings would be ~18K, 16% ~40K. Not nothing, but not incredibly significant compared against our total budget.

Quota:

  • pool4 is capped at 80 instances * 3 zones * 2 local ssds * 375GB/ssd = 180,000 GB
  • current quota for us-central1 is 100,000 GB

Conclusions:

  • bump quota to 200,000 GB local SSD
  • pool5's current configuration doesn't appear to impact performance negatively
  • pool5 will be slightly cheaper than before
  • let's move forward with this, and consider denser/higher-CPU nodes in the future

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2021

Looks like I'm going to have to force recreation of the node pool to drop the taint

$ gcloud beta container node-pools update pool5-20210927201718330300000001 --project=k8s-infra-prow-build --zone=us-central1 --cluster=prow-build --node-taints=""
ERROR: (gcloud.beta.container.node-pools.update) ResponseError: code=400, message=Updates for 'taints' are not supported in node pools with autoscaling enabled (as a workaround, consider temporarily disabling autoscaling or recreating the node pool with the updated values.).

@aojea
Copy link
Member

aojea commented Sep 28, 2021

out of curiosity, why 2 local ssds?

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2021

Why not less:

  • I wanted to match the 500GB amount we used previously, and 375GB was too little
  • I wanted to over-provision capacity so as not to saturate a single ssd

Why not more:

  • Based on experiment (see dashboard pic above), it seemed like there was an order of magnitude improvement in IOPS/throttled bytes, but no real corresponding change in build performance
  • Anything more starts to raise our costs, and I'm starting to get slightly price sensitive

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2021

OK, migration to new nodepool with local SSDs for ephemeral storage complete, see #2839 for details

Throttled I/O is way down post-migration
Screen Shot 2021-09-28 at 10 22 43 AM

Throttled bytes from old pool nodes on the left, new pool nodes on the right
Screen Shot 2021-09-28 at 10 23 53 AM

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2021

I'll hold this open for a day to see if this had any negative impact, but I otherwise consider this issue closed.

It'll take a bit to determine whether this has had any impact on job / build time. Again, my guess based on a brief survey of the canary jobs from yesterday is negligible impact at best, but hopefully fewer noisy neighbors. We lack a great way to display this data at present, though I suspect the data will be available in some form in bigquery k8s-gubernator:build

@spiffxp
Copy link
Member Author

spiffxp commented Sep 29, 2021

/close
Calling this done

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 29, 2021

/assign
/remove-help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants