Investigate AWS 5k Node e2e Scale costs #6165

BenTheElder · 2023-12-06T21:53:53Z

We're spending around 50% of our budget on AWS on the 5k node scale tests, this is disproportionate.

We should investigate if there's knobs we can tune with the cluster configuration to bring this down, or otherwise reduce the interval again.

For a very rough comparison in November 2022 the 5k node project on GCP spend ~$22k, we spent > $142k in the 5k node AWS account for November 2023. These aren't equivalent, but I wouldn't expect this order of magnitude difference which suggests there's room to run the job more efficiently (e.g. perhaps we have public IPs on each worker node?), I know there was some tuning on the kube-up jobs.

/sig k8s-infra scalability
cc @dims

ameukam · 2023-12-06T22:05:40Z

cc @hakman @hakuna-matatah @upodroid

ameukam · 2023-12-06T22:16:24Z

AFAIK, we are talking about a single test: ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2

ameukam · 2023-12-06T22:22:18Z

The test runs in a single account k8s-infra-e2e-boskos-scale-001.

Breaking down some services used:

Service	November 23
EC2-Instances	118,494.91
EC2-Other	15,299.39
Support (Business)	8,603.50
Elastic Load Balancing	258.91
Key Management Service	13.66
S3	3.16

ameukam · 2023-12-06T22:34:57Z

So what's EC2-Other ?

So EBS volumes on the EC2 instances and Data Transfer within the region.

hakuna-matatah · 2023-12-06T23:17:52Z

One reason was that we were running 2 periodics vs 1 periodic on GCP.
Other reason is that we kicked off one off tests using pre-submits as well while making code changes and fixing some issues before we saw successful runs.
We are also running statefulsets in our cl2 tests, haven't looked into if GCP is running stateful sets as part of scale tests.

BenTheElder · 2023-12-06T23:27:06Z

One reason was that we were running 2 periodics vs 1 periodic on GCP.

I don't think that was true in November '22?
But either way that doesn't account for using ~50% of budget on AWS and a much smaller fraction on GCP.

EDIT: These two jobs run 5k node in k8s-infra-e2e-scale-5k-project
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-correctness
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance

Other reason is that we kicked off one off tests using pre-submits as well while making code changes and fixing some issues before we saw successful runs.

My understanding is that was true in the past, but in November we were not developing anymore?
This was discussed in today's meeting.

We are also running statefulsets in our cl2 tests, haven't looked into if GCP is running stateful sets as part of scale tests.

If this can account for a dramatic cost increase, we should consider not doing it.
I doubt it, since the GCP tests have been extremely long running (like > 10h), so it's mostly down to the size and shape of the cluster under test (or something else comparable like making sure we have reliable cleanup of resources when the test is down)

EDIT: I remember wrong, the GCE jobs take 3-4h.

I suspect we can run smaller worker nodes or something else along those lines?

BenTheElder · 2023-12-06T23:30:04Z

Aside: The GCP / AWS comparison is only to set a sense of scale by which this seems to be running much more expensively than expected, I don't expect identical costs and we're not attempting to compare platforms ... it will always be apples to oranges between kube-up on GCE and kops on EC2, but the difference is so large that I suspect we're missing something with running these new AWS scale test jobs cost effectively.

The more meaningful datapoint is consuming ~50% of budget on AWS just for this account, which is not expected.

So EBS volumes on the EC2 instances and Data Transfer within the region.

Maybe smaller worker node volumes are in order? Or a different volume type?

It looks like the spend on this account is by far mostly ec2 instance costs though, so probably we should revisit the node sizes / machine types?

dims · 2023-12-06T23:56:10Z

cc @hakuna-matatah @mengqiy

hakuna-matatah · 2023-12-06T23:57:33Z

I don't think that was true in November '22?
But either way that doesn't account for using ~50% of budget on AWS and a much smaller fraction on GCP.

I think we have been running 2 tests until 2 days back, this PR 2 days back moved it to one test for every 24 hrs.

My understanding is that was true in the past, but in November we were not developing anymore?
This was discussed in today's meeting.

I remember kicking off one off tests when working on experimenting with Networking SLOs in Novemeber.

Maybe smaller worker node volumes are in order? Or a different volume type?

We can definitely improve cost basis on the instance types we choose, we are having a discussion on slack around this here - https://kubernetes.slack.com/archives/C09QZTRH7/p1701906151681289?thread_ts=1701899708.933279&cid=C09QZTRH7

hakman · 2023-12-07T05:05:38Z

I think there are some aspects where kOps may be improved for reducing the cost and duration of the 5k tests:

instance root volume size - the default is 48 GB, which seems way too much for this use case
node registration - probably client-side throttling when so many nodes try to join at the same time
collecting the logs and data from the nodes after tests - done sequentially, so we limited it to 500 nodes, but can probably do it in parallel
resource cleanup - there is a lot of room for optimisation there, due to the many AWS API requests, they get throttled often

hakuna-matatah · 2023-12-07T05:16:54Z

resource cleanup - there is a lot of room for optimisation there, due to the many AWS API requests, they get throttled often

Do we know how long this step is taking today ? I want to compare it with internal EKS 5k test runs tear-down time. And what throttling are we experiencing in terms of APIs ? We can try to increase the limits if current limits are not enough at the time of deletion process.

hakman · 2023-12-07T05:59:48Z

Dumping log and other info: 60 min
Resource cleanup: 25 min

hakuna-matatah · 2023-12-07T06:23:36Z

Dumping log and other info: 60 min

I was wondering if it's feasible to keep just 500 nodes and cleanup 4,500 nodes and then dump the logs for those 500 ? That way we would save 60mins on 4,500 nodes ? WDYT ?

sftim · 2023-12-07T09:35:47Z

Could we use spot instances at all? If not, let's document why not.

marseel · 2023-12-07T15:09:03Z

Could we use spot instances at all? If not, let's document why not.

Currently, I don't think it's possible as the scale test does not tolerate spot instances well. Definitely, that's something we would like to improve though.

sftim · 2023-12-07T15:13:09Z

We can definitely do one of those two options 😉

BenTheElder · 2023-12-07T16:34:03Z

For log dumping: The GCE scale test log dumper sshes to the nodes and then actually push the logs out from each node, so the local command in the e2e pod is relatively cheap/quick (plus parallel dumping mentioned above).

https://github.com/kubernetes/test-infra/blob/master/logexporter/cluster/log-dump.sh

To do this it currently writes a file to the results with a link to where the logs will be dumped in a separate bucket and just grants the test cluster nodes access to the scale log bucket.

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1732445180289617920/artifacts/master-and-node-logs.link.txt

In this way it is dumping all nodes:

$ gsutil ls gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1732445180289617920 | wc -l
    5003

marseel · 2023-12-07T17:06:42Z

For log dumping: The GCE scale test log dumper sshes to the nodes and then actually push the logs out from each node, so the local command in the e2e pod is relatively cheap/quick (plus parallel dumping mentioned above).

IIRC ssh thing is only done for nodes for which logexporter daemonset failed: https://github.com/kubernetes/test-infra/blob/master/logexporter/cluster/logexporter-daemonset.yaml

from scale test logs:

Dumping logs from nodes to GCS directly at 'gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1732082847704944640' using logexporter

hakuna-matatah · 2024-01-02T19:26:53Z

Looks like there is another improvement made by @rifelpet w.r.t to parallelizing resource dumps to save cost here , linking it here as it's related to this discussion we are having here.

k8s-triage-robot · 2024-04-01T20:26:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hakuna-matatah · 2024-04-01T20:37:07Z

How does the cost look now a days given tests are pretty stable and no dev work as well ? Do we need one more round of optimization efforts ? Or are we within the budget allocation now ?

k8s-triage-robot · 2024-05-01T21:37:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ameukam · 2024-05-02T02:52:34Z

@hakuna-matatah I think currently costs is Ok. We are within the monthly budget.

@dims I think we are done here ?

dims · 2024-05-02T10:55:25Z

@ameukam yes we are done here for now.

ameukam · 2024-05-02T11:30:10Z

/close

k8s-ci-robot · 2024-05-02T11:30:15Z

@ameukam: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakuna-matatah · 2024-05-13T16:13:19Z

Another improvement that reduced the test-duration around ~30mins - kubernetes/kops#16532

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Dec 6, 2023

This was referenced Dec 7, 2023

Reduce disk volume size to reduce cost kubernetes/kops#16157

Merged

Order instances in terms of cost kubernetes/kops#16156

Merged

rifelpet mentioned this issue Dec 7, 2023

Add --max-nodes flag to toolbox dump, default to 500 kubernetes/kops#16160

Merged

BenTheElder mentioned this issue Dec 7, 2023

Questions on migrating kubespray CI to test-infra kubernetes/test-infra#31351

Open

BenTheElder mentioned this issue Dec 13, 2023

kubernetes/client-go:.github/PULL_REQUEST_TEMPLATE.md is unhelpful plus suggestion about SIG labelling kubernetes/kubernetes#122238

Open

torredil mentioned this issue Jan 4, 2024

Disable Statefulsets provisioning from CL2 Load Tests kubernetes/kops#16172

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 1, 2024

k8s-ci-robot closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate AWS 5k Node e2e Scale costs #6165

Investigate AWS 5k Node e2e Scale costs #6165

BenTheElder commented Dec 6, 2023 •

edited

Loading

ameukam commented Dec 6, 2023

ameukam commented Dec 6, 2023

ameukam commented Dec 6, 2023

ameukam commented Dec 6, 2023

hakuna-matatah commented Dec 6, 2023 •

edited

Loading

BenTheElder commented Dec 6, 2023 •

edited

Loading

BenTheElder commented Dec 6, 2023 •

edited

Loading

dims commented Dec 6, 2023

hakuna-matatah commented Dec 6, 2023 •

edited

Loading

hakman commented Dec 7, 2023

hakuna-matatah commented Dec 7, 2023

hakman commented Dec 7, 2023

hakuna-matatah commented Dec 7, 2023

sftim commented Dec 7, 2023

marseel commented Dec 7, 2023

sftim commented Dec 7, 2023

BenTheElder commented Dec 7, 2023 •

edited

Loading

marseel commented Dec 7, 2023

hakuna-matatah commented Jan 2, 2024

k8s-triage-robot commented Apr 1, 2024

hakuna-matatah commented Apr 1, 2024

k8s-triage-robot commented May 1, 2024

ameukam commented May 2, 2024

dims commented May 2, 2024

ameukam commented May 2, 2024

k8s-ci-robot commented May 2, 2024

hakuna-matatah commented May 13, 2024

Investigate AWS 5k Node e2e Scale costs #6165

Investigate AWS 5k Node e2e Scale costs #6165

Comments

BenTheElder commented Dec 6, 2023 • edited Loading

ameukam commented Dec 6, 2023

ameukam commented Dec 6, 2023

ameukam commented Dec 6, 2023

ameukam commented Dec 6, 2023

hakuna-matatah commented Dec 6, 2023 • edited Loading

BenTheElder commented Dec 6, 2023 • edited Loading

BenTheElder commented Dec 6, 2023 • edited Loading

dims commented Dec 6, 2023

hakuna-matatah commented Dec 6, 2023 • edited Loading

hakman commented Dec 7, 2023

hakuna-matatah commented Dec 7, 2023

hakman commented Dec 7, 2023

hakuna-matatah commented Dec 7, 2023

sftim commented Dec 7, 2023

marseel commented Dec 7, 2023

sftim commented Dec 7, 2023

BenTheElder commented Dec 7, 2023 • edited Loading

marseel commented Dec 7, 2023

hakuna-matatah commented Jan 2, 2024

k8s-triage-robot commented Apr 1, 2024

hakuna-matatah commented Apr 1, 2024

k8s-triage-robot commented May 1, 2024

ameukam commented May 2, 2024

dims commented May 2, 2024

ameukam commented May 2, 2024

k8s-ci-robot commented May 2, 2024

hakuna-matatah commented May 13, 2024

BenTheElder commented Dec 6, 2023 •

edited

Loading

hakuna-matatah commented Dec 6, 2023 •

edited

Loading

BenTheElder commented Dec 6, 2023 •

edited

Loading

BenTheElder commented Dec 6, 2023 •

edited

Loading

hakuna-matatah commented Dec 6, 2023 •

edited

Loading

BenTheElder commented Dec 7, 2023 •

edited

Loading