-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate AWS 5k Node e2e Scale costs #6165
Comments
AFAIK, we are talking about a single test: ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2 |
The test runs in a single account Breaking down some services used:
|
So what's So EBS volumes on the EC2 instances and Data Transfer within the region. |
|
I don't think that was true in November '22? EDIT: These two jobs run 5k node in
My understanding is that was true in the past, but in November we were not developing anymore?
If this can account for a dramatic cost increase, we should consider not doing it. EDIT: I remember wrong, the GCE jobs take 3-4h. I suspect we can run smaller worker nodes or something else along those lines? |
Aside: The GCP / AWS comparison is only to set a sense of scale by which this seems to be running much more expensively than expected, I don't expect identical costs and we're not attempting to compare platforms ... it will always be apples to oranges between kube-up on GCE and kops on EC2, but the difference is so large that I suspect we're missing something with running these new AWS scale test jobs cost effectively. The more meaningful datapoint is consuming ~50% of budget on AWS just for this account, which is not expected.
Maybe smaller worker node volumes are in order? Or a different volume type? It looks like the spend on this account is by far mostly ec2 instance costs though, so probably we should revisit the node sizes / machine types? |
I think we have been running 2 tests until 2 days back, this PR 2 days back moved it to one test for every 24 hrs.
I remember kicking off one off tests when working on experimenting with Networking SLOs in Novemeber.
We can definitely improve cost basis on the instance types we choose, we are having a discussion on slack around this here - https://kubernetes.slack.com/archives/C09QZTRH7/p1701906151681289?thread_ts=1701899708.933279&cid=C09QZTRH7 |
I think there are some aspects where kOps may be improved for reducing the cost and duration of the 5k tests:
|
Do we know how long this step is taking today ? I want to compare it with internal EKS 5k test runs |
Dumping log and other info: 60 min |
I was wondering if it's feasible to keep just 500 nodes and cleanup 4,500 nodes and then dump the logs for those 500 ? That way we would save 60mins on 4,500 nodes ? WDYT ? |
Could we use spot instances at all? If not, let's document why not. |
Currently, I don't think it's possible as the scale test does not tolerate spot instances well. Definitely, that's something we would like to improve though. |
We can definitely do one of those two options 😉 |
For log dumping: The GCE scale test log dumper https://github.com/kubernetes/test-infra/blob/master/logexporter/cluster/log-dump.sh To do this it currently writes a file to the results with a link to where the logs will be dumped in a separate bucket and just grants the test cluster nodes access to the scale log bucket. In this way it is dumping all nodes:
|
IIRC ssh thing is only done for nodes for which logexporter daemonset failed: https://github.com/kubernetes/test-infra/blob/master/logexporter/cluster/logexporter-daemonset.yaml from scale test logs:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
How does the cost look now a days given tests are pretty stable and no dev work as well ? Do we need one more round of optimization efforts ? Or are we within the budget allocation now ? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
@hakuna-matatah I think currently costs is Ok. We are within the monthly budget. @dims I think we are done here ? |
@ameukam yes we are done here for now. |
/close |
@ameukam: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Another improvement that reduced the test-duration around ~30mins - kubernetes/kops#16532 |
We're spending around 50% of our budget on AWS on the 5k node scale tests, this is disproportionate.
We should investigate if there's knobs we can tune with the cluster configuration to bring this down, or otherwise reduce the interval again.
For a very rough comparison in November 2022 the 5k node project on GCP spend ~$22k, we spent > $142k in the 5k node AWS account for November 2023. These aren't equivalent, but I wouldn't expect this order of magnitude difference which suggests there's room to run the job more efficiently (e.g. perhaps we have public IPs on each worker node?), I know there was some tuning on the kube-up jobs.
/sig k8s-infra scalability
cc @dims
The text was updated successfully, but these errors were encountered: