Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability tests #1931

Merged
merged 5 commits into from Apr 17, 2024
Merged

Scalability tests #1931

merged 5 commits into from Apr 17, 2024

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Mar 29, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add scalability test to be able to detect regressions in Kueue's overall scheduling performance.

Which issue(s) this PR fixes:

Relates to #1912

Special notes for your reviewer:

This should be the fist stage of this change, once this is merged and we have a CI job set up, we should refine the generator config and the expected metrics value.

Does this PR introduce a user-facing change?

Added scalability test for scheduling performance

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 29, 2024
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 29, 2024
Copy link

netlify bot commented Mar 29, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit f3a7d05
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/661fd89884e7b600089b4b59

@alculquicondor
Copy link
Contributor

/assign @mimowo

@trasc trasc force-pushed the scalability-tests branch 2 times, most recently from 29752d1 to bc15fd0 Compare April 4, 2024 12:26
@trasc
Copy link
Contributor Author

trasc commented Apr 4, 2024

/test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 5, 2024
@trasc trasc changed the title Scalability tests Scalabilit tests Apr 5, 2024
@trasc trasc marked this pull request as ready for review April 5, 2024 11:56
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mimowo April 5, 2024 11:56
@mimowo
Copy link
Contributor

mimowo commented Apr 8, 2024

ack

@trasc trasc changed the title Scalabilit tests Scalability tests Apr 9, 2024
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I would like to steer the first iteration of the PR in this direction:

  • drop custom code to track individual workloads, events, queues with controllers
  • we only track metrics, scraped in 1s intervals by default. A report is saved for a subset of metrics, only guage and counter, in csv as time series. I'm ok with the metrics scraping as a follow up.

I think this approach will be much easier to maintain going forward. It will also create a healthy incentive to add metrics, which will be used on production clusters.

test/scalability/README.md Outdated Show resolved Hide resolved
test/scalability/README.md Outdated Show resolved Hide resolved
test/scalability/README.md Show resolved Hide resolved
test/scalability/README.md Outdated Show resolved Hide resolved
test/scalability/default_rangespec.yaml Outdated Show resolved Hide resolved
test/scalability/checker/checker_test.go Outdated Show resolved Hide resolved
test/scalability/default_generator_config.yaml Outdated Show resolved Hide resolved
test/scalability/default_rangespec.yaml Outdated Show resolved Hide resolved
test/scalability/runner/recorder/recorder.go Outdated Show resolved Hide resolved
@trasc
Copy link
Contributor Author

trasc commented Apr 10, 2024

Generally I would like to steer the first iteration of the PR in this direction:

  • drop custom code to track individual workloads, events, queues with controllers
  • we only track metrics, scraped in 1s intervals by default. A report is saved for a subset of metrics, only guage and counter, in csv as time series. I'm ok with the metrics scraping as a follow up.

I think this approach will be much easier to maintain going forward. It will also create a healthy incentive to add metrics, which will be used on production clusters.

Let's go with the set of features already developed and add new functionality in follow-ups.

@mimowo
Copy link
Contributor

mimowo commented Apr 10, 2024

Let's go with the set of features already developed and add new functionality in follow-ups.

I would like to drop the bits indicated in the comments to simplify the code, there is a maintenance cost if we commit them.

@trasc
Copy link
Contributor Author

trasc commented Apr 10, 2024

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2024
@trasc
Copy link
Contributor Author

trasc commented Apr 11, 2024

Relying on kueue's internal metrics in this scenario is problematic because:

  • Adds extra CPU consumption in minimalkueue: Serving the scraping will add extra CPU consumption and make the critical path (scheduling) less relevant in the overall consumption.
  • Are unflexible: No workload class specific measurements can be extracted. In my opinion being able track the time to admission for specific type of workload (having different priorities) is important.
  • Are not very precise:
    • Scraping at 1s intervals while the workloads we are using run for at most 1s (90% of them way under) can miss important hot spots.
    • With some exceptions, the timings computed for the metrics are based on some k8s api stored time (creation, condition last transition) which have seconds resolution.

It needs to be mentioned that the runner components that are extracting the measurements have additional attributes, (managing when a workload is declared finished, managing workload's eviction , determine when the scenario execution has ended.)

@trasc
Copy link
Contributor Author

trasc commented Apr 16, 2024

Split done, #1987 contains the runner extracted metrics.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks for the cleanup.

test/scalability/runner/main.go Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
test/scalability/README.md Show resolved Hide resolved
@trasc
Copy link
Contributor Author

trasc commented Apr 16, 2024

LGTM overall, thanks for the cleanup.

We should not look at it as a cleanup, is just a split in two PRs.

@mimowo
Copy link
Contributor

mimowo commented Apr 16, 2024

We should not look at it as a cleanup, is just a split in two PRs.

Right, but let's park the second part until we play more with the tool, and have some idea if this is really needed, because it adds significant complexity to the code.

@mimowo
Copy link
Contributor

mimowo commented Apr 16, 2024

/lgtm
/approve

/assign @alculquicondor
For the top-level Makefile

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7798498d85c2e3dc8dce9db1952b349225ae70d3

Makefile Show resolved Hide resolved
endif

ifdef SCALABILITY_KUEUE_LOGS
SCALABILITY_EXTRA_ARGS += --withLogs=true --logToFile=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the file that it goes to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bin/run-scalability/minimalkueue.err.log and bin/run-scalability/minimalkueue.out.log

(Just a side note, 1820 was really good, only from log size POV it got the size of bin/run-scalability/minimalkueue.err.log from around 3GB to under 100MB )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @gabesaba, as the person who discovered the issue :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, actually, @gabesaba's was #1897
It was probably a combination of both.

@@ -0,0 +1,64 @@
# Scalability test

Is a test meant to detect regressions int the Kueue's overall scheduling capabilities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should put all of this inside test/performance/scheduling?

The existing tests in test/performance are scalability tests as well, just a different level.

Another potential directory structure could be:

test/performance (for this tool) and test/performance_e2e (for the cl2 based things).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can but let's come up with a precise naming scheme, not only for the code location but also the artifacts and make targets, otherwise it will be hard to follow the terminology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the target names should match. Maybe we can put all the targets for "performance" inside its own Makefile. so it would look like:

make test/performance/jobs test/performance/scheduling

But we can leave them for a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with the follow-up then

test/scalability/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,31 @@
- className: cohort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no templates for the objects, right? They are all generated in code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short yes, we can extend the schema for this file if needed, but to keep it simple for now is better to just "hardcode" the genaration of namespace, LQs, ResourceFlavor ....

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2024
@alculquicondor
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2024
test/scalability/README.md Outdated Show resolved Hide resolved
test/scalability/checker/checker_test.go Outdated Show resolved Hide resolved
test/scalability/default_rangespec.yaml Outdated Show resolved Hide resolved
test/scalability/minimalkueue/main.go Outdated Show resolved Hide resolved
test/scalability/runner/controller/controller.go Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d263b36621e23db124057908d6673e1576c52513

@k8s-ci-robot k8s-ci-robot merged commit c691eef into kubernetes-sigs:main Apr 17, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Apr 17, 2024
@trasc trasc deleted the scalability-tests branch April 17, 2024 15:02
vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024
* [testing] Make RestConfigToKubeConfig an util function.

* [scalability] Initial implementation.

* Review remarks.

* Review Remarks

* Review Remarks
@alculquicondor
Copy link
Contributor

/release-note-edit

Added scalability test for scheduling performance

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants