Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler integration benchmark improvements #107677

Merged
merged 6 commits into from
Feb 14, 2022

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 20, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

I was running the test/integration/scheduler_perf and encountered various issues in it and some of the scheduler code. See the individual commit messages for details.

Special notes for your reviewer:

I still have problems running all benchmarks in a single "go test" invocation as discussed in https://kubernetes.slack.com/archives/C09TP78DV/p1642623117030500 but this PR helps a bit.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 20, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 20, 2022
@alculquicondor
Copy link
Member

alculquicondor commented Jan 21, 2022

/assign @sanposhiho
for a first pass for sig-scheduling

@pohly
Copy link
Contributor Author

pohly commented Jan 21, 2022

/retest

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 22, 2022
@alculquicondor
Copy link
Member

/approve

@pohly pohly force-pushed the scheduler-integration-benchmark branch from b4d4076 to 07bdfc7 Compare January 25, 2022 19:33
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2022
@pohly
Copy link
Contributor Author

pohly commented Jan 25, 2022

I've rebased due to a conflict.

Can I get this reviewed? @alculquicondor, perhaps you can start for SIG Scheduling?

@alculquicondor
Copy link
Member

I already reviewed and only left 1 comment.

But lgtm anyways

/lgtm

@pohly
Copy link
Contributor Author

pohly commented Jan 28, 2022

/retest

@oomichi
Copy link
Member

oomichi commented Feb 3, 2022

./test part seems good for me.

/lgtm
/approve

if err := c.EndpointReconciler.RemoveEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts); err == nil {
klog.Error("Found stale data, removed previous endpoints on kubernetes service, apiserver didn't exit successfully previously")
} else if !storage.IsNotFound(err) {
if err := c.EndpointReconciler.RemoveEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts); err != nil && !storage.IsNotFound(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea wasn't that your code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, why do we need to change it?
#105531

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the commit message: be4bf5c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea : does that explain it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, I understand now, this thing bit me twice this week
The test is using the masterCount reconciler, not the lease reconciler. The lease reconciler is the default ...
The integration framework should use the lease reconciler ... I think I will write something to try to deprecate the masterCount reconciler, it is flaky :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndpointReconcilerType: string(reconcilers.LeaseEndpointReconcilerType),

Copy link
Member

@aojea aojea Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, found the inconsistency

diff --git a/pkg/controlplane/instance.go b/pkg/controlplane/instance.go
index 424cd0ce54b..6426038e3b2 100644
--- a/pkg/controlplane/instance.go
+++ b/pkg/controlplane/instance.go
@@ -274,9 +274,9 @@ func (c *Config) createEndpointReconciler() reconcilers.EndpointReconciler {
        klog.Infof("Using reconciler: %v", c.ExtraConfig.EndpointReconcilerType)
        switch c.ExtraConfig.EndpointReconcilerType {
        // there are numerous test dependencies that depend on a default controller
-       case "", reconcilers.MasterCountReconcilerType:
+       case reconcilers.MasterCountReconcilerType:
                return c.createMasterCountReconciler()
-       case reconcilers.LeaseEndpointReconcilerType:
+       case "", reconcilers.LeaseEndpointReconcilerType:
                return c.createLeaseReconciler()
        case reconcilers.NoneEndpointReconcilerType:
                return c.createNoneReconciler()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the code and deprecate the master-count
#107952

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @aojea is going to address the false error message, I took this commit out of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR was closed without a conclusion, but the issue about the error message is still open: #107953

// line. k8s.io/component-base/logs called klog.InitFlags for us.
v := flag.Lookup("v").Value
v.Set("4")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? This is affecting everybody who imports. Before it was just effective if you created an apiserver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "why" is that the previous approach didn't work for -v0 on the command line, which is relevant for benchmarks.

I was assuming that setting -v4 as default would be desirable for all integration tests. If that is not the case, then startAPIServerOrDie must allow controlling whether it raises the verbosity.

Would controlplane.Config be a suitable place for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would controlplane.Config be a suitable place for that?

Hmm, probably not.

Copy link
Contributor Author

@pohly pohly Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler than an explicit parameter would be a global variable in controlplane_utils.go:

// MinVerbosity determines the minimum klog verbosity when running tests that involve the apiserver.
// This overrides the -v value from the command line, i.e. -v0 has no effect when MinVerbosity is 4 (the default).
// Tests can opt out of this by setting MinVerbosity to zero before starting the control plane or
// choose some different minimum verbosity.
var MinVerbosity = 4

@dims: I think you added that code originally. Any thoughts?

Copy link
Contributor Author

@pohly pohly Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the init function with the global variable.

GetNamespaceLabelsSnapshot has a fallback when it gets errors when looking up a
namespace, therefore reporting the error is more informational than a real
error. In particular, not finding the namespace is normal when running
test/integration/scheduler_perf and happens so frequently that there is a lot
of output on stderr:

E0120 12:19:09.204768   95305 plugin.go:138] "getting namespace, assuming empty set of namespace labels" err="namespace \"namespace-1\" not found" namespace="namespace-1"
When running an integration test that measures performance, like for example
test/integration/scheduler_perf, running etcd with debug level output is
undesirable because it creates additional load on the system and isn't
realistic.

The default is still "debug", but ETCD_LOGLEVEL=warn can be used to override
that.
The output redirection in the info message did not match the actual
invocation (stdout and stderr swapped).
"-bench=PerfScheduling/Preemption/500Nodes" ran both the
PerfScheduling/Preemption/500Nodes and the
PerfScheduling/PreemptionPVs/500Nodes benchmark.

This can be avoided by choosing names where none is the prefix of another.
Occasionally, writing as JSON failed because a NaN float couldn't be
encoded. The extended log message helps understand where that comes from, for
example:

F0120 20:24:45.515745  511835 scheduler_perf_test.go:540] BenchmarkPerfScheduling: unable to write measured data {Version:v1 DataItems:[{Data:map[Average:35.714285714285715 Perc50:2 Perc90:36 Perc95:412 Perc99:412] Unit:pods/s Labels:map[Metric:SchedulingThroughput Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2]} {Data:map[Average:27.863967530999993 Perc50:13.925925925925926 Perc90:30.06711409395973 Perc95:31.85682326621924 Perc99:704] Unit:ms Labels:map[Metric:scheduler_e2e_scheduling_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2]} {Data:map[Average:11915.651577744 Perc50:15168.796680497926 Perc90:19417.759336099585 Perc95:19948.87966804979 Perc99:20373.77593360996] Unit:ms Labels:map[Metric:scheduler_pod_scheduling_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2]} {Data:map[Average:1.1865832049999983 Perc50:0.7636363636363637 Perc90:2.891903719912473 Perc95:3.066958424507659 Perc99:5.333333333333334] Unit:ms Labels:map[Metric:scheduler_framework_extension_point_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2 extension_point:Filter]} {Data:map[Average:NaN Perc50:NaN Perc90:NaN Perc95:NaN Perc99:NaN] Unit:ms Labels:map[Metric:scheduler_framework_extension_point_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2 extension_point:Score]}]}: json: unsupported value: NaN
@pohly pohly force-pushed the scheduler-integration-benchmark branch from 0d7bcb3 to 26dac27 Compare February 7, 2022 08:16
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2022
@pohly
Copy link
Contributor Author

pohly commented Feb 7, 2022

@sttts : can you take another look?

@pohly
Copy link
Contributor Author

pohly commented Feb 8, 2022

/retest

@sttts
Copy link
Contributor

sttts commented Feb 11, 2022

Change looks good now from apiserver point of view.

@alculquicondor @oomichi can you re-lgtm?

@pohly pohly force-pushed the scheduler-integration-benchmark branch from 26dac27 to 770b2b4 Compare February 11, 2022 15:19
This provides a mechanism for overriding the forced increase of the klog
verbosity to 4 when starting the apiserver and uses that for the scheduler_perf
benchmark. Other tests run as before.

A global variable was used because adding an explicit parameter to several
helper functions would have caused a lot of code churn (test ->
integration/util.StartApiserver ->
integration/framework.RunAnAPIServerUsingServer ->
integration/framework.startAPIServerOrDie).
@pohly pohly force-pushed the scheduler-integration-benchmark branch from 770b2b4 to e1e84c8 Compare February 11, 2022 15:58
@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2022
@pohly
Copy link
Contributor Author

pohly commented Feb 12, 2022

/retest

@sttts
Copy link
Contributor

sttts commented Feb 14, 2022

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, oomichi, pohly, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2022
@k8s-ci-robot k8s-ci-robot merged commit 21c0f6f into kubernetes:master Feb 14, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants