-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler integration benchmark improvements #107677
scheduler integration benchmark improvements #107677
Conversation
/assign @sanposhiho |
/retest |
cb5302b
to
1a57e1a
Compare
/approve |
b4d4076
to
07bdfc7
Compare
I've rebased due to a conflict. Can I get this reviewed? @alculquicondor, perhaps you can start for SIG Scheduling? |
I already reviewed and only left 1 comment. But lgtm anyways /lgtm |
/retest |
/lgtm |
pkg/controlplane/controller.go
Outdated
if err := c.EndpointReconciler.RemoveEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts); err == nil { | ||
klog.Error("Found stale data, removed previous endpoints on kubernetes service, apiserver didn't exit successfully previously") | ||
} else if !storage.IsNotFound(err) { | ||
if err := c.EndpointReconciler.RemoveEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts); err != nil && !storage.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea wasn't that your code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, why do we need to change it?
#105531
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the commit message: be4bf5c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea : does that explain it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, I understand now, this thing bit me twice this week
The test is using the masterCount reconciler, not the lease reconciler. The lease reconciler is the default ...
The integration framework should use the lease reconciler ... I think I will write something to try to deprecate the masterCount reconciler, it is flaky :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EndpointReconcilerType: string(reconcilers.LeaseEndpointReconcilerType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, found the inconsistency
diff --git a/pkg/controlplane/instance.go b/pkg/controlplane/instance.go
index 424cd0ce54b..6426038e3b2 100644
--- a/pkg/controlplane/instance.go
+++ b/pkg/controlplane/instance.go
@@ -274,9 +274,9 @@ func (c *Config) createEndpointReconciler() reconcilers.EndpointReconciler {
klog.Infof("Using reconciler: %v", c.ExtraConfig.EndpointReconcilerType)
switch c.ExtraConfig.EndpointReconcilerType {
// there are numerous test dependencies that depend on a default controller
- case "", reconcilers.MasterCountReconcilerType:
+ case reconcilers.MasterCountReconcilerType:
return c.createMasterCountReconciler()
- case reconcilers.LeaseEndpointReconcilerType:
+ case "", reconcilers.LeaseEndpointReconcilerType:
return c.createLeaseReconciler()
case reconcilers.NoneEndpointReconcilerType:
return c.createNoneReconciler()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the code and deprecate the master-count
#107952
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @aojea is going to address the false error message, I took this commit out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR was closed without a conclusion, but the issue about the error message is still open: #107953
// line. k8s.io/component-base/logs called klog.InitFlags for us. | ||
v := flag.Lookup("v").Value | ||
v.Set("4") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? This is affecting everybody who imports. Before it was just effective if you created an apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "why" is that the previous approach didn't work for -v0 on the command line, which is relevant for benchmarks.
I was assuming that setting -v4 as default would be desirable for all integration tests. If that is not the case, then startAPIServerOrDie
must allow controlling whether it raises the verbosity.
Would controlplane.Config
be a suitable place for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would controlplane.Config be a suitable place for that?
Hmm, probably not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler than an explicit parameter would be a global variable in controlplane_utils.go:
// MinVerbosity determines the minimum klog verbosity when running tests that involve the apiserver.
// This overrides the -v value from the command line, i.e. -v0 has no effect when MinVerbosity is 4 (the default).
// Tests can opt out of this by setting MinVerbosity to zero before starting the control plane or
// choose some different minimum verbosity.
var MinVerbosity = 4
@dims: I think you added that code originally. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the init function with the global variable.
GetNamespaceLabelsSnapshot has a fallback when it gets errors when looking up a namespace, therefore reporting the error is more informational than a real error. In particular, not finding the namespace is normal when running test/integration/scheduler_perf and happens so frequently that there is a lot of output on stderr: E0120 12:19:09.204768 95305 plugin.go:138] "getting namespace, assuming empty set of namespace labels" err="namespace \"namespace-1\" not found" namespace="namespace-1"
When running an integration test that measures performance, like for example test/integration/scheduler_perf, running etcd with debug level output is undesirable because it creates additional load on the system and isn't realistic. The default is still "debug", but ETCD_LOGLEVEL=warn can be used to override that.
The output redirection in the info message did not match the actual invocation (stdout and stderr swapped).
"-bench=PerfScheduling/Preemption/500Nodes" ran both the PerfScheduling/Preemption/500Nodes and the PerfScheduling/PreemptionPVs/500Nodes benchmark. This can be avoided by choosing names where none is the prefix of another.
Occasionally, writing as JSON failed because a NaN float couldn't be encoded. The extended log message helps understand where that comes from, for example: F0120 20:24:45.515745 511835 scheduler_perf_test.go:540] BenchmarkPerfScheduling: unable to write measured data {Version:v1 DataItems:[{Data:map[Average:35.714285714285715 Perc50:2 Perc90:36 Perc95:412 Perc99:412] Unit:pods/s Labels:map[Metric:SchedulingThroughput Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2]} {Data:map[Average:27.863967530999993 Perc50:13.925925925925926 Perc90:30.06711409395973 Perc95:31.85682326621924 Perc99:704] Unit:ms Labels:map[Metric:scheduler_e2e_scheduling_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2]} {Data:map[Average:11915.651577744 Perc50:15168.796680497926 Perc90:19417.759336099585 Perc95:19948.87966804979 Perc99:20373.77593360996] Unit:ms Labels:map[Metric:scheduler_pod_scheduling_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2]} {Data:map[Average:1.1865832049999983 Perc50:0.7636363636363637 Perc90:2.891903719912473 Perc95:3.066958424507659 Perc99:5.333333333333334] Unit:ms Labels:map[Metric:scheduler_framework_extension_point_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2 extension_point:Filter]} {Data:map[Average:NaN Perc50:NaN Perc90:NaN Perc95:NaN Perc99:NaN] Unit:ms Labels:map[Metric:scheduler_framework_extension_point_duration_seconds Name:BenchmarkPerfScheduling/PreemptionPVs/500Nodes/namespace-2 extension_point:Score]}]}: json: unsupported value: NaN
0d7bcb3
to
26dac27
Compare
@sttts : can you take another look? |
/retest |
Change looks good now from apiserver point of view. @alculquicondor @oomichi can you re-lgtm? |
26dac27
to
770b2b4
Compare
This provides a mechanism for overriding the forced increase of the klog verbosity to 4 when starting the apiserver and uses that for the scheduler_perf benchmark. Other tests run as before. A global variable was used because adding an explicit parameter to several helper functions would have caused a lot of code churn (test -> integration/util.StartApiserver -> integration/framework.RunAnAPIServerUsingServer -> integration/framework.startAPIServerOrDie).
770b2b4
to
e1e84c8
Compare
/lgtm |
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, oomichi, pohly, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
I was running the
test/integration/scheduler_perf
and encountered various issues in it and some of the scheduler code. See the individual commit messages for details.Special notes for your reviewer:
I still have problems running all benchmarks in a single "go test" invocation as discussed in https://kubernetes.slack.com/archives/C09TP78DV/p1642623117030500 but this PR helps a bit.
Does this PR introduce a user-facing change?