scheduler-perf: run as integration tests #118202

pohly · 2023-05-23T10:00:38Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This has two purposes:

By running in the pre-merge pull-kubernetes-integration, changes that break scheduler_perf might be caught before they get merged.
Eventually (see WIP: integration: run with race detection enabled #116980), these integration tests will run with race detection enabled. This fills a gap in the test coverage for the kube-scheduler code because unit tests are often too simplistic to expose races and E2E testing runs without race detection.

Special notes for your reviewer:

Split into multiple stand-alone commits to simplify reviews, but probably none of those are worth merging by their own.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2023-05-23T10:00:46Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly · 2023-05-23T17:49:04Z

FAIL: TestScheduling/SchedulingPreferredPodAffinity/500Nodes

But there's no failure message 😢

/retest

pohly · 2023-05-23T18:26:19Z

Here's why the failure is not shown:

clipping failure message in test case : TestScheduling/SchedulingPreferredPodAffinity/500Nodes

That's from cmd/prune-junit-xml/prunexml.go.

Verbosity is a bit high. The V(5) log message from graph_builder.go are too much. I'm not sure yet where that gets increased. The default should be -v2.

pohly · 2023-05-24T13:15:21Z

The verbosity problem was fixed by reconfiguring defaults for integration tests and the test failure should be fixed by disabling the sampling interval check. It's only relevant when we measure performance and is more likely to fail when many Go tests run in parallel.

dims · 2023-06-06T16:15:27Z

@Huang-Wei @ahg-g @alculquicondor can one of you please approve/lgtm?

Huang-Wei · 2023-06-06T16:54:11Z

@kerthcet could you take a look as you recently reviewed similar PRs in pert tests. Thanks!

pohly · 2023-06-21T12:14:49Z

I've gone through the review feedback and tried to address everything. For now I continue to use gomega.Eventually.

@alculquicondor: if you still prefer wait.Until, then I'll switch to that.

Each benchmark test case runs with a fresh etcd instance. Therefore it is not necessary to delete objects after a run. A future unit test might reuse etcd, therefore cleanup is optional.

kerthcet · 2023-06-26T05:41:04Z

/retest

kerthcet · 2023-06-26T05:42:31Z

Addressed all my concerns, so
/approve

k8s-ci-robot · 2023-06-26T05:42:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, mimani68, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kerthcet · 2023-06-26T05:45:03Z

/hold
for @alculquicondor

pohly · 2023-06-27T14:05:18Z

/retest

alculquicondor · 2023-06-27T17:08:56Z

test/integration/scheduler_perf/util.go

 // - k8s api server
 // - scheduler
 // It returns regular and dynamic clients, and destroyFunc which should be used to
 // remove resources after finished.
 // Notes on rate limiter:
 //   - client rate limit is set to 5000.
-func mustSetupScheduler(ctx context.Context, b *testing.B, config *config.KubeSchedulerConfiguration, enabledFeatures map[featuregate.Feature]bool) (informers.SharedInformerFactory, clientset.Interface, dynamic.Interface) {
+func mustSetupScheduler(ctx context.Context, tb testing.TB, config *config.KubeSchedulerConfiguration, enabledFeatures map[featuregate.Feature]bool) (informers.SharedInformerFactory, clientset.Interface, dynamic.Interface) {


It still says Scheduler here.
Also please update the comment

alculquicondor · 2023-06-27T17:15:12Z

test/integration/util/util.go

@@ -127,7 +132,10 @@ func StartFakePVController(ctx context.Context, clientSet clientset.Interface, i
 			claimRef := obj.Spec.ClaimRef
 			pvc, err := clientSet.CoreV1().PersistentVolumeClaims(claimRef.Namespace).Get(ctx, claimRef.Name, metav1.GetOptions{})
 			if err != nil {


This should avoid any races:

Suggested change

if err != nil {

if err != nil && errors.Is(err, context.Canceled) {

We have to return when there was an error, otherwise the code below would use a nil pvc. What needs to be fixed is the check whether that error should be logger. I had the logic backwards. It now is:

if err != nil { // Note that the error can be anything, because components like // apiserver are also shutting down at the same time, but this // check is conservative and only ignores the "context canceled" // error while shutting down. if ctx.Err() == nil || !errors.Is(err, context.Canceled) { klog.Errorf("error while getting %v/%v: %v", claimRef.Namespace, claimRef.Name, err) } return }

alculquicondor · 2023-06-27T17:15:37Z

test/integration/util/util.go

@@ -136,7 +144,10 @@ func StartFakePVController(ctx context.Context, clientSet clientset.Interface, i
 				metav1.SetMetaDataAnnotation(&pvc.ObjectMeta, pvutil.AnnBindCompleted, "yes")
 				_, err := clientSet.CoreV1().PersistentVolumeClaims(claimRef.Namespace).Update(ctx, pvc, metav1.UpdateOptions{})
 				if err != nil {
-					klog.Errorf("error while updating %v/%v: %v", claimRef.Namespace, claimRef.Name, err)
+					if ctx.Err() != nil {


alculquicondor · 2023-06-27T17:19:25Z

test/integration/scheduler_perf/scheduler_perf_test.go

+		func() {
+			_, ctx := ktesting.NewTestContext(t)
+			// 30 minutes is for *all* tests using this configuration.
+			ctx, cancel := context.WithTimeout(ctx, 30*time.Minute)


why not just context.WithCancel then?

alculquicondor · 2023-06-27T17:20:19Z

test/integration/scheduler_perf/scheduler_perf_test.go

+		}
+	}
+
+	// We need to wait here because even with deletion time stamp set,


Suggested change

// We need to wait here because even with deletion time stamp set,

// We need to wait here because even with deletion timestamp set,

alculquicondor · 2023-06-27T17:24:05Z

test/integration/scheduler_perf/scheduler_perf_test.go

+		pods, err := client.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{})
+		if err != nil {
+			tb.Fatalf("failed to list pods in %q: %v", namespace, err)
+		}
+		for _, pod := range pods.Items {
+			if err := client.CoreV1().Pods(namespace).Delete(ctx, pod.Name, deleteNow); err != nil {
+				tb.Fatalf("failed to delete pod %q in namespace %q: %v", pod.Name, namespace, err)
+			}
+		}


Can we use a single DeleteCollection? IIUC, it also accepts DeleteOptions

Good idea, changed.

kerthcet · 2023-06-28T03:00:43Z

Seems forgot to push the commit...

Once the context is canceled, the controller can stop processing events. Without this change it prints errors when the apiserver is already down.

This becomes relevant when doing more fine-grained leak checking.

Merely deleting the namespace is not enough: - Workloads might rely on the garbage collector to get rid of obsolete objects, so we should run it to be on the safe side. - Pods must be force-deleted because kubelet is not running. - Finally, the namespace controller is needed to get rid of deleted namespaces.

This runs workloads that are labeled as "integration-test". The apiserver and scheduler are only started once per unique configuration, followed by each workload using that configuration. This makes execution faster. In contrast to benchmarking, we care less about starting with a clean slate for each test.

…imeout This is done for the sake of consistency. The failure message becomes less useful.

pohly · 2023-06-28T07:38:24Z

Seems forgot to push the commit...

I didn't quite finish yesterday after leaving my initial replies. I think I addressed everything now and pushed: https://github.com/kubernetes/kubernetes/compare/2ff7891706089c1a3ae58b6cf6cd116b8e462dab..0d41d509d2d96ccc3473924cb4e1b8e1b3e4c170

alculquicondor

/lgtm

k8s-ci-robot · 2023-06-28T12:27:30Z

LGTM label has been added.

Git tree hash: b66b490fb759b8567b59abbd483c435bbdb2fc06

alculquicondor · 2023-06-28T12:27:48Z

/hold cancel

k8s-ci-robot requested review from chendave and dchen1107 May 23, 2023 10:01

pohly force-pushed the scheduler-perf-unit-test branch 2 times, most recently from 9b65d9e to 1ba89fd Compare May 23, 2023 15:34

mimani68 approved these changes May 23, 2023

View reviewed changes

pohly changed the title ~~scheduler-perf: run as integration tests~~ WIP: scheduler-perf: run as integration tests May 23, 2023

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 23, 2023

pohly force-pushed the scheduler-perf-unit-test branch 2 times, most recently from 06b49cb to af79207 Compare May 24, 2023 09:07

pohly changed the title ~~WIP: scheduler-perf: run as integration tests~~ scheduler-perf: run as integration tests May 24, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2023

pohly force-pushed the scheduler-perf-unit-test branch from af79207 to 95c8e50 Compare May 24, 2023 13:13

pohly force-pushed the scheduler-perf-unit-test branch from 62844bc to 0358fdb Compare June 21, 2023 12:14

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 21, 2023

scheduler_perf: skip expensive cleanup during benchmarks

c91c578

Each benchmark test case runs with a fresh etcd instance. Therefore it is not necessary to delete objects after a run. A future unit test might reuse etcd, therefore cleanup is optional.

pohly force-pushed the scheduler-perf-unit-test branch from 0358fdb to 7066d05 Compare June 22, 2023 07:00

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2023

alculquicondor reviewed Jun 27, 2023

View reviewed changes

kerthcet mentioned this pull request Jun 28, 2023

Use gomega Eventually in scheduler integration tests #116956

Closed

pohly added 5 commits June 28, 2023 08:14

test/integration: avoid errors in fake PC controller during shutdown

2e7f373

Once the context is canceled, the controller can stop processing events. Without this change it prints errors when the apiserver is already down.

scheduler_perf: fix goroutine leak in runWorkload

d9c16a1

This becomes relevant when doing more fine-grained leak checking.

scheduler_perf: replace gomega.Eventually with wait.PollUntilContextT…

0d41d50

…imeout This is done for the sake of consistency. The failure message becomes less useful.

pohly force-pushed the scheduler-perf-unit-test branch from 2ff7891 to 0d41d50 Compare June 28, 2023 07:38

alculquicondor reviewed Jun 28, 2023

View reviewed changes

k8s-ci-robot assigned alculquicondor Jun 28, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 28, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2023

k8s-ci-robot merged commit c78204d into kubernetes:master Jun 28, 2023
13 checks passed

k8s-ci-robot added this to the v1.28 milestone Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler-perf: run as integration tests #118202

scheduler-perf: run as integration tests #118202

pohly commented May 23, 2023 •

edited

k8s-ci-robot commented May 23, 2023

pohly commented May 23, 2023

pohly commented May 23, 2023

pohly commented May 24, 2023

dims commented Jun 6, 2023

Huang-Wei commented Jun 6, 2023

pohly commented Jun 21, 2023

kerthcet commented Jun 26, 2023

kerthcet commented Jun 26, 2023 •

edited

k8s-ci-robot commented Jun 26, 2023

kerthcet commented Jun 26, 2023

pohly commented Jun 27, 2023

alculquicondor Jun 27, 2023

pohly Jun 27, 2023

alculquicondor Jun 27, 2023

pohly Jun 27, 2023

alculquicondor Jun 27, 2023

alculquicondor Jun 27, 2023

alculquicondor Jun 27, 2023

alculquicondor Jun 27, 2023

pohly Jun 28, 2023

alculquicondor Jun 27, 2023

pohly Jun 28, 2023

kerthcet commented Jun 28, 2023

pohly commented Jun 28, 2023

alculquicondor left a comment

k8s-ci-robot commented Jun 28, 2023

alculquicondor commented Jun 28, 2023

	if err != nil {
	if err != nil && errors.Is(err, context.Canceled) {

	// We need to wait here because even with deletion time stamp set,
	// We need to wait here because even with deletion timestamp set,

scheduler-perf: run as integration tests #118202

scheduler-perf: run as integration tests #118202

Conversation

pohly commented May 23, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented May 23, 2023

pohly commented May 23, 2023

pohly commented May 23, 2023

pohly commented May 24, 2023

dims commented Jun 6, 2023

Huang-Wei commented Jun 6, 2023

pohly commented Jun 21, 2023

kerthcet commented Jun 26, 2023

kerthcet commented Jun 26, 2023 • edited

k8s-ci-robot commented Jun 26, 2023

kerthcet commented Jun 26, 2023

pohly commented Jun 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kerthcet commented Jun 28, 2023

pohly commented Jun 28, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 28, 2023

alculquicondor commented Jun 28, 2023

pohly commented May 23, 2023 •

edited

kerthcet commented Jun 26, 2023 •

edited