Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler-perf: run as integration tests #118202

Merged
merged 6 commits into from Jun 28, 2023

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 23, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This has two purposes:

  • By running in the pre-merge pull-kubernetes-integration, changes that break scheduler_perf might be caught before they get merged.
  • Eventually (see WIP: integration: run with race detection enabled #116980), these integration tests will run with race detection enabled. This fills a gap in the test coverage for the kube-scheduler code because unit tests are often too simplistic to expose races and E2E testing runs without race detection.

Special notes for your reviewer:

Split into multiple stand-alone commits to simplify reviews, but probably none of those are worth merging by their own.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2023
@pohly pohly force-pushed the scheduler-perf-unit-test branch 2 times, most recently from 9b65d9e to 1ba89fd Compare May 23, 2023 15:34
@pohly
Copy link
Contributor Author

pohly commented May 23, 2023

FAIL: TestScheduling/SchedulingPreferredPodAffinity/500Nodes

But there's no failure message 😢

/retest

@pohly
Copy link
Contributor Author

pohly commented May 23, 2023

Here's why the failure is not shown:

clipping failure message in test case : TestScheduling/SchedulingPreferredPodAffinity/500Nodes

That's from cmd/prune-junit-xml/prunexml.go.

Verbosity is a bit high. The V(5) log message from graph_builder.go are too much. I'm not sure yet where that gets increased. The default should be -v2.

@pohly pohly changed the title scheduler-perf: run as integration tests WIP: scheduler-perf: run as integration tests May 23, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 23, 2023
@pohly pohly force-pushed the scheduler-perf-unit-test branch 2 times, most recently from 06b49cb to af79207 Compare May 24, 2023 09:07
@pohly pohly changed the title WIP: scheduler-perf: run as integration tests scheduler-perf: run as integration tests May 24, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2023
@pohly pohly force-pushed the scheduler-perf-unit-test branch from af79207 to 95c8e50 Compare May 24, 2023 13:13
@pohly
Copy link
Contributor Author

pohly commented May 24, 2023

The verbosity problem was fixed by reconfiguring defaults for integration tests and the test failure should be fixed by disabling the sampling interval check. It's only relevant when we measure performance and is more likely to fail when many Go tests run in parallel.

@dims
Copy link
Member

dims commented Jun 6, 2023

@Huang-Wei @ahg-g @alculquicondor can one of you please approve/lgtm?

@Huang-Wei
Copy link
Member

@kerthcet could you take a look as you recently reviewed similar PRs in pert tests. Thanks!

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 21, 2023
@pohly
Copy link
Contributor Author

pohly commented Jun 21, 2023

I've gone through the review feedback and tried to address everything. For now I continue to use gomega.Eventually.

@alculquicondor: if you still prefer wait.Until, then I'll switch to that.

Each benchmark test case runs with a fresh etcd instance. Therefore it is not
necessary to delete objects after a run.

A future unit test might reuse etcd, therefore cleanup is optional.
@kerthcet
Copy link
Member

/retest

@kerthcet
Copy link
Member

kerthcet commented Jun 26, 2023

Addressed all my concerns, so
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, mimani68, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kerthcet
Copy link
Member

/hold
for @alculquicondor

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2023
@pohly
Copy link
Contributor Author

pohly commented Jun 27, 2023

/retest

// - k8s api server
// - scheduler
// It returns regular and dynamic clients, and destroyFunc which should be used to
// remove resources after finished.
// Notes on rate limiter:
// - client rate limit is set to 5000.
func mustSetupScheduler(ctx context.Context, b *testing.B, config *config.KubeSchedulerConfiguration, enabledFeatures map[featuregate.Feature]bool) (informers.SharedInformerFactory, clientset.Interface, dynamic.Interface) {
func mustSetupScheduler(ctx context.Context, tb testing.TB, config *config.KubeSchedulerConfiguration, enabledFeatures map[featuregate.Feature]bool) (informers.SharedInformerFactory, clientset.Interface, dynamic.Interface) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still says Scheduler here.
Also please update the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -127,7 +132,10 @@ func StartFakePVController(ctx context.Context, clientSet clientset.Interface, i
claimRef := obj.Spec.ClaimRef
pvc, err := clientSet.CoreV1().PersistentVolumeClaims(claimRef.Namespace).Get(ctx, claimRef.Name, metav1.GetOptions{})
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should avoid any races:

Suggested change
if err != nil {
if err != nil && errors.Is(err, context.Canceled) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to return when there was an error, otherwise the code below would use a nil pvc. What needs to be fixed is the check whether that error should be logger. I had the logic backwards. It now is:

			if err != nil {
				// Note that the error can be anything, because components like
				// apiserver are also shutting down at the same time, but this
				// check is conservative and only ignores the "context canceled"
				// error while shutting down.
				if ctx.Err() == nil || !errors.Is(err, context.Canceled) {
					klog.Errorf("error while getting %v/%v: %v", claimRef.Namespace, claimRef.Name, err)
				}
				return
			}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

@@ -136,7 +144,10 @@ func StartFakePVController(ctx context.Context, clientSet clientset.Interface, i
metav1.SetMetaDataAnnotation(&pvc.ObjectMeta, pvutil.AnnBindCompleted, "yes")
_, err := clientSet.CoreV1().PersistentVolumeClaims(claimRef.Namespace).Update(ctx, pvc, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("error while updating %v/%v: %v", claimRef.Namespace, claimRef.Name, err)
if ctx.Err() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

func() {
_, ctx := ktesting.NewTestContext(t)
// 30 minutes is for *all* tests using this configuration.
ctx, cancel := context.WithTimeout(ctx, 30*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just context.WithCancel then?

}
}

// We need to wait here because even with deletion time stamp set,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We need to wait here because even with deletion time stamp set,
// We need to wait here because even with deletion timestamp set,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1209 to 1212
pods, err := client.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{})
if err != nil {
tb.Fatalf("failed to list pods in %q: %v", namespace, err)
}
for _, pod := range pods.Items {
if err := client.CoreV1().Pods(namespace).Delete(ctx, pod.Name, deleteNow); err != nil {
tb.Fatalf("failed to delete pod %q in namespace %q: %v", pod.Name, namespace, err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a single DeleteCollection? IIUC, it also accepts DeleteOptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, changed.

@kerthcet
Copy link
Member

Seems forgot to push the commit...

Once the context is canceled, the controller can stop processing
events. Without this change it prints errors when the apiserver is already
down.
This becomes relevant when doing more fine-grained leak checking.
Merely deleting the namespace is not enough:
- Workloads might rely on the garbage collector to get rid of obsolete objects,
  so we should run it to be on the safe side.
- Pods must be force-deleted because kubelet is not running.
- Finally, the namespace controller is needed to get rid of
  deleted namespaces.
This runs workloads that are labeled as "integration-test". The apiserver and
scheduler are only started once per unique configuration, followed by each
workload using that configuration. This makes execution faster. In contrast to
benchmarking, we care less about starting with a clean slate for each test.
…imeout

This is done for the sake of consistency. The failure message becomes less
useful.
@pohly
Copy link
Contributor Author

pohly commented Jun 28, 2023

Seems forgot to push the commit...

I didn't quite finish yesterday after leaving my initial replies. I think I addressed everything now and pushed: https://github.com/kubernetes/kubernetes/compare/2ff7891706089c1a3ae58b6cf6cd116b8e462dab..0d41d509d2d96ccc3473924cb4e1b8e1b3e4c170

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 28, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b66b490fb759b8567b59abbd483c435bbdb2fc06

@alculquicondor
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2023
@k8s-ci-robot k8s-ci-robot merged commit c78204d into kubernetes:master Jun 28, 2023
13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants