Switches from UID to pointer for identifying cycle IDs #39

codefromthecrypt · 2023-06-15T04:45:20Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Before, we used the UID to identify which cycle is in progress. However, when a scheduling cycle fails, the same UID can happen the next time.

This uses the pointer instead as this is different per run, and easier to compare in wasm. We use the lower bits as that's likely to not conflict, easier to use in wasm, and also can't identify fully the real host memory address.

Finally, this resets the pointer between benchmark runs. This makes sure a run isn't accidentally faster. Before, it was faster due to colliding on the same UID. By resetting the pointer, we avoid re-creating that issue.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

I noticed curious results in the prior PR. The current results look a lot more realistic

Does this PR introduce a user-facing change?

NONE

What are the benchmark results of this change?

goos: darwin
goarch: arm64
pkg: sigs.k8s.io/kube-scheduler-wasm-extension/internal/e2e
BenchmarkPluginFilter
BenchmarkPluginFilter/noop-wat
BenchmarkPluginFilter/noop-wat/params:_small
BenchmarkPluginFilter/noop-wat/params:_small-12         	 4514036	       271.1 ns/op
BenchmarkPluginFilter/noop-wat/params:_real
BenchmarkPluginFilter/noop-wat/params:_real-12          	 4434201	       263.2 ns/op
BenchmarkPluginFilter/noop
BenchmarkPluginFilter/noop/params:_small
BenchmarkPluginFilter/noop/params:_small-12             	 3605126	       335.3 ns/op
BenchmarkPluginFilter/noop/params:_real
BenchmarkPluginFilter/noop/params:_real-12              	 3170498	       339.0 ns/op
BenchmarkPluginFilter/test
BenchmarkPluginFilter/test/params:_small
BenchmarkPluginFilter/test/params:_small-12             	  154036	      7219 ns/op
BenchmarkPluginFilter/test/params:_real
BenchmarkPluginFilter/test/params:_real-12              	    9352	    121479 ns/op
BenchmarkPluginScore
BenchmarkPluginScore/noop-wat
BenchmarkPluginScore/noop-wat/params:_small
BenchmarkPluginScore/noop-wat/params:_small-12          	 4664058	       257.6 ns/op
BenchmarkPluginScore/noop-wat/params:_real
BenchmarkPluginScore/noop-wat/params:_real-12           	 4602486	       261.2 ns/op
BenchmarkPluginScore/noop
BenchmarkPluginScore/noop/params:_small
BenchmarkPluginScore/noop/params:_small-12              	 3169807	       377.1 ns/op
BenchmarkPluginScore/noop/params:_real
BenchmarkPluginScore/noop/params:_real-12               	 3494535	       344.6 ns/op
BenchmarkPluginScore/test
BenchmarkPluginScore/test/params:_small
BenchmarkPluginScore/test/params:_small-12              	  283779	      4048 ns/op
BenchmarkPluginScore/test/params:_real
BenchmarkPluginScore/test/params:_real-12               	   22026	     54050 ns/op
BenchmarkPluginFilterAndScore
BenchmarkPluginFilterAndScore/noop-wat
BenchmarkPluginFilterAndScore/noop-wat/params:_small
BenchmarkPluginFilterAndScore/noop-wat/params:_small-12 	 3251036	       367.1 ns/op
BenchmarkPluginFilterAndScore/noop-wat/params:_real
BenchmarkPluginFilterAndScore/noop-wat/params:_real-12  	 3233397	       369.3 ns/op
BenchmarkPluginFilterAndScore/noop
BenchmarkPluginFilterAndScore/noop/params:_small
BenchmarkPluginFilterAndScore/noop/params:_small-12     	 2127667	       565.0 ns/op
BenchmarkPluginFilterAndScore/noop/params:_real
BenchmarkPluginFilterAndScore/noop/params:_real-12      	 2246898	       531.6 ns/op
BenchmarkPluginFilterAndScore/test
BenchmarkPluginFilterAndScore/test/params:_small
BenchmarkPluginFilterAndScore/test/params:_small-12     	  105128	     10904 ns/op
BenchmarkPluginFilterAndScore/test/params:_real
BenchmarkPluginFilterAndScore/test/params:_real-12      	    5520	    211922 ns/op

Before, we used the UID to identify which cycle is in progress. However, when a scheduling cycle fails, the same UID can happen the next time. This uses the pointer instead as this is different per run, and easier to compare in wasm. We use the lower bits as that's likely to not conflict, easier to use in wasm, and also can't identify fully the real host memory address. Finally, this resets the pointer between benchmark runs. This makes sure a run isn't accidentally faster. Before, it was faster due to colliding on the same UID. By resetting the pointer, we avoid re-creating that issue. Signed-off-by: Adrian Cole <adrian@tetrate.io>

codefromthecrypt · 2023-06-15T04:49:31Z

scheduler/plugin/pool.go

-	// assignedToBindingPod has the guest instances for Pods in binding cycle.
-	assignedToBindingPod map[types.UID]guest
-}
+	mux sync.RWMutex


so this iteration uses a simple lock for now. This is under the assumption that access to one plugin isn't concurrent enough to warrant the additional complexity of read and write locks, where often you need a write one. We can later optimize this, but for now, simple IMHO is better.

sanposhiho

I haven't yet gone through all but left one comment for now.

sanposhiho · 2023-06-15T08:48:29Z

scheduler/plugin/plugin.go

-	defer pl.pool.assignedToSchedulingPodLock.Unlock()
-
-	g, err := pl.pool.getOrCreateGuest(ctx, pod.GetUID())
+	g, err := pl.pool.getForScheduling(ctx, cycleID(pod))


assignedToSchedulingPodLock was the lock to protect guest to be used in parallel. (because Filter can be called in parallel during preemption)
And here, because you remove assignedToSchedulingPodLock, the guest getting here can be used in parallel during preemption, right? maybe we should keep assignedToSchedulingPodLock? (or have something to prevent that in getForScheduling?

I see. indeed the pool lock won't prevent overlapping calls to use the guest.

So I think I need to revise the docs because scheduling is also not one at a time.

so I'm thinking that is the desire is up to one and only one guest handling scheduling cycles, we need like a pool.doWithScheduling(func(guest)) sort of thing.

This would lock activity, but prevent having multiple layers of locks (in the plugin and also the pool)

However, I think it may be simpler and easier to allow multiple scheduling cycles to co-exist, since the comment suggests exactly that can happen during pre-emption.

We would change getForScheduling to be backed by a map, so that in the case of pre-emption there is no conflict on one guest, and also no fighting during the concurrency. For example, in our current design this seems possible PreFilter(id=1)(initializes state) PreFilter(id=2) clobbers id 1's state with its state during pre-emption, then Filter(id=1) has no state anymore.

So, TL;DR; I think we should be ok with another guest. Main issue is that we may end up with a garbage collection issue, because there's no "finished" state I'm aware of in the scheduling cycle. So, we might end up with dangling guests if we allow more than one in the scheduling cycle.

It would be great to be able to know if something was a pre-emption, because then we could simply special case it..

I think state problem is too abstract for me to have brought up, yet because we don't implement it in this plugin, yet.

I will fix the overlapping problem on this PR and bear in mind the details about state later!

I changed the impl to serialize access during scheduling, and put some notes about my understanding of it. Thanks for all the help and feel free to correct more

Thanks for the change! The doWithScheduling looks great + the note on it is correct.

It would be great to be able to know if something was a pre-emption, because then we could simply special case it..

Just one idea. Technically, we can notice that the preemption starts with state.Clone() here.
https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/scheduler/framework/preemption/preemption.go#L574

It's called before SelectVictimsOnNode in every goroutine (created here). (SelectVictimsOnNode is the one we talked in slack, it calles Filter(), AddPod(), and RemovePod() in it.)
And, state.Clone() internally calls Clone() methods implemented in data stored from each plugin.
https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/scheduler/framework/cycle_state.go#L82

So, if we put something with Clone() method to the cyclestate in PreFilter, that Clone() will be executed before Filter(), AddPod(), and RemovePod() are called from each goroutine. So, we can regard Clone() as a trigger and may want to do something to prepare the calls of Filter() etc coming soon after.

Signed-off-by: Adrian Cole <adrian@tetrate.io>

sanposhiho

/lgtm
/approve

k8s-ci-robot · 2023-06-15T13:06:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: codefromthecrypt, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanposhiho · 2023-06-15T13:08:17Z

Thanks for having this as well as some discussion. I'm glad that, even though we're entering the jungle of complex preemption area, we have some clear insights about problems to solve

codefromthecrypt · 2023-06-15T22:57:22Z

Thanks, @sanposhiho I learned a lot too, and any tricks to identify pre-emption sound really useful. I think we'll end up revisiting the design a couple times at least, especially when we get pre-emption tests in!

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 15, 2023

k8s-ci-robot requested a review from sanposhiho June 15, 2023 04:45

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 15, 2023

codefromthecrypt commented Jun 15, 2023

View reviewed changes

codefromthecrypt mentioned this pull request Jun 15, 2023

allocates less buffers when decoding multiple large protos #40

Merged

sanposhiho reviewed Jun 15, 2023

View reviewed changes

sync scheduler

ac40a20

Signed-off-by: Adrian Cole <adrian@tetrate.io>

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 15, 2023

sanposhiho approved these changes Jun 15, 2023

View reviewed changes

k8s-ci-robot assigned sanposhiho Jun 15, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023

k8s-ci-robot merged commit 84d22b7 into kubernetes-sigs:main Jun 15, 2023
5 checks passed

codefromthecrypt deleted the reset-bench branch June 15, 2023 22:56

codefromthecrypt mentioned this pull request Jun 16, 2023

Changes guest index back to UID #42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switches from UID to pointer for identifying cycle IDs #39

Switches from UID to pointer for identifying cycle IDs #39

codefromthecrypt commented Jun 15, 2023

codefromthecrypt Jun 15, 2023

sanposhiho left a comment

sanposhiho Jun 15, 2023 •

edited

codefromthecrypt Jun 15, 2023

codefromthecrypt Jun 15, 2023

codefromthecrypt Jun 15, 2023

codefromthecrypt Jun 15, 2023

sanposhiho Jun 15, 2023

sanposhiho left a comment

k8s-ci-robot commented Jun 15, 2023

sanposhiho commented Jun 15, 2023

codefromthecrypt commented Jun 15, 2023

Switches from UID to pointer for identifying cycle IDs #39

Switches from UID to pointer for identifying cycle IDs #39

Conversation

codefromthecrypt commented Jun 15, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

What are the benchmark results of this change?

codefromthecrypt Jun 15, 2023

Choose a reason for hiding this comment

sanposhiho left a comment

Choose a reason for hiding this comment

sanposhiho Jun 15, 2023 • edited

Choose a reason for hiding this comment

codefromthecrypt Jun 15, 2023

Choose a reason for hiding this comment

codefromthecrypt Jun 15, 2023

Choose a reason for hiding this comment

codefromthecrypt Jun 15, 2023

Choose a reason for hiding this comment

codefromthecrypt Jun 15, 2023

Choose a reason for hiding this comment

sanposhiho Jun 15, 2023

Choose a reason for hiding this comment

sanposhiho left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 15, 2023

sanposhiho commented Jun 15, 2023

codefromthecrypt commented Jun 15, 2023

sanposhiho Jun 15, 2023 •

edited