Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switches from UID to pointer for identifying cycle IDs #39

Merged
merged 2 commits into from
Jun 15, 2023

Conversation

codefromthecrypt
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Before, we used the UID to identify which cycle is in progress. However, when a scheduling cycle fails, the same UID can happen the next time.

This uses the pointer instead as this is different per run, and easier to compare in wasm. We use the lower bits as that's likely to not conflict, easier to use in wasm, and also can't identify fully the real host memory address.

Finally, this resets the pointer between benchmark runs. This makes sure a run isn't accidentally faster. Before, it was faster due to colliding on the same UID. By resetting the pointer, we avoid re-creating that issue.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

I noticed curious results in the prior PR. The current results look a lot more realistic

Does this PR introduce a user-facing change?

NONE

What are the benchmark results of this change?

goos: darwin
goarch: arm64
pkg: sigs.k8s.io/kube-scheduler-wasm-extension/internal/e2e
BenchmarkPluginFilter
BenchmarkPluginFilter/noop-wat
BenchmarkPluginFilter/noop-wat/params:_small
BenchmarkPluginFilter/noop-wat/params:_small-12         	 4514036	       271.1 ns/op
BenchmarkPluginFilter/noop-wat/params:_real
BenchmarkPluginFilter/noop-wat/params:_real-12          	 4434201	       263.2 ns/op
BenchmarkPluginFilter/noop
BenchmarkPluginFilter/noop/params:_small
BenchmarkPluginFilter/noop/params:_small-12             	 3605126	       335.3 ns/op
BenchmarkPluginFilter/noop/params:_real
BenchmarkPluginFilter/noop/params:_real-12              	 3170498	       339.0 ns/op
BenchmarkPluginFilter/test
BenchmarkPluginFilter/test/params:_small
BenchmarkPluginFilter/test/params:_small-12             	  154036	      7219 ns/op
BenchmarkPluginFilter/test/params:_real
BenchmarkPluginFilter/test/params:_real-12              	    9352	    121479 ns/op
BenchmarkPluginScore
BenchmarkPluginScore/noop-wat
BenchmarkPluginScore/noop-wat/params:_small
BenchmarkPluginScore/noop-wat/params:_small-12          	 4664058	       257.6 ns/op
BenchmarkPluginScore/noop-wat/params:_real
BenchmarkPluginScore/noop-wat/params:_real-12           	 4602486	       261.2 ns/op
BenchmarkPluginScore/noop
BenchmarkPluginScore/noop/params:_small
BenchmarkPluginScore/noop/params:_small-12              	 3169807	       377.1 ns/op
BenchmarkPluginScore/noop/params:_real
BenchmarkPluginScore/noop/params:_real-12               	 3494535	       344.6 ns/op
BenchmarkPluginScore/test
BenchmarkPluginScore/test/params:_small
BenchmarkPluginScore/test/params:_small-12              	  283779	      4048 ns/op
BenchmarkPluginScore/test/params:_real
BenchmarkPluginScore/test/params:_real-12               	   22026	     54050 ns/op
BenchmarkPluginFilterAndScore
BenchmarkPluginFilterAndScore/noop-wat
BenchmarkPluginFilterAndScore/noop-wat/params:_small
BenchmarkPluginFilterAndScore/noop-wat/params:_small-12 	 3251036	       367.1 ns/op
BenchmarkPluginFilterAndScore/noop-wat/params:_real
BenchmarkPluginFilterAndScore/noop-wat/params:_real-12  	 3233397	       369.3 ns/op
BenchmarkPluginFilterAndScore/noop
BenchmarkPluginFilterAndScore/noop/params:_small
BenchmarkPluginFilterAndScore/noop/params:_small-12     	 2127667	       565.0 ns/op
BenchmarkPluginFilterAndScore/noop/params:_real
BenchmarkPluginFilterAndScore/noop/params:_real-12      	 2246898	       531.6 ns/op
BenchmarkPluginFilterAndScore/test
BenchmarkPluginFilterAndScore/test/params:_small
BenchmarkPluginFilterAndScore/test/params:_small-12     	  105128	     10904 ns/op
BenchmarkPluginFilterAndScore/test/params:_real
BenchmarkPluginFilterAndScore/test/params:_real-12      	    5520	    211922 ns/op

Before, we used the UID to identify which cycle is in progress. However,
when a scheduling cycle fails, the same UID can happen the next time.

This uses the pointer instead as this is different per run, and easier
to compare in wasm. We use the lower bits as that's likely to not
conflict, easier to use in wasm, and also can't identify fully the real
host memory address.

Finally, this resets the pointer between benchmark runs. This makes sure
a run isn't accidentally faster. Before, it was faster due to colliding
on the same UID. By resetting the pointer, we avoid re-creating that
issue.

Signed-off-by: Adrian Cole <adrian@tetrate.io>
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 15, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 15, 2023
// assignedToBindingPod has the guest instances for Pods in binding cycle.
assignedToBindingPod map[types.UID]guest
}
mux sync.RWMutex
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this iteration uses a simple lock for now. This is under the assumption that access to one plugin isn't concurrent enough to warrant the additional complexity of read and write locks, where often you need a write one. We can later optimize this, but for now, simple IMHO is better.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet gone through all but left one comment for now.

defer pl.pool.assignedToSchedulingPodLock.Unlock()

g, err := pl.pool.getOrCreateGuest(ctx, pod.GetUID())
g, err := pl.pool.getForScheduling(ctx, cycleID(pod))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignedToSchedulingPodLock was the lock to protect guest to be used in parallel. (because Filter can be called in parallel during preemption)
And here, because you remove assignedToSchedulingPodLock, the guest getting here can be used in parallel during preemption, right? maybe we should keep assignedToSchedulingPodLock? (or have something to prevent that in getForScheduling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. indeed the pool lock won't prevent overlapping calls to use the guest.

So I think I need to revise the docs because scheduling is also not one at a time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'm thinking that is the desire is up to one and only one guest handling scheduling cycles, we need like a pool.doWithScheduling(func(guest)) sort of thing.

This would lock activity, but prevent having multiple layers of locks (in the plugin and also the pool)

However, I think it may be simpler and easier to allow multiple scheduling cycles to co-exist, since the comment suggests exactly that can happen during pre-emption.

We would change getForScheduling to be backed by a map, so that in the case of pre-emption there is no conflict on one guest, and also no fighting during the concurrency. For example, in our current design this seems possible PreFilter(id=1)(initializes state) PreFilter(id=2) clobbers id 1's state with its state during pre-emption, then Filter(id=1) has no state anymore.

So, TL;DR; I think we should be ok with another guest. Main issue is that we may end up with a garbage collection issue, because there's no "finished" state I'm aware of in the scheduling cycle. So, we might end up with dangling guests if we allow more than one in the scheduling cycle.

It would be great to be able to know if something was a pre-emption, because then we could simply special case it..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think state problem is too abstract for me to have brought up, yet because we don't implement it in this plugin, yet.

I will fix the overlapping problem on this PR and bear in mind the details about state later!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the impl to serialize access during scheduling, and put some notes about my understanding of it. Thanks for all the help and feel free to correct more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! The doWithScheduling looks great + the note on it is correct.

It would be great to be able to know if something was a pre-emption, because then we could simply special case it..

Just one idea. Technically, we can notice that the preemption starts with state.Clone() here.
https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/scheduler/framework/preemption/preemption.go#L574

It's called before SelectVictimsOnNode in every goroutine (created here). (SelectVictimsOnNode is the one we talked in slack, it calles Filter(), AddPod(), and RemovePod() in it.)
And, state.Clone() internally calls Clone() methods implemented in data stored from each plugin.
https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/scheduler/framework/cycle_state.go#L82

So, if we put something with Clone() method to the cyclestate in PreFilter, that Clone() will be executed before Filter(), AddPod(), and RemovePod() are called from each goroutine. So, we can regard Clone() as a trigger and may want to do something to prepare the calls of Filter() etc coming soon after.

Signed-off-by: Adrian Cole <adrian@tetrate.io>
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 15, 2023
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: codefromthecrypt, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit 84d22b7 into kubernetes-sigs:main Jun 15, 2023
5 checks passed
@sanposhiho
Copy link
Member

Thanks for having this as well as some discussion. I'm glad that, even though we're entering the jungle of complex preemption area, we have some clear insights about problems to solve

@codefromthecrypt codefromthecrypt deleted the reset-bench branch June 15, 2023 22:56
@codefromthecrypt
Copy link
Contributor Author

Thanks, @sanposhiho I learned a lot too, and any tricks to identify pre-emption sound really useful. I think we'll end up revisiting the design a couple times at least, especially when we get pre-emption tests in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants