-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switches from UID to pointer for identifying cycle IDs #39
Switches from UID to pointer for identifying cycle IDs #39
Conversation
Before, we used the UID to identify which cycle is in progress. However, when a scheduling cycle fails, the same UID can happen the next time. This uses the pointer instead as this is different per run, and easier to compare in wasm. We use the lower bits as that's likely to not conflict, easier to use in wasm, and also can't identify fully the real host memory address. Finally, this resets the pointer between benchmark runs. This makes sure a run isn't accidentally faster. Before, it was faster due to colliding on the same UID. By resetting the pointer, we avoid re-creating that issue. Signed-off-by: Adrian Cole <adrian@tetrate.io>
// assignedToBindingPod has the guest instances for Pods in binding cycle. | ||
assignedToBindingPod map[types.UID]guest | ||
} | ||
mux sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this iteration uses a simple lock for now. This is under the assumption that access to one plugin isn't concurrent enough to warrant the additional complexity of read and write locks, where often you need a write one. We can later optimize this, but for now, simple IMHO is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet gone through all but left one comment for now.
scheduler/plugin/plugin.go
Outdated
defer pl.pool.assignedToSchedulingPodLock.Unlock() | ||
|
||
g, err := pl.pool.getOrCreateGuest(ctx, pod.GetUID()) | ||
g, err := pl.pool.getForScheduling(ctx, cycleID(pod)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignedToSchedulingPodLock
was the lock to protect guest to be used in parallel. (because Filter can be called in parallel during preemption)
And here, because you remove assignedToSchedulingPodLock
, the guest getting here can be used in parallel during preemption, right? maybe we should keep assignedToSchedulingPodLock
? (or have something to prevent that in getForScheduling
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. indeed the pool lock won't prevent overlapping calls to use the guest.
So I think I need to revise the docs because scheduling is also not one at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I'm thinking that is the desire is up to one and only one guest handling scheduling cycles, we need like a pool.doWithScheduling(func(guest))
sort of thing.
This would lock activity, but prevent having multiple layers of locks (in the plugin and also the pool)
However, I think it may be simpler and easier to allow multiple scheduling cycles to co-exist, since the comment suggests exactly that can happen during pre-emption.
We would change getForScheduling to be backed by a map, so that in the case of pre-emption there is no conflict on one guest, and also no fighting during the concurrency. For example, in our current design this seems possible PreFilter(id=1)(initializes state) PreFilter(id=2) clobbers id 1's state with its state during pre-emption, then Filter(id=1) has no state anymore.
So, TL;DR; I think we should be ok with another guest. Main issue is that we may end up with a garbage collection issue, because there's no "finished" state I'm aware of in the scheduling cycle. So, we might end up with dangling guests if we allow more than one in the scheduling cycle.
It would be great to be able to know if something was a pre-emption, because then we could simply special case it..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think state problem is too abstract for me to have brought up, yet because we don't implement it in this plugin, yet.
I will fix the overlapping problem on this PR and bear in mind the details about state later!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the impl to serialize access during scheduling, and put some notes about my understanding of it. Thanks for all the help and feel free to correct more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! The doWithScheduling
looks great + the note on it is correct.
It would be great to be able to know if something was a pre-emption, because then we could simply special case it..
Just one idea. Technically, we can notice that the preemption starts with state.Clone() here.
https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/scheduler/framework/preemption/preemption.go#L574
It's called before SelectVictimsOnNode
in every goroutine (created here). (SelectVictimsOnNode
is the one we talked in slack, it calles Filter(), AddPod(), and RemovePod() in it.)
And, state.Clone() internally calls Clone()
methods implemented in data stored from each plugin.
https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/scheduler/framework/cycle_state.go#L82
So, if we put something with Clone()
method to the cyclestate in PreFilter, that Clone()
will be executed before Filter(), AddPod(), and RemovePod() are called from each goroutine. So, we can regard Clone()
as a trigger and may want to do something to prepare the calls of Filter() etc coming soon after.
Signed-off-by: Adrian Cole <adrian@tetrate.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: codefromthecrypt, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for having this as well as some discussion. I'm glad that, even though we're entering the jungle of complex preemption area, we have some clear insights about problems to solve |
Thanks, @sanposhiho I learned a lot too, and any tricks to identify pre-emption sound really useful. I think we'll end up revisiting the design a couple times at least, especially when we get pre-emption tests in! |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Before, we used the UID to identify which cycle is in progress. However, when a scheduling cycle fails, the same UID can happen the next time.
This uses the pointer instead as this is different per run, and easier to compare in wasm. We use the lower bits as that's likely to not conflict, easier to use in wasm, and also can't identify fully the real host memory address.
Finally, this resets the pointer between benchmark runs. This makes sure a run isn't accidentally faster. Before, it was faster due to colliding on the same UID. By resetting the pointer, we avoid re-creating that issue.
Which issue(s) this PR fixes:
NONE
Special notes for your reviewer:
I noticed curious results in the prior PR. The current results look a lot more realistic
Does this PR introduce a user-facing change?
NONE
What are the benchmark results of this change?