-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not use the clock in __dealloc__ to prevent deadlock #5621
Conversation
Thanks for opening your first pull request here! 💖 Please check out our contributing guidelines. |
e2480b0
to
dcea1e8
Compare
If I'm reading your changes correctly, when deallocating from The problem with this fix, is that if you call Concurrency is really difficult, in particular using I'm not super familiar with the graphics code, but it'd look something like but instead of doing The alternative is to make the clock lock a reentrant lock. The problem with this is that we'd have to go through all the code and check that it's safe. I'm guessing I didn't use a RLock to begin with probably because I didn't think it was safe. And I have a feeling that would could never make it safe for all the places the lock is used. |
@matham In Kivy 1.10 our framework (Mission Pinball Framework short MPF) will hang every 10-30 minutes which is unacceptable for a pinball machine. With this fix our user base did not get a single hang within the last week. At least for our usecase this looks super safe. I see your concern. However, since gl_xx cannot/should not be called from another thread there is also no way to allocate objects such as FBOs or shaders from another thread because that would also invoke gl_xx commands. This might still be theoretically unsafe but practically I don't see much cases where you could create graphics objects in another thread. |
@matham I quickly checked the Clock and RLock looks reasonable. Afaik the only limitation would be that the lock has to be released in the same thread which looks fine for all methods (everything releases the lock before returning). Consequently, all cases where RLock will not lock but Lock will would result in a dead lock currently. So I guess that would be fine. I did a build with RLock instead of Lock on master and will run my test case for the rest of the day to make sure that is also prevents our deadlock. So far it works fine. Will report tonight. |
764d3b8
to
6b44fef
Compare
@tito Any thoughts on this? |
Your patch yes solve the issue, but then consume cpu every second even if the app does nothing. So you replaced the implementation from a trigger to a pull. It would be best to get back to a trigger approach :) BUT, @matham, is the clock can be executed from another thread, then it's an API break IMO. Any callback executed from the Clock was ensured to be done in the mainthread, because the Clock.tick is done in the mainthread itself. If that doesn't work, then the Clock is broken. We have a Whatever the reason is, the Clock should never execute a callback somewhere else that the "tick"-related thread. Even more, it should not execute callback outside the "tick" method. |
And the fact that the graphics code segfault really highlight this issue. |
@tito: We cannot use the clock from the destructor because GC might trigger that from within the clock and deadlock the Clock. The clock will only trigger and schedule callbacks in the main thread. However, the GC might trigger the destructor in any thread. That's why we require this quirk via the Clock callback. |
Then the implementation of the clock is an issue IMO, not the usage. I guess the Clock must be fixed to be reentrant. Unsure if RLock itself will be sufficient |
If we want to go that way RLock would be the solution. But unfortunately it is not enough because the linked list inside the Clock is not reentrant safe. We would trade the deadlock with potential data corruption during Clock calls from GC. However, GC is exactly the only place where this would happen because it can only be triggered from within the Clock. So reentrant Clock would be nice but (a) very hard to get right (b) very hard to test since this will never occur outside GC. I'm not sure if that effort would be worth it only to be able to use the clock from within GC. |
I don't think that would be that hard to get. The trigger just add an event into the list. Accessing the list is under lock, but maybe with just few cases, we can rlock it correctly. Touching the GC in the clock because we hit the issue is madness :) |
If we change to use a list for scheduling, then that would basically drop the performance by a lot. The linked list together with the lock currently used is what ensures it's not slow. Also, the previous code using a list actually didn't safely unschedule (I'm pretty sure - we copied the list of events of events to be processed so any unschedule would not have been noticed), but if it did it would have needed locks so it would have had the same issue. There's nothing wrong with the current clock code in the sense it doesn't have a bug, it's simply that it's never safe to access a lock from within This kind of issue is well known and python itself is not immune from it: https://bugs.python.org/issue14976#msg275377 and https://www.pytennessee.org/schedule/presentation/159/. The only solution, other than making the clock code re-entrant, which I think is quite difficult, is to simply add anything But I do see the issue with wasting cycles by the periodic checking, that's why I wanted your feedback. Although, doesn't cache do the same thing where it periodically checks for expiration? We could also have a list that Clock always processes after each tick and add it to that list. I'm not sure what else we can do. |
@matham what about a Clock.trigger_from_gc/trigger_lock_free then? That would fill a list and the clock would check that at the start of the tick. I can implement that if you want. |
Yeah, that sounds good. But it may just be better to use a simple function |
(I rewrote this comment many times, changing my opinion after trying to simulate the issue with threads, reading the Clock code etc.) So the whole issue is just located at .get_callback() in So i wonder what about delocking that loop:
Whatever my mind goes around it, there is always a way where it actually break and get inconsistencies.
But the way we loop over the event doesn't care anything about it. Woud it be possible to:
@matham ? |
I didn't understand what you mean by "either rework the loop to be as great as the process_event". Regarding the second solution, it is a nice solution, but the first issue that comes to mind is that what if you cancel and then reschedule the event in this frame? We guarantee now that events are executed in the order they were scheduled. I suppose that we could mark it as canceled, but then move it to the end of the linked list the next time it's scheduled. But I don't understand how any of this solves the original issue? The original problem had to do with scheduling, or triggering and event, not unschedule. So the problem is with this part of the code: https://github.com/kivy/kivy/blob/master/kivy/_clock.pyx#L57. For that we definitely do need to the lock because it's supposed to be thread safe so multiple threads can trigger or schedule an event. |
6b44fef
to
c60fd51
Compare
@matham I changed the PR accordingly. Please have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks great, thanks! There's just a few small changes for polish.
@@ -205,6 +205,7 @@ cdef class CyClockBase(object): | |||
self._lock = Lock() | |||
self._lock_acquire = self._lock.acquire | |||
self._lock_release = self._lock.release | |||
self._del_queue = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please declare it here https://github.com/kivy/kivy/blob/master/kivy/_clock.pxd#L83 as cdef public list _del_queue
.
@@ -248,6 +249,13 @@ cdef class CyClockBase(object): | |||
ev.release() | |||
return ev | |||
|
|||
cpdef schedule_del_safe(self, callback): | |||
'''Schedule a callback. Might be called from GC and cannot be cancelled. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to expand a little explaining what the problem is. E.g. something like:
Schedule a callback. Might be called from GC and cannot be cancelled.
It's unsafe to call various kinds of code, such as code with a lock, from a `__del__` or `__dealloc__` methods. Since Kivy's Clock uses a lock, it's generally unsafe to call from these methods. Instead, use this method, which is thread safe and `__del__` or `__dealloc__` safe, to schedule the callback in the kivy thread. It'll be executed in order after the normal events are processed.
The callback takes no parameters and cannot be canceled.
kivy/_clock.pyx
Outdated
while self._del_queue: | ||
callback = self._del_queue.pop() | ||
callback() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slightly more performant approach is:
callbacks = self._del_queue[:]
del self._del_queue[:len(callbacks)]
for callback in callbacks:
callback()
The reason is that it's cheaper than popping one callback at a time. Notice how it makes sure it only deletes the callbacks it'll process in this frame (this wasn't an issue with your code though). Also, this process them in order it was scheduled.
@matham The error is not where you linked. The error is that the gc is triggered when calling get_callback(), and we lock this part. Everywhere else, it look like the lock doesn't have any impact on the Python-related code, only Cython. And you're right, i forgot about the order of execution. So i guess adding a new API works, but it will solve the user issue, unsure it it will still work for all cases. |
@tito There's also https://github.com/kivy/kivy/blob/master/kivy/_clock.pyx#L54. But even if we move that out of the lock, are you sure that any of these cython dot access and object comparison doesn't execute bytecode? I'm also not sure how maintainable it'd be to assume no python bytecode can be executed while the lock is held in clock? It just happens that it's currently in cython rather than in python, but someone may later add some other code while the lock is held forgetting about this, or we'd want some feature that requires python code. I think this PR should solve the issue, as long as we remember to not do anything else from |
@matham good point. OK for the PR |
@matham I did the changes as you requested. |
@jabdoa2 Did you forget to push? |
Oh looks like I pushed to the wrong branch. Will correct that tomorrow |
Congrats on merging your first pull request! 🎉🎉🎉 |
Thanks for the code and your patience! |
Prevents hangs such as in #5605
In 1.10 the Clock always hold a lock when registering or unregistering callbacks. So if garbage collection occurs while the clock holds the lock in the same thread (such as in #5605) the process will deadlock. This PR changes the behavior in
__dealloc__
to perform the work synchronously without using the clock to prevent this issue.