do not use the clock in dealloc to prevent deadlock #5621

jabdoa2 · 2018-02-16T15:56:32Z

Prevents hangs such as in #5605
In 1.10 the Clock always hold a lock when registering or unregistering callbacks. So if garbage collection occurs while the clock holds the lock in the same thread (such as in #5605) the process will deadlock. This PR changes the behavior in __dealloc__ to perform the work synchronously without using the clock to prevent this issue.

welcome · 2018-02-16T15:56:35Z

Thanks for opening your first pull request here! 💖 Please check out our contributing guidelines.

matham · 2018-02-23T07:32:56Z

If I'm reading your changes correctly, when deallocating from __dealloc__ you call gl_dealloc rather than scheduling it through the clock. The reason is that __dealloc__ could have been executed when unschedule is called, so we're already holding the lock when __dealloc__ is executed and get deadlocked when trying to get the lock.

The problem with this fix, is that if you call unschedule from another thread, you'll be executing __dealloc__ and then gl_dealloc from that second thread. However, gl functions are not allowed to be called from a thread outside the main thread (that's why it was scheduled through the clock so that it happens in the main thread to begin with I'm assuming), otherwise it may crash or become corrupt.

Concurrency is really difficult, in particular using __del__ or __dealloc__. One (somewhat complex) solution is instead of calling schedule like currently, just add the code to be called to a queue that kivy kivy services say once every few seconds. Then, you don't need to schedule anything because kivy will call the freeing code from its thread when it gets to it. Using e.g. a list, it should be safe.

I'm not super familiar with the graphics code, but it'd look something like but instead of doing self.trigger_gl_dealloc() you'd always do Clock.clock_service_queue.append(self.trigger_gl_dealloc). And kivy would have previously scheduled a safe function that services this queue periodically.

The alternative is to make the clock lock a reentrant lock. The problem with this is that we'd have to go through all the code and check that it's safe. I'm guessing I didn't use a RLock to begin with probably because I didn't think it was safe. And I have a feeling that would could never make it safe for all the places the lock is used.

jabdoa2 · 2018-02-23T11:16:08Z

@matham In Kivy 1.10 our framework (Mission Pinball Framework short MPF) will hang every 10-30 minutes which is unacceptable for a pinball machine. With this fix our user base did not get a single hang within the last week. At least for our usecase this looks super safe.

I see your concern. However, since gl_xx cannot/should not be called from another thread there is also no way to allocate objects such as FBOs or shaders from another thread because that would also invoke gl_xx commands. This might still be theoretically unsafe but practically I don't see much cases where you could create graphics objects in another thread.

jabdoa2 · 2018-02-23T11:58:52Z

@matham I quickly checked the Clock and RLock looks reasonable. Afaik the only limitation would be that the lock has to be released in the same thread which looks fine for all methods (everything releases the lock before returning). Consequently, all cases where RLock will not lock but Lock will would result in a dead lock currently. So I guess that would be fine.

I did a build with RLock instead of Lock on master and will run my test case for the rest of the day to make sure that is also prevents our deadlock. So far it works fine. Will report tonight.

…to PR kivy#5621

jabdoa2 · 2018-02-23T20:51:20Z

@tito Any thoughts on this?

tito · 2018-02-26T08:55:02Z

Your patch yes solve the issue, but then consume cpu every second even if the app does nothing. So you replaced the implementation from a trigger to a pull. It would be best to get back to a trigger approach :)

BUT, @matham, is the clock can be executed from another thread, then it's an API break IMO. Any callback executed from the Clock was ensured to be done in the mainthread, because the Clock.tick is done in the mainthread itself. If that doesn't work, then the Clock is broken. We have a @mainthread, and in lot of app, i'm scheduling a function call (sometimes unschedule first then schedule) a callback to be called in the mainthread.

Whatever the reason is, the Clock should never execute a callback somewhere else that the "tick"-related thread. Even more, it should not execute callback outside the "tick" method.

tito · 2018-02-26T08:57:05Z

And the fact that the graphics code segfault really highlight this issue.

jabdoa2 · 2018-02-26T09:29:03Z

@tito: We cannot use the clock from the destructor because GC might trigger that from within the clock and deadlock the Clock. The clock will only trigger and schedule callbacks in the main thread. However, the GC might trigger the destructor in any thread. That's why we require this quirk via the Clock callback.

tito · 2018-02-26T09:35:23Z

Then the implementation of the clock is an issue IMO, not the usage. I guess the Clock must be fixed to be reentrant. Unsure if RLock itself will be sufficient

jabdoa2 · 2018-02-26T09:58:37Z

If we want to go that way RLock would be the solution. But unfortunately it is not enough because the linked list inside the Clock is not reentrant safe. We would trade the deadlock with potential data corruption during Clock calls from GC. However, GC is exactly the only place where this would happen because it can only be triggered from within the Clock. So reentrant Clock would be nice but (a) very hard to get right (b) very hard to test since this will never occur outside GC. I'm not sure if that effort would be worth it only to be able to use the clock from within GC.

tito · 2018-02-26T10:35:00Z

I don't think that would be that hard to get. The trigger just add an event into the list. Accessing the list is under lock, but maybe with just few cases, we can rlock it correctly.

Touching the GC in the clock because we hit the issue is madness :)

matham · 2018-02-26T16:31:04Z

If we change to use a list for scheduling, then that would basically drop the performance by a lot. The linked list together with the lock currently used is what ensures it's not slow. Also, the previous code using a list actually didn't safely unschedule (I'm pretty sure - we copied the list of events of events to be processed so any unschedule would not have been noticed), but if it did it would have needed locks so it would have had the same issue.

There's nothing wrong with the current clock code in the sense it doesn't have a bug, it's simply that it's never safe to access a lock from within __dealloc__ or __del__ because of deadlock issues. Turning it into a RLock would fix this issue but is not at all simple, because the nature of __del__ means it can be called at any line, which makes it really difficult to protect against corruption. I don't think we can easily switch to an RLock.

This kind of issue is well known and python itself is not immune from it: https://bugs.python.org/issue14976#msg275377 and https://www.pytennessee.org/schedule/presentation/159/. The only solution, other than making the clock code re-entrant, which I think is quite difficult, is to simply add anything __dealloc__ needs to process to a list so that it doesn't interact with the clock or anything, which is what this PR is doing.

But I do see the issue with wasting cycles by the periodic checking, that's why I wanted your feedback. Although, doesn't cache do the same thing where it periodically checks for expiration? We could also have a list that Clock always processes after each tick and add it to that list. I'm not sure what else we can do.

jabdoa2 · 2018-02-26T16:58:29Z

@matham what about a Clock.trigger_from_gc/trigger_lock_free then? That would fill a list and the clock would check that at the start of the tick. I can implement that if you want.

matham · 2018-02-26T17:05:23Z

Yeah, that sounds good. But it may just be better to use a simple function schedule_del_safe that takes a callback and adds it to a list of callbacks to be run after the normal callbacks are processed during a tick. Rather than going through the typical event scheduling, triggering etc. Maybe that's what you meant? We'll never need to cancel such a callback so there's not reason to use a typical clock event that uses trigger etc.

tito · 2018-02-26T18:21:46Z

(I rewrote this comment many times, changing my opinion after trying to simulate the issue with threads, reading the Clock code etc.)

So the whole issue is just located at .get_callback() in Clock.unschedule. Everything else seems ok.

So i wonder what about delocking that loop:

                while ev is not None:
                    if ev.get_callback() == callback:
                        events.append(ev)
                    ev = ev.next

Whatever my mind goes around it, there is always a way where it actually break and get inconsistencies.
My understanding:

This loop call cancel() on all the events that match the callback (if all=True, otherwise just the first one)
The cancel() is marvelous, and work with a cap_event

But the way we loop over the event doesn't care anything about it. Woud it be possible to:

either rework the loop to be as great as the process_event?
just "mark" the event to be cancelled, and actually do it in the tick?
The second solution is nice, look simple to implement, and doesn't really have a direct impact on performance. And so we don't care about looping over the current event list as "deletion" would happen on the tick. And addition is appended to the end, so no issue there.

@matham ?

matham · 2018-02-26T19:30:34Z

I didn't understand what you mean by "either rework the loop to be as great as the process_event".

Regarding the second solution, it is a nice solution, but the first issue that comes to mind is that what if you cancel and then reschedule the event in this frame? We guarantee now that events are executed in the order they were scheduled. I suppose that we could mark it as canceled, but then move it to the end of the linked list the next time it's scheduled.

But I don't understand how any of this solves the original issue? The original problem had to do with scheduling, or triggering and event, not unschedule. So the problem is with this part of the code: https://github.com/kivy/kivy/blob/master/kivy/_clock.pyx#L57. For that we definitely do need to the lock because it's supposed to be thread safe so multiple threads can trigger or schedule an event.

jabdoa2 · 2018-02-26T22:41:58Z

@matham I changed the PR accordingly. Please have a look

matham

Overall it looks great, thanks! There's just a few small changes for polish.

matham · 2018-02-26T22:55:25Z

kivy/_clock.pyx

@@ -205,6 +205,7 @@ cdef class CyClockBase(object):
        self._lock = Lock()
        self._lock_acquire = self._lock.acquire
        self._lock_release = self._lock.release
+        self._del_queue = []


Please declare it here https://github.com/kivy/kivy/blob/master/kivy/_clock.pxd#L83 as cdef public list _del_queue.

matham · 2018-02-26T23:03:53Z

kivy/_clock.pyx

@@ -248,6 +249,13 @@ cdef class CyClockBase(object):
        ev.release()
        return ev

+    cpdef schedule_del_safe(self, callback):
+        '''Schedule a callback. Might be called from GC and cannot be cancelled.
+


It'd be nice to expand a little explaining what the problem is. E.g. something like:

Schedule a callback. Might be called from GC and cannot be cancelled. It's unsafe to call various kinds of code, such as code with a lock, from a `__del__` or `__dealloc__` methods. Since Kivy's Clock uses a lock, it's generally unsafe to call from these methods. Instead, use this method, which is thread safe and `__del__` or `__dealloc__` safe, to schedule the callback in the kivy thread. It'll be executed in order after the normal events are processed. The callback takes no parameters and cannot be canceled.

matham · 2018-02-26T23:09:14Z

kivy/_clock.pyx

+        while self._del_queue:
+            callback = self._del_queue.pop()
+            callback()
+


A slightly more performant approach is:

callbacks = self._del_queue[:] del self._del_queue[:len(callbacks)] for callback in callbacks: callback()

The reason is that it's cheaper than popping one callback at a time. Notice how it makes sure it only deletes the callbacks it'll process in this frame (this wasn't an issue with your code though). Also, this process them in order it was scheduled.

tito · 2018-02-27T08:46:00Z

@matham The error is not where you linked. The error is that the gc is triggered when calling get_callback(), and we lock this part. Everywhere else, it look like the lock doesn't have any impact on the Python-related code, only Cython.

And you're right, i forgot about the order of execution. So i guess adding a new API works, but it will solve the user issue, unsure it it will still work for all cases.

matham · 2018-02-28T04:40:38Z

@tito There's also https://github.com/kivy/kivy/blob/master/kivy/_clock.pyx#L54. But even if we move that out of the lock, are you sure that any of these cython dot access and object comparison doesn't execute bytecode?

I'm also not sure how maintainable it'd be to assume no python bytecode can be executed while the lock is held in clock? It just happens that it's currently in cython rather than in python, but someone may later add some other code while the lock is held forgetting about this, or we'd want some feature that requires python code.

I think this PR should solve the issue, as long as we remember to not do anything else from __dealloc__. That should be the rule anyway to never do much in __dealloc__.

tito · 2018-02-28T07:58:21Z

@matham good point. OK for the PR

jabdoa2 · 2018-03-01T20:08:08Z

@matham I did the changes as you requested.

matham · 2018-03-01T22:00:05Z

@jabdoa2 Did you forget to push?

jabdoa2 · 2018-03-01T22:27:04Z

Oh looks like I pushed to the wrong branch. Will correct that tomorrow

jabdoa2 · 2018-03-02T10:42:54Z

@tito @matham here we go

welcome · 2018-03-02T16:21:15Z

Congrats on merging your first pull request! 🎉🎉🎉

matham · 2018-03-02T16:22:01Z

Thanks for the code and your patience!

jabdoa2 force-pushed the prevent_deadlock branch from e2480b0 to dcea1e8 Compare February 16, 2018 15:59

akshayaurora approved these changes Feb 16, 2018

View reviewed changes

jabdoa2 added a commit to jabdoa2/kivy that referenced this pull request Feb 23, 2018

use RLock instead of Lock to prevent hangs in kivy#5605. Alternative …

dc8bc1d

…to PR kivy#5621

jabdoa2 mentioned this pull request Feb 23, 2018

use RLock instead of Lock to prevent hangs in #5605 #5631

Closed

jabdoa2 force-pushed the prevent_deadlock branch from 764d3b8 to 6b44fef Compare February 23, 2018 15:39

tito added Component: core-providers kivy/core Priority: High Should be fixed ASAP Status: Needs-analysis Issue needs to be analyzed if it's real Status: On-hold PR is on hold and should not be merged until issues inside are resolved labels Feb 26, 2018

akshayaurora requested a review from tito February 26, 2018 10:30

akshayaurora assigned tito Feb 26, 2018

jabdoa2 added 2 commits February 26, 2018 23:40

implement Clock::schedule_del_safe

94e8e98

use schedule_del_safe in context

c60fd51

jabdoa2 force-pushed the prevent_deadlock branch from 6b44fef to c60fd51 Compare February 26, 2018 22:41

matham reviewed Feb 26, 2018

View reviewed changes

improved solution as requested by matham

fc430e2

matham merged commit 1ccd294 into kivy:master Mar 2, 2018

matham mentioned this pull request Mar 2, 2018

Kivy hangs in kivy.graphics.context.Context.dealloc_fbo #5605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not use the clock in dealloc to prevent deadlock #5621

do not use the clock in dealloc to prevent deadlock #5621

jabdoa2 commented Feb 16, 2018 •

edited

Loading

welcome bot commented Feb 16, 2018

matham commented Feb 23, 2018

jabdoa2 commented Feb 23, 2018

jabdoa2 commented Feb 23, 2018

jabdoa2 commented Feb 23, 2018

tito commented Feb 26, 2018

tito commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018

tito commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018 •

edited

Loading

tito commented Feb 26, 2018

matham commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018

matham commented Feb 26, 2018

tito commented Feb 26, 2018

matham commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018

matham left a comment

matham Feb 26, 2018

matham Feb 26, 2018

matham Feb 26, 2018 •

edited

Loading

tito commented Feb 27, 2018 •

edited

Loading

matham commented Feb 28, 2018

tito commented Feb 28, 2018

jabdoa2 commented Mar 1, 2018

matham commented Mar 1, 2018

jabdoa2 commented Mar 1, 2018

jabdoa2 commented Mar 2, 2018

welcome bot commented Mar 2, 2018

matham commented Mar 2, 2018

do not use the clock in __dealloc__ to prevent deadlock #5621

do not use the clock in __dealloc__ to prevent deadlock #5621

Conversation

jabdoa2 commented Feb 16, 2018 • edited Loading

welcome bot commented Feb 16, 2018

matham commented Feb 23, 2018

jabdoa2 commented Feb 23, 2018

jabdoa2 commented Feb 23, 2018

jabdoa2 commented Feb 23, 2018

tito commented Feb 26, 2018

tito commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018

tito commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018 • edited Loading

tito commented Feb 26, 2018

matham commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018

matham commented Feb 26, 2018

tito commented Feb 26, 2018

matham commented Feb 26, 2018

jabdoa2 commented Feb 26, 2018

matham left a comment

Choose a reason for hiding this comment

matham Feb 26, 2018

Choose a reason for hiding this comment

matham Feb 26, 2018

Choose a reason for hiding this comment

matham Feb 26, 2018 • edited Loading

Choose a reason for hiding this comment

tito commented Feb 27, 2018 • edited Loading

matham commented Feb 28, 2018

tito commented Feb 28, 2018

jabdoa2 commented Mar 1, 2018

matham commented Mar 1, 2018

jabdoa2 commented Mar 1, 2018

jabdoa2 commented Mar 2, 2018

welcome bot commented Mar 2, 2018

matham commented Mar 2, 2018

do not use the clock in dealloc to prevent deadlock #5621

do not use the clock in dealloc to prevent deadlock #5621

jabdoa2 commented Feb 16, 2018 •

edited

Loading

jabdoa2 commented Feb 26, 2018 •

edited

Loading

matham Feb 26, 2018 •

edited

Loading

tito commented Feb 27, 2018 •

edited

Loading