Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not use the clock in __dealloc__ to prevent deadlock #5621

Merged
merged 3 commits into from
Mar 2, 2018

Conversation

jabdoa2
Copy link
Contributor

@jabdoa2 jabdoa2 commented Feb 16, 2018

Prevents hangs such as in #5605
In 1.10 the Clock always hold a lock when registering or unregistering callbacks. So if garbage collection occurs while the clock holds the lock in the same thread (such as in #5605) the process will deadlock. This PR changes the behavior in __dealloc__ to perform the work synchronously without using the clock to prevent this issue.

@welcome
Copy link

welcome bot commented Feb 16, 2018

Thanks for opening your first pull request here! 💖 Please check out our contributing guidelines.

@matham
Copy link
Member

matham commented Feb 23, 2018

If I'm reading your changes correctly, when deallocating from __dealloc__ you call gl_dealloc rather than scheduling it through the clock. The reason is that __dealloc__ could have been executed when unschedule is called, so we're already holding the lock when __dealloc__ is executed and get deadlocked when trying to get the lock.

The problem with this fix, is that if you call unschedule from another thread, you'll be executing __dealloc__ and then gl_dealloc from that second thread. However, gl functions are not allowed to be called from a thread outside the main thread (that's why it was scheduled through the clock so that it happens in the main thread to begin with I'm assuming), otherwise it may crash or become corrupt.

Concurrency is really difficult, in particular using __del__ or __dealloc__. One (somewhat complex) solution is instead of calling schedule like currently, just add the code to be called to a queue that kivy kivy services say once every few seconds. Then, you don't need to schedule anything because kivy will call the freeing code from its thread when it gets to it. Using e.g. a list, it should be safe.

I'm not super familiar with the graphics code, but it'd look something like but instead of doing self.trigger_gl_dealloc() you'd always do Clock.clock_service_queue.append(self.trigger_gl_dealloc). And kivy would have previously scheduled a safe function that services this queue periodically.

The alternative is to make the clock lock a reentrant lock. The problem with this is that we'd have to go through all the code and check that it's safe. I'm guessing I didn't use a RLock to begin with probably because I didn't think it was safe. And I have a feeling that would could never make it safe for all the places the lock is used.

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 23, 2018

@matham In Kivy 1.10 our framework (Mission Pinball Framework short MPF) will hang every 10-30 minutes which is unacceptable for a pinball machine. With this fix our user base did not get a single hang within the last week. At least for our usecase this looks super safe.

I see your concern. However, since gl_xx cannot/should not be called from another thread there is also no way to allocate objects such as FBOs or shaders from another thread because that would also invoke gl_xx commands. This might still be theoretically unsafe but practically I don't see much cases where you could create graphics objects in another thread.

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 23, 2018

@matham I quickly checked the Clock and RLock looks reasonable. Afaik the only limitation would be that the lock has to be released in the same thread which looks fine for all methods (everything releases the lock before returning). Consequently, all cases where RLock will not lock but Lock will would result in a dead lock currently. So I guess that would be fine.

I did a build with RLock instead of Lock on master and will run my test case for the rest of the day to make sure that is also prevents our deadlock. So far it works fine. Will report tonight.

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 23, 2018

@tito Any thoughts on this?

@tito
Copy link
Member

tito commented Feb 26, 2018

Your patch yes solve the issue, but then consume cpu every second even if the app does nothing. So you replaced the implementation from a trigger to a pull. It would be best to get back to a trigger approach :)

BUT, @matham, is the clock can be executed from another thread, then it's an API break IMO. Any callback executed from the Clock was ensured to be done in the mainthread, because the Clock.tick is done in the mainthread itself. If that doesn't work, then the Clock is broken. We have a @mainthread, and in lot of app, i'm scheduling a function call (sometimes unschedule first then schedule) a callback to be called in the mainthread.

Whatever the reason is, the Clock should never execute a callback somewhere else that the "tick"-related thread. Even more, it should not execute callback outside the "tick" method.

@tito
Copy link
Member

tito commented Feb 26, 2018

And the fact that the graphics code segfault really highlight this issue.

@tito tito added Component: core-providers kivy/core Priority: High Should be fixed ASAP Status: Needs-analysis Issue needs to be analyzed if it's real Status: On-hold PR is on hold and should not be merged until issues inside are resolved labels Feb 26, 2018
@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 26, 2018

@tito: We cannot use the clock from the destructor because GC might trigger that from within the clock and deadlock the Clock. The clock will only trigger and schedule callbacks in the main thread. However, the GC might trigger the destructor in any thread. That's why we require this quirk via the Clock callback.

@tito
Copy link
Member

tito commented Feb 26, 2018

Then the implementation of the clock is an issue IMO, not the usage. I guess the Clock must be fixed to be reentrant. Unsure if RLock itself will be sufficient

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 26, 2018

If we want to go that way RLock would be the solution. But unfortunately it is not enough because the linked list inside the Clock is not reentrant safe. We would trade the deadlock with potential data corruption during Clock calls from GC. However, GC is exactly the only place where this would happen because it can only be triggered from within the Clock. So reentrant Clock would be nice but (a) very hard to get right (b) very hard to test since this will never occur outside GC. I'm not sure if that effort would be worth it only to be able to use the clock from within GC.

@tito
Copy link
Member

tito commented Feb 26, 2018

I don't think that would be that hard to get. The trigger just add an event into the list. Accessing the list is under lock, but maybe with just few cases, we can rlock it correctly.

Touching the GC in the clock because we hit the issue is madness :)

@matham
Copy link
Member

matham commented Feb 26, 2018

If we change to use a list for scheduling, then that would basically drop the performance by a lot. The linked list together with the lock currently used is what ensures it's not slow. Also, the previous code using a list actually didn't safely unschedule (I'm pretty sure - we copied the list of events of events to be processed so any unschedule would not have been noticed), but if it did it would have needed locks so it would have had the same issue.

There's nothing wrong with the current clock code in the sense it doesn't have a bug, it's simply that it's never safe to access a lock from within __dealloc__ or __del__ because of deadlock issues. Turning it into a RLock would fix this issue but is not at all simple, because the nature of __del__ means it can be called at any line, which makes it really difficult to protect against corruption. I don't think we can easily switch to an RLock.

This kind of issue is well known and python itself is not immune from it: https://bugs.python.org/issue14976#msg275377 and https://www.pytennessee.org/schedule/presentation/159/. The only solution, other than making the clock code re-entrant, which I think is quite difficult, is to simply add anything __dealloc__ needs to process to a list so that it doesn't interact with the clock or anything, which is what this PR is doing.

But I do see the issue with wasting cycles by the periodic checking, that's why I wanted your feedback. Although, doesn't cache do the same thing where it periodically checks for expiration? We could also have a list that Clock always processes after each tick and add it to that list. I'm not sure what else we can do.

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 26, 2018

@matham what about a Clock.trigger_from_gc/trigger_lock_free then? That would fill a list and the clock would check that at the start of the tick. I can implement that if you want.

@matham
Copy link
Member

matham commented Feb 26, 2018

Yeah, that sounds good. But it may just be better to use a simple function schedule_del_safe that takes a callback and adds it to a list of callbacks to be run after the normal callbacks are processed during a tick. Rather than going through the typical event scheduling, triggering etc. Maybe that's what you meant? We'll never need to cancel such a callback so there's not reason to use a typical clock event that uses trigger etc.

@tito
Copy link
Member

tito commented Feb 26, 2018

(I rewrote this comment many times, changing my opinion after trying to simulate the issue with threads, reading the Clock code etc.)

So the whole issue is just located at .get_callback() in Clock.unschedule. Everything else seems ok.

So i wonder what about delocking that loop:

                while ev is not None:
                    if ev.get_callback() == callback:
                        events.append(ev)
                    ev = ev.next

Whatever my mind goes around it, there is always a way where it actually break and get inconsistencies.
My understanding:

  • This loop call cancel() on all the events that match the callback (if all=True, otherwise just the first one)
  • The cancel() is marvelous, and work with a cap_event

But the way we loop over the event doesn't care anything about it. Woud it be possible to:

  • either rework the loop to be as great as the process_event?
  • just "mark" the event to be cancelled, and actually do it in the tick?
    The second solution is nice, look simple to implement, and doesn't really have a direct impact on performance. And so we don't care about looping over the current event list as "deletion" would happen on the tick. And addition is appended to the end, so no issue there.

@matham ?

@matham
Copy link
Member

matham commented Feb 26, 2018

I didn't understand what you mean by "either rework the loop to be as great as the process_event".

Regarding the second solution, it is a nice solution, but the first issue that comes to mind is that what if you cancel and then reschedule the event in this frame? We guarantee now that events are executed in the order they were scheduled. I suppose that we could mark it as canceled, but then move it to the end of the linked list the next time it's scheduled.

But I don't understand how any of this solves the original issue? The original problem had to do with scheduling, or triggering and event, not unschedule. So the problem is with this part of the code: https://github.com/kivy/kivy/blob/master/kivy/_clock.pyx#L57. For that we definitely do need to the lock because it's supposed to be thread safe so multiple threads can trigger or schedule an event.

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Feb 26, 2018

@matham I changed the PR accordingly. Please have a look

Copy link
Member

@matham matham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks great, thanks! There's just a few small changes for polish.

@@ -205,6 +205,7 @@ cdef class CyClockBase(object):
self._lock = Lock()
self._lock_acquire = self._lock.acquire
self._lock_release = self._lock.release
self._del_queue = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please declare it here https://github.com/kivy/kivy/blob/master/kivy/_clock.pxd#L83 as cdef public list _del_queue.

@@ -248,6 +249,13 @@ cdef class CyClockBase(object):
ev.release()
return ev

cpdef schedule_del_safe(self, callback):
'''Schedule a callback. Might be called from GC and cannot be cancelled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to expand a little explaining what the problem is. E.g. something like:

Schedule a callback. Might be called from GC and cannot be cancelled.

It's unsafe to call various kinds of code, such as code with a lock, from a `__del__` or `__dealloc__` methods. Since Kivy's Clock uses a lock, it's generally unsafe to call from these methods. Instead, use this method, which is thread safe and `__del__` or `__dealloc__` safe, to schedule the callback in the kivy thread. It'll be executed in order after the normal events are processed. 

The callback takes no parameters and cannot be canceled.

kivy/_clock.pyx Outdated
while self._del_queue:
callback = self._del_queue.pop()
callback()

Copy link
Member

@matham matham Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly more performant approach is:

callbacks = self._del_queue[:]
del self._del_queue[:len(callbacks)]
for callback in callbacks:
    callback()

The reason is that it's cheaper than popping one callback at a time. Notice how it makes sure it only deletes the callbacks it'll process in this frame (this wasn't an issue with your code though). Also, this process them in order it was scheduled.

@tito
Copy link
Member

tito commented Feb 27, 2018

@matham The error is not where you linked. The error is that the gc is triggered when calling get_callback(), and we lock this part. Everywhere else, it look like the lock doesn't have any impact on the Python-related code, only Cython.

And you're right, i forgot about the order of execution. So i guess adding a new API works, but it will solve the user issue, unsure it it will still work for all cases.

@matham
Copy link
Member

matham commented Feb 28, 2018

@tito There's also https://github.com/kivy/kivy/blob/master/kivy/_clock.pyx#L54. But even if we move that out of the lock, are you sure that any of these cython dot access and object comparison doesn't execute bytecode?

I'm also not sure how maintainable it'd be to assume no python bytecode can be executed while the lock is held in clock? It just happens that it's currently in cython rather than in python, but someone may later add some other code while the lock is held forgetting about this, or we'd want some feature that requires python code.

I think this PR should solve the issue, as long as we remember to not do anything else from __dealloc__. That should be the rule anyway to never do much in __dealloc__.

@tito
Copy link
Member

tito commented Feb 28, 2018

@matham good point. OK for the PR

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Mar 1, 2018

@matham I did the changes as you requested.

@matham
Copy link
Member

matham commented Mar 1, 2018

@jabdoa2 Did you forget to push?

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Mar 1, 2018

Oh looks like I pushed to the wrong branch. Will correct that tomorrow

@jabdoa2
Copy link
Contributor Author

jabdoa2 commented Mar 2, 2018

@tito @matham here we go

@matham matham merged commit 1ccd294 into kivy:master Mar 2, 2018
@welcome
Copy link

welcome bot commented Mar 2, 2018

Congrats on merging your first pull request! 🎉🎉🎉

@matham
Copy link
Member

matham commented Mar 2, 2018

Thanks for the code and your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: core-providers kivy/core Priority: High Should be fixed ASAP Status: Needs-analysis Issue needs to be analyzed if it's real Status: On-hold PR is on hold and should not be merged until issues inside are resolved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants