-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Synchrotrons can deadlock on state_resolve_lock whilst calculating /sync #2505
Comments
Based on timestamps, the event which triggered the state resolution that blew everything up seems to be:
|
(possibly similar to #1981, although the logs mentioned there don't seem to show a permenant deadlock) |
ah, right - yup, it's the same. The deadlock looks like it happens after a request melts with maximum recursion, presumably whilst holding the lock => game over.
|
Another weird aspect is that some of the lost requests don't appear to block on acquiring the lock, but after having released it:
I can't see enough detail in the logs (especially with broken log contexts) to understand what's gone wrong. So hopefully the best bet here will be to make twisted actually log where we run out of stack. (Unless this is caused by recursive deferreds or a very long chain of deferreds). |
I've captured the logs for posterity at ~/bug-2505.log on matrix@hera |
It's possible that one reason we don't get a useful stacktrace from that 'maximum recursion depth exceeded' is https://twistedmatrix.com/trac/ticket/9301. |
It just happened again; captured logs at ~/bug-2505-2.log. An observation:
|
I've been working on some patches to synapse and twisted which might help track down the stack overflow. Now deployed on matrix.org; let's see if they help. |
1. make it not blow out the stack when there are more than 50 things waiting for a lock. Fixes #2505. 2. Make it not mess up the log contexts.
fixed by #2532, hopefully |
Just had a situation with thousands of stuck /sync processes on 3 out of 4 synchrotrons due to something bad happening in #riot:matrix.org's state. Looks like it might be whilst under relatively heavy load?
It starts off with a normal /sync request (taken from todays synchrotron3.log):
(why do we see this logged multiple times for the same GET?)
Then, it tries to resolve state for #riot:matrix.org:
...at which point any other client /syncing on #riot:matrix.org gets stuck behind the same state_resolve_lock:
etc.
Eventually the original request releases the lock and everyone else catches up:
until one suddenly gets wedged after acquiring the lock:
...and nobody ever manages to acquire it again. None of the /sync requests ever return, and so we leak FDs and RAM until being restarted, and the clients infinispin.
The text was updated successfully, but these errors were encountered: