Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synchrotrons can deadlock on state_resolve_lock whilst calculating /sync #2505

Closed
ara4n opened this issue Oct 5, 2017 · 9 comments
Closed

Comments

@ara4n
Copy link
Member

ara4n commented Oct 5, 2017

Just had a situation with thousands of stuck /sync processes on 3 out of 4 synchrotrons due to something bad happening in #riot:matrix.org's state. Looks like it might be whilst under relatively heavy load?

It starts off with a normal /sync request (taken from todays synchrotron3.log):

2017-10-05 16:09:34,222 - synapse.access.http.8088 - 59 - INFO - GET-5196546 - 2a00:23c4:4988:ca00:74b4:d8de:bbe7:66ff - 8088 - Received request: GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s288049908_301254689_217064_54609738_15239112_147463_1860637_235802
4&access_token=<redacted>
2017-10-05 16:09:34,223 - synapse.handlers.sync - 547 - INFO - GET-5196546 - Calculating sync response for DomainSpecificString(localpart=u'ptyx', domain=u'matrix.org')
...
2017-10-05 16:09:37,150 - synapse.handlers.sync - 547 - INFO - GET-5196546 - Calculating sync response for DomainSpecificString(localpart=u'ptyx', domain=u'matrix.org')
...
2017-10-05 16:09:38,126 - synapse.handlers.sync - 547 - INFO - GET-5196546 - Calculating sync response for DomainSpecificString(localpart=u'ptyx', domain=u'matrix.org')

(why do we see this logged multiple times for the same GET?)

Then, it tries to resolve state for #riot:matrix.org:

2017-10-05 16:09:40,166 - synapse.util.async - 206 - INFO - GET-5196546 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:40,166 - synapse.state - 320 - INFO - GET-5196546 - Resolving state for !DgvjtOljKujDBrxyHk:matrix.org with 2 groups
2017-10-05 16:09:40,194 - synapse.state - 335 - INFO - GET-5196546 - Resolving conflicted state for u'!DgvjtOljKujDBrxyHk:matrix.org'
2017-10-05 16:09:40,205 - synapse.state - 495 - INFO - GET-5196546 - Asking for 32 conflicted events

...at which point any other client /syncing on #riot:matrix.org gets stuck behind the same state_resolve_lock:

2017-10-05 16:09:40,219 - synapse.util.async - 198 - INFO - GET-5196414 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:40,231 - synapse.util.async - 198 - INFO - GET-5196518 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:40,243 - synapse.util.async - 198 - INFO - GET-5196672 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:40,254 - synapse.util.async - 198 - INFO - GET-5196396 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:40,266 - synapse.util.async - 198 - INFO - GET-5196951 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])

etc.

Eventually the original request releases the lock and everyone else catches up:

2017-10-05 16:09:40,979 - synapse.state - 506 - INFO - GET-5196546 - Asking for 3 auth events
2017-10-05 16:09:41,001 - synapse.util.async - 213 - INFO - GET-5196546 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,001 - synapse.util.async - 206 - INFO - GET-5196414 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,001 - synapse.util.async - 213 - INFO - GET-5196414 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,001 - synapse.util.async - 206 - INFO - GET-5196518 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,001 - synapse.util.async - 213 - INFO - GET-5196518 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,002 - synapse.util.async - 206 - INFO - GET-5196672 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,002 - synapse.util.async - 213 - INFO - GET-5196672 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])

until one suddenly gets wedged after acquiring the lock:

2017-10-05 16:09:41,011 - synapse.util.async - 206 - INFO - GET-5196351 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,011 - synapse.util.async - 213 - INFO - GET-5196351 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,011 - synapse.util.async - 206 - INFO - GET-5196407 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,011 - synapse.util.async - 213 - INFO - GET-5196407 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,011 - synapse.util.async - 206 - INFO - GET-5196251 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,041 - synapse.util.metrics - 101 - WARNING -  - Context has unexpectedly changed from 'GET-5196351@7fae4c810600' to 'sentinel'. ('load_filtered_recents')
2017-10-05 16:09:41,057 - synapse.util.metrics - 101 - WARNING -  - Context has unexpectedly changed from 'GET-5196827@7fae4b60be50' to 'sentinel'. ('load_filtered_recents')
2017-10-05 16:09:41,073 - synapse.util.metrics - 101 - WARNING -  - Context has unexpectedly changed from 'GET-5196477@7fadca02a9f0' to 'sentinel'. ('load_filtered_recents')
2017-10-05 16:09:41,090 - synapse.util.metrics - 101 - WARNING -  - Context has unexpectedly changed from 'GET-5196714@7fad34462670' to 'sentinel'. ('load_filtered_recents')
2017-10-05 16:09:41,106 - synapse.util.metrics - 101 - WARNING -  - Context has unexpectedly changed from 'GET-5196413@7fad6400e590' to 'sentinel'. ('load_filtered_recents')

...and nobody ever manages to acquire it again. None of the /sync requests ever return, and so we leak FDs and RAM until being restarted, and the clients infinispin.

@ara4n
Copy link
Member Author

ara4n commented Oct 5, 2017

Based on timestamps, the event which triggered the state resolution that blew everything up seems to be:

{
  "origin_server_ts": 1507219766326,
  "sender": "@voyager:t2bot.io",
  "event_id": "$15072197664496JPsoP:t2bot.io",
  "unsigned": {
    "prev_content": {
      "membership": "join",
      "avatar_url": "mxc://t2bot.io/hcSELkhLCNMRxLLTXKffPPSn",
      "displayname": "Matrix Traveler"
    },
    "prev_sender": "@voyager:t2bot.io",
    "replaces_state": "$1490057728164ohjjO:t2bot.io",
    "age": 2588788
  },
  "state_key": "@voyager:t2bot.io",
  "content": {
    "membership": "join",
    "avatar_url": "mxc://t2bot.io/hcSELkhLCNMRxLLTXKffPPSn",
    "displayname": "Matrix Traveler (bot)"
  },
  "membership": "join",
  "type": "m.room.member",
  "room_id": "!DgvjtOljKujDBrxyHk:matrix.org"
}

@ara4n
Copy link
Member Author

ara4n commented Oct 5, 2017

(possibly similar to #1981, although the logs mentioned there don't seem to show a permenant deadlock)

@ara4n
Copy link
Member Author

ara4n commented Oct 5, 2017

ah, right - yup, it's the same. The deadlock looks like it happens after a request melts with maximum recursion, presumably whilst holding the lock => game over.

2017-10-05 16:09:58,433 - twisted - 131 - CRITICAL -  - Unhandled error in Deferred:
2017-10-05 16:09:58,433 - twisted - 131 - CRITICAL -  - 
Traceback (most recent call last):
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1030, in _cbDeferred
    self.errback(failure.Failure(FirstError(result, index)))
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 500, in errback
    self._startRunCallbacks(fail)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 567, in _startRunCallbacks
    self._runCallbacks()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 666, in _runCallbacks
    current.result = failure.Failure(captureVars=self.debug)
RuntimeError: maximum recursion depth exceeded

@ara4n
Copy link
Member Author

ara4n commented Oct 5, 2017

Another weird aspect is that some of the lost requests don't appear to block on acquiring the lock, but after having released it:

2017-10-05 16:09:33,091 - synapse.access.http.8088 - 59 - INFO - GET-5196407 - ::ffff:185.159.157.11 - 8088 - Received request: GET /_matrix/client/r0/sync?filter=6&timeout=30000&since=s288049887_301254688_217060_54609726_15239107_147463_1860637_2358024&access_token=<redacted>
2017-10-05 16:09:33,092 - synapse.handlers.sync - 547 - INFO - GET-5196407 - Calculating sync response for DomainSpecificString(localpart=u'anonyx', domain=u'matrix.org')
2017-10-05 16:09:38,076 - synapse.handlers.sync - 547 - INFO - GET-5196407 - Calculating sync response for DomainSpecificString(localpart=u'anonyx', domain=u'matrix.org')
2017-10-05 16:09:40,709 - synapse.util.async - 198 - INFO - GET-5196407 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,011 - synapse.util.async - 206 - INFO - GET-5196407 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])
2017-10-05 16:09:41,011 - synapse.util.async - 213 - INFO - GET-5196407 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84420041L, 84417790L])

I can't see enough detail in the logs (especially with broken log contexts) to understand what's gone wrong. So hopefully the best bet here will be to make twisted actually log where we run out of stack. (Unless this is caused by recursive deferreds or a very long chain of deferreds).

@ara4n
Copy link
Member Author

ara4n commented Oct 5, 2017

I've captured the logs for posterity at ~/bug-2505.log on matrix@hera

@richvdh
Copy link
Member

richvdh commented Oct 6, 2017

It's possible that one reason we don't get a useful stacktrace from that 'maximum recursion depth exceeded' is https://twistedmatrix.com/trac/ticket/9301.

@ara4n
Copy link
Member Author

ara4n commented Oct 6, 2017

It just happened again; captured logs at ~/bug-2505-2.log. An observation:

2017-10-06 14:38:47,257 - synapse.util.async - 213 - INFO - GET-2174893 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84658537L, 84658541L])
2017-10-06 14:38:47,257 - synapse.util.async - 206 - INFO - GET-2174862 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84658537L, 84658541L])
2017-10-06 14:38:47,257 - synapse.util.async - 213 - INFO - GET-2174862 - Releasing linearizer lock 'state_resolve_lock' for key frozenset([84658537L, 84658541L])
2017-10-06 14:38:47,257 - synapse.util.async - 206 - INFO - GET-2174933 - Acquired linearizer lock 'state_resolve_lock' for key frozenset([84658537L, 84658541L])
2017-10-06 14:38:49,151 - synapse.util.async - 198 - INFO - GET-2174854 - Waiting to acquire linearizer lock 'state_resolve_lock' for key frozenset([84658537L, 84658541L])
  • the first request there completes
  • the second one never completes, despite releasing the lock - so i think it's the one which hits
  • the the third one never completes, nor releases the lock
  • the fourth one never completes, nor acquires the lock

@richvdh
Copy link
Member

richvdh commented Oct 10, 2017

I've been working on some patches to synapse and twisted which might help track down the stack overflow. Now deployed on matrix.org; let's see if they help.

richvdh added a commit that referenced this issue Oct 11, 2017
1. make it not blow out the stack when there are more than 50 things waiting
   for a lock. Fixes #2505.

2. Make it not mess up the log contexts.
@richvdh
Copy link
Member

richvdh commented Oct 12, 2017

fixed by #2532, hopefully

@richvdh richvdh closed this as completed Oct 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants