Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

"Fatal Python error: Cannot recover from stack overflow" from logger #4202

Closed
superboum opened this issue Nov 19, 2018 · 14 comments · Fixed by #8268
Closed

"Fatal Python error: Cannot recover from stack overflow" from logger #4202

superboum opened this issue Nov 19, 2018 · 14 comments · Fixed by #8268

Comments

@superboum
Copy link
Contributor

superboum commented Nov 19, 2018

Description

The server crashes and stops with error Fatal Python error: Cannot recover from stack overflow.
The full stacktrace can be found here: matrix-stackoverflow.log

Steps to reproduce

The only thing that I guess that could help to reproduce the bug is to edit log.yaml and set every log levels to INFO or even DEBUG.

I expect that my matrix server doesn't crash.

Version information

  • Homeserver: my personal homeserver (im.deuxfleurs.fr)
  • Version: v0.33.9rc1
  • Install method: Homemade Docker (with Python 3 + downloading synapse archive from Github)
  • Platform: Running in Docker, on a Debian Buster host, on a Scaleway VPS
@neilisfragile
Copy link
Contributor

I can't reproduce this error - can you try against python 3.6? It may be version dependent

@neilisfragile
Copy link
Contributor

related to #4191?

@kiorky
Copy link

kiorky commented Nov 23, 2018

@neilisfragile seems dup, ill try to make our docker image switch to py3.6 , and it should then reproduce exactly your error.

kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
kiorky added a commit to corpusops/docker-matrix that referenced this issue Nov 23, 2018
@superboum
Copy link
Contributor Author

superboum commented Nov 26, 2018

I encountered this error while having lot of instabilities with matrix, with some strange errors due to encoding and errors loop (an error handler that failed, triggering an error handler that fail and so on).
However, I was not able to clearly identify the problem at the moment, and even now I struggle to understand why Matrix went wild. It seems that setting the error verbosity to ERROR mitigated the problem.

I am a bit lost as some stacktraces seem normal as this one:

2018-11-26 08:53:36,253 - twisted - 244 - CRITICAL - - Unexpected exception from twisted.web.client.FileBodyProducer.stopProducing
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/_newclient.py", line 195, in _callAppFunction
    function()
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/client.py", line 1087, in stopProducing
    self._task.stop()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 497, in stop
    self._checkFinish()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 507, in _checkFinish
    raise self._completionState
twisted.internet.task.TaskStopped

In my case, I suppose that the reported error here has been triggered by this error, which seem to be also reported in #4191 :

2018-11-26 09:04:47,977 - synapse.state.v1 - 196 - ERROR - PUT-260673-$1543223055125NpvOU:encom.eu.org-$2458-4-4-6T06252.913452|Marina's_Silly_Donkey_Sits_Snowpain's_Good_Newt:ponies.im- Failed to resolve state
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 364, in _enqueue_events
    for e in res if e
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: {"$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im": _EventCacheEntry(event=<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>, redacted_event=None)}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 197, in _get_events
    defer.returnValue(events)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: [<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>]

During handling of the above exception, another exception occurred:

[...]

  File "/usr/local/lib/python3.5/dist-packages/synapse/state/v1.py", line 301, in key_func
    return -int(e.depth), hashlib.sha1(e.event_id.encode('ascii')).hexdigest()
builtins.UnicodeEncodeError: 'ascii' codec can't encode character '\u205f' in position 42: ordinal not in range(128)

And a verbosity greater than ERROR. But it's only speculation.

But the stacktrace is so huge, and I struggle to find where is the useful information in this stacktrace.
Here are the current logs of my server which really look like to #4191 but seem a bit different too (maybe due to python3?).
stderr.txt

Sorry for this not so useful bug report, I think you could close it in favor of #4191, even if I suspect that this error can cause many different stacktraces and even crash the whole server.
Thanks for the help, and if you want that I repost some elements from the last logfile I have attached in #4191, just ask.

@kiorky
Copy link

kiorky commented Nov 28, 2018

For us the current workaround was to filter with CRITICAL, ERROR was not sufficient

@richvdh
Copy link
Member

richvdh commented Nov 29, 2018

2018-11-26 08:53:36,253 - twisted - 244 - CRITICAL - - Unexpected exception from twisted.web.client.FileBodyProducer.stopProducing
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/_newclient.py", line 195, in _callAppFunction
    function()
  File "/usr/local/lib/python3.5/dist-packages/twisted/web/client.py", line 1087, in stopProducing
    self._task.stop()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 497, in stop
    self._checkFinish()
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/task.py", line 507, in _checkFinish
    raise self._completionState
twisted.internet.task.TaskStopped

This is #4003.

2018-11-26 09:04:47,977 - synapse.state.v1 - 196 - ERROR - PUT-260673-$1543223055125NpvOU:encom.eu.org-$2458-4-4-6T06252.913452|Marina's_Silly_Donkey_Sits_Snowpain's_Good_Newt:ponies.im- Failed to resolve state
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 364, in _enqueue_events
    for e in res if e
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: {"$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im": _EventCacheEntry(event=<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>, redacted_event=None)}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.5/dist-packages/synapse/storage/events_worker.py", line 197, in _get_events
    defer.returnValue(events)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: [<FrozenEvent event_id='$2458-4-4-1T34409.752380|Queen Chrysalis's\u205fUpset\u205fKangaroo\u205fSteals\u205fCranky Doodle Ryan\u2019s\u205fFantastic\u205fLeopard:ponies.im', type='m.room.member', state_key='@/dev/ponies:ponies.im'>]

During handling of the above exception, another exception occurred:

[...]

  File "/usr/local/lib/python3.5/dist-packages/synapse/state/v1.py", line 301, in key_func
    return -int(e.depth), hashlib.sha1(e.event_id.encode('ascii')).hexdigest()
builtins.UnicodeEncodeError: 'ascii' codec can't encode character '\u205f' in position 42: ordinal not in range(128)

This is #4226, and not related to #4191 afaict.

@richvdh
Copy link
Member

richvdh commented Nov 29, 2018

A couple of things which might be worth trying here:

Let me know if either of the above help.

@superboum
Copy link
Contributor Author

superboum commented Nov 29, 2018

I have already set the encoding for my log file.

log.yaml

version: 1

formatters:
  precise:
   format: '%(asctime)s - %(name)s - %(lineno)d - %(levelname)s - %(request)s- %(message)s'

filters:
  context:
    (): synapse.util.logcontext.LoggingContextFilter
    request: ""

handlers:
  file:
    class: logging.handlers.RotatingFileHandler
    formatter: precise
    filename: /var/log/matrix-synapse/homeserver.log
    maxBytes: 104857600
    backupCount: 10
    filters: [context]
    level: ERROR
    encoding: utf8
  console:
    class: logging.StreamHandler
    formatter: precise
    level: ERROR

loggers:
  synapse:
    level: ERROR

  synapse.storage.SQL:
    level: ERROR

  ldap3:
    level: ERROR
  ldap_auth_provider:
    level: ERROR

root:
  level: ERROR
  handlers: [file, console]

I was running synapse with /usr/bin/python3 -m synapse.app.homeserver -c /etc/matrix-synapse/homeserver.yaml, I have changed it for /usr/bin/python3 -m synapse.app.homeserver -n -c /etc/matrix-synapse/homeserver.yaml as suggested.

If I still encounter some bugs, I will try to switch from ERROR to CRITICAL in log.yaml.
Anyway, I will update this comment after a while to see how synapse behave.

@richvdh
Copy link
Member

richvdh commented Nov 29, 2018

for what it's worth, I think most of those level settings are irrelevant, because they will be inherited from the root logger.

@julianuu
Copy link

I get the same log as here

The server crashes and stops with error Fatal Python error: Cannot recover from stack overflow.
The full stacktrace can be found here: matrix-stackoverflow.log

Not sure if this is any useful, but for me it was enough to increase the level to ERROR in file and disabling the level in root to make synapse run again

handlers:
    file:
        class: logging.handlers.RotatingFileHandler
        formatter: precise
        filename: /var/log/synapse/homeserver.log
        maxBytes: 104857600
        backupCount: 10
        filters: [context]
        encoding: utf8
        level: ERROR
    console:
        class: logging.StreamHandler
        formatter: precise
        filters: [context]
        level: INFO

loggers:
    synapse:
        level: INFO

    synapse.storage.SQL:
        # beware: increasing this to DEBUG will make synapse log sensitive
        # information such as access tokens.
        level: INFO

root:
    #level: ERROR
    handlers: [file, console]

@julianuu
Copy link

julianuu commented Feb 1, 2020

Ok no it just happened again

@superboum
Copy link
Contributor Author

Since the creation of this bug report, I have updated matrix many times and have not encountered this bug anymore. Feel free to close this issue.

@richvdh richvdh changed the title Python 3 and v0.33.9rc1 - Fatal Python error: Cannot recover from stack overflow. "Fatal Python error: Cannot recover from stack overflow" from logger Feb 3, 2020
@richvdh
Copy link
Member

richvdh commented Feb 3, 2020

so the basis of the problem is:

  • if the log subsystem encounters an error when writing a log line, it attempts to write an error message to stderr
  • synapse redirects stderr to the twisted logger (at level ERROR) unless --no-redirect-stdio (aka -n) is set on the commandline
  • so the error message is sent back through the log subsystem
  • boom

So there are two problems here. One is that stderr is piped back through the log subsystem, which would be fixed with something like #1539 (comment), but also there was an initial error which set off the cascade.

Not sure if this is any useful, but for me it was enough to increase the level to ERROR in file and disabling the level in root to make synapse run again

This sort of thing can reduce the amount of logging that is written, thus reducing the likelihood of hitting an error in the log subsystem - but as you have discovered, is not a complete solution.

It is tricky to figure out what is going on to cause the initial error, beyond obvious things like /var/log/synapse running out of diskspace. The easiest approach is probably:

  • make sure synapse is configured not to daemonize itself (ie, don't use use --daemonize, or if you're using synctl, set --no-daemonize).
  • set --no-redirect-stdio on the commandline
  • remove console from the list of handlers under the root logger, to reduce the amount of stuff going to stderr
  • pipe stderr to somewhere useful
  • leave it running
  • see what comes up in stderr

@richvdh
Copy link
Member

richvdh commented Sep 7, 2020

closing as a dup of #4240

@richvdh richvdh closed this as completed Sep 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants