Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Frequent crashes and spikes in RAM use, several critical errors in logs #3188

Closed
JoKeyser opened this issue May 3, 2018 · 23 comments
Closed

Comments

@JoKeyser
Copy link

JoKeyser commented May 3, 2018

Description

Since a few days, I notice frequent disconnects to my homeserver; mainly on Riot Android, but also some on Riot desktop.

In Synapse's logs, I find several errors that may hint at the problem(s), see below. Sorry that it's a fairly long dump, but I can't judge what parts are useful. Let me know if I can check/add more.

2018-05-03 23:02:00,056 - twisted - 131 - CRITICAL - - Unexpected exception from synapse.http.client.FileBodyProducer.stopProducing
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/policies.py", line 125, in connectionLost
    self.wrappedProtocol.connectionLost(reason)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 964, in dispatcher
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1634, in _connectionLost_TRANSMITTING
    self._currentRequest.stopWriting()
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 865, in stopWriting
    _callAppFunction(self.bodyProducer.stopProducing)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 194, in _callAppFunction
    function()
  File "/usr/lib/python2.7/dist-packages/twisted/web/client.py", line 1086, in stopProducing
    self._task.stop()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 497, in stop
    self._checkFinish()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 507, in _checkFinish
    raise self._completionState
twisted.internet.task.TaskStopped: 

There are a bunch of ones like this:

2018-05-03 23:29:58,103 - synapse.http.server - 191 - ERROR - GET-638- Failed handle request synapse.rest.media.v1.download_resource._async_render_GET on <synapse.rest.media.v1.download_resource.DownloadResource instance at 0x7faf9a3e1d88>: <SynapseRequest at 0x7faf0ecca710 method=GET uri=/_matrix/media/v1/download/matrix.org/jdVtvZBLTBLNCdnSXfkfOibx clientproto=HTTP/1.1 site=8008>: Traceback (most recent call last):
Failure: exceptions.Exception: Consumer asked us to stop producing
2018-05-03 23:29:58,107 - twisted - 131 - CRITICAL - GET-638- Unhandled Error
Traceback (most recent call last):
Failure: exceptions.RuntimeError: Producer was not unregistered for /_matrix/media/v1/download/matrix.org/jdVtvZBLTBLNCdnSXfkfOibx
2018-05-03 23:50:30,669 - synapse.http.server - 191 - ERROR - GET-148- Failed handle request synapse.rest.media.v1.thumbnail_resource._async_render_GET on <synapse.rest.media.v1.thumbnail_resource.ThumbnailResource instance at 0x7fc773a48a28>: <SynapseRequest at 0x7fc73817d200 method=GET uri=/_matrix/media/v1/thumbnail/matrix.org/YqvlPbTsQhGREHeVRXWsHqgL?width=36&height=36&method=crop clientproto=HTTP/1.1 site=8008>: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1442, in gotResult
    _inlineCallbacks(r, g, deferred)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/synapse/http/server.py", line 165, in wrapped_request_handler
    yield request_handler(self, request)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/thumbnail_resource.py", line 80, in _async_render_GET
    width, height, method, m_type
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/thumbnail_resource.py", line 233, in _respond_remote_thumbnail
    media_info = yield self.media_repo.get_remote_media_info(server_name, media_id)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/media_repository.py", line 273, in get_remote_media_info
    server_name, media_id,
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/media_repository.py", line 323, in _get_remote_me
    server_name, media_id, file_id
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/media_repository.py", line 429, in _download_remo
    filesystem_id=file_id,
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 424, in _simple_insert
    self._simple_insert_txn, table, values,
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 320, in runInteraction
    result = yield self.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 365, in runWithConnection
    inner_func, *args, **kwargs
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 307, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 298, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 361, in inner_func
    return func(conn, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 316, in inner_func
    func, *args, **kwargs
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 241, in _new_transaction
    r = func(txn, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 444, in _simple_insert_txn
    txn.execute(sql, vals)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 82, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 112, in _do_execute
    sql, *args
2018-05-04 00:14:07,385 - synapse.federation.federation_server - 620 - ERROR - PUT-1236- Failed to handle edu 'm.receipt'
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/federation/federation_server.py", line 616, in on_edu
    yield handler(origin, content)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/handlers/receipts.py", line 84, in _received_remote_receipt
    yield self._handle_new_receipts(receipts)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/handlers/receipts.py", line 101, in _handle_new_receipts
    room_id, receipt_type, user_id, event_ids, data
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/receipts.py", line 456, in insert_receipt
    stream_id=stream_id,
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 323, in runInteraction
    after_callback(*after_args, **after_kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/receipts.py", line 308, in _invalidate_get_users_with_receipts_in_room
    if user_id in res:
TypeError: argument of type 'ObservableDeferred' is not iterable
2018-05-04 00:19:13,429 - synapse.federation.transport.server - 188 - ERROR - PUT-1913- authenticate_request failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/federation/transport/server.py", line 181, in new_func
    origin = yield authenticator.authenticate_request(request, content)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/federation/transport/server.py", line 145, in authenticate_request
    yield self.keyring.verify_json_for_server(origin, json_request)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/dist-packages/synapse/crypto/keyring.py", line 768, in _handle_key_deferred
    Codes.UNAUTHORIZED,
SynapseError: 401: Invalid signature for server MY_DOMAIN with key ed25519:a_RBEL
2018-05-04 00:23:36,277 - twisted - 131 - CRITICAL - - Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 103, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 86, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 619, in _doReadOrWrite
    why = selectable.doWrite()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1168, in doWrite
    "doWrite called on a %s" % reflect.qual(self.__class__))
exceptions.RuntimeError: doWrite called on a twisted.internet.udp.Port

Steps to reproduce

Running Synapse... :/.

Version information

If not matrix.org:

  • Version: Synapse/0.27.2 and 0.28.1.
  • Install method: Debian's apt.
  • Platform: Debian buster, amd64, on dedicated hardware with 4 cores and 4 GiB, http://www.pcengines.ch/apu2c4.htm.
@neilisfragile
Copy link
Contributor

Thanks for the report - as a first step can you upgrade to 0.28.1 and confirm that the problem persists?

@JoKeyser
Copy link
Author

@neilisfragile thanks, will do.

@JoKeyser
Copy link
Author

JoKeyser commented May 15, 2018

@neilisfragile Since the upgrade to 0.28.1, things have indeed improved significantly. Still, I'm getting frequent connection problems (Riot complains "Connection to server has been lost"). However it's at least usable again now.

When I grep for critical errors, I still get 3 kinds in the last 2 days:

2018-05-15 09:52:44,310 - twisted - 131 - CRITICAL - GET-24- Unhandled Error
Traceback (most recent call last):
Failure: exceptions.RuntimeError: Producer was not unregistered for /_matrix/media/v1/download/matrix.org/HfvPXmCyHffdceDsODrunDyB

[...intermittend other WARNINGs...]

--
    server_name, key_ids
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/dist-packages/synapse/crypto/keyring.py", line 665, in get_server_verify_key_v1_direct
    raise KeyLookupError("TLS certificate doesn't match")
KeyLookupError: TLS certificate doesn't match
2018-05-15 14:49:08,575 - synapse.federation.transport.server - 189 - ERROR - PUT-2819- authenticate_request failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/federation/transport/server.py", line 182, in new_func
    origin = yield authenticator.authenticate_request(request, content)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/federation/transport/server.py", line 146, in authenticate_request
    yield self.keyring.verify_json_for_server(origin, json_request)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/crypto/keyring.py", line 752, in _handle_key_deferred
    Codes.UNAUTHORIZED,
SynapseError: 401: No key for riot.waytt.cf with id ['ed25519:a_SWPL']
2018-05-15 15:34:16,944 - twisted - 131 - CRITICAL - - Unhandled error in Deferred:
2018-05-15 15:34:16,946 - twisted - 131 - CRITICAL - - 
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 204, in loop
    time_now - time_then, limit=3
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 140, in interval
    for name, (count, cum_time) in self.current_counters.iteritems():
RuntimeError: dictionary changed size during iteration
2018-05-15 18:43:47,672 - twisted - 131 - CRITICAL - - Unexpected exception from synapse.http.client.FileBodyProducer.stopProducing
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/policies.py", line 125, in connectionLost
    self.wrappedProtocol.connectionLost(reason)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 964, in dispatcher
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1634, in _connectionLost_TRANSMITTING
    self._currentRequest.stopWriting()
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 865, in stopWriting
    _callAppFunction(self.bodyProducer.stopProducing)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 194, in _callAppFunction
    function()
  File "/usr/lib/python2.7/dist-packages/twisted/web/client.py", line 1086, in stopProducing
    self._task.stop()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 497, in stop
    self._checkFinish()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/task.py", line 507, in _checkFinish
    raise self._completionState
twisted.internet.task.TaskStopped: 

There are also several ERRORs, e.g. synapse.federation.federation_server - 171 - ERROR - PUT-16- Failed to handle PDU, or synapse.federation.transport.server - 189 - ERROR - PUT-169- authenticate_request failed, or synapse.crypto.keyring - 747 - ERROR - PUT-43- Got Exception when downloading keys for riot.waytt.cf: KeyLookupError TLS certificate doesn't match, ...
However I'm not sure whether I should dump the logs about those as well...? They seem fairly normal to me.

@JoKeyser
Copy link
Author

I forgot to say that synapse periodically eats up all my RAM (4 GiB) and then apparently crashes and gets restarted every few minutes. (I've watched it for a while with htop). Happy to test/log this better, if it helps.
I also should clarifiy that while the update has improved things, it's still pretty flaky, and nearly unusable.

@neilisfragile
Copy link
Contributor

I'm guessing that your errors are the symptom rather than the cause. Synapse is can be very memory hungry but your box spec should be fine (depending on exactly what you are doing with it).

Can you send me logs of the time leading up to a crash?

@JoKeyser
Copy link
Author

JoKeyser commented May 16, 2018

@neilisfragile thanks for reading and your input. Now that I watched it more carefully, I'm certain that the "connection problems" in the issue's title are due to Synapse crashing or spiking into >90% of RAM use.

The funny thing is that in the last crashes I've observed now (with htop and concurrent tail -f /var/log/matrix-synapse/homeserver.log), there are no errors except for a bunch of WARNINGs. I assume those can't really explain the crash.

My box doesn't do too much else. Without Synapse running, RAM use is at about 525 MB, and I set the SYNAPSE_CACHE_FACTOR at a moderate 0.5. I've now lowered it to 0.25, but RAM still spikes to 3.6 out of 3.8 GB and clients disconnect. My first impression is that crashes are less frequent now. To be fair; I've joined a bunch of large rooms on matrix.org.

One curious WARNING I just found is this, not sure what to make of it:

2018-05-17 00:25:57,998 - synapse.app.homeserver - 455 - WARNING - - Error reporting stats: 522: <!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<meta http-equiv="refresh" content="0">

<title>matrix.org | 522: Connection timed out</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css" media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>




</head>
<body>
<div id="cf-wrapper">

    

    <div id="cf-error-details" class="cf-error-details-wrapper">
        <div class="cf-wrapper cf-error-overview">
            <h1>
              
              <span class="cf-error-type">Error</span>
              <span class="cf-error-code">522</span>
              <small class="heading-ray-id">Ray ID: 41c142917aaf2c06 &bull; 2018-05-16 22:25:55 UTC</small>
            </h1>
            <h2 class="cf-subheadline">Connection timed out</h2>
        </div><!-- /.error-overview -->
        
        <div class="cf-section cf-highlight cf-status-display">
            <div class="cf-wrapper">
                <div class="cf-columns cols-3">
                  
<div id="cf-browser-status" class="cf-column cf-status-item cf-browser-status ">
  <div class="cf-icon-error-container">
    <i class="cf-icon cf-icon-browser"></i>
    <i class="cf-icon-status cf-icon-ok"></i>
  </div>
  <span class="cf-status-desc">You</span>
  <h3 class="cf-status-name">Browser</h3>
  <span class="cf-status-label">Working</span>
</div>

<div id="cf-cloudflare-status" class="cf-column cf-status-item cf-cloudflare-status ">
  <div class="cf-icon-error-container">
    <i class="cf-icon cf-icon-cloud"></i>
    <i class="cf-icon-status cf-icon-ok"></i>
  </div>
  <span class="cf-status-desc">Amsterdam</span>
  <h3 class="cf-status-name">Cloudflare</h3>
  <span class="cf-status-label">Working</span>
</div>

<div id="cf-host-status" class="cf-column cf-status-item cf-host-status cf-error-source">
  <div class="cf-icon-error-container">
    <i class="cf-icon cf-icon-server"></i>
    <i class="cf-icon-status cf-icon-error"></i>
  </div>
  <span class="cf-status-desc">matrix.org</span>
  <h3 class="cf-status-name">Host</h3>
  <span class="cf-status-label">Error</span>
</div>

                </div>
              
            </div>
        </div><!-- /.status-display -->

        <div class="cf-section cf-wrapper">
            <div class="cf-columns two">
                <div class="cf-column">
                    <h2>What happened?</h2>
                    <p>The initial connection between Cloudflare's network and the origin web server timed out. As a result, the web page can not be displayed.</p>
                </div>
              
                <div class="cf-column">
                    <h2>What can I do?</h2>
                          <h5>If you're a visitor of this website:</h5>
      <p>Please try again in a few minutes.</p>

      <h5>If you're the owner of this website:</h5>
      <p><span>Contact your hosting provider letting them know your web server is not completing requests. An Error 522 means that the request was able to connect to your web server, but that the request didn't finish. The most likely cause is that something on your server is hogging resources.</span> <a href="https://support.cloudflare.com/hc/en-us/articles/200171906-Error-522">Additional troubleshooting information here.</a></p>
                </div>
            </div>
              
        </div><!-- /.section -->

        <div class="cf-error-footer cf-wrapper">
  <p>
    <span class="cf-footer-item">Cloudflare Ray ID: <strong>41c142917aaf2c06</strong></span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Your IP</span>: 2001:984:e8e3:1:43ac:64b9:43b0:7da4</span>
    <span class="cf-footer-separator">&bull;</span>
    <span class="cf-footer-item"><span>Performance &amp; security by</span> <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link" target="_blank">Cloudflare</a></span>
    
  </p>
</div><!-- /.error-footer -->


    </div><!-- /#cf-error-details -->
</div><!-- /#cf-wrapper -->
</body>
</html>

@JoKeyser
Copy link
Author

JoKeyser commented May 16, 2018

Maybe to provide more context, sorry for not mentioning it yet: Synapse is running against postgres, and the client port behind apache2.

@JoKeyser JoKeyser changed the title Client connection issues, found several errors in logs, some critical Frequent crashes and spikes in RAM use, several critical errors in logs May 17, 2018
@JoKeyser
Copy link
Author

Changed the title to more closely reflect what's going on.

@neilisfragile
Copy link
Contributor

Okay thanks, let me take a closer look, I'll come back to you.

I agree the Cloud Flare error is odd - I'm assuming is it coming from matrix.org (we use CF for ddos protection). Though there is no reason to think it relates to your instability.

We have some performance improvements about to land rsn in response to resource constraints in running matrix.org - these will likely help the situation.

@JoKeyser
Copy link
Author

JoKeyser commented May 24, 2018

For the record, it got worse again: I can't connect to my Synapse anymore in any meaningful way, Riot is constantly showing that it's connectivity with the server has been lost.
There are no new critical errors that I haven't reported above (they just keep repeating).
There are some errors I haven't mentioned here yet; I doubt they explain the problem on their own, but maybe provide further clues:

2018-05-21 10:16:49,721 - synapse.storage.events - 195 - ERROR - None- IntegrityError, retrying.
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 193, in f
    res = yield func(self, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 445, in _persist_events
    new_forward_extremeties=new_forward_extremeties,
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 320, in runInteraction
    result = yield self.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 365, in runWithConnection
    inner_func, *args, **kwargs
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 307, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 298, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 361, in inner_func
    return func(conn, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 316, in inner_func
    func, *args, **kwargs
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 241, in _new_transaction
    r = func(txn, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/util/logutils.py", line 84, in wrapped
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 691, in _persist_events_txn
    events_and_contexts=events_and_contexts,
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 979, in _update_outliers_txn
    "state_group": state_group_id,
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 444, in _simple_insert_txn
    txn.execute(sql, vals)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 82, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 112, in _do_execute
    sql, *args
IntegrityError: duplicate key value violates unique constraint "ex_outlier_stream_pkey"
DETAIL:  Key (event_stream_ordering)=(640187) already exists.
2018-05-22 09:46:56,264 - synapse.storage.events - 195 - ERROR - None- IntegrityError, retrying.
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 193, in f
    res = yield func(self, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 445, in _persist_events
    new_forward_extremeties=new_forward_extremeties,
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 320, in runInteraction
    result = yield self.runWithConnection(inner_func, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 365, in runWithConnection
    inner_func, *args, **kwargs
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 307, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/usr/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py", line 298, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 361, in inner_func
    return func(conn, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 316, in inner_func
    func, *args, **kwargs
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 241, in _new_transaction
    r = func(txn, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/util/logutils.py", line 84, in wrapped
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/events.py", line 713, in _persist_events_txn
    self._store_event_state_mappings_txn(txn, events_and_contexts)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/state.py", line 797, in _store_event_state_mappings_txn
    for event_id, state_group_id in state_groups.iteritems()
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 484, in _simple_insert_many_txn
    txn.executemany(sql, vals)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 85, in executemany
    self._do_execute(self.txn.executemany, sql, *args)
  File "/usr/lib/python2.7/dist-packages/synapse/storage/_base.py", line 112, in _do_execute
    sql, *args
IntegrityError: null value in column "state_group" violates not-null constraint
DETAIL:  Failing row contains ($15269724413749517KpLDY:matrix.org, null).
018-05-23 23:53:48,393 - synapse.storage._base - 249 - WARNING - None- [TXN OPERROR] {_get_state_groups_from_groups-cbf} out of memory
DETAIL:  Failed on request of size 106.
2018-05-24 11:06:21,606 - synapse.http.server - 89 - ERROR - GET-599- Failed handle request via <function _async_render_GET at 0x7fb648111578>: <SynapseRequest at 0x7fb5b1c65cf8 method=GET uri=/_matrix/media/v1/thumbnail/matrix.trancendances.fr/OWzRqPgBMXSpxqktvuVXVRcl?width=14&height=14&method=crop clientproto=HTTP/1.1 site=8008>: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1442, in gotResult
    _inlineCallbacks(r, g, deferred)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/synapse/http/server.py", line 66, in wrapped_request_handler
    yield h(self, request)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/thumbnail_resource.py", line 82, in _async_render_GET
    width, height, method, m_type
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/thumbnail_resource.py", line 243, in _respond_remote_thumbnail
    width, height, method, m_type, thumbnail_infos
  File "/usr/lib/python2.7/dist-packages/synapse/rest/media/v1/thumbnail_resource.py", line 294, in _select_thumbnail
    return min(info_list2)[-1]
exceptions.ValueError: min() arg is an empty sequence
DETAIL:  Failing row contains ($15271243501925vkKvC:ajann.xyz, null).
2018-05-24 21:59:06,259 - synapse.federation.federation_server - 171 - ERROR - PUT-117- Failed to handle PDU $15271892817205zhJvr:disroot.org
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/federation/federation_server.py", line 163, in process_pdus_for_room
    transaction.origin, pdu
FirstError: FirstError[#0, [Failure instance: Traceback: <type 'exceptions.Exception'>: Context for new extremity event $1527187997498474PPuaY:matrix.org has no state group
2018-05-24 23:41:24,430 - synapse.federation.federation_server - 171 - ERROR - PUT-221- Failed to handle PDU $1527193467605750SdnUC:matrix.org
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/synapse/federation/federation_server.py", line 163, in process_pdus_for_room
    transaction.origin, pdu
ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]

@neilisfragile
Copy link
Contributor

Hi, thanks for the update. My hunch is that the errors are symptoms of high load rather than a cause.

It's not clear to me why you are seeing such extreme perf problems though - can you give a sense of the sort of rooms you are federating against - I'm interested in size and number rather than the specific room names.

Our perf improvements have sat behind recent GDPR work, but are still very much in the works.

Best I can suggest at this point is further reduction of SYNAPSE_CACHE_FACTOR

@neilisfragile
Copy link
Contributor

Looking into SYNAPSE_CACHE_FACTOR a bit more closely, I'd actually suggest that you do the opposite of my advice and see what happens if you increase it.

For well spec'd instances (such as yours) increasing the factor can actually reduce overall RAM load since Synapse spends less effort calculating results it could have retrieved from the cache.

Suggest trying 1 and see what happens.

@JoKeyser
Copy link
Author

@neilisfragile thanks for your help. I'm the only user on my server, so everything I do is in federated rooms. The breakdown of user number of joined room is: About 20 1:1 chats, 3x with 10-100 users, 7x 100-1k users, 3x with 2k-7k, and 1x with 15k. I'd be happy to leave the bigger ones to see if it helps, probably there's a server command to do so?

I've set SYNAPSE_CACHE_FACTOR = 1, but even after 12+ hours I still cannot connect, and it keeps crashing (oom_reaper). The only apparent improvement is that only one kind of CRITICAL errors seems to persist, namely Unhandled error in Deferred.

Due to the lack of other reports like mine, I suspect it's the unique state of my server. It was working fine for months, and I didn't add big rooms recently.
Would there be a way to start fresh(er) without losing my joined rooms etc?

@bjo81
Copy link

bjo81 commented May 26, 2018

You are not the only one ;) We have a federated synapse running, a room with ~45 users (in total, matrix + IRC) is bridged to IRC, and some users also use the IRC-bridge to connect to ~10 IRC channels.

A month ago or so, synapse got very memory hungry. It eats up to 5 GB RAM (on a 6 GB VM) and gets killed, after increasing the memory to 8GB it get's killed after consuming 7 GB. Playing around with SYNAPSE_CACHE_FACTOR with values between 0.25 and 1 did not have any useful effect.

@JoKeyser
Copy link
Author

@bjo81 ok great to know that this isn't just my own problem. Based on your experiments with adding RAM, I will now refrain from attempting to add swap space to my btrfs partition, thanks for the info.
Could you maybe share more details about your install? My problems also started about 4 weeks ago, maybe it was the same "event" in our Debian installs, if you also use that distro? Let's try to exclude some possibilities...

@bjo81
Copy link

bjo81 commented May 27, 2018

@JoKeyser We use Ubuntu Xenial as distro, and there are no real errors in the log:

grep error homeserver.log
2018-05-27 08:46:51,329 - synapse.federation.transaction_queue - 542 - WARNING - - TX [cloud.mkleen.de] Failed to send transaction: An error occurred while connecting: 113: No route to host.
2018-05-27 08:47:57,212 - synapse.federation.transaction_queue - 542 - WARNING - - TX [hamburg.freifunk.net] Failed to send transaction: An error occurred while connecting: 113: No route to host.
2018-05-27 11:53:19,135 - twisted - 131 - CRITICAL - - Unhandled error in Deferred:
2018-05-27 15:25:28,015 - twisted - 131 - CRITICAL - - Unhandled error in Deferred:

@JoKeyser
Copy link
Author

JoKeyser commented May 27, 2018

Inspired by #1884 I changed the logging to only console, not journal or file (after adding filters: [context] to the console handler entry), and this actually improves my situation: Synapse still OOMs but more rarely, and I can connect in between. At least that means that it's not my DB state per se.
I got some few new messages from federated rooms, I hope to get more over the course of the next hours.

Edit 1: And as short-term workaround, I consider to follow the suggestion in #3038 to leave some bigger rooms and then use the synapse janitor script from https://github.com/xwiki-labs/synapse_scripts...

Edit 2: Unfortunately, it's back to unusable after all... cannot log in since a few hours.

@bjo81
Copy link

bjo81 commented May 28, 2018

Interesting, I also used a config from #1884 now and the mem usage bounces between 1 und 5 GB, but no OOM yet.

@JoKeyser
Copy link
Author

JoKeyser commented May 31, 2018

Yesterday I used the synapse janitor script, which deleted 28 unused rooms (that I forgot about). I also upgraded to Synapse 0.30.1. Unfortunately still no improvement after 12+ hours. Only the OOM repeats are faster now again, about every 2-3 minutes...

Edit: Things actually improved eventually after several days; I can login about 10% of the time and receive some new messages and can send. OOMs still persist though.

@bjo81
Copy link

bjo81 commented Jul 18, 2018

We are running our instance now in a VM with 12 GB RAM and synapses keeps running stable with a usage of 5 GB.

@JoKeyser
Copy link
Author

For me, there was a relapse, constant OOMs and no ability to login anymore in between. Who knows, maybe it would work if I had more RAM, but as is, it's unusable. I'm out of my wits... is there a good way to backup the entire database and start fresh, with the same server keys?

@richvdh
Copy link
Member

richvdh commented Oct 3, 2018

@JoKeyser did this ever get resolved?

@JoKeyser
Copy link
Author

JoKeyser commented Oct 5, 2018

@richvdh thanks for asking. Unfortunately no, but I resorted to restore from an old backup, and since then it works fine since a few weeks. RAM use is much lower than it used to be, great work. I think this issue can be closed.

@JoKeyser JoKeyser closed this as completed Oct 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants