Server connect rate-limit / preallocating reserved connections (hot spares)? #478

jeffdoering · 2020-04-17T01:37:48Z

We are seeing application-induced login storms against our PG database where applications with aggressive timeouts drop connections to our bouncer when queries temporarily get slower (for various reason). That part is sort of fine although not ideal.

Unfortunately; the dropped client connections lead the bouncer to shut down server connections closing because: client unexpected eof (age=3413s). We see a high cost on PG to establish new connections which is why the bouncer is so valuable. Unfortunately in this case the situation spirals - clients drop connections due to slowness, new connections to the DB spike it's CPU slowing thing further, and essentially the bouncer pool of server connections gets torn down.

As the connection pool shrinks; clients hit hard timeout failures without killing new connections so the pool starts to grow again - but it struggles to open new connections successfully as the high PG CPU makes everything slow and the cycle repeats.

Ultimately the original slow-down is a problem; but the connection login storms appears to significantly amplify the impact and causes the system to thrash.

We are looking at improving client logic to try and not drop connection so aggressively causing loss of the backend connection. However; that may be unavoidable in various cases.

We are interested in whether there are configuration patterns that accomplish something like a rate-limit on new connections to PG (let clients timeout waiting for connections if needed) and even more ideally a mechanism to pre-allocate (slowly) hot-spare connections that can be used to replace dropped connections. That way as long as the initial slowdown and connection loss is small enough; it can be immediately mitigated by draining a pool of hot-spares and then slowly replenish the spares.

I don't think pgbouncer provides this pattern now but am interested in options on mitigating this pattern assuming we will have transient slowdowns and want to protect the overall infrastructure.

The text was updated successfully, but these errors were encountered:

eulerto · 2020-05-30T10:10:14Z

It seems you should tune parameter server_idle_timeout to retain connections for a large period. Another idea is to set min_pool_size so you will always have at least that number of open connections.
You did not provide details about your setup, but if you have set reserve_pool_size and you are using all pool connections, it will take some time to open the connections from this "reserve pool".

markokr · 2020-05-30T15:24:52Z

The mistake here seems to be client simply closing slow connections. They should instead keep them and send cancel request instead. That would allow to send signal to postgres backends that client lost interest in the result and that they should stop.

If on the event when database gets slow, clients start to launch additional backends where they launch same query, without waiting for result of first one and without canceling it, there cannot be good outcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server connect rate-limit / preallocating reserved connections (hot spares)? #478

Server connect rate-limit / preallocating reserved connections (hot spares)? #478

jeffdoering commented Apr 17, 2020

eulerto commented May 30, 2020

markokr commented May 30, 2020

Server connect rate-limit / preallocating reserved connections (hot spares)? #478

Server connect rate-limit / preallocating reserved connections (hot spares)? #478

Comments

jeffdoering commented Apr 17, 2020

eulerto commented May 30, 2020

markokr commented May 30, 2020