Skip to content
This repository has been archived by the owner on Aug 26, 2022. It is now read-only.

bug 1434296: Use persistent DB connections #4644

Merged
merged 1 commit into from
Jan 30, 2018

Conversation

jwhitlock
Copy link
Contributor

Use persistent DB connections by setting CONN_MAX_AGE, defaulting to 60 seconds.

AWS RDS limits the maximum connections as a factor of the instance size. Our current instance allows 1320 simultaneous connections. My estimate is that we use 1010 connections, and would be below this limit:

Deployment Pod Count Per Pod Total
api 2 4 8
celery-beat 1 1 1
celery-cam 1 1 1
celery-worker 10 4 40
kumascript 6 0 0
web 20 8 960
Total 1010

Another limiting factor is that MySQL will close idle connections, set by wait_timeout. It is currently set to 28800, or 8 hours. The CONN_MAX_AGE should be below this number, and we may want to lower MySQL's wait_timeout so that idle connections are released sooner. I believe this was much lower in SCL3, based on experience with interactive sessions, which made the default CONN_MAX_AGE a better choice for that environment.

@jgmize and @metadave may have feedback from an SRE perspective on this change.

Use persistent DB connections by setting CONN_MAX_AGE, defaulting to 60
seconds.
Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the database max-connection analysis, and for this change! I really like the idea of moving toward longer-lived connections.

return None


CONN_MAX_AGE = config('CONN_MAX_AGE', default=60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with starting at 60 seconds for the default, but I lean towards going even higher (like maybe something between 5 and 15 minutes?), just so we can reduce the overhead of establishing new connections as much as possible.

I suspect that if we go beyond a certain value, we may have to contend with firewalls. In my past experience at two different employers, we had persistent DB connections that connected through a firewall, and discovered that the firewall silently dropped idle connections (wouldn't inform either end) after some period of time (it was something between 30 minutes to an hour if I remember correctly, and of course is configurable per firewall). We got around the issue by pinging the connection periodically to keep it alive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can play with the settings in prod, and adjust the default as needed. I'd like to only have a minute to shoot ourselves in the foot if, for example, a web pod doesn't properly close a the connection as a new deployment is being rolled out. We're a little close to the max connections for my comfort...

@bookshelfdave
Copy link
Contributor

bookshelfdave commented Jan 30, 2018

+1, we can try lowering wait_timeout for the mdn-stage-params RDS paramter group before changing on prod (a DB reboot may be required)

@jwhitlock
Copy link
Contributor Author

I'd like to ship this when we're around to monitor, and then tweak CONN_MAX_AGE and wait_timeout if we have problems, especially around rolling deployments. We can experiment with a per-connection wait_timeout as well (in a new code push).

@jwhitlock jwhitlock merged commit b3a52d0 into mdn:master Jan 30, 2018
@jwhitlock jwhitlock deleted the conn-max-age-1434296 branch January 30, 2018 22:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants