Prevent excess DB re-connection for services

Previously, the pool size for postgres DB connections was set to 1. This was causing SQLAlchemy to create a new connection from scratch any time there was more than one thread simultaneously using Butlers sharing a connection pool (as is typical in service use cases.)
lsst · Apr 24, 2024 · 4dc22d0 · 4dc22d0
1 parent 5a9335e
commit 4dc22d0
Show file tree

Hide file tree

Showing 3 changed files with 46 additions and 2 deletions.
diff --git a/doc/changes/DM-44050.bugfix.md b/doc/changes/DM-44050.bugfix.md
@@ -0,0 +1 @@
+Postgres database connections are now checked for liveness before they are used, significantly reducing the chance of exceptions being thrown due to stale connections.
diff --git a/doc/changes/DM-44050.perf.md b/doc/changes/DM-44050.perf.md
@@ -0,0 +1 @@
+Increased the Postgres connection pool size, fixing an issue where multi-threaded services would re-create the database connection excessively.
diff --git a/python/lsst/daf/butler/registry/databases/postgresql.py b/python/lsst/daf/butler/registry/databases/postgresql.py
@@ -159,12 +159,54 @@ def makeEngine(
     ) -> sqlalchemy.engine.Engine:
         return sqlalchemy.engine.create_engine(
             uri,
-            pool_size=1,
             # Prevent stale database connections from throwing exeptions, at
             # the expense of a round trip to the database server each time we
             # check out a session.  Many services using the Butler operate in
-            # networks connections are frequently dropped.
+            # networks where connections are dropped when idle for some time.
             pool_pre_ping=True,
+            # This engine and database connection pool can be shared between
+            # multiple Butler instances created via Butler.clone() or
+            # LabeledButlerFactory, and typically these will be used from
+            # multiple threads simultaneously.  So we need to configure
+            # SQLAlchemy to pool connections for multi-threaded usage.
+            #
+            # This is not the maximum number of active connections --
+            # SQLAlchemy allows some additional overflow configured via the
+            # max_overflow parameter.  pool_size is only the maximum number
+            # saved in the pool during periods of lower concurrency.
+            #
+            # This specific value for pool size was chosen somewhat arbitrarily
+            # -- there has not been any formal testing done to profile database
+            # concurrency. The value chosen may be somewhat lower than is
+            # optimal for service use cases.  Some considerations:
+            #
+            # 1. Connections are only created as they are needed, so in typical
+            #    single-threaded Butler use only one connection will ever be
+            #    created. Services with low peak concurrency may never create
+            #    this many connections.
+            # 2. Most services using the Butler (including Butler
+            #    server) are using FastAPI, which uses a thread pool of 40 by
+            #    default.  So when running at max concurrency we may have:
+            #      * 10 connections checked out from the pool
+            #      * 10 "overflow" connections re-created each time they are
+            #        used.
+            #      * 20 threads queued up, waiting for a connection, and
+            #        potentially timing out if the other threads don't release
+            #        their connections in a timely manner.
+            # 3. The main Butler databases at SLAC are run behind pgbouncer,
+            #    so we can support a larger number of simultaneous connections
+            #    than if we were connecting directly to Postgres.
+            #
+            # See
+            # https://docs.sqlalchemy.org/en/20/core/pooling.html#sqlalchemy.pool.QueuePool.__init__
+            # for more information on the behavior of this parameter.
+            pool_size=10,
+            # In combination with pool_pre_ping, prevent SQLAlchemy from
+            # unnecessarily reviving pooled connections that have gone stale.
+            # Setting this to true makes it always re-use the most recent
+            # known-good connection when possible, instead of cycling to other
+            # connections in the pool that we may no longer need.
+            pool_use_lifo=True,
         )
 
     @classmethod