Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use keepalives #147

Open
rod-glover opened this issue Mar 8, 2023 · 1 comment
Open

Use keepalives #147

rod-glover opened this issue Mar 8, 2023 · 1 comment

Comments

@rod-glover
Copy link
Contributor

Note: This issue requires only a documentation change. Code does not need to be changed.

Something in the path to our database server, and/or the database engine itself, kills jobs that run longer than 120 minutes. This is the default keepalive time in both Linux and the database engine. One way to address this problem is to add keepalives on the client side at an interval shorter than the default (or whatever they happen to be set to be).

The PostgreSQL documentation for Database Connection Control Functions contains information on this. Note the following; this is a client-side setting, not a database engine setting:

The following functions deal with making a connection to a PostgreSQL backend server.

The documentation lists the functions and the parameters that can be passed to them. These parameters are listed in section 34.1.2, and include the following keepalive parameters.

keepalives
Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for connections made via a Unix-domain socket.

keepalives_idle
Controls the number of seconds of inactivity after which TCP should send a keepalive message to the server. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPIDLE or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_interval
Controls the number of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPINTVL or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_count
Controls the number of TCP keepalives that can be lost before the client's connection to the server is considered dead. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPCNT or an equivalent socket option is available; on other systems, it has no effect.

By setting these parameters to values that send keepalives at intervals shorter than the critical interval of 7200 s = 120 min, we can prevent the connection from being dropped.

Further, these parameters can be set in the database connection strings, which enables all clients to be set up appropriately without code change (which would be impossible with psql, for example). For example:

postgresql://user:password@db.pcic.uvic.ca/crmp?keepalives=1&keepalives_idle=300&keepalives_interval=300&keepalives_count=9

The 300 s (5 min) interval is very conservative, but given the minimal overhead for keepalives, it should not cause problems.

@rod-glover
Copy link
Contributor Author

rod-glover commented Mar 8, 2023

This has been verified in two ways:

  • An explicit change to the code for script manage-views (on a dead-end branch), which fixed the problem for those queries, which are very long-running.
  • A test run of psql connections from my work laptop, one with the recommended parameters, and one without. Each ran the query select pg_sleep(2.2*60*60); (2.2 hr). The connection with parameters succeeded without error; the connection without got SSL SYSCALL error: Connection timed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant