Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep-alive option #38

Merged
merged 2 commits into from
Sep 27, 2014
Merged

Add keep-alive option #38

merged 2 commits into from
Sep 27, 2014

Conversation

flowchartsman
Copy link
Contributor

done by eschewing net.Dial shorthand and creating a new dialer and setting the keepalive option

jackc added a commit to jackc/pgx_notify_lost_connection that referenced this pull request Sep 26, 2014
Vendored pgx with jackc/pgx#38 included
@jackc
Copy link
Owner

jackc commented Sep 26, 2014

I did a test of this functionality by testing a connection to a PostgreSQL server on a different computer, then pulling the network cable from that computer. I got some surprising results.

Here's the test program I used: https://github.com/jackc/pgx_notify_lost_connection

Without your patch, it hung, apparently forever (I let it go 15 minutes). With your patch with a KeepAlive of 5 seconds, it hung for 10 minutes, and then noticed the connection was dead.

I found this post that explains what is happening: http://felixge.de/2014/08/26/tcp-keepalive-with-golang.html

I'm not sure I want to use the library the post references -- it doesn't support Windows and I try to avoid external dependencies.

I suppose timing out after a platform specific amount of time is better than never timing out at all. But it feels wrong to expose control of keep-alive as a number that can't accurately reflect what the true timeout value is. I wonder if it would be a good idea to not expose any configuration and just to turn on keep alive on with a timeout of 1 minute (or 5 or 10...). The libpq C client library enables keep alives by default so presumably it is safe (http://www.postgresql.org/docs/9.3/static/libpq-connect.html). In this way we would not set any expectations of how quickly dead connections will be noticed, but we can be sure that they will be noticed eventually (this is also on my mind because it occurred to me while working on this that a dropped connection at the wrong time could cause Query or Exec to lockup as well).

@flowchartsman
Copy link
Contributor Author

I was able to get fairly quick detection of a lost connection with a for-select that attempted a nop ( --; ) every two minutes and otherwise sat in a 30 second WaitForNotification loop. I think perhaps that turning keep-alives on by default is good, but I'm unsure what the best hard-coded value would be in that case. Thoughts? It also occurs to me that the null query functionality I added earlier today to support the aforementioned NOP could help detect dropped connections as well.

@jackc
Copy link
Owner

jackc commented Sep 27, 2014

It probably was the nop that was detecting the lost connections, because that is causing a write on the connection. That would detect the broken connection regardless of keep-alive. Where the keep-alive would help is a WaitForNotification loop that did not include the nop (because that only reads and never tries to send on the connection).

The problem with just using keep-alive is the actual time it takes from broken connection to detection is highly variable based on OS and OS configuration (and can easily be many minutes until lost connection detection). So to get consistent and shorter latencty detection of lost connection in a WaitForNotification loop it does seem necessary to poll in some manner.

This makes me think we have two issues here. One is the specific case of WaitForNotification detecting a lost connection, and two is the general case of a lost connection.

For WaitForNotification it bothers me to require users of pgx to include the nop request in their WaitForNotification loop. I think an internal nop (which may not have to be an Exec -- there may be a lower level nop message we can send -- Flush might do it) would be a good idea.

For the general case of the lost connection, I think we do need to have keep-alive on. Here's a possible scenario: A user sends a select statement that will return 100 rows. After receiving 50 of them the connection silently is lost. pgx could currently hang forever waiting to read. The keep-alive time doesn't matter a great deal because this is just a fail-safe to ensure we eventually notice the lost connection. For a real lost or slow connection detection we would need to use Set[Read|Write]Deadline.

I'm going to merge this pull request and then change keep-alive to always on. I created issue #40 for WaitForNotification detecting lost connections quicker to continue the discussion there.

jackc added a commit that referenced this pull request Sep 27, 2014
@jackc jackc merged commit a724d7b into jackc:master Sep 27, 2014
@flowchartsman flowchartsman deleted the keepalive branch September 28, 2014 15:13
jackc added a commit that referenced this pull request Dec 4, 2021
Handle IPv6 in connection URLs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants