Hydra write to database: broken pipe #1599

jacksontj · 2019-10-09T15:08:22Z

Describe the bug

Requests to hydra (running with at least postgres) can sometimes hit an "idle" (but closed) connection to the DB. In this scenario the caller gets back:

error_hint=write tcp SRC:37738->DST:5432: write: broken pipe

I've run into similar errors in other apps, and the issue is upstream in lib/pq as well (lib/pq#870).

I was able to work around this issue by setting max_conn_lifetime=10s in the DSN -- as it avoids the connections being held open but actually closed on the remote.

Reproducing the bug

This has been annoying to reproduce on-demand. I've noticed this when tokens go to get refreshed from the IDP -- specifically after a long time (8hrs in this case) such that the connection can close out.

Expected behavior

I would expect that L4 errors such as this would be transparently retried instead of bubbling all the way to the client.

The text was updated successfully, but these errors were encountered:

aeneasr · 2019-10-09T15:46:47Z

Using these settings is the appropriate way of handling this. You did not provide the version string, but we're using best effort defaults to work around a broken connection pool.

max_conn_lifetime forcibly terminates connections, which is why we don't use that per default and why Go has this disabled per default. It's only required when the network is extremely flaky (e.g. connecting to Google SQL from a local machine) or when Postgres is configured in such a way that idle connections are closed.

Unless the driver give us an indicator that a connection should be retried, this would not make sense because errors such as 404, 409, 400 should not be retried with (exponential) backoff or else all these errors will take forever to load. On top of that, the sql connection pool (from Golang) is actually capable of handling retries in connection issue scenarios:

When this happens sql.DB handles it gracefully. Bad connections will automatically be retried twice before giving up, at which point Go will remove the connection from the pool and create a new one.

So this really looks a driver issue to me, and not an issue we want to solve with "code on top" to fix something, especially if max_conn_lifetime seems to be working fine.

I would be open to set max_conn_lifetime to a default that's not 0 as the performance impact should be negligible when we're above something reasonable like 1h. However, I'm not sure if it actually resolves the issue at hand when there's a lot of pressure on the server and e.g. the PostgreSQL server restarts. If you can confirm that simply setting this to non-zero solves the broken pipe error in all scenarios I think this is a nobrainer to fix!

jacksontj · 2019-10-10T15:30:15Z

So this really looks a driver issue to me, and not an issue we want to solve with "code on top" to fix something, especially if max_conn_lifetime seems to be working fine.

Seems reasonable, from the upstream issue it seems that the driver isn't properly marking those connections for the sql.DB client to retry them properly.

I would be open to set max_conn_lifetime to a default that's not 0 as the performance impact should be negligible when we're above something reasonable like 1h.

I'm not sure what a reasonable default would be. In my case I just have a postgres box in a cloud provider and have run into this issue but I haven't found the source of the connection closing yet. THe value would need to be lower than whatever is closing the connections, in my case I to the extreme (10s) since I need it to always work (and my load is pretty low right now). So again, not sure we want to set a default timeout (maybe just have some docs explaining the situation?). It seems that pgx (another pg driver for go -- https://github.com/jackc/pgx) has this retry logic implemented -- jackc/pgx@4868929

So maybe a simpler solution is swapping :/

aeneasr · 2019-10-11T09:07:27Z

Interesting, I didn't know this driver existed. Swapping the driver is definitely a bit of work because we have to handle error codes correctly etc.

But if lib/pq fails to deliver it's a viable solution.

aeneasr · 2019-10-14T09:20:13Z

It seems like pgx explicitly targets software that only connects to PostgreSQL by providing additional methods and being able to work with types not supported by database/sql. If that holds true, we can not use that driver here as we need support for cockroach and mysql.

aeneasr · 2019-10-14T09:40:31Z

I checked around a bit and it doesn't seem that what you're trying to do (retrying on e.g. broken pipe errors) is something most sql driver implementations are supporting as it's not possible to distinguish if the server accepted the request or not. From the official Go mysql driver:

@RobertGrantEllis I agree that checking connection is closed or not before sending command is preferable. But (a) it requires dedicated reader goroutine, and (b) SetConnMaxLifetime() is far better than it.

And even MySQL Connector/C (official implementation of client) doesn't try detecting broken connection before sending packet.

Source: go-sql-driver/mysql#529 (comment)

It does appear that there's a PR for checking MySQL pruned connections ( go-sql-driver/mysql#934 ) but I'm not sure if that's even an issue with PostgreSQL. If it was, and this was unsolved, I'm sure we could convince the maintainers to accept a similar PR.

jacksontj · 2019-10-14T16:39:42Z

It seems like pgx explicitly targets software that only connects to PostgreSQL by providing additional methods and being able to work with types not supported by database/sql. If that holds true, we can not use that driver here as we need support for cockroach and mysql.

pgx also has a stdlib interface (https://github.com/jackc/pgx/tree/master/stdlib) to use the same go sql interface. TBH I haven't used it with cockroach but I would expect it to work (since its PG API), as for mysql lib/pq doesn't support mysql either, right?

Checked around a bit and it doesn't seem that what you're trying to do (retrying on e.g. broken pipe errors) is something most sql driver implementations are supporting as it's not possible to distinguish if the server accepted the request or not.

IIRC there's a way to check at the syscall level -- but I'm not sure if thats easily accessible on the go side of things (I assume it could be). This is really only an issue for idle connections that when used are actually broken. Now that I've set my idle timeout very short I just don't see this issue anymore.

It seems like we'd want lib/pq to basically do the same as here -- go-sql-driver/mysql#934

aeneasr · 2019-10-15T11:06:17Z

Do you know if go-sql-driver/mysql#934 is included in pgx/stdlib? I tried to re-animate a discussion in lib/pq that tried to solve this using another way but so far no response.

aeneasr · 2019-10-18T08:41:46Z

Anyways, I'm closing this here. The proposed solution to retry on failure is not an acceptable solutions as I've laid out in previous comments. This needs to be fixed in the driver itself.

We could switch to pgx/stdlib assuming that it performs better than lib/pq in total and specifically these scenarios.

glerchundi · 2020-02-19T07:57:31Z

Just putting my 2 cents in.

@aeneasr I would seriously consider switching from lib/pq to jackc/pgx as it seems better implemented & maintained:

https://jbrandhorst.com/post/postgres/
sql: ensure that a pgwire-level cancellation request (using a cancel msg) prevents a client freeze cockroachdb/cockroach#32973 (comment)
qa: INET data type cockroachdb/cockroach#21876 (comment)
perf: scaling cockroach clusters while running under cayley, slows performance cockroachdb/cockroach#17108 (comment)
Switch CockroachDB driver from lib/pq to jackc/pgx. cayleygraph/cayley#697
sql: support COPY from stdin binary cockroachdb/cockroach#19603 (comment)
sql: COPY FROM tests are broken cockroachdb/cockroach#18352 (comment)
sql: lib/pq confused by unicode characters in text[], while psql is OK with them cockroachdb/cockroach#31872 (comment)

Obviously, in order to make this as less disruptive as possible I would use the stdlib (https://godoc.org/github.com/jackc/pgx/stdlib) driver instead of the non-database/sql compliant API.

In our case we found that context propagation/cancellation weren't supported in a way we need/thought which made some of our workload behave in unexpected ways.

/cc @lopezator

aeneasr · 2020-02-19T08:00:10Z

If switching to pgx/stlib is a one-liner I'd be up for that

Closes ory#1599

Closes #1599

Closes #1599 Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>

Closes ory#1599

Closes ory#1599 Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>

aeneasr closed this as completed Oct 18, 2019

aeneasr reopened this Feb 19, 2020

glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020

*: switch from lib/pq to jackc/{pgconn,pgx}

57b099d

Closes ory#1599

glerchundi mentioned this issue Feb 20, 2020

refactor: switch from lib/pq to jackc/pgx #1736

Merged

glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020

refactor: switch from lib/pq to jackc/{pgconn,pgx}

91c53ce

Closes ory#1599

glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020

refactor: switch from lib/pq to jackc/{pgconn,pgx}

21b34d0

Closes ory#1599

glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 20, 2020

refactor: switch from lib/pq to jackc/pgx

7fcd905

Closes ory#1599

glerchundi added a commit to glerchundi/hydra that referenced this issue Feb 21, 2020

refactor: switch from lib/pq to jackc/pgx

59e1ded

Closes ory#1599

aeneasr closed this as completed in #1736 Feb 23, 2020

aeneasr pushed a commit that referenced this issue Feb 23, 2020

refactor: switch from lib/pq to jackc/pgx (#1736)

ec78668

Closes #1599

aeneasr mentioned this issue Feb 23, 2020

refactor: switch from lib/pq to jackc/pgx #1738

Merged

aeneasr added a commit that referenced this issue Feb 23, 2020

refactor: switch from lib/pq to jackc/pgx (#1738)

a42b767

Closes #1599 Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>

aeneasr mentioned this issue Feb 23, 2020

refactor: switch from lib/pq to jackc/pgx (#1738) #1739

Merged

aeneasr added a commit that referenced this issue Feb 28, 2020

refactor: switch from lib/pq to jackc/pgx (#1738)

2296e78

Closes #1599 Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>

eli-zh pushed a commit to eli-zh/hydra that referenced this issue Mar 22, 2020

refactor: switch from lib/pq to jackc/pgx (ory#1736)

d81ce64

Closes ory#1599

eli-zh pushed a commit to eli-zh/hydra that referenced this issue Mar 22, 2020

refactor: switch from lib/pq to jackc/pgx (ory#1738)

74462d8

Closes ory#1599 Co-authored-by: Gorka Lerchundi Osa <glertxundi@gmail.com>

jakkab mentioned this issue Jul 20, 2020

Retrieving oAuth token fails every time on first call kyma-project/kyma#8914

Closed

jfyne mentioned this issue Jan 19, 2021

Postgres broken pipe regression #2312

Closed

This was referenced Aug 1, 2022

[Snyk] Security upgrade node-fetch from 2.6.1 to 3.2.10 piotrkpc/hydra#19

Closed

[Snyk] Security upgrade node-fetch from 2.6.0 to 3.2.10 piotrkpc/hydra#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hydra write to database: broken pipe #1599

Hydra write to database: broken pipe #1599

jacksontj commented Oct 9, 2019

aeneasr commented Oct 9, 2019 •

edited

Loading

jacksontj commented Oct 10, 2019

aeneasr commented Oct 11, 2019

aeneasr commented Oct 14, 2019

aeneasr commented Oct 14, 2019

jacksontj commented Oct 14, 2019

aeneasr commented Oct 15, 2019

aeneasr commented Oct 18, 2019

glerchundi commented Feb 19, 2020

aeneasr commented Feb 19, 2020

Hydra write to database: broken pipe #1599

Hydra write to database: broken pipe #1599

Comments

jacksontj commented Oct 9, 2019

aeneasr commented Oct 9, 2019 • edited Loading

jacksontj commented Oct 10, 2019

aeneasr commented Oct 11, 2019

aeneasr commented Oct 14, 2019

aeneasr commented Oct 14, 2019

jacksontj commented Oct 14, 2019

aeneasr commented Oct 15, 2019

aeneasr commented Oct 18, 2019

glerchundi commented Feb 19, 2020

aeneasr commented Feb 19, 2020

aeneasr commented Oct 9, 2019 •

edited

Loading