Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault can't handle properly AWS RDS postgresql multi-az failover #6792

Closed
zenathar opened this issue May 28, 2019 · 27 comments
Closed

Vault can't handle properly AWS RDS postgresql multi-az failover #6792

zenathar opened this issue May 28, 2019 · 27 comments
Labels
bug Used to indicate a potential bug community-sentiment Tracking high-profile issues from the community secret/database

Comments

@zenathar
Copy link

zenathar commented May 28, 2019

Environment:

  • Vault Version: 1.1.2
  • Operating System/Architecture: centos-release-7-2.1511.el7.centos.2.10.x86_64

Vault Config File:

backend "consul" {
  address = "127.0.0.1:8500"
  path = "vault"
}

listener "tcp" {
  address = "x.x.x.x:8200"
  tls_disable = 1
}

ha_backend "consul" {
  api_addr = "http:/[[vault_dns_name]]:8200"
  cluster_addr = "http://[[vault_dns_name]]:8201"
}

cluster_name = "xxx"
ui = false

Startup Log Output:

May 28 11:26:35 [censored] systemd[1]: Started Vault.
May 28 11:26:35 [censored] systemd[1]: Starting Vault...
May 28 11:26:35 [censored] vault[2140]: ==> Vault server configuration:
May 28 11:26:35 [censored] vault[2140]: HA Storage: consul
May 28 11:26:35 [censored] vault[2140]: Api Address: http://[censored]:8200
May 28 11:26:35 [censored] vault[2140]: Cgo: disabled
May 28 11:26:35 [censored] vault[2140]: Cluster Address: https://[censored]:8201
May 28 11:26:35 [censored] vault[2140]: Listener 1: tcp (addr: "[censored]:8200", cluster address: "[censored]:8201", max_request_duration: "1m30s", max
_request_size: "33554432", tls: "disabled")
May 28 11:26:35 [censored] vault[2140]: Log Level: info
May 28 11:26:35 [censored] vault[2140]: Mlock: supported: true, enabled: true
May 28 11:26:35 [censored] vault[2140]: Storage: consul
May 28 11:26:35 [censored] vault[2140]: Version: Vault v1.1.2
May 28 11:26:35 [censored] vault[2140]: Version Sha: 0082501623c0b704b87b1fbc84c2d725994bac54
May 28 11:26:35 [censored] vault[2140]: ==> Vault server started! Log data will stream in below:
May 28 11:26:35 [censored] vault[2140]: 2019-05-28T11:26:35.921Z [WARN]  storage.consul: appending trailing forward slash to path
May 28 11:26:35 [censored] vault[2140]: 2019-05-28T11:26:35.923Z [WARN]  no `api_addr` value specified in config or in VAULT_API_ADDR; falling back to d
etection if possible, but this value should be manually set
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.819Z [INFO]  core: vault is unsealed
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.821Z [INFO]  core.cluster-listener: starting listener: listener_address=[censored]:8201
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.821Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[censored]:8201
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.821Z [INFO]  core: entering standby mode
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.844Z [INFO]  core: acquired lock, enabling active operation
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.890Z [INFO]  core: post-unseal setup starting
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.892Z [INFO]  core: loaded wrapping token key
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.892Z [INFO]  core: successfully setup plugin catalog: plugin-directory=
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.896Z [INFO]  core: successfully mounted backend: type=kv path=secret/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.896Z [INFO]  core: successfully mounted backend: type=system path=sys/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.896Z [INFO]  core: successfully mounted backend: type=identity path=identity/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.896Z [INFO]  core: successfully mounted backend: type=database path=database/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.896Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.911Z [INFO]  core: successfully enabled credential backend: type=token path=token/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.911Z [INFO]  core: successfully enabled credential backend: type=approle path=approle/
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.911Z [INFO]  core: restoring leases
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.911Z [INFO]  rollback: starting rollback manager
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.918Z [INFO]  identity: entities restored
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.920Z [INFO]  identity: groups restored
May 28 11:27:05 [censored] vault[2140]: 2019-05-28T11:27:05.922Z [INFO]  core: post-unseal setup complete
May 28 11:27:06 [censored] vault[2140]: 2019-05-28T11:27:06.067Z [INFO]  expiration: lease restore complete

Expected Behavior:
After multi-az failover on postgresql on AWS RDS vault should properly generate new credentials when requested.

Actual Behavior:
After multi-az failover vault hangs maximum amount of time (90s) on generating new credentials, then times out. Credentials are properly generated in about 5-20 minutes after failover.
Additionaly, there is no traffic seen in tcpdump for both ip adresses of AWS rds postgresql - old one (before) and new one (after failover).
Issue won't occur on AWS rds mysql

Steps to Reproduce:

  1. Create aws postgresql with multi-az enabled
  2. Execute following statement on newly created database:
CREATE ROLE vault_root ROLE root; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO vault_root; GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO vault_root; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO vault_root; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO vault_root;'"
  1. Create postgresql database config as described:
/vault read database/config/db
Key                                   Value
---                                   -----
allowed_roles                         [db_rw]
connection_details                    map[max_open_connections:4 connection_url:postgresql://root:*****@[db_url]:5432/db max_connection_lifetime:5s max_idle_connections:-1]
plugin_name                           postgresql-database-plugin
root_credentials_rotate_statements    []
  1. Create postgresql database role as described:
[root@aint2vault01b vault]# ./vault read database/roles/db_rw
Key                      Value
---                      -----
creation_statements      [CREATE ROLE "{{name}}" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE vault_root; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO "{{name}}"; GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO "{{name}}";]
db_name                 db
default_ttl              1h
max_ttl                  800000h
renew_statements         []
revocation_statements    []
rollback_statements      []
  1. Reboot AWS postgresql database - - tick "Reboot With Failover?" checkbox
  2. Wait about minute-two and try to read database/creds/db_rw

Important Factoids:

References:

@melkorm
Copy link

melkorm commented Oct 2, 2019

More details on on this issue.

Steps to reproduce

Using https://gist.github.com/melkorm/25ce9f0d3840d29caa3491a47129e00f we can reproduce this by running setup-server.sh then setup-db.sh with correct hostname and password for RDS, after running this commands we can watch for credentials given by vault using watch.sh.
After this we can reboot RDS with failover in AWS console and watch vault logs.

Observations:

  • It looks like none of configuration parameters can't control time when connection hits driver: bad connection error,
  • vault behaves correctly if it doesn't run into driver: bad connection state

driver: bad connection logs: https://gist.github.com/melkorm/d6a46b37ba2618222a89c5476d34ab06

If you look trough logs you can find this part:

2019-10-01T15:50:03.346Z [TRACE] secrets.database.database_0f754b69.postgresql-database-plugin: create user: transport=builtin status=finished err="driver: bad connection" took=5m13.5570328s
2019-10-01T15:50:03.347Z [TRACE] secrets.database.database_0f754b69.postgresql-database-plugin: create user: transport=builtin status=finished err="context canceled" took=4m11.3820182s
2019-10-01T15:50:03.349Z [TRACE] secrets.database.database_0f754b69.postgresql-database-plugin: create user: transport=builtin status=finished err="context canceled" took=3m9.3568063s
2019-10-01T15:50:03.349Z [TRACE] secrets.database.database_0f754b69.postgresql-database-plugin: create user: transport=builtin status=finished err="context canceled" took=1m5.1224078s
2019-10-01T15:50:03.351Z [TRACE] secrets.database.database_0f754b69.postgresql-database-plugin: create user: transport=builtin status=finished err="context canceled" took=2m7.2336053s
2019-10-01T15:50:04.857Z [TRACE] secrets.database.database_0f754b69.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=4.5787007s

which shows that vault blocks all incoming new password requests, even if DNS points to new IP and waits for driver: bad connection to happen for X amount of minutes (already hit 10 minutes and 5 minutes).

Below are the logs when vault correctly timeouts and uses new RDS instance:

2019-10-01T15:34:58.571Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.4509824s
2019-10-01T15:35:04.782Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:06.118Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.3357046s
2019-10-01T15:35:08.540Z [DEBUG] rollback: attempting rollback: path=auth/token/
2019-10-01T15:35:08.540Z [DEBUG] rollback: attempting rollback: path=secret/
2019-10-01T15:35:08.540Z [DEBUG] rollback: attempting rollback: path=sys/
2019-10-01T15:35:08.540Z [DEBUG] rollback: attempting rollback: path=cubbyhole/
2019-10-01T15:35:08.540Z [DEBUG] rollback: attempting rollback: path=identity/
2019-10-01T15:35:08.540Z [DEBUG] rollback: attempting rollback: path=database/
2019-10-01T15:35:12.304Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:13.715Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.4107558s
2019-10-01T15:35:19.990Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:21.375Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err="read tcp 172.17.0.2:34286->10.5.8.165:5432: i/o timeout" took=1.3850439s
2019-10-01T15:35:27.494Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:29.499Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err="dial tcp 10.5.8.165:5432: i/o timeout" took=2.0057219s
2019-10-01T15:35:35.631Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:37.637Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err="dial tcp 10.5.8.165:5432: i/o timeout" took=2.0058206s
2019-10-01T15:35:43.754Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:45.762Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err="dial tcp 10.5.8.165:5432: i/o timeout" took=2.0082028s
2019-10-01T15:35:52.002Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:35:54.006Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err="dial tcp 10.5.8.165:5432: i/o timeout" took=2.0043805s
2019-10-01T15:36:00.248Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:36:01.794Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.5458048s
2019-10-01T15:36:07.932Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:36:08.467Z [DEBUG] rollback: attempting rollback: path=secret/
2019-10-01T15:36:08.468Z [DEBUG] rollback: attempting rollback: path=identity/
2019-10-01T15:36:08.468Z [DEBUG] rollback: attempting rollback: path=cubbyhole/
2019-10-01T15:36:08.468Z [DEBUG] rollback: attempting rollback: path=sys/
2019-10-01T15:36:08.468Z [DEBUG] rollback: attempting rollback: path=database/
2019-10-01T15:36:08.469Z [DEBUG] rollback: attempting rollback: path=auth/token/
2019-10-01T15:36:09.247Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.3147729s
2019-10-01T15:36:15.433Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:36:16.892Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.4587123s
2019-10-01T15:36:23.059Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started
2019-10-01T15:36:24.367Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=finished err=<nil> took=1.308189s
2019-10-01T15:36:30.502Z [TRACE] secrets.database.database_3c1d9906.postgresql-database-plugin: create user: transport=builtin status=started

I've tested it with v1.0.1 and v1.2.3 vault versions.

I will dig more into this issue and try to rebuild vault with more debug points to find exactly where timeout is not respected but it would be also nice to get more information from the team so perhaps we can resolve this faster :) Perhaps issue is in https://github.com/lib/pq or in https://github.com/golang/go itself.

PS. Also found that GO https://golang.org/src/database/sql/sql.go#L777 handles driver.ErrBadConn differently that any other error.

@tj13
Copy link

tj13 commented Oct 12, 2019

Any update about this issue

@bandesz
Copy link

bandesz commented Oct 21, 2019

lib/pq does not handle query timeouts correctly and it can hang for a long time during an AZ failover. Here is the related issue: lib/pq#450.

@melkorm
Copy link

melkorm commented Oct 21, 2019

@bandesz thanks for the link to related issue ! I've discovered this myself and wanted to create similar issue but it's great that it exists already, although after some research it looks like Vault doesn't provide credentials concurrently which causes that application is not able to recover from this error and vault is in broken state - can't serve credentials for this db.

Vault could handle it and accept new requests for credentials while old connection is broken by just allowing for new requests as connection pool can handle it, currently it's not possible, please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/vault-tool/GXK3EMW7GGM/ii5DMAFODwAJ

lib/pq#450 this issue is already 3 years old and I guess there are two issues, one related to vault and other to lib/pq.

@michelvocks
Copy link
Contributor

michelvocks commented Nov 7, 2019

Hi @zenathar & @melkorm!

#5731 has been merged recently which looks like a solution for this issue.
I will close this issue for now but please let me know if the issue is still present.

Cheers,
Michel

@bandesz
Copy link

bandesz commented Nov 7, 2019

@michelvocks that seems to be an unrelated issue, can you please link the correct one?

@michelvocks
Copy link
Contributor

@bandesz Sorry. It was #5731.

@melkorm
Copy link

melkorm commented Nov 17, 2019

@bandesz @michelvocks Hey, can we actually reopen this as it was a mistake ?

@michelvocks
Copy link
Contributor

Hi @melkorm!

Yes. Sorry for the inconvenience!

Cheers,
Michel

@michelvocks michelvocks reopened this Nov 18, 2019
@michelvocks michelvocks added bug Used to indicate a potential bug secret/database labels Nov 18, 2019
@tyrannosaurus-becks tyrannosaurus-becks self-assigned this Mar 31, 2020
@tyrannosaurus-becks
Copy link
Contributor

Thanks for providing such clear steps to reproduce this. I'm able to reproduce it locally. I've been testing and I found that if I comment out these lines, the problem disappears. Going to see if I can find an easy workaround for it.

@melkorm
Copy link

melkorm commented Apr 1, 2020

@tyrannosaurus-becks thank you for looking into it 🥇

I think this works as we always open a new connection when asking for credentials so we don't run into broken connections 🤔
but it's weird that we don't run into https://github.com/hashicorp/vault/blob/master/plugins/database/postgresql/postgresql.go#L125 as this lock should be still held when previous query / connection hangs - or am I misunderstanding it ?

I think proper way to fix this issue is to replace / change underlaying Postgres library as it can't handle timeouts correctly so even if we fix that on vault side and it won't block by creating new connections we can run into issue of using all available connections.

PS. I am afraid that there is no simple workaround for it :(

@tyrannosaurus-becks
Copy link
Contributor

I have a working branch going here for anyone following along. I've found that if we simply don't cache connections, we eliminate the problem, so I'm working on making a setting for that.

@frittentheke
Copy link

frittentheke commented Apr 1, 2020

@tyrannosaurus-becks Even if you do not reuse old connections, the issue would still be on individual connections not timing out (and not receiving any other lower layer error such as a TCP reset). This is what half-open TCP connections are like: silent ;-)

While there is a long outstanding PR to add this lib/pq#792 why not add a timeout wrapper around the function doing the database querying and have that timeout (and potentially retry).

Otherwise you have individual API request that Vault received from its client time out and run into an error - while this then heals things on the next try the client does with a new connections every time as with your PR - it seems kind of unclean to not at least retry once to talk to the SQL backend.

@tyrannosaurus-becks
Copy link
Contributor

@melkorm and @frittentheke , what do you think of #8660? I did testing and replicated the issue, and with the linked code, I was able to have no delay with returning creds after failover. Vault already was closing connections if the ping failed, this simply does that every time instead of sometimes.

@tyrannosaurus-becks
Copy link
Contributor

Of course with the code, I needed to use a slightly different config:

vault write database/config/mypostgresqldatabase \
    plugin_name=postgresql-database-plugin \
    allowed_roles="my-role" \
    connection_url="postgresql://{{username}}:{{password}}@test-issue-6792.redacted.us-east-1.rds.amazonaws.com:5432/" \
    username="postgres" \
    password="redacted" \
    disable_connection_caching=true

@frittentheke
Copy link

@tyrannosaurus-becks yes, it's a 99% improvement as it will always use a fresh connection and only an individual query that came in right when the database is being switched might fail.

But would the underlying driver (pq in this case) have the said timeout it would fail and a retry on this individual connection or query could happen as well. But the more I think about it your solution might just be that 99% and in any case be a good addition in options to configure the SQL backend.

@melkorm
Copy link

melkorm commented Apr 1, 2020

@tyrannosaurus-becks I will test it tomorrow 👍

My only concern is that previously when I was testing it the code was hanging on the lock not the actual query, so even with fresh connection we still wait for the lock to be released 🤔 as from my understanding Vault can't provide credentials asynchronously and each API call waits for each other, if you could clarify it it would be great as I can misunderstand something - here you can find more details on this issue https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/vault-tool/GXK3EMW7GGM

@tyrannosaurus-becks
Copy link
Contributor

tyrannosaurus-becks commented Apr 1, 2020

Ah! Thanks! In my testing I didn't encounter the locking issues you mention. I'll be very curious if you encounter them again in your testing.

As for the question in that thread, why does the lock exist, I think it's simply because of the connection being cached. If the lock weren't there, there'd be a race with the connection caching. If you do still encounter locking issues, let me know and I'll scratch my head and see if I can come up with any other options. I could maybe run the client's queries in a goroutine where, if the client doesn't return after a certain amount of time, it releases the lock and moves on....

@melkorm
Copy link

melkorm commented Apr 2, 2020

Hey, so I've gave it more thought and tested it locally and hopefully have some more information.

@tyrannosaurus-becks The patch makes things better, but after running it for a moment and jumping with RDS failovers I ran into this: status=finished err="pq: tuple concurrently updated" took=866.536553ms more logs can be found here: https://gist.github.com/melkorm/86576974ac8d2560981e09a0f2c3440b

also I think that this fix addresses situations when we run into failover before we get the lock e.g.:

  1. We hit vault for credentials A -> it is under failover, we get timeout
  2. we do next API call and we get new connection and freshly baked credentials
  3. any next call also works as we get fresh conns

but if connection failover happens after we get connection and acquire the lock we land back into socket timeout game https://gist.github.com/melkorm/37660c59bc260543aa54e347c9fdb6cb :(

I am thinking why vault even tries to cache connections and locks stuff if Go db plugin handles connection pool and database transactions can handle consistency 🤔

If we could get rid of timeout when we are in the middle of transaction ... but on other hand it fells more like quick-fixing rather than proper solution - perhaps vault could use patched lib/pq library version with timeouts properly implemented ? This is what we thought to do if nothing better happens :/

Cheers, and thank you for working on this 🙏

@tyrannosaurus-becks
Copy link
Contributor

That makes sense. I'm catching up to your level of context on the issue. I think I'm going to close the associated PR here because I don't think it gets us to where we need to go.

I was wondering, do you think that if Vault used the DialTimeout method located here, that might help with the hanging issues?

As for the locks, I hear you, if the underlying pq library handles it, there's no reason for us to do the same because it forces synchronous connections unnecessarily.

@melkorm
Copy link

melkorm commented Apr 3, 2020

I was wondering, do you think that if Vault used the DialTimeout method located here, that might help with the hanging issues?

This would resolve issues around connecting to database and currently can be achieved by specifying it on database url level e.g.: postgresql://{{username}}:{{password}}@test-issue-6792.redacted.us-east-1.rds.amazonaws.com:5432/db_name?connect_timeout=1
see here https://github.com/lib/pq/blob/356e267cd3f45ee1303b7d8765e3583cb949950e/conn.go#L341

I will think more about locks/pq and update this answer later.

So after thinking about for a while I think there are few solutions we could think about.

  1. use patched https://github.com/lib/pq with Added read_timeout and write_timeout lib/pq#792 - this would resolve this issue and perhaps few more and make vault postgres database plugin more stable.

  2. use alternative library for postgres, looking trough https://github.com/lib/pq issues it seems not maintained well enough comparing to alternatives like https://github.com/jackc/pgx and https://github.com/go-pg/pg, those are just examples and seems more high level (ORMs & stuff) but we could pick something more low level which would support timeouts and be more dependable and maintained.

  3. Rethink vault's connection caching and locking http://go-database-sql.org/connection-pool.html reading trough this make sense to leave connection pool to database/sql rather than reimplementing it 🤔 I bet there was a case for this in earlier days of Go but it seems a bit off now.

Let me know what you think @tyrannosaurus-becks

@pbernal pbernal added this to the triaged milestone May 11, 2020
@srikiraju
Copy link

Can this issue occur with other RDS backends like MySQL? We had a similar issue with RDS MySQL and it might be related to this

@aphorise
Copy link
Contributor

@tyrannosaurus-becks - hey what happened to the WIP?

@CleverDBA
Copy link

Modifying these Vault storage parameters reduces the "5 minute hang" to under a minute, but then you basically lose connection pooling:

connection_url="[whatever your url is]?connect_timeout=5"
max_idle_connections=-1
max_connection_lifetime="1s"

It would be great if Vault handled lost database connections more gracefully. An RDS Postgres failover can happen in a matter of seconds, but vault's default behavior is to hang for 5 or 10 minutes when an RDS Postgres failover occurs.

@hsimon-hashicorp hsimon-hashicorp added the community-sentiment Tracking high-profile issues from the community label Jan 18, 2022
@hsimon-hashicorp hsimon-hashicorp removed this from the triaged milestone Jan 18, 2022
@aphorise
Copy link
Contributor

aphorise commented Sep 5, 2022

There's been a lot of changes since 1.1.2 - including for example in the most recent 1.11.x:

database & storage: Change underlying driver library from lib/pq to pgx. This change affects Redshift & Postgres database secrets engines, and CockroachDB & Postgres storage engines [GH-15343]

Hey @zenathar I was interested to know if you've retested this flow or if it's still applicable?

@melkorm
Copy link

melkorm commented Sep 25, 2022

@aphorise Can't really reproduce it atm but looking at pgx code it looks like their are at least trying to handle such cases https://github.com/jackc/pgx/blob/d7c7ddc594209e641b6066b625973e8d7d711142/internal/nbconn/nbconn.go#L62 so in my opinion this issue could be closed and reopened if someone will hit this issue again.

Thank you for replacing pq with pgx as I can imagine it was a lot of work 💪🏼 🎉

@hsimon-hashicorp
Copy link
Contributor

As per the last comment, I'm going to go ahead and close this issue now. Please feel free to open a new one as needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug community-sentiment Tracking high-profile issues from the community secret/database
Projects
None yet
Development

Successfully merging a pull request may close this issue.