New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault can't handle properly AWS RDS postgresql multi-az failover #6792
Comments
More details on on this issue.Steps to reproduceUsing https://gist.github.com/melkorm/25ce9f0d3840d29caa3491a47129e00f we can reproduce this by running Observations:
driver: bad connection logs: https://gist.github.com/melkorm/d6a46b37ba2618222a89c5476d34ab06 If you look trough logs you can find this part:
which shows that vault blocks all incoming new password requests, even if DNS points to new IP and waits for Below are the logs when vault correctly timeouts and uses new RDS instance:
I've tested it with I will dig more into this issue and try to rebuild vault with more debug points to find exactly where timeout is not respected but it would be also nice to get more information from the team so perhaps we can resolve this faster :) Perhaps issue is in https://github.com/lib/pq or in https://github.com/golang/go itself. PS. Also found that GO https://golang.org/src/database/sql/sql.go#L777 handles |
Any update about this issue |
lib/pq does not handle query timeouts correctly and it can hang for a long time during an AZ failover. Here is the related issue: lib/pq#450. |
@bandesz thanks for the link to related issue ! I've discovered this myself and wanted to create similar issue but it's great that it exists already, although after some research it looks like Vault doesn't provide credentials concurrently which causes that application is not able to recover from this error and vault is in broken state - can't serve credentials for this db. Vault could handle it and accept new requests for credentials while old connection is broken by just allowing for new requests as connection pool can handle it, currently it's not possible, please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/vault-tool/GXK3EMW7GGM/ii5DMAFODwAJ lib/pq#450 this issue is already 3 years old and I guess there are two issues, one related to vault and other to lib/pq. |
@michelvocks that seems to be an unrelated issue, can you please link the correct one? |
@bandesz @michelvocks Hey, can we actually reopen this as it was a mistake ? |
Hi @melkorm! Yes. Sorry for the inconvenience! Cheers, |
Thanks for providing such clear steps to reproduce this. I'm able to reproduce it locally. I've been testing and I found that if I comment out these lines, the problem disappears. Going to see if I can find an easy workaround for it. |
@tyrannosaurus-becks thank you for looking into it 🥇 I think this works as we always open a new connection when asking for credentials so we don't run into broken connections 🤔 I think proper way to fix this issue is to replace / change underlaying Postgres library as it can't handle timeouts correctly so even if we fix that on vault side and it won't block by creating new connections we can run into issue of using all available connections. PS. I am afraid that there is no simple workaround for it :( |
I have a working branch going here for anyone following along. I've found that if we simply don't cache connections, we eliminate the problem, so I'm working on making a setting for that. |
@tyrannosaurus-becks Even if you do not reuse old connections, the issue would still be on individual connections not timing out (and not receiving any other lower layer error such as a TCP reset). This is what half-open TCP connections are like: silent ;-) While there is a long outstanding PR to add this lib/pq#792 why not add a timeout wrapper around the function doing the database querying and have that timeout (and potentially retry). Otherwise you have individual API request that Vault received from its client time out and run into an error - while this then heals things on the next try the client does with a new connections every time as with your PR - it seems kind of unclean to not at least retry once to talk to the SQL backend. |
@melkorm and @frittentheke , what do you think of #8660? I did testing and replicated the issue, and with the linked code, I was able to have no delay with returning creds after failover. Vault already was closing connections if the ping failed, this simply does that every time instead of sometimes. |
Of course with the code, I needed to use a slightly different config:
|
@tyrannosaurus-becks yes, it's a 99% improvement as it will always use a fresh connection and only an individual query that came in right when the database is being switched might fail. But would the underlying driver (pq in this case) have the said timeout it would fail and a retry on this individual connection or query could happen as well. But the more I think about it your solution might just be that 99% and in any case be a good addition in options to configure the SQL backend. |
@tyrannosaurus-becks I will test it tomorrow 👍 My only concern is that previously when I was testing it the code was hanging on the lock not the actual query, so even with fresh connection we still wait for the lock to be released 🤔 as from my understanding Vault can't provide credentials asynchronously and each API call waits for each other, if you could clarify it it would be great as I can misunderstand something - here you can find more details on this issue https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/vault-tool/GXK3EMW7GGM |
Ah! Thanks! In my testing I didn't encounter the locking issues you mention. I'll be very curious if you encounter them again in your testing. As for the question in that thread, why does the lock exist, I think it's simply because of the connection being cached. If the lock weren't there, there'd be a race with the connection caching. If you do still encounter locking issues, let me know and I'll scratch my head and see if I can come up with any other options. I could maybe run the client's queries in a goroutine where, if the client doesn't return after a certain amount of time, it releases the lock and moves on.... |
Hey, so I've gave it more thought and tested it locally and hopefully have some more information. @tyrannosaurus-becks The patch makes things better, but after running it for a moment and jumping with RDS failovers I ran into this: also I think that this fix addresses situations when we run into failover before we get the lock e.g.:
but if connection failover happens after we get connection and acquire the lock we land back into socket timeout game https://gist.github.com/melkorm/37660c59bc260543aa54e347c9fdb6cb :( I am thinking why vault even tries to cache connections and locks stuff if Go db plugin handles connection pool and database transactions can handle consistency 🤔 If we could get rid of timeout when we are in the middle of transaction ... but on other hand it fells more like quick-fixing rather than proper solution - perhaps vault could use patched lib/pq library version with timeouts properly implemented ? This is what we thought to do if nothing better happens :/ Cheers, and thank you for working on this 🙏 |
That makes sense. I'm catching up to your level of context on the issue. I think I'm going to close the associated PR here because I don't think it gets us to where we need to go. I was wondering, do you think that if Vault used the As for the locks, I hear you, if the underlying |
This would resolve issues around connecting to database and currently can be achieved by specifying it on database url level e.g.:
So after thinking about for a while I think there are few solutions we could think about.
Let me know what you think @tyrannosaurus-becks |
Can this issue occur with other RDS backends like MySQL? We had a similar issue with RDS MySQL and it might be related to this |
@tyrannosaurus-becks - hey what happened to the WIP? |
Modifying these Vault storage parameters reduces the "5 minute hang" to under a minute, but then you basically lose connection pooling:
It would be great if Vault handled lost database connections more gracefully. An RDS Postgres failover can happen in a matter of seconds, but vault's default behavior is to hang for 5 or 10 minutes when an RDS Postgres failover occurs. |
There's been a lot of changes since 1.1.2 - including for example in the most recent 1.11.x:
Hey @zenathar I was interested to know if you've retested this flow or if it's still applicable? |
@aphorise Can't really reproduce it atm but looking at pgx code it looks like their are at least trying to handle such cases https://github.com/jackc/pgx/blob/d7c7ddc594209e641b6066b625973e8d7d711142/internal/nbconn/nbconn.go#L62 so in my opinion this issue could be closed and reopened if someone will hit this issue again. Thank you for replacing pq with pgx as I can imagine it was a lot of work 💪🏼 🎉 |
As per the last comment, I'm going to go ahead and close this issue now. Please feel free to open a new one as needed. Thanks! |
Environment:
Vault Config File:
Startup Log Output:
Expected Behavior:
After multi-az failover on postgresql on AWS RDS vault should properly generate new credentials when requested.
Actual Behavior:
After multi-az failover vault hangs maximum amount of time (90s) on generating new credentials, then times out. Credentials are properly generated in about 5-20 minutes after failover.
Additionaly, there is no traffic seen in tcpdump for both ip adresses of AWS rds postgresql - old one (before) and new one (after failover).
Issue won't occur on AWS rds mysql
Steps to Reproduce:
database/creds/db_rw
Important Factoids:
References:
The text was updated successfully, but these errors were encountered: