postgresql HA backend - vault auto seals when it cannot open connection to database due to exhausted local ports #11936

write0nly · 2021-06-24T19:13:52Z

This issue was caught in QA/stress testing and is not really expected in a production environment, however could also be forced by users who can login to the vault server host.

Because vault (when using postgresql backend) does fast ephemeral connections to postgresql to create/delete leases and entries in the DB, if there are too many connections in TIME_WAIT or CLOSE_WAIT vault can cycle through the entirety of the port range available for connections and eventually run out of ports, printing the following type of error:

Jun 24 19:38:50 vault-2 vault[32594]: 2021-06-24T19:38:50.179+0100 [ERROR] core: failed to create token: error="failed to persist accessor index entry: dial tcp 10.0.0.1:5432: connect: cannot assign requested address"
Jun 24 19:38:50 vault-2 vault[32594]: 2021-06-24T19:38:50.841+0100 [ERROR] core: failed to create token: error="failed to persist accessor index entry: dial tcp 10.0.0.1:5432: connect: cannot assign requested address"

after some time of this happening vault errors and seals itself. In the case where vault tries to expire leases upon startup and has too many leases (let's say 200k it needs to expire) vault exhausts ports and then seals itself. This causes vault to become unusable because it keeps on re-sealing itself over and over again.

If the user is persistent and also unseals vault in a loop, then vault will reach a stable point when the number of leases goes below 10k, which can be seen with this query:

vault=> select path, count(path) from vault_kv_store group by path having count(path) > 500;
                path                | count
------------------------------------+-------
 /sys/expire/id/auth/approle/login/ |  1826
 /sys/token/accessor/               |  3300
 /sys/token/id/                     |  3968
(3 rows)

Steps to reproduce the behavior:

Have a vault cluster (tested on v1.7.2) using the postgresql storage backend. This was tested in a cluster but may also happen on a single vault.
Run vault write auth/approle/login role_id=... secret_id=... in a loop millions of times until we have 200k+ outstanding vault tokens that are going to expire.
stop all vaults in the cluster, let's say due to an upgrade.
Make sure you have a large number of outstanding leases in the DB:

vault=> select path, count(path) from vault_kv_store group by path having count(path) > 500;
                path                | count
------------------------------------+--------
 /sys/expire/id/auth/approle/login/ | 226473
 /sys/token/accessor/               | 227460
 /sys/token/id/                     | 227691
(3 rows)

restart and unseal the vaults

Expected behavior
5. vault gets unsealed and works normally doing lease deletion in background.

Observed behaviour
6. Vault frantically tries to remove expired leases and delete leases from tables, rapidly cycling and exhausting all source ports to the postgresql server. When no ports are available anymore vault starts erroring and then seals itself.

cluster_name            = "test"
log_level               = "trace"
pid_file                = "/run/vault_pgsql.pid"

ui                      = true
disable_mlock           = true
verbose_oidc_logging    = true
raw_storage_endpoint    = true

# must have full protocol in the string
cluster_addr      = "https://..."
api_addr          = "https://..."

tls_require_and_verify_client_cert = "false"

listener "tcp" {
  address = "10.0.0.10:9999"
  tls_disable  = "false"
  tls_disable_client_certs = "true"
  tls_cert_file ="/etc/vault.d/tls/vault.crt"
  tls_key_file ="/etc/vault.d/tls/vault.key"


storage "postgresql" {
    connection_url = "postgres://vault_user:password@dbhost:5432/vault?sslmode=disable"
    ha_enabled     = true
}

The text was updated successfully, but these errors were encountered:

write0nly · 2021-06-25T16:08:14Z

for the record this seems to happen due to the connection pooling which is too small by default (unset?). If we set max_idle_connections > max_parallel the connections are not torn down and there is no churn. It has the obvious down side of having many connections open, but maybe max_parallel can be lowered too.

The following setting worked flawlessly.

    max_idle_connections = 256
    max_parallel = 128

IMHO this could become:
1- change the default so that max_idle_connections >= max_parallel
2- document this clearly on the postgresql backend page

ncabatoff · 2021-07-08T21:06:04Z

for the record this seems to happen due to the connection pooling which is too small by default (unset?). If we set max_idle_connections > max_parallel the connections are not torn down and there is no churn. It has the obvious down side of having many connections open, but maybe max_parallel can be lowered too.

The following setting worked flawlessly.
    max_idle_connections = 256
    max_parallel = 128
IMHO this could become:
1- change the default so that max_idle_connections >= max_parallel
2- document this clearly on the postgresql backend page

Hi @write0nly,

This suggestion makes good sense to me, I'm all for it. I'm not sure when we'll get to it though, feel free to submit a PR if you get impatient.

heatherezell · 2021-09-02T21:22:13Z

Hi @write0nly - following up on Nick's comment, was this work that you'd be interested in taking up and filing a PR for? Please let us know how we can help. Thanks!

icy · 2022-08-03T08:31:26Z

for the record this seems to happen due to the connection pooling which is too small by default (unset?). If we set max_idle_connections > max_parallel the connections are not torn down and there is no churn. It has the obvious down side of having many connections open, but maybe max_parallel can be lowered too.

The following setting worked flawlessly.
    max_idle_connections = 256
    max_parallel = 128
IMHO this could become: 1- change the default so that max_idle_connections >= max_parallel 2- document this clearly on the postgresql backend page

Thanks for this. We have a small setup with mysql backend and we faced the same issue. In our case, the following configuration also works smoothly

    max_idle_connections = 10
    max_parallel = 5

vishalnayak added the ecosystem label Jun 24, 2021

ncabatoff added the storage/postgresql label Jul 8, 2021

ncabatoff added good-first-issue and removed ecosystem labels Jul 8, 2021

heatherezell added the waiting-for-response label Sep 2, 2021

heatherezell removed the waiting-for-response label Dec 20, 2021

stevendpclark mentioned this issue Apr 1, 2022

Vault v1.5.2 on EKS with Amazon RDS backend entering sealed state #14783

Closed

CuriousCorrelation added a commit to CuriousCorrelation/vault that referenced this issue May 31, 2022

Fix exhausted local ports preventing new connection (hashicorp#11936)

7c7bf1f

CuriousCorrelation mentioned this issue May 31, 2022

Fix exhausted local ports preventing new connection (#11936) #15694

Closed

maxcoulombe added the ecosystem label Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

postgresql HA backend - vault auto seals when it cannot open connection to database due to exhausted local ports #11936

postgresql HA backend - vault auto seals when it cannot open connection to database due to exhausted local ports #11936

write0nly commented Jun 24, 2021

write0nly commented Jun 25, 2021

ncabatoff commented Jul 8, 2021

heatherezell commented Sep 2, 2021

icy commented Aug 3, 2022

postgresql HA backend - vault auto seals when it cannot open connection to database due to exhausted local ports #11936

postgresql HA backend - vault auto seals when it cannot open connection to database due to exhausted local ports #11936

Comments

write0nly commented Jun 24, 2021

write0nly commented Jun 25, 2021

ncabatoff commented Jul 8, 2021

heatherezell commented Sep 2, 2021

icy commented Aug 3, 2022