Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault leaking file descriptors in Standby Mode #9276

Closed
pierresouchay opened this issue Jun 20, 2020 · 9 comments
Closed

Vault leaking file descriptors in Standby Mode #9276

pierresouchay opened this issue Jun 20, 2020 · 9 comments
Labels
bug Used to indicate a potential bug core Issues and Pull-Requests specific to Vault Core

Comments

@pierresouchay
Copy link

pierresouchay commented Jun 20, 2020

Describe the bug

In standby mode, Vault is leaking descriptors (while the leader is fine)
This happens only in one of our clusters, but vault is leaking around 10 fd/s until it reaches its limit (16k).

The file descriptors are not shown with ss command but can be shown with lsof -U vault:

vault   1353 vault *190u     sock                0,9       0t0    110799 protocol: TCP
vault   1353 vault *191u     sock                0,9       0t0    110801 protocol: TCP
vault   1353 vault *192u     sock                0,9       0t0    105365 protocol: TCP
vault   1353 vault *193u     sock                0,9       0t0    105367 protocol: TCP
vault   1353 vault *194u     sock                0,9       0t0    110803 protocol: TCP
vault   1353 vault *195u     sock                0,9       0t0    105369 protocol: TCP
vault   1353 vault *196u     sock                0,9       0t0    110805 protocol: TCP
sudo lsof -u vault|wc -l
16388

99% of them are showing "protocol: TCP":

sudo lsof -u vault|grep "protocol: TCP"|wc -l
16365

To Reproduce
Until unsealed, no problem, but when unsealling the vault, it starts leaking FDs.

Expected behavior
A clear and concise description of what you expected to happen.

Environment:

Key             Value
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
Total Shares    4
Threshold       2
Version         1.4.0-criteo1
Cluster Name    vault-cluster-2c1f174d
Cluster ID      a238e897-cf90-684d-90e9-0030d0a6de47
HA Enabled      true
HA Cluster      https://consulkv06-par.central.criteo.prod:8201
HA Mode         active

Vau

Vault server configuration file(s):

{
  "api_addr": "http://mymachine.example.acme.org:8200",
  "disable_mlock": true,
  "ui": true,
  "plugin_directory": "/var/opt/vault/plugins/",
  "listener": {
    "tcp": {
      "address": "0.0.0.0:8200",
      "tls_disable": "true"
    }
  },
  "storage": {
    "consul": {
      "address": "http://localhost:9500",
      "token": "xxxx-xxxx-xxxx-xxxxx-xxxxx"
    }
  },
  "ha_storage": {
    "consul": {
      "address": "http://localhost:9500",
      "path": "ha-vault",
      "token": "xxxx-xxxxx-xxxx-xxxx-xxxx-xxx"
    }
  },
  "telemetry": {
    "prometheus_retention_time": "24h",
    "disable_hostname": true
  }
}

Looks a bit similar to #3633 and #3244

@ncabatoff
Copy link
Contributor

Hi @pierresouchay,

This looks like a custom build/fork of Vault. Can you reproduce this with one of our binaries?

@pierresouchay
Copy link
Author

@ncabatoff We are still investigating, we will keep you posted
cc @tionebsalocin

@ncabatoff ncabatoff added bug Used to indicate a potential bug core Issues and Pull-Requests specific to Vault Core waiting-for-response labels Jul 6, 2020
@ncabatoff
Copy link
Contributor

I wonder if this is related to #9383?

@pierresouchay
Copy link
Author

@ncabatoff Indeed that's possible we are using plugins!

@sorindumitru
Copy link
Contributor

@ncabatoff @pierresouchay I ran my reproducer locally and this does seem to be the case. After each Setup() failure for a plugin I get the following extra fds in vault:

vault   2730652 sdumitru   11r     FIFO               0,13       0t0 8877065 pipe
vault   2730652 sdumitru   12u     IPv4            8877068       0t0     TCP localhost:14505->localhost:50884 (ESTABLISHED)
vault   2730652 sdumitru   13r     FIFO               0,13       0t0 8877066 pipe
vault   2730652 sdumitru   14u     unix 0x000000005e240baf       0t0 8877071 /tmp/plugin047720296 type=STREAM
vault   2730652 sdumitru   15u     unix 0x00000000caae83d0       0t0 8875129 type=STREAM
vault   2730652 sdumitru   16u     unix 0x00000000b7849ca2       0t0 8873950 /tmp/plugin047720296 type=STREAM

The easiest way to reproduce it for me was to run with TRACE logging, an external authentication plugin and this patch to vault:

diff --git a/sdk/plugin/middleware.go b/sdk/plugin/middleware.go
index 04a6f4c50..12da9dbd4 100644
--- a/sdk/plugin/middleware.go
+++ b/sdk/plugin/middleware.go
@@ -2,6 +2,7 @@ package plugin

 import (
        "context"
+       "fmt"
        "time"

        log "github.com/hashicorp/go-hclog"
@@ -81,13 +82,23 @@ func (b *backendTracingMiddleware) InvalidateKey(ctx context.Context, key string
        b.next.InvalidateKey(ctx, key)
 }

+var fail int = 0
+
 func (b *backendTracingMiddleware) Setup(ctx context.Context, config *logical.BackendConfig) (err error) {
        defer func(then time.Time) {
                b.logger.Trace("setup", "status", "finished", "err", err, "took", time.Since(then))
        }(time.Now())

        b.logger.Trace("setup", "status", "started")
-       return b.next.Setup(ctx, config)
+       err = b.next.Setup(ctx, config)
+
+       if b.Type() == logical.TypeCredential {
+               if fail >= 1 {
+                       err = fmt.Errorf("just giving up")
+               }
+               fail = fail + 1
+       }
+       return err
 }

 func (b *backendTracingMiddleware) Type() logical.BackendType {

I used that fail condition to test a bunch of cases. With this i just curl the authentication endpoint for my plugin and that triggers a leak.

@ncabatoff
Copy link
Contributor

Further evidence: lsof -U shows only unix domain sockets, which IIRC are only used in Vault for talking to plugins.

@pierresouchay
Copy link
Author

@ncabatoff yes, same here

@pierresouchay
Copy link
Author

@ncabatoff The explanation from @sorindumitru looks very plausible (see #9276 (comment) )
Do you need more information to move forward on this?

@ncabatoff
Copy link
Contributor

Nope, we already did. There was a CLA snafu so the fix PR was resubmitted as #9557. Should be in 1.5.1. I guess I'll close this issue, as we agree this should be the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core Issues and Pull-Requests specific to Vault Core
Projects
None yet
Development

No branches or pull requests

4 participants