Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Too many open files" on Vault #867

Closed
brockoffdev opened this issue Dec 16, 2015 · 13 comments

Comments

@brockoffdev
Copy link

@brockoffdev brockoffdev commented Dec 16, 2015

Hi All-

Strange issue I recently ran into with Vault (v0.4.0) and Consul (v0.5.2). System had been up and running for a solid month (save for the downtime to upgrade to 0.4.0), until today, with no problems. This morning, we noted an inability to reach the Vault servers, and upon checking the logs, we saw a large amount of errors of type "too many open files":

16:48:52 http: Accept error: accept tcp [::]:8200: accept4: too many open files; retrying in 1s

We were able to rectify the issue simply by restarting vault (which in itself transitioned the master unit to our backup vault master machine, so that failover worked very well), but are working to find a solution to this problem. We noted that this appears to be a Consul connections limit issue (see: hashicorp/consul#688), but were hoping some on here may have run into this issue. Any thoughts?

@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 16, 2015

Hi @brockoffdev,

We've seen this in the past with very large numbers of leases all expiring at once trying to open huge numbers of connectiosn to Consul, but we actually implemented a connection pool in the Consul physical backend to prevent that. So offhand I don't have a good idea.

Raising the ulimit for the user running Vault may help, but first it would be nice to understand what/where the open files are being used. Various combinations of lsof flags may help a lot; if your Vault user is vault, try:

  • lsof -a -i -uvault
  • lsof -a -U -uvault
  • lsof -uvault

to see open network connections, open Unix sockets, and all open files by that user, respectively. Hopefully that will find something immediately apparent; even if it's Vault itself, it would be helpful to figure out where Vault is receiving connections from (which aren't being dropped/closed) or making connections to (potentially too many at a time).

Let me know what you find out!

@brockoffdev

This comment has been minimized.

Copy link
Author

@brockoffdev brockoffdev commented Dec 16, 2015

Ah, so interesting, it does appear that Vault is establishing a pretty large amount of tcp connections:

vault     17229 vault  136u  IPv6 2582322      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3746 (ESTABLISHED)

vault     17229 vault  137u  IPv6 2582324      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3747 (ESTABLISHED)

vault     17229 vault  138u  IPv6 2582326      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3748 (ESTABLISHED)

vault     17229 vault  139u  IPv6 2582328      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3749 (ESTABLISHED)

vault     17229 vault  140u  IPv6 2582330      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64015 (ESTABLISHED)

vault     17229 vault  141u  IPv6 2582331      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64016 (ESTABLISHED)

vault     17229 vault  142u  IPv6 2582388      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3761 (ESTABLISHED)

vault     17229 vault  143u  IPv4 2582419      0t0  TCP ip-10-10-3-86.ec2.internal:50248->205.251.242.222:https (ESTABLISHED)

vault     17229 vault  144u  IPv4 2582422      0t0  TCP ip-10-10-3-86.ec2.internal:50249->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  145u  IPv4 2582425      0t0  TCP ip-10-10-3-86.ec2.internal:50250->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  146u  IPv4 2582428      0t0  TCP ip-10-10-3-86.ec2.internal:50251->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  147u  IPv4 2582431      0t0  TCP ip-10-10-3-86.ec2.internal:50252->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  148u  IPv4 2582434      0t0  TCP ip-10-10-3-86.ec2.internal:50253->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  149u  IPv4 2582437      0t0  TCP ip-10-10-3-86.ec2.internal:50254->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  150u  IPv4 2582440      0t0  TCP ip-10-10-3-86.ec2.internal:50255->205.251.242.200:https (ESTABLISHED)

vault     17229 vault  151u  IPv6 2582544      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-0-171.ec2.internal:3781 (ESTABLISHED)

vault     17229 vault  152u  IPv6 2582588      0t0  TCP ip-10-10-3-86.ec2.internal:8200->ip-10-10-2-188.ec2.internal:64054 (ESTABLISHED)

vault     17229 vault  153u  IPv4 2582633      0t0  TCP ip-10-10-3-86.ec2.internal:41462->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  154u  IPv4 2582636      0t0  TCP ip-10-10-3-86.ec2.internal:41463->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  155u  IPv4 2582639      0t0  TCP ip-10-10-3-86.ec2.internal:41464->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  156u  IPv4 2582642      0t0  TCP ip-10-10-3-86.ec2.internal:41465->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  157u  IPv4 2582645      0t0  TCP ip-10-10-3-86.ec2.internal:41466->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  158u  IPv4 2582648      0t0  TCP ip-10-10-3-86.ec2.internal:41467->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  159u  IPv4 2582651      0t0  TCP ip-10-10-3-86.ec2.internal:41468->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  160u  IPv4 2582654      0t0  TCP ip-10-10-3-86.ec2.internal:41469->72.21.214.99:https (ESTABLISHED)

vault     17229 vault  161u  IPv4 2582661      0t0  TCP ip-10-10-3-86.ec2.internal:46620->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  162u  IPv4 2582664      0t0  TCP ip-10-10-3-86.ec2.internal:46621->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  163u  IPv4 2582667      0t0  TCP ip-10-10-3-86.ec2.internal:46622->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  164u  IPv4 2582670      0t0  TCP ip-10-10-3-86.ec2.internal:46623->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  165u  IPv4 2582673      0t0  TCP ip-10-10-3-86.ec2.internal:46624->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  166u  IPv4 2582676      0t0  TCP ip-10-10-3-86.ec2.internal:46625->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  167u  IPv4 2582679      0t0  TCP ip-10-10-3-86.ec2.internal:46626->iam.amazonaws.com:https (ESTABLISHED)

vault     17229 vault  168u  IPv4 2582682      0t0  TCP ip-10-10-3-86.ec2.internal:46627->iam.amazonaws.com:https (ESTABLISHED)

Should be noted that ip-10-10-3-86.ec2.internal references both the vault master, and the consul-agent running on the vault box; the vault user has an unlimited ulimit. But 168 connections opened...

@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 16, 2015

My main takeaway was "unlimited ulimit and they're still seeing this...?" :-)

168 seems a lot-ish, but ulimit for a normal user is usually (these days) 2048 or 4096...so there's certainly no slam-dunk here.

What about non-Vault processes on the box? If the user has no limit, it's possible that the box itself has run out of connections.

@brockoffdev

This comment has been minimized.

Copy link
Author

@brockoffdev brockoffdev commented Dec 16, 2015

Thanks @jefferai appreciate the help, btw!

Nope vault user is by far using the most connections. Nothing else is even coming close. I'll keep monitoring for the time being. :)

@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 16, 2015

OK. Try using whatever tools you can think of to find a lot of open files on the system, or in open unix sockets, or IP sockets. I'm fine assuming it's Vault until proven otherwise, but I can't think of a change in 0.4 offhand that should cause any such problem, and we in fact explicitly put in changes in 0.4 to prevent this kind of problem (for consul and etcd backends). So I'm currently mystified.

@brockoffdev

This comment has been minimized.

Copy link
Author

@brockoffdev brockoffdev commented Dec 16, 2015

Yeah, definitely a curious bug. I wish I could go back and see if 0.3.1 had any of these same issues, but I know this has not come up prior to the 0.4.0 upgrade.

@brockoffdev

This comment has been minimized.

Copy link
Author

@brockoffdev brockoffdev commented Dec 16, 2015

@jefferai so very curious thing...I noticed that most of the open connections were being maintained by the AWS backed on vault (see: iam.amazonaws.com, 205.251.242.200). Just as a test (since our aws mount isn't critical), I decided to unmount our aws mount point, restart vault, and remount the aws mount point.

Sure enough, the issue seems to be rectified! I'm going to keep an eye on the issue moving forward just to be sure, but it would seem that something with the aws backend caused the connection pool to become overloaded. Thoughts?

@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 16, 2015

There are only two things that changed specific to the AWS backend during the 0.4 release cycle:

  1. We changed how the http client was being created to fix #700 -- a problem you're very aware of. I believe you have been using the custom binary I created which already had these changes in them, and if so, that likely eliminates this as the problem.
  2. We updated to Amazon's new official aws-sdk-go, which hit 1.0 and had a very many internal changes.

I would bet a decent sum that (2) is the cause of your problem. I can tell you that since 0.4 released they've already gone from 1.0.2 to 1.0.5 so they're updating it at a good clip. Unfortunately the release notes only list feature improvements, not bug fixes.

If you have a Go dev environment, I'd be interested in seeing how things behave with 1.0.5 -- I could bump the dep in our tree if you like. Or, I could make you a binary with a newer version of their SDK.

@brockoffdev

This comment has been minimized.

Copy link
Author

@brockoffdev brockoffdev commented Dec 16, 2015

@jefferai thanks much for that. Let's keep that idea on hold for now. I've been doing some testing against the newly created AWS backend and all seems to be fine. The connection pool has reached at most 45 connections, very much down from the 165 I saw earlier...and even 245 at its peak. It seems like the aws-sdk-go client may be a good place to investigate...

@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 17, 2015

OK. I'll keep this issue open; if you think it's resolved feel free to close it (or let me know and I'll close it).

@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 17, 2015

@brockoffdev While it could be a combination of multiple factors, it's actually possible it's related to (1) :-(

See fsouza/go-dockerclient#434 -- it seems that Golang doesn't properly close idle connections when a client goes out of scope/is GCd. I'm trying to get more info from the reporter.

@jefferai jefferai closed this in bf246af Dec 17, 2015
@jefferai

This comment has been minimized.

Copy link
Member

@jefferai jefferai commented Dec 17, 2015

@brockoffdev If that is the case it would seem pure luck that you didn't hit it with the 0.3.1 build.

@alexw23

This comment has been minimized.

Copy link

@alexw23 alexw23 commented Dec 24, 2015

I'm getting this issue also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.