Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Kernel Memory Leak: possible conflict between LuaJIT or OpenResty and the consul server agent #1476
Comments
jmealo
changed the title from
Service agent uses 400-500MB of ram in a 7 node, single data center configuration
to
Service agent uses 400-500MB of ram in a 7 node, single data center configuration with no local checks
Dec 8, 2015
jmealo
changed the title from
Service agent uses 400-500MB of ram in a 7 node, single data center configuration with no local checks
to
Server agent uses 400-500MB of ram in a 7 node, single data center configuration with no local checks
Dec 8, 2015
|
Hi @jmealo the best way to see where the memory is going is to set the Consul 0.6 has a number of fixes to improve memory usage, including a new in-memory database that's more efficient as well as automatic reduction of large receive buffers for idle connections to the servers (these can definitely hit a few hundred megabytes if your servers have gotten a huge burst of RPC traffic). |
jmealo
commented
Dec 9, 2015
|
@slackpad I'll definitely be documenting this as the OOM killer just took out one of my load balancers. This configuration worked flawlessly on 512mb ram prior to introducing Consul. Now even with a 1024mb box consul is able to take the whole node down. |
jmealo
commented
Dec 9, 2015
|
After running only 6 minutes, here's the output of /debug/pprof/heap?debug=1: node-2.txt |
|
That one looks pretty normal with objects built up in the state store after initialization and some other startup items - 10 megs worth of stuff. |
jmealo
commented
Dec 9, 2015
|
@slackpad: I upgraded to 0.60 and the problem persists. Here's the output of /debug/pprof/heap?debug=1 after running for 8 hours 38 minutes. output.txt Output of
Output of
Output of
Output of
|
|
Hi @jmealo - that latest dump looks pretty normal as well, and looking at the end |
jmealo
commented
Dec 10, 2015
|
This is on a different node, still it's a 1GB node running NGINX and Consul: consul:
nginx:
I finally have some cached memory on this one, rather than it all disappearing. I had my provider switch one of the nodes to a different hypervisor. If the problem doesn't persist I'll close the ticket. So far it's been like clockwork. |
|
Ok cool - yeah these numbers all look pretty good - thanks for providing all the debug info! |
jmealo
commented
Dec 10, 2015
|
@slackpad We just hit 92% memory usage again, attached are the output of all of the commands that I've provided before as well as the consul_debug.txt |
|
Hmm - I still don't see anything that points to Consul (or any of your other processes for that matter):
That's showing an RSS for Consul of about 38 MB. And that roughly matches your debug logs:
Though from looking at the debug log it had just done a Raft snapshot, so it may have freed up some stuff. It might be interesting to run your What your access pattern like against Consul - do you have periods of huge write volumes in a short time window? |
jmealo
commented
Dec 10, 2015
|
@slackpad: I don't think so, we're not heavily using it right now. We have 7 nodes and 48 health checks total. |
jmealo
commented
Dec 10, 2015
|
Judging by the fact that we only have this issue on the load balancers, I'm getting ready to close this ticket. All 7 nodes are running as server agents with one of the load balancers typically being the leader. Would being a leader cause a drastic change in shared memory usage? Does this look like a kernel issue? I just don't know how to figure out where the memory is going. |
|
Not drastic - all the servers will make the same changes to their data store. Writes and some queries will go to the leader so it'll have more active connections and a little more GC churn, but shouldn't be way out of family with the other Consul servers. |
jmealo
commented
Dec 10, 2015
•
|
On the bright side, the memory usage seems to have improved greatly in 0.60x. |
jmealo
closed this
Dec 10, 2015
jmealo
commented
Dec 10, 2015
|
@slackpad Thanks for all of your help/patience. This is what the memory usage looks like, it presents itself like a classic memory leak, however, it doesn't seem to be based on requests or traffic, but rather time. Due to consul running health checks at regular intervals, that might create enough traffic to influence the graph. |
|
If the health checks have text output that varies from run to run (like timestamps) and/or a lot of output, that could cause a lot of churn in the Raft log; even if the checks are always passing. That might be something to check. If there was a lot of churn in the health checks (and if they run pretty often) then it could build up a big Raft log that eventually gets compacted during a snapshot. The Raft log would hold each delta change with the new check output, and when that gets compacted you'd only be left with the latest update to the check in the in-memory data store. I'm not sure that the big Raft log would occupy a ton of RAM though, so this is kind of speculative. One final thing is you could dump telemetry on one of the Consul servers and we could look for fishiness there - https://www.consul.io/docs/agent/telemetry.html. Be aware that this might include some service names and other info, so be careful if you post this. |
jmealo
reopened this
Dec 15, 2015
jmealo
commented
Dec 15, 2015
|
@slackpad: It certainly seems like it is raft using up the memory (see output below). I'm not sure why
|
jmealo
closed this
Dec 15, 2015
jmealo
reopened this
Dec 15, 2015
jmealo
commented
Dec 15, 2015
|
Here is the
Here is the
And another node that does not have this behavior:
|
jmealo
commented
Dec 15, 2015
|
@slackpad |
|
@jmealo yes for telemetry it won't go to the syslog writer, it'll be written to your Consul server's |
jmealo
changed the title from
Server agent uses 400-500MB of ram in a 7 node, single data center configuration with no local checks
to
Server agent uses causes OOM error on Ubuntu 14.04 when run in conjunction with OpenResty (NGINX)
Dec 16, 2015
jmealo
commented
Dec 16, 2015
|
@slackpad: It's been 1 day 20 minutes since I disabled consul on my cluster and we've experienced no memory issues. Prior to disabling consul I had to reboot the servers every 6 hours to clear memory. I'll rest easier if we let it go another couple of days, but, this configuration worked for 2-3 months without Consul and removing it seems to immediately solve the problem, so I can say with some certainty that Consul is the issue. I don't know why only the load balancers have this issue, the only difference in their configuration is the allowed number of open files is significantly higher, which could possibly be allowing something to leak handles. To me the If it helps at all this happened in 0.5.x and 0.6.x so the large changes in 0.6.x probably don't have anything to do with it. |
|
@jmealo do you have any updates on this one? I think it's unusual in general to run application stuff like load balancers on your Consul server nodes, typically folks want to keep them isolated. In the pmap outputs above the RSS values for Consul look to be ~40 MB which is pretty reasonable. |
jmealo
commented
Jan 8, 2016
|
@slackpad Thank you for following up. As excited as we were about consul we ended up reimplementing everything we needed from it in Node in about a day and have had zero infrastructure issues since we got rid of Consul. My only update is that Consul was the cause of the OOM issues. I guess it's safe to close this ticket; however, the problem exists and causes catastrophic failures that could take every node in a cluster down one-by-one within a few minutes of each other if consul is started at the same time on each node where this condition is present. This would obviously be bad but also likely require the user to redo the bootstrap process. That might be a good thing, because if the cluster goes down and comes back up and consul isn't working, it won't take down the cluster again. We sunk a couple of weeks of troubleshooting and hours of downtime in production into this and I hope that nobody else goes down that rabbit hole. If you think of it whenever the problem is finally solved or encountered on your end please ping me :-) |
jmealo
closed this
Jan 8, 2016
|
I'm glad you found a solution @jmealo. If you ever get around to it we'd love to look into this further with you but if you're happy with your current solution its best to leave it as-is. We work personally (1-on-1) with a handful of the largest websites that exist that run gigantic Consul clusters and in the past we have found memory leaks we've fixed straight away (none recently to my knowledge). These clusters generally run with a long uptime and their memory usage is stable. They use every feature of Consul (at least prior to 0.6, folks are still adopting 0.6 features) and have never experienced memory issues. None of this means that you didn't have a real issue, but I'm confident that in the general case Consul is extremely stable and memory is well under control. There may have been something unique about your setup we missed through the back and forth here, though. I think @slackpad overall was correct and I mirror his view that all your debug output seems to point to Consul using a very reasonable amount of memory. So we must be missing some data somewhere. Good luck and let us know if we can help in the future. (EDIT: sorry "james" in GitHub for pinging you with this, I meant @slackpad) |
jmealo
commented
Jan 8, 2016
|
@mitchellh I'd really like to see this fixed. My node solution simply parses my consul health checks and runs them using Node instead of Consul. The health checks are unchanged except for one that checked a socket.io server by requesting I can say with high confidence that the health checks were not the issue as they are running from Node without issue. As part of our consul troubleshooting, the socket.io test was disabled without relief. The issue at hand here is consul/go-raft/go was causing a memory leak that:
My primary concern is that another enduser will go absolutely bonkers trying to figure this out. I myself was convinced that consul was not the issue based on our troubleshooting. I went to DigitalOcean, KVM, linux-kernel NGINX, OpenResty, consul trying to figure out the memory usage issue to no avail. If I can get a box setup and recreate the conditions can you delegate engineering time to work on this?I'm not sure if I can reproduce it without the other 7 nodes in the cluster but it might be worth a try. The stack that caused this will be open sourced in the not so distant future (a year tops). I may be able to provide bootstrap scripts for an entire cluster identical to this one at some point if you want to test it internally. Thanks, EDIT: Additional info on largest health check response and wording. |
|
@jmealo I'll go ahead and kick this one back open. If you can get a repro environment that won't disrupt your operations and that we can poke at I can give you some cycles to try to see what's going on. Appreciate you offering to set that up! If you had a health check that was logging on the order of megabytes and churning it could cause some bad GC behavior possibly, depending on how often it was running. That would be something to take a look at. With regards to a richer HTTP check we've gotten requests for that but generally avoided adding more features in favor of people just scripting their own with |
slackpad
reopened this
Jan 8, 2016
jmealo
commented
Apr 7, 2016
|
@slackpad @mitchellh From the looks of it this was a kernel leak caused by non-process code (which is why the memory could not be attributed to a process). That being said, using OpenResty and consul together can reproduce the issue. Using either in isolation does not. The issue did not happen when running vanilla NGINX with consul. The only difference is that OpenResty is serving 1-2k connections and NGINX was in the low hundreds. Since both the primary and failover load balancers would both fail after a set amount of time regardless of connection or load, I think it's safe to say that the network traffic has nothing to do with it. In the interest of saving everyone time (unless someone wants to learn how to diagnose kernel memory leaks in KVM on a cloud provider) I'd be willing to give it another go when Ubuntu 16 LTS is released. Does that sound like a good plan? |
jmealo
changed the title from
Server agent uses causes OOM error on Ubuntu 14.04 when run in conjunction with OpenResty (NGINX)
to
Kernel Memory Leak: possible conflict between LuaJIT or OpenResty and the consul server agent
Apr 11, 2016
|
Hi @jmealo thanks for the update - that sounds like a good plan! |
|
Haven't heard more about this one so we will close it out for now, but please let us know if you have more information or need help. |

jmealo commentedDec 8, 2015
This is not related to #1060.
I'm running the official linux x64 binary release of v0.5.2 on Ubuntu 14.04. I have a single datacenter cluster with seven nodes (all running as server agents).
Here is an example of one of the misbehaving server agents:
cat /proc/$(pidof consul)/status
consul --version:
/etc/consul.d/server/config.json
{ "server": true, "datacenter": "redacted", "data_dir": "/var/consul", "encrypt": "redacted", "log_level": "INFO", "enable_syslog": true, "retry_join": [ "redacted", "redacted", "redacted", "redacted", "redacted", "redacted", ], "bind_addr": "redacted", "ui_dir": "/var/www/consul" }/etc/consul.d/server/nginx.json
{ "service": { "name": "nginx", "tags": ["www"], "port": 80 } }uname -a: