New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dj: memory leak #978
Comments
Hey @K1773R, that's frustrating! I haven't noticed that. It's possible that upstart is killing and restarting them for me when ram runs low, but I haven't observed that. Are you running any unusual Agents? I suspect it's some Agent that I don't use often. |
Are the ones i listed unusual agents? |
XMPP seems the most likely. If you disable it, does memory stabilize? |
it leaks even when there are no xmpp agents... |
That's very odd. @dsander, have you seen memory issues with DJ? |
@cantino I did not really monitor the memory usage on my server, but I doubt the issue is DJ iteself. @K1773R debugging memory leaks in ruby is a bit annoying, if you notice the leak within a few minutes the quickest way would probably be to disable all but one agent type, restart your worker and check if the leak still exists. If that is not an option and you are using 2.1+ I will try to add some sort of memory profiling option tomorrow. |
im using ruby 2.2.2p95 |
I may have just run into this (where 'just' == sometime early last evening). Two virtual CPUs, four job runners, and both jobs and the web user interface died with a signal 137 after staying online for something like four days running. I had to kill the delayed_job processes and restart from scratch to get Huginn running again. As an experiment I've cut my delayed job fleet to three to see how long it takes to run out of RAM. Ruby v2.1.5p273. |
Did this issue start recently? |
The hang-up happened some time last night (maybe 1900 hours PDT). I last updated my running install on 19 August (commit c3e9545), after which memory utilization started creeping up until something broke. Going through log/development.log, there isn't anything in there because the error message was emitted by the shell Huginn was started from. It didn't say which process received the out-of-memory signal, though. I can fire up a few more DJ's to artifically use up RAM and see what part falls over if it would help. |
What commit were you running before this, that seemed more stable? And are you running in RAILS_ENV=production or RAILS_ENV=development? |
I was running commit 9e8f9bf in development mode. Still in development mode. |
I have a stack trace from two days ago, from approximately the time of the crash. I don't know if it'll help debug this issue (I'm working to reproduce it independently, which is why I jumped into this ticket's discussion thread) but I can throw it on pastebin if it'll help. |
Sure, let's see it. |
Well, we shouldn't be running Huginn in development mode for production use, but there was really very little code change between your two commits. The only things we did that I could imagine affecting RAM were to remove You should ensure |
Here it is: http://pastebin.com/TcBR22tq Removing Spring has done a world of good - things are much more stable now! I've been keeping an eye on it, and it appears to be gone for good. |
That looks like the box running out of RAM or disk. Was this correlated with ballooning RAM? If so, do you know how high it got? |
That was my analysis also - the amount of free RAM hit zero and the OOM killer took out a process. I don't know which one because this is all the kernel message buffer has in it:
So, some aspect of Huginn got killed but it's not clear which one it was. I'm trying to recreate this problem on my instance to help debug it, so I just captured the ruby processes running right now, and if one mysteriously dies I can compare the lists and see which one was the victim. |
@K1773R I updated my memory profile branch. If you import this scenario the MemoryProfileAgent will count the allocated objects every 5 minutes and send the data to a PeakDetectionAgent, when a peak is detected a memory dump will be written to the configured directory. It works with multiple background workers but the results will probably look a bit strange because the job will run on a random worker process. To get an general overview (see how many objects are leaked over time) it would probably the best to run just one worker for a few hours. To analyze a memory dump a small script is added to the bin directory: Get an overview how many objects have been allocated (and still are) in which garbage collection generation run: I ran the branch over night on my production server and also see a small leak of objects but was not able to find the source, if you are running more agents the results might be more conclusive. |
@dsander switching to a single worker is currently not really an option. |
My branch is based on the current master branch and just adds the memory profile agent. As long as you delete that agent before you switch back there should not be any problems. |
Data!
dj.1 was killed with a signal 137 (out of memory). The 'system` process detected this and sent SIGTERMs to the web server, scheduler, Twitter subsystem, and dj0.1 and dj2.1. However, not everything is responding to SIGTERM correctly. From my currently running process list:
|
How up to date is @dsander's fork of the codebase? |
I'm trying to recreate the OOM problem by force by manually starting DJs. I'm capturing stdout to a logfile. When I have something I'll comment on this issue. |
When jobs aren't running, memory utilization doesn't change (e.g., 485 MB free and holding). I'm up to 9 delayed job runners now with my usual agent load of approximately 750 agents. |
Results! I'll post what I got in separate Screen terminals in successive comments.
|
Nothing emitted on stderr. Nothing in ~/stderr.log. |
In one of the extra workers I'd spawned:
|
The same stack trace happened in all of the extra DJs I spawned to force this to happen. It killed almost all of the spares I spawned to test this; a few survived because they were running at the time the others died, but now that the scheduler's died it's sitting there waiting for things to do. Looking at the process list:
|
You need to watch the disk usage though, per default the PeakDetectorAgent in the scenario will trigger a dump for every detected peak. I ran out of space of my root disk once 😄 |
Thanks! Just added a monitor to keep an eye on that. |
I'm seeing the memory leak too after a few days running. I suspect it may be the cause of some of my crashes. According to #1114, this will happen with a completely empty agent too, which makes me suspect DJ or the scheduler? |
My instance (now running with the multithreaded scheduler) is doing the same thing. I've had two lockups in as many days. Until I purged the pending jobs this morning, I had a backlog of 7900 jobs waiting to run. I'm interested to see if commit 07e2d9c fixes it. |
I'm reading in heroku discussion forum that some people was experimenting memory leaks running the latests version of Ruby. Have anybody tried running huggin with Ruby 2.0.0? More info: |
The leak's happening in v2.1.5, too. |
Are you still seeing a steady climb of memory? |
Yes. However, JabberAgent's been locking up Huginn way more than OOMs have been killing Huginn lately (two weeks or therabouts). |
@virtadpt thanks for telling, was just about to update huginn. a lockup due to jabberagent would be horrible for me. i'll try as soon as the problem with jabberagent is fixed. |
Pretty sure it is not an issue with the jabber agent in general, as far as I can remember we did not change it for a least a few months and I am using it as a receiver (JabberAgent::Worker) without any issues. |
Insofar as the memory leak, I think the article that @jn11 linked to has a lot to do with it. I'm using JabberAgent as a sender, which might be the problem. But that's a separate issue (literally). |
Thought of this issue when I read this How Ruby 2.2 can cause an out-of-memory server crash |
@brianpetro versions below 2.2 are affected too |
I think this would only happen if recursion was interrupted by a signal, so the GC got disabled? Then we wouldn't see a slow climb in memory, but probably a fairly quick one? But I could totally be wrong. If you're having memory issues, would you mind trying the newer version of Ruby and seeing if it helps? It'd be awesome if this fixed our issues. |
@cantino Jumping in here to help troubleshoot. As I was fiddling around with Huginn today I noticed that Celluloid 0.16.0 gets installed during setup. There is a known memory leak with that version of Celluloid, see celluloid/celluloid#455 You might look at requiring Celluloid version 0.15.2 or 0.17.2, see https://gist.github.com/gazay/3b518f72266b5a7e88ff I also bookmarked a list of "leaky gems" and delayed_job and therubyracer, both of which are in the Huginn gemfile, are listed. See https://github.com/ASoftCo/leaky-gems When I set up Huginn today I noticed that delayed_job version 4.0.6 was installed -- that version is known to have memory leaks: collectiveidea/delayed_job#776 therubyracer was installed with gem version 0.12.2 which fixes the memory leak that was previously reported, AND ~> 0.12.2 is specified in the gemfile, so that one should be fine. rubyjs/therubyracer#336 Hope that helps! |
Thanks @wasafiri, this is very helpful! I'll try updating the Gemfile. |
I haven't had issues recently, but it wasn't affecting me much. @K1773R? |
I think so. I'm running the threaded job runner (have been since I got Huginn into production mode) and it's been remarkably stable since I disabled all of my JabberAgents in favor of BeeperAgent instances. |
@cantino I should have time for an upgrade at the weekend. i'll report back. |
I've been having memory leak issues for the past couple of months. I'm using the latest codebase on an ubuntu xenial box with 1 GB memory. The logs don't seem to show any errors. |
That's not good. I've seen my RAM usage climb once in the last month I think. Seems like there's still some subtle issue. |
I did some monitoring of my installation and it looks like the memory usage stabilized after a few days. The amount of RAM used for the twitter stream process still surprises me a bit.
|
How high does your RAM usage get @HyShai? I know that Ruby's generational garbage collector can take a while before it reaches a steady state, which is always higher than where it starts. |
delay_job is using more and more memory over time. it starts at ~150M and rises forever, slowly tough. As i have 16 workers, im affected much faster.
Any way to debug this? Im currently just restarting the workers every X hours.
Im using Web,Mail,MailDigest,XMPP,Trigger-Agents. With Web+Trigger being the most used.
Can foreman handle memory usage of dj? ie, restart a dj if mem is higher than X and no jobs currently running? restarting a worker after X jobs might also be an option.
The text was updated successfully, but these errors were encountered: