Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory-Related Outage on really long running tasks mixed with ActiveRecord batch creations #1759

Closed
HoneyryderChuck opened this Issue Jun 4, 2014 · 7 comments

Comments

Projects
None yet
2 participants
@HoneyryderChuck
Copy link

HoneyryderChuck commented Jun 4, 2014

I've recently migrated from resque to sidekiq as my background job queue. The multi-threaded model fits my needs quite well, but I'm finding some issues with my tasks, which can be quite long and do a lot of stuff. This is more or less the summary:

(by the way, i'm running the default 25 parallel threads)

  • I pass a list of hostnames to my worker.
  • the worker will start and use another library to concurrently connect to the remote devices through ssh (this library uses celluloid-io)
  • after establishing the connections, a lot of information will be fetched and stored in collections (in memory).
  • I also connect to certain databases to fill the information gap.
  • After I fetch all the information I need, I build AR objects with it and start batch-saving them.

First problem I get is memory usage. Everytime I start a job, it requires me a lot of memory to move on. I keep the data in memory before i send it to the DB (to make it transactional), but this has to be kept there for quite some variable time. At some point, other tasks might be triggered, and suddenly I have n tasks bloating my system. At some point the process disappears (does sidekiq this automatically? after reaching some memory threshold?).

Second, I am not really being able to figure out the reason. I've read somewhere sidekiq had some memory leak issues (I'm using 2.17.7, by the way), but I don't know if this concerns more other libraries I might be using. More than thread-safe, it concerns memory-sieve libraries in this case. But how can I best check what is bloating the memory in a currently running sidekiq process? Is there a way to or tool for inspecting the running process and see where is currently data/file descriptors which can't be garbage collected? I mean, the problem might start with leaks caused by me :)

Third is, I'm quite suspicious about AR being my main offender here. Every job can potentially create x jobs plus associations yada yada and batch saves them (take counter cache and other callbacks into account). And I don't know, ActiveRecord is not very good at this, I guess. I mean, this is the main reason one learns never to call Model.all like that. So, I'm also looking for a best practice for such cases. How can I create batches of interrelated objects, make the whole process transactional and (most importantly) not bloat memory?

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Jun 4, 2014

Memory bloat is something many Rails applications suffer from. We at The Clymb have a serious case of bloat in our pumas and sidekiqs too. Our workaround: have monit monitor the amount of memory consumed and restart the process if it reaches a certain threshold.

We've never tracked down the cause of our bloat as there are no good tools for tracking down CPU or memory problems in MRI. You are better off moving to JRuby for the JVM's great toolset.

BTW when reporting problems like this, you need to supply version numbers for your entire stack. Ruby 1.9 is very different from Ruby 2.1, Rails 3.0 vs 4.1, etc.

@HoneyryderChuck

This comment has been minimized.

Copy link
Author

HoneyryderChuck commented Jun 4, 2014

Sorry, completely forgot that. Currently I'm still on 1.9 and rails 3.2.x. I know 2.1 has some new tools regarding tracking memory bloat, unfortunately I cannot use them. About JRuby, I'll have to convince the sysadmins to support JDK. But still, Java historically has suffered from bloat. Maythe the tools would be handy, though.

Restarting the workers seems a decent-enough workaround. But the main question was, am I doing something wrong in this flow? From what I understood concerning many resque vs. sidekiq, sidekiq (specially on MRI) thrives when a lot of IO takes place. Which is my case, i guess. But is it ideal for such long-running tasks which don't involve sending an email? (I mean, the duration of the task may vary quite).

Another more straight question would be: is there any way to tune the GC to make it more responsive to such behaviour? You know sidekiq quite well, maybe you already experienced some memory usage improvements by tweaking the GC here and there.

One more thing: concerning batch saving of models, did you experience (or have heard of) improvements by moving from AR to something else (say, Sequel)?

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Jun 4, 2014

Your GC should definitely be tuned on MRI. Read this and develop your own tuning: http://samsaffron.com/archive/2014/04/08/ruby-2-1-garbage-collection-ready-for-production

Long-running tasks are perfectly fine as long as they're idempotent (can be run many times) so you can restart Sidekiq at any time. Otherwise you're better off running a rake task for each long job.

If you want to insert lots of data, it's best to drop down to raw sql with a multi-INSERT statement or use something like mysql's LOAD DATA INFILE and skip AR altogether. You'll see a 100x performance improvement.

@HoneyryderChuck

This comment has been minimized.

Copy link
Author

HoneyryderChuck commented Jun 5, 2014

I've singled the problem to a few Nokogiri calls... I guess this has been happening quite a lot. If you have any clue which is the most stable leak-free version of nokogiri around, I would very much appreciate it :)

About the multi-INSERT, that might be the way to go. But I'll have to almost build an almost-stored procedure to handle it, because there are a few callbacks happening in the AR model, namely counter_caches. Performance boost, maintenance hell. Is there a perfect world?

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Jun 7, 2014

Thanks for the nokogiri tip, we're going to upgrade to the latest version and see if our bloat gets any better.

@mperham mperham closed this Jun 7, 2014

@HoneyryderChuck

This comment has been minimized.

Copy link
Author

HoneyryderChuck commented Jun 20, 2014

One more question regarding this issue: If for some reason monit restarts the sidekiq process WHILE jobs are currently being processed, what happens? Do the jobs get resent to the queue before killing the process? The monit restart is being done by calling sidekiqctl stop command. Do you know some way of avoiding this? I would like to requeue the current running jobs before killing process. I'm talking about sidekiq 2.17.6/7, by the way.

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Jun 20, 2014

@TiagoCardoso1983 This is covered in the FAQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.