Improve Ruby Reaper performance under heavy load #663

francesmcmullin · 2021-12-01T14:46:02Z

Is your feature request related to a problem? Please describe.
I find the ruby reaper runs slowly, and takes up a lot of memory when I'm processing a large volume of jobs (hundreds of thousands in a few hours). I've updated our configuration so that if the reaper fails entirely, the reaper resurrector brings it back, and it mostly doesn't hit the timeout, but ideally it would run a little bit faster with a smaller footprint.

Describe the solution you'd like
I suggest that the reaper should work through the oldest digests first, and that it should avoid loading all digests in ruby at once. Here's the code I'm interested in:

sidekiq-unique-jobs/lib/sidekiq_unique_jobs/orphans/ruby_reaper.rb

Lines 58 to 63 in 323d4ef

    
           conn.zrevrange(digests.key, 0, -1).each_with_object([]) do |digest, memo| 
        
             next if belongs_to_job?(digest) 
        
             memo << digest 
        
             break memo if memo.size >= reaper_count 
        
           end

Currently, using zrevrange means we go from the highest score to the lowest. As the current timestamp is generally used for a digest, this means going from newest to oldest. It's certainly not perfect, but I suggest a better general guess when seeking stale digests would be to go from oldest to newest - we can do this by using zrange instead of zrevrange.

Second, perhaps more laboriously, I suggest paging through digests rather than loading the whole set. It might look similar to this:

        page = 0
        per = reaper_count * 2
        orphans = []
        digests = conn.zrange(digests.key, page * per, (page + 1) * per)
        while(digests.size > 0)
          digests.each do |digest|
            next if belongs_to_job?(digest)
  
            orphans << digest
            break if orphans.size >= reaper_count
          end

          break if orphans.size >= reaper_count
          page +=1
          digests = conn.zrange(digests.key, page * per, (page + 1) * per)
       end
       orphans

Describe alternatives you've considered
I've considered switching to the Lua reaper, but I was concerned about blocking redis. I'm also thinking about changing some of our application logic so we don't lean quite so heavily on unique jobs, but that will take a bit longer to develop.

Additional context
I'm happy to provide more detail on how we're using sidekiq-unique-jobs in case that's helpful. We tend to process large volumes of jobs (e.g., 300,000) in a short amount of time (e.g., 2 hours) and then have long periods with much less activity.

The text was updated successfully, but these errors were encountered:

Closes #663 Signed-off-by: mhenrixon <mikael@mhenrixon.com>

* Improve reaper performance under heavy load Closes #663 Signed-off-by: mhenrixon <mikael@mhenrixon.com> * Mandatory linter commit

mhenrixon · 2021-12-01T19:53:08Z

❤️ thank you again for the bug report and the solution! I noticed the issue yesterday when the last entry in the sorted set was deleted instead of the first one.

Today I also fixed that bug, all thanks to you! It has been released as v7.1.12

francesmcmullin · 2021-12-03T14:01:55Z

Just a brief report back on this one - for our use case this cut reaper run time by 60%, thank you so much for moving so quickly! 👏🏻 👏🏻 👏🏻

Separately, just so you're aware, while the reaper now runs very effectively for us even when there is a large number of digests to clean (e.g., 150,000), it will struggle a lot if our sidekiq queues are very full (e.g. - 5,000 jobs in queues). For our use case, this is a surge which will pass, so it's fine for the reaper to give up/timeout and be resurrected later, after the surge has passed, but it might be a problem for other use cases.

mhenrixon · 2021-12-03T15:02:41Z

Hi @francesmcmullin, can you open a separate issue for this one?

I believe we could skip reaping based on how large the queues are.

The most frequent use case would and should be until executed and with that one there shouldn't be a need for reaping when so many jobs are queued successfully.

Rather it should be done as cleanup after the surge.

mhenrixon self-assigned this Dec 1, 2021

mhenrixon added the enhancement label Dec 1, 2021

mhenrixon added a commit that referenced this issue Dec 1, 2021

Improve reaper performance under heavy load

c90c1ca

Closes #663 Signed-off-by: mhenrixon <mikael@mhenrixon.com>

mhenrixon mentioned this issue Dec 1, 2021

Improve reaper performance under heavy load #666

Merged

mhenrixon closed this as completed in #666 Dec 1, 2021

mhenrixon added a commit that referenced this issue Dec 1, 2021

Improve reaper performance under heavy load (#666)

8bcf948

* Improve reaper performance under heavy load Closes #663 Signed-off-by: mhenrixon <mikael@mhenrixon.com> * Mandatory linter commit

francesmcmullin mentioned this issue Dec 9, 2021

Skip reaping when queues are too large #670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Ruby Reaper performance under heavy load #663

Improve Ruby Reaper performance under heavy load #663

francesmcmullin commented Dec 1, 2021

mhenrixon commented Dec 1, 2021

francesmcmullin commented Dec 3, 2021 •

edited

mhenrixon commented Dec 3, 2021

Improve Ruby Reaper performance under heavy load #663

Improve Ruby Reaper performance under heavy load #663

Comments

francesmcmullin commented Dec 1, 2021

mhenrixon commented Dec 1, 2021

francesmcmullin commented Dec 3, 2021 • edited

mhenrixon commented Dec 3, 2021

francesmcmullin commented Dec 3, 2021 •

edited