Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Ruby Reaper performance under heavy load #663

Closed
francesmcmullin opened this issue Dec 1, 2021 · 3 comments · Fixed by #666
Closed

Improve Ruby Reaper performance under heavy load #663

francesmcmullin opened this issue Dec 1, 2021 · 3 comments · Fixed by #666
Assignees

Comments

@francesmcmullin
Copy link
Contributor

Is your feature request related to a problem? Please describe.
I find the ruby reaper runs slowly, and takes up a lot of memory when I'm processing a large volume of jobs (hundreds of thousands in a few hours). I've updated our configuration so that if the reaper fails entirely, the reaper resurrector brings it back, and it mostly doesn't hit the timeout, but ideally it would run a little bit faster with a smaller footprint.

Describe the solution you'd like
I suggest that the reaper should work through the oldest digests first, and that it should avoid loading all digests in ruby at once. Here's the code I'm interested in:

conn.zrevrange(digests.key, 0, -1).each_with_object([]) do |digest, memo|
next if belongs_to_job?(digest)
memo << digest
break memo if memo.size >= reaper_count
end

Currently, using zrevrange means we go from the highest score to the lowest. As the current timestamp is generally used for a digest, this means going from newest to oldest. It's certainly not perfect, but I suggest a better general guess when seeking stale digests would be to go from oldest to newest - we can do this by using zrange instead of zrevrange.

Second, perhaps more laboriously, I suggest paging through digests rather than loading the whole set. It might look similar to this:

        page = 0
        per = reaper_count * 2
        orphans = []
        digests = conn.zrange(digests.key, page * per, (page + 1) * per)
        while(digests.size > 0)
          digests.each do |digest|
            next if belongs_to_job?(digest)
  
            orphans << digest
            break if orphans.size >= reaper_count
          end

          break if orphans.size >= reaper_count
          page +=1
          digests = conn.zrange(digests.key, page * per, (page + 1) * per)
       end
       orphans

Describe alternatives you've considered
I've considered switching to the Lua reaper, but I was concerned about blocking redis. I'm also thinking about changing some of our application logic so we don't lean quite so heavily on unique jobs, but that will take a bit longer to develop.

Additional context
I'm happy to provide more detail on how we're using sidekiq-unique-jobs in case that's helpful. We tend to process large volumes of jobs (e.g., 300,000) in a short amount of time (e.g., 2 hours) and then have long periods with much less activity.

@mhenrixon mhenrixon self-assigned this Dec 1, 2021
mhenrixon added a commit that referenced this issue Dec 1, 2021
Closes #663

Signed-off-by: mhenrixon <mikael@mhenrixon.com>
mhenrixon added a commit that referenced this issue Dec 1, 2021
* Improve reaper performance under heavy load

Closes #663

Signed-off-by: mhenrixon <mikael@mhenrixon.com>

* Mandatory linter commit
@mhenrixon
Copy link
Owner

❤️ thank you again for the bug report and the solution! I noticed the issue yesterday when the last entry in the sorted set was deleted instead of the first one.

Today I also fixed that bug, all thanks to you! It has been released as v7.1.12

@francesmcmullin
Copy link
Contributor Author

francesmcmullin commented Dec 3, 2021

Just a brief report back on this one - for our use case this cut reaper run time by 60%, thank you so much for moving so quickly! 👏🏻 👏🏻 👏🏻

Separately, just so you're aware, while the reaper now runs very effectively for us even when there is a large number of digests to clean (e.g., 150,000), it will struggle a lot if our sidekiq queues are very full (e.g. - 5,000 jobs in queues). For our use case, this is a surge which will pass, so it's fine for the reaper to give up/timeout and be resurrected later, after the surge has passed, but it might be a problem for other use cases.

@mhenrixon
Copy link
Owner

Hi @francesmcmullin, can you open a separate issue for this one?

I believe we could skip reaping based on how large the queues are.

The most frequent use case would and should be until executed and with that one there shouldn't be a need for reaping when so many jobs are queued successfully.

Rather it should be done as cleanup after the surge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants