Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory bloat / dangling keys / reaper not cleaning orphans #637

Closed
sharq1 opened this issue Sep 15, 2021 · 26 comments · Fixed by #726
Closed

Memory bloat / dangling keys / reaper not cleaning orphans #637

sharq1 opened this issue Sep 15, 2021 · 26 comments · Fixed by #726

Comments

@sharq1
Copy link

sharq1 commented Sep 15, 2021

Describe the bug
The uniquejobs:... keys do not get removed from Redis, thus over few months it took almost all of our Redis memory (500MB). See how it grew over 9 weeks:
image

I have just manually removed all keys matching uniquejobs:* and it released nearly all the memory that was being used:
image

Expected behavior
No uniquejobs keys are left dangling in Redis once there are no enqueued jobs.

Current behavior
Lots of dangling keys are left in Redis.

Additional context
We were on version 7.0.1 for a while and switched to 7.1.5 just days ago.

Here's our Sidekiq initilizer:

Sidekiq.configure_server do |config|
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  SidekiqUniqueJobs::Server.configure(config)
end

Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end
end

Some dangling keys found when no jobs were enqueued:

"uniquejobs:22593cd0a41a588fac3281643d33e118:RUN", "uniquejobs:3126d0f7a392524f753789d57780c470:RUN:INFO", "uniquejobs:3b8cbe9c7c4553c72860b9b4987b5d06:RUN", "uniquejobs:54acd73863fa5871768cb083188d831d:RUN", "uniquejobs:b4191ec985c34b244af30ab3143cabad:RUN:INFO", "uniquejobs:e5c365ec72e9029139f446f4af3167ae:RUN", "uniquejobs:555f124f477550c264931b9f63a4151e:RUN:PRIMED", "uniquejobs:c2eec995a5bd4b25b2724a29e90140cd:RUN:PRIMED", "uniquejobs:c61e62e95f5bd06f182b25f278c1df25:RUN:PRIMED", "uniquejobs:212996c302973dcf0644c806f3e11e79:RUN:INFO", "uniquejobs:81e90987d7df3df2a3a99aa4e17e2df9:RUN:PRIMED", "uniquejobs:38f8ac8163e0d13fa03529b7fbfab2b0:RUN:INFO", "uniquejobs:dd8f271e16866d31a7449d1d76560e7d:RUN:INFO", "uniquejobs:12179e1549a813de224efbffce519dbf:RUN:INFO", "uniquejobs:01c8c00a98662b4105cbb3b3da1462a7:RUN:INFO", "uniquejobs:43f4e427ed0027409842e3dae8fb843f:RUN", "uniquejobs:2909c6d6374d70bffe48c40f8af024ed:RUN", "uniquejobs:4a0c9d496a8d6ff7ec4788b57a3e6091:RUN:PRIMED", "uniquejobs:6ae85a603fd21f16c5c4c954d549655c:RUN:INFO", "uniquejobs:7b9862a384420e4dc1d6a23ca11a7706:RUN:PRIMED", "uniquejobs:cb31ed8134ed21e7a3a3c29064379e91:RUN", "uniquejobs:cd7776a82ab53042f6029c829f3b6ee6:RUN", "uniquejobs:a1793460fb38bb3a29e7e5385169dcec:RUN", "uniquejobs:93dd635f5c4a92021228d0637696fde6:RUN:INFO", "uniquejobs:b47459c06d2beac78937b2e318ff4363:RUN", "uniquejobs:ed56a8dd005fb21b2a34f5da9e047537:RUN:PRIMED", "uniquejobs:8ccf96815ca0af4e85fce6bb28b2b8c5:RUN:PRIMED", "uniquejobs:f005969e5292e90ee26f24971eaa3a07:RUN", "uniquejobs:f1ab984cf3029db6cb8efb0c1efcf7ff:RUN:INFO", "uniquejobs:2445ad2c006ee45d00e1660b1b2d8b13:RUN:PRIMED", "uniquejobs:d5f3f639190f3cb1d1b640f1f33fb65a:RUN:PRIMED", "uniquejobs:3a19cbcb65db4246a694c901b500b9f8:RUN:INFO", "uniquejobs:ff72e0459affd6506f23233bb8f4b907:RUN:PRIMED", "uniquejobs:a6ca3f3eccc0850c53454693feb1aac2:RUN", "uniquejobs:dab684c5685db372216af1bb0bf058a9:RUN:PRIMED", "uniquejobs:9bd285eeb11f9d7363cd03226c74252d:RUN:INFO", "uniquejobs:54040d7ce8076c69ea89780acb3907ac:RUN", "uniquejobs:965c2ac877ca1f0d2153be10c1c3b0f8:RUN:PRIMED", "uniquejobs:14eebb7367f944c7f525a771301ab4f5:RUN", "uniquejobs:0a862bd9b0851e14ccdbb87ba4efbf08:RUN:PRIMED"

Possibly related issue

We've also been having issues with until_and_while_executing lock - on 7.0.1 we discovered that around 4% of our jobs did not perform because of this. On 7.1.5 it looks like it's better, but still there are some that are missed. I've added additional logging to determine the cause, but given all the issues I'm afraid we'll have to stop / pause using the gem until it's stable.

However I really appreciate your hard work and please let me know if I can be of any help with determining the cause of issues.

@sharq1 sharq1 changed the title Memory bloat / dangling keys / reaper not working Memory bloat / dangling keys / reaper not cleaning orphans Sep 15, 2021
@sharq1
Copy link
Author

sharq1 commented Sep 15, 2021

Is it possible, that bacause of this bloat, the newly generated random uids were actually seldomly the same as some of the existing orphaned keys? We've had roughtly 6,5 million of these orphaned keys when I was removing them. It sounds not very likely that it would contribute to 4% of cases, but after the upgrade to 7.1.5 the number of jobs lost was around 0.3%.

@sharq1
Copy link
Author

sharq1 commented Sep 21, 2021

We've disabled locks on the worker that was problematic and we're not having issues with missed jobs anymore. However we must now come up with an alternative strategy until this gets fixed (any clues about whether it would? 🙂 ) or permanently. Not mentioning having to fix/recalculate nearly all of our historical records in different parts of the system, which will take days... 😬

@mhenrixon
Copy link
Owner

Sorry @sharq1, life got in the way this summer: #601 (comment)

I'll be dealing with these topics in the next few days

@sharq1
Copy link
Author

sharq1 commented Sep 21, 2021

I'm sorry to hear that... Hope the bad luck is over for you.
Thanks for the heads up.

@mhenrixon
Copy link
Owner

@sharq1 are you on Sidekiq 6.2.2? If so, they renamed the Sidekiq::Job to Sidekiq::JobRecord and I'll add extensions for that.

@sharq1
Copy link
Author

sharq1 commented Sep 21, 2021

Yes, however we've switched to 6.2.2 from 6.2.1 at around the same time when we disabled locks for some of the jobs (workers) - but for some we still have locks. Will the sidekiq-unique-jobs still function regardless or is it a breaking change?

@mhenrixon
Copy link
Owner

Yes but if you for example delete things in Sidekiq::API classes (uses by sidekiq/web) then related locks won't be deleted.

@mhenrixon
Copy link
Owner

V7.1.6 should work with latest sidekiq version.

@mhenrixon
Copy link
Owner

Hi @sharq1 I will have a deeper look into this later today.

@bdarcet
Copy link

bdarcet commented Oct 15, 2021

So we pretty much have the same issue on our end. If one of our sidekiq job crashes (OOM in our case) then the uniquejobs:key:.. + uniquejobs:key:..:LOCKED are not cleared and the job will never rerun until we clear the keys manually.
We are considering adding an other daily job to clean those keys. But I guess it should be the reaper job to handle that ?

Versions:
sidekiq-unique-jobs: 7.1.5
sidekiq: 6.2.2

@mhenrixon
Copy link
Owner

@bdarcet I strongly recommend you upgrade to 7.1.8 because 7.1.5 is not compatible with the sidekiq version you are using.

@fotinakis
Copy link

Just wanted to note we encountered a similar issue with orphaned keys not getting cleaned up, having just started using sidekiq-unique-jobs.

sidekiq-unique-jobs (7.1.25)
sidekiq-pro (5.3.1)
sidekiq (6.5.1)

Job is simple, only has:

sidekiq_options lock: :until_expired, lock_ttl: 30.minutes

It is a heavy load job, being executed millions of times in a 4-5 hour timeframe.

We found tens of thousands of these PRIMED orphaned keys which remained forever in the DB:
image

The reaper never cleaned them up, so we deleted them manually.

@mhenrixon
Copy link
Owner

There was some work being done in #724 which will help greatly for you as well.

@mhenrixon
Copy link
Owner

When #725 is merged it would be great to get some feedback on how it performs for you @fotinakis. Any insights into this would be greatly appreciated. Preferably before I release it as we changed how the expired keys are stored.

@fotinakis
Copy link

Amazing, thanks @mhenrixon! I'm going to be able to run a large-scale test today, and will report the results to you here.

@fotinakis
Copy link

fotinakis commented Jul 12, 2022

@mhenrixon @francesmcmullin

Test results: unfortunately, saw similar results for memory leaks under load with :until_expired locks under the latest main branch.

Setup:

  • Job lock options:
sidekiq_options lock: :until_expired, lock_ttl: 30.minutes
  • sidekiq (6.5.1)
  • sidekiq-pro (5.5.1)
  • sidekiq-unique-jobs (main @ b999c8b)

Result:

  • ~800k unique jobs run (out of some unknown number of non-unique attempts, it would have been many multiples larger than this)
  • Peak of ~5000 unique jobs/sec run, but more like 500-1000 jobs/sec avg

image

image
^ At the red line is ~30 minutes beyond when the jobs finished. Expected: memory usage should have dropped back down to starting levels

image
^ same with keys

image
^ in redis, ~18k orphaned keys look like this

image
^ and ~200k orphaned keys look like this

--

Calling the reaper directly reports nothing to delete:

> POOL2.with { |r| puts SidekiqUniqueJobs::Orphans::Reaper.call(r) };
2022-07-12T20:42:09.000Z pid=14081 tid=12wt INFO: Nothing to delete; exiting.
2022-07-12T20:42:09.001Z pid=14081 tid=12wt INFO: Nothing to delete; exiting.

(one complexity to note: we do use sidekiq sharding to support multiple redis dbs, but I don't think that causes any additional issues here)

--

Any ideas? (I don't know enough about how sidekiq-unique-jobs's internals work, but happy to help debug or test again)

@fotinakis
Copy link

Quick followup:

It looks like like the keys with b- are sidekiq-pro Batch keys. It looks like these can be cleaned up by setting sidekiq-pro's "LINGER" setting (sidekiq/sidekiq#3083).

So the only orphans of concern above are the uniquejobs:..:PRIMED keys

@mhenrixon
Copy link
Owner

Awesome! I'll get on those lingering keys then!

@fotinakis
Copy link

Great, thanks @mhenrixon! Do you have any thoughts on an ETA for fixing this – we're using at such a scale that I'm somewhat blocked on turning on a big system until this particular issue is resolved or we work around it.

We've also just joined as as a GitHub sponsor! 💯 Really appreciate your work on this fantastic open source project.

@mhenrixon
Copy link
Owner

@fotinakis I'll get started first thing on Tuesday.

This week I have off to spend some quality time with my wife. Our kids are in Sweden with my parents and yesterday was our fifth wedding anniversary!

Sincerely appreciate the support and the kind words 🙏.

Gets me very excited about fixing problems ❤️

@mhenrixon
Copy link
Owner

@fotinakis Monday I'll get started! Just had a look in the calendar and my birthday isn't until Tuesday so have the whole of Monday to look at those dangling keys.

@fotinakis
Copy link

Hey @mhenrixon — appreciate it. No rush from our end, we have implemented a workaround in the meantime. 👍 Will be good to fix that orphan cleanup whenever you get to it.

@mhenrixon
Copy link
Owner

@fotinakis I have a PR (#726) that should hopefully fix the problem. Would appreciate help battle testing the change. I am a little unsure about the potential effect of concurrency here.

@mhenrixon
Copy link
Owner

Please checkout v7.1.26 https://github.com/mhenrixon/sidekiq-unique-jobs/releases/tag/v7.1.26 as it has quite a few fixes for until and while executing and also until expired

@mhenrixon
Copy link
Owner

Version v7.1.27 should be even better, especially if you are using redis namespace.

That said, redis namespace is not recommended. It was discovered it is not up to date with redis-rb's keyword arguments.

Who knows what else is failing silently with it.

@joshmn
Copy link

joshmn commented Aug 18, 2022

@fotinakis similar throughput as you and similar issues. we're on 7.1.27. what were your workarounds? i woke up to a nightmare :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants