Memory bloat / dangling keys / reaper not cleaning orphans #637

sharq1 · 2021-09-15T07:53:30Z

Describe the bug
The uniquejobs:... keys do not get removed from Redis, thus over few months it took almost all of our Redis memory (500MB). See how it grew over 9 weeks:

I have just manually removed all keys matching uniquejobs:* and it released nearly all the memory that was being used:

Expected behavior
No uniquejobs keys are left dangling in Redis once there are no enqueued jobs.

Current behavior
Lots of dangling keys are left in Redis.

Additional context
We were on version 7.0.1 for a while and switched to 7.1.5 just days ago.

Here's our Sidekiq initilizer:

Sidekiq.configure_server do |config|
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  SidekiqUniqueJobs::Server.configure(config)
end

Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end
end

Some dangling keys found when no jobs were enqueued:

"uniquejobs:22593cd0a41a588fac3281643d33e118:RUN", "uniquejobs:3126d0f7a392524f753789d57780c470:RUN:INFO", "uniquejobs:3b8cbe9c7c4553c72860b9b4987b5d06:RUN", "uniquejobs:54acd73863fa5871768cb083188d831d:RUN", "uniquejobs:b4191ec985c34b244af30ab3143cabad:RUN:INFO", "uniquejobs:e5c365ec72e9029139f446f4af3167ae:RUN", "uniquejobs:555f124f477550c264931b9f63a4151e:RUN:PRIMED", "uniquejobs:c2eec995a5bd4b25b2724a29e90140cd:RUN:PRIMED", "uniquejobs:c61e62e95f5bd06f182b25f278c1df25:RUN:PRIMED", "uniquejobs:212996c302973dcf0644c806f3e11e79:RUN:INFO", "uniquejobs:81e90987d7df3df2a3a99aa4e17e2df9:RUN:PRIMED", "uniquejobs:38f8ac8163e0d13fa03529b7fbfab2b0:RUN:INFO", "uniquejobs:dd8f271e16866d31a7449d1d76560e7d:RUN:INFO", "uniquejobs:12179e1549a813de224efbffce519dbf:RUN:INFO", "uniquejobs:01c8c00a98662b4105cbb3b3da1462a7:RUN:INFO", "uniquejobs:43f4e427ed0027409842e3dae8fb843f:RUN", "uniquejobs:2909c6d6374d70bffe48c40f8af024ed:RUN", "uniquejobs:4a0c9d496a8d6ff7ec4788b57a3e6091:RUN:PRIMED", "uniquejobs:6ae85a603fd21f16c5c4c954d549655c:RUN:INFO", "uniquejobs:7b9862a384420e4dc1d6a23ca11a7706:RUN:PRIMED", "uniquejobs:cb31ed8134ed21e7a3a3c29064379e91:RUN", "uniquejobs:cd7776a82ab53042f6029c829f3b6ee6:RUN", "uniquejobs:a1793460fb38bb3a29e7e5385169dcec:RUN", "uniquejobs:93dd635f5c4a92021228d0637696fde6:RUN:INFO", "uniquejobs:b47459c06d2beac78937b2e318ff4363:RUN", "uniquejobs:ed56a8dd005fb21b2a34f5da9e047537:RUN:PRIMED", "uniquejobs:8ccf96815ca0af4e85fce6bb28b2b8c5:RUN:PRIMED", "uniquejobs:f005969e5292e90ee26f24971eaa3a07:RUN", "uniquejobs:f1ab984cf3029db6cb8efb0c1efcf7ff:RUN:INFO", "uniquejobs:2445ad2c006ee45d00e1660b1b2d8b13:RUN:PRIMED", "uniquejobs:d5f3f639190f3cb1d1b640f1f33fb65a:RUN:PRIMED", "uniquejobs:3a19cbcb65db4246a694c901b500b9f8:RUN:INFO", "uniquejobs:ff72e0459affd6506f23233bb8f4b907:RUN:PRIMED", "uniquejobs:a6ca3f3eccc0850c53454693feb1aac2:RUN", "uniquejobs:dab684c5685db372216af1bb0bf058a9:RUN:PRIMED", "uniquejobs:9bd285eeb11f9d7363cd03226c74252d:RUN:INFO", "uniquejobs:54040d7ce8076c69ea89780acb3907ac:RUN", "uniquejobs:965c2ac877ca1f0d2153be10c1c3b0f8:RUN:PRIMED", "uniquejobs:14eebb7367f944c7f525a771301ab4f5:RUN", "uniquejobs:0a862bd9b0851e14ccdbb87ba4efbf08:RUN:PRIMED"

Possibly related issue

We've also been having issues with until_and_while_executing lock - on 7.0.1 we discovered that around 4% of our jobs did not perform because of this. On 7.1.5 it looks like it's better, but still there are some that are missed. I've added additional logging to determine the cause, but given all the issues I'm afraid we'll have to stop / pause using the gem until it's stable.

However I really appreciate your hard work and please let me know if I can be of any help with determining the cause of issues.

The text was updated successfully, but these errors were encountered:

sharq1 · 2021-09-15T09:34:55Z

Is it possible, that bacause of this bloat, the newly generated random uids were actually seldomly the same as some of the existing orphaned keys? We've had roughtly 6,5 million of these orphaned keys when I was removing them. It sounds not very likely that it would contribute to 4% of cases, but after the upgrade to 7.1.5 the number of jobs lost was around 0.3%.

sharq1 · 2021-09-21T10:40:55Z

We've disabled locks on the worker that was problematic and we're not having issues with missed jobs anymore. However we must now come up with an alternative strategy until this gets fixed (any clues about whether it would? 🙂 ) or permanently. Not mentioning having to fix/recalculate nearly all of our historical records in different parts of the system, which will take days... 😬

mhenrixon · 2021-09-21T12:03:48Z

Sorry @sharq1, life got in the way this summer: #601 (comment)

I'll be dealing with these topics in the next few days

sharq1 · 2021-09-21T12:10:16Z

I'm sorry to hear that... Hope the bad luck is over for you.
Thanks for the heads up.

mhenrixon · 2021-09-21T12:10:58Z

@sharq1 are you on Sidekiq 6.2.2? If so, they renamed the Sidekiq::Job to Sidekiq::JobRecord and I'll add extensions for that.

sharq1 · 2021-09-21T12:35:22Z

Yes, however we've switched to 6.2.2 from 6.2.1 at around the same time when we disabled locks for some of the jobs (workers) - but for some we still have locks. Will the sidekiq-unique-jobs still function regardless or is it a breaking change?

mhenrixon · 2021-09-21T12:43:40Z

Yes but if you for example delete things in Sidekiq::API classes (uses by sidekiq/web) then related locks won't be deleted.

mhenrixon · 2021-09-24T16:01:58Z

V7.1.6 should work with latest sidekiq version.

mhenrixon · 2021-10-08T10:44:22Z

Hi @sharq1 I will have a deeper look into this later today.

bdarcet · 2021-10-15T11:41:41Z

So we pretty much have the same issue on our end. If one of our sidekiq job crashes (OOM in our case) then the uniquejobs:key:.. + uniquejobs:key:..:LOCKED are not cleared and the job will never rerun until we clear the keys manually.
We are considering adding an other daily job to clean those keys. But I guess it should be the reaper job to handle that ?

Versions:
sidekiq-unique-jobs: 7.1.5
sidekiq: 6.2.2

mhenrixon · 2021-10-15T13:06:05Z

@bdarcet I strongly recommend you upgrade to 7.1.8 because 7.1.5 is not compatible with the sidekiq version you are using.

fotinakis · 2022-07-11T14:23:47Z

Just wanted to note we encountered a similar issue with orphaned keys not getting cleaned up, having just started using sidekiq-unique-jobs.

sidekiq-unique-jobs (7.1.25)
sidekiq-pro (5.3.1)
sidekiq (6.5.1)

Job is simple, only has:

sidekiq_options lock: :until_expired, lock_ttl: 30.minutes

It is a heavy load job, being executed millions of times in a 4-5 hour timeframe.

We found tens of thousands of these PRIMED orphaned keys which remained forever in the DB:

The reaper never cleaned them up, so we deleted them manually.

mhenrixon · 2022-07-11T15:17:59Z

There was some work being done in #724 which will help greatly for you as well.

mhenrixon · 2022-07-12T09:07:03Z

When #725 is merged it would be great to get some feedback on how it performs for you @fotinakis. Any insights into this would be greatly appreciated. Preferably before I release it as we changed how the expired keys are stored.

fotinakis · 2022-07-12T14:09:56Z

Amazing, thanks @mhenrixon! I'm going to be able to run a large-scale test today, and will report the results to you here.

fotinakis · 2022-07-12T20:50:45Z

@mhenrixon @francesmcmullin

Test results: unfortunately, saw similar results for memory leaks under load with :until_expired locks under the latest main branch.

Setup:

Job lock options:

sidekiq_options lock: :until_expired, lock_ttl: 30.minutes

sidekiq (6.5.1)
sidekiq-pro (5.5.1)
sidekiq-unique-jobs (main @ b999c8b)

Result:

~800k unique jobs run (out of some unknown number of non-unique attempts, it would have been many multiples larger than this)
Peak of ~5000 unique jobs/sec run, but more like 500-1000 jobs/sec avg

^ At the red line is ~30 minutes beyond when the jobs finished. Expected: memory usage should have dropped back down to starting levels

^ same with keys

^ in redis, ~18k orphaned keys look like this

^ and ~200k orphaned keys look like this

--

Calling the reaper directly reports nothing to delete:

> POOL2.with { |r| puts SidekiqUniqueJobs::Orphans::Reaper.call(r) };
2022-07-12T20:42:09.000Z pid=14081 tid=12wt INFO: Nothing to delete; exiting.
2022-07-12T20:42:09.001Z pid=14081 tid=12wt INFO: Nothing to delete; exiting.

(one complexity to note: we do use sidekiq sharding to support multiple redis dbs, but I don't think that causes any additional issues here)

--

Any ideas? (I don't know enough about how sidekiq-unique-jobs's internals work, but happy to help debug or test again)

fotinakis · 2022-07-12T21:00:00Z

Quick followup:

It looks like like the keys with b- are sidekiq-pro Batch keys. It looks like these can be cleaned up by setting sidekiq-pro's "LINGER" setting (sidekiq/sidekiq#3083).

So the only orphans of concern above are the uniquejobs:..:PRIMED keys

mhenrixon · 2022-07-13T10:15:32Z

Awesome! I'll get on those lingering keys then!

fotinakis · 2022-07-14T18:37:55Z

Great, thanks @mhenrixon! Do you have any thoughts on an ETA for fixing this – we're using at such a scale that I'm somewhat blocked on turning on a big system until this particular issue is resolved or we work around it.

We've also just joined as as a GitHub sponsor! 💯 Really appreciate your work on this fantastic open source project.

mhenrixon · 2022-07-15T06:01:12Z

@fotinakis I'll get started first thing on Tuesday.

This week I have off to spend some quality time with my wife. Our kids are in Sweden with my parents and yesterday was our fifth wedding anniversary!

Sincerely appreciate the support and the kind words 🙏.

Gets me very excited about fixing problems ❤️

mhenrixon · 2022-07-15T06:22:34Z

@fotinakis Monday I'll get started! Just had a look in the calendar and my birthday isn't until Tuesday so have the whole of Monday to look at those dangling keys.

fotinakis · 2022-07-19T22:33:30Z

Hey @mhenrixon — appreciate it. No rush from our end, we have implemented a workaround in the meantime. 👍 Will be good to fix that orphan cleanup whenever you get to it.

mhenrixon · 2022-07-20T07:24:26Z

@fotinakis I have a PR (#726) that should hopefully fix the problem. Would appreciate help battle testing the change. I am a little unsure about the potential effect of concurrency here.

mhenrixon · 2022-07-28T10:39:12Z

Please checkout v7.1.26 https://github.com/mhenrixon/sidekiq-unique-jobs/releases/tag/v7.1.26 as it has quite a few fixes for until and while executing and also until expired

mhenrixon · 2022-07-30T09:57:46Z

Version v7.1.27 should be even better, especially if you are using redis namespace.

That said, redis namespace is not recommended. It was discovered it is not up to date with redis-rb's keyword arguments.

Who knows what else is failing silently with it.

joshmn · 2022-08-18T01:47:27Z

@fotinakis similar throughput as you and similar issues. we're on 7.1.27. what were your workarounds? i woke up to a nightmare :(

sharq1 changed the title ~~Memory bloat / dangling keys / reaper not working~~ Memory bloat / dangling keys / reaper not cleaning orphans Sep 15, 2021

francesmcmullin mentioned this issue Nov 11, 2021

Expired Locks remain in zset of digests [using "until_expired" lock] #653

Closed

fotinakis mentioned this issue Jul 11, 2022

Ensure batch delete removes expiring locks #724

Merged

mhenrixon mentioned this issue Jul 20, 2022

Fix(unlock): Delete primed keys on last entry #726

Merged

mhenrixon closed this as completed in #726 Jul 28, 2022

JeremiahChurch mentioned this issue Dec 19, 2023

uniquejobs:digests sorted set seems to grow forever #821

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory bloat / dangling keys / reaper not cleaning orphans #637

Memory bloat / dangling keys / reaper not cleaning orphans #637

sharq1 commented Sep 15, 2021 •

edited

sharq1 commented Sep 15, 2021 •

edited

sharq1 commented Sep 21, 2021 •

edited

mhenrixon commented Sep 21, 2021

sharq1 commented Sep 21, 2021

mhenrixon commented Sep 21, 2021

sharq1 commented Sep 21, 2021

mhenrixon commented Sep 21, 2021

mhenrixon commented Sep 24, 2021

mhenrixon commented Oct 8, 2021

bdarcet commented Oct 15, 2021

mhenrixon commented Oct 15, 2021

fotinakis commented Jul 11, 2022

mhenrixon commented Jul 11, 2022

mhenrixon commented Jul 12, 2022

fotinakis commented Jul 12, 2022

fotinakis commented Jul 12, 2022 •

edited

fotinakis commented Jul 12, 2022

mhenrixon commented Jul 13, 2022

fotinakis commented Jul 14, 2022

mhenrixon commented Jul 15, 2022

mhenrixon commented Jul 15, 2022

fotinakis commented Jul 19, 2022

mhenrixon commented Jul 20, 2022

mhenrixon commented Jul 28, 2022

mhenrixon commented Jul 30, 2022

joshmn commented Aug 18, 2022

Memory bloat / dangling keys / reaper not cleaning orphans #637

Memory bloat / dangling keys / reaper not cleaning orphans #637

Comments

sharq1 commented Sep 15, 2021 • edited

Possibly related issue

sharq1 commented Sep 15, 2021 • edited

sharq1 commented Sep 21, 2021 • edited

mhenrixon commented Sep 21, 2021

sharq1 commented Sep 21, 2021

mhenrixon commented Sep 21, 2021

sharq1 commented Sep 21, 2021

mhenrixon commented Sep 21, 2021

mhenrixon commented Sep 24, 2021

mhenrixon commented Oct 8, 2021

bdarcet commented Oct 15, 2021

mhenrixon commented Oct 15, 2021

fotinakis commented Jul 11, 2022

mhenrixon commented Jul 11, 2022

mhenrixon commented Jul 12, 2022

fotinakis commented Jul 12, 2022

fotinakis commented Jul 12, 2022 • edited

fotinakis commented Jul 12, 2022

mhenrixon commented Jul 13, 2022

fotinakis commented Jul 14, 2022

mhenrixon commented Jul 15, 2022

mhenrixon commented Jul 15, 2022

fotinakis commented Jul 19, 2022

mhenrixon commented Jul 20, 2022

mhenrixon commented Jul 28, 2022

mhenrixon commented Jul 30, 2022

joshmn commented Aug 18, 2022

sharq1 commented Sep 15, 2021 •

edited

sharq1 commented Sep 15, 2021 •

edited

sharq1 commented Sep 21, 2021 •

edited

fotinakis commented Jul 12, 2022 •

edited