Confused about sidekiq enqueued keys and memory usage #3978

alexevanczuk · 2018-09-26T06:20:59Z

Ruby version: 2.3.5
Sidekiq / Pro / Enterprise version(s):
enterprise, "4.2.10"

I'm looking into some issues we are having on our production application where starting a month ago our application started using redis excessively even when nothing special was happening.

This corresponds to an operation where queued up millions of sidekiq jobs, exceeding the redis memory and causing sidekiq to drop jobs. We ended up having to manually try to delete jobs using the sidekiq API. However, it appears that maybe some jobs are left over. A couple of days ago sidekiq queued up and tried to run 500 instances of the jobs that were queued up more than a month ago (all immediately failing because the worker class has since been deleted).

The sidekiq API is not showing anything, so I looked at our redis keys to see what I could find.

all_keys = []
$redis.scan_each(match: '*') do |key|
    all_keys << key
end

PRODUCTION 27> all_keys.count
18539

all_keys.map{|k| k.split(":").first}.group_by(&:itself).transform_values(&:count)

{
               "app_name_sidekiq_jobs_development" => 14095,
                                    "cache-production" => 4370,
                                     ... [ OTHER UNRELATED THINGS ] ...
}

$redis.mget('app_name_sidekiq_jobs_development:unique:f93b7540c29b11e99cb33d08d6ec191538bf8ab8')
[
    [0] "4692669172.565897"
]
workers = Sidekiq::Workers.new
workers.each do |process_id, thread_id, work|
  if Time.at(work['run_at']).to_datetime < 1.month.ago
    puts "ProcessId: #{process_id}, Q: #{work['queue']}, JID: #{work['payload']['jid']}, Retry: #{work['payload']['retry']}, RunAt: #{work['run_at'] ? Time.at(work['run_at']).to_datetime : 'nil'},   EnqueuedAt: #{work['payload']['enqueued_at'] ? Time.at(work['payload']['enqueued_at']) : 'nil'}"
  end
end

[
    [0] "0:15:439b379dfc19",
    [1] "1:15:ee39c095b9dd",
    [2] "2:14:e4bb2d1626fe",
    [3] "3:15:5a02e3edfdd5"
]

What is going on here? Why do we have 14K sidekiq jobs that have the development suffix? Why aren't they showing up in the API, and why are their values just a decimal type?

Is this something that Sidekiq::Workers.new.prune would help with?

Thank you!!

The text was updated successfully, but these errors were encountered:

mperham · 2018-09-26T15:44:14Z

hawaiian_ice_sidekiq_jobs_development is a Redis namespace. One of your developers possibly accidentally used a production Redis URL as their development Redis. unique means that's a key for the unique jobs feature in Sidekiq Enterprise. I suspect it is safe to remove all ` hawaiian_ice_sidekiq_jobs_development` keys but I would double check with the rest of the team first.

…

On Tue, Sep 25, 2018 at 11:21 PM Alex Evanczuk ***@***.***> wrote: Ruby version: 2.3.5 Sidekiq / Pro / Enterprise version(s): enterprise, "4.2.10" I'm looking into some issues we are having on our production application where starting a month ago our application started using redis excessively even when nothing special was happening: [image: screen shot 2018-09-25 at 10 03 26 pm] <https://user-images.githubusercontent.com/3311200/46060690-09f20800-c119-11e8-8733-6d6c127ec7e6.png> This corresponds to an operation where queued up millions of sidekiq jobs, exceeding the redis memory and causing sidekiq to drop jobs. We ended up having to manually try to delete jobs using the sidekiq API. However, it appears that maybe some jobs are left over. A couple of days ago sidekiq queued up and tried to run 500 instances of the jobs that were queued up more than a month ago (all immediately failing because the worker class has since been deleted). The sidekiq API is not showing anything, so I looked at our redis keys to see what I could find. all_keys = []$redis.scan_each(match: '*') do |key| all_keys << keyend HawaiianIce RO PRODUCTION 27> all_keys.count18539 all_keys.map{|k| k.split(":").first}.group_by(&:itself).transform_values(&:count) { "hawaiian_ice_sidekiq_jobs_development" => 14095, "cache-production" => 4370, ... [ OTHER UNRELATED THINGS ] ... } $redis.mget('hawaiian_ice_sidekiq_jobs_development:unique:f93b7540c29b11e99cb33d08d6ec191538bf8ab8') [ [0] "4692669172.565897" ] workers = Sidekiq::Workers.new workers.each do |process_id, thread_id, work| if Time.at(work['run_at']).to_datetime < 1.month.ago puts "ProcessId: #{process_id}, Q: #{work['queue']}, JID: #{work['payload']['jid']}, Retry: #{work['payload']['retry']}, RunAt: #{work['run_at'] ? Time.at(work['run_at']).to_datetime : 'nil'}, EnqueuedAt: #{work['payload']['enqueued_at'] ? Time.at(work['payload']['enqueued_at']) : 'nil'}" endend [ [0] "0:15:439b379dfc19", [1] "1:15:ee39c095b9dd", [2] "2:14:e4bb2d1626fe", [3] "3:15:5a02e3edfdd5" ] What is going on here? Why do we have 14K sidekiq jobs that have the development suffix? Why aren't they showing up in the API, and why are their values just a decimal type? Is this something that Sidekiq::Workers.new.prune would help with? Thank you!! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3978>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAALX1ALY_guf3bd6QVnFXSYRR7CLuRoks5uexzMgaJpZM4W5-Rv> .

alexevanczuk · 2018-09-26T16:14:18Z

Thanks @mperham. I knew there was something fishy going on and that seems clear to me now that you say it.

Appreciate your quick response and all the hard work on this service.

alexevanczuk · 2018-09-27T19:24:06Z

Hi @mperham -

I am opening this up again because we realized that these keys were actually production keys based on our configuration:

sidekiq:
  namespace: # YML logic that results in 'development'
  load_periodic_jobs: true
  load_product_periodic_jobs: true

I definitely realize this is not a good practice and it was likely a mistake in our initial configuration, and migrating to a new namespace is a future project for us.

For now, we'd like to analyze what is using so much space, and maybe clear out any cruft data. Does sidekiq offer any first-class ways to analyze redis usage? If not, how might you recommend doing this? Based on the graph above, I think it might have to do with a huge increase in the number of processed jobs. However, I would have thought that sidekiq would just have a single key that holds the number of processed jobs and other stats in O(1) space.

Any advice on this would be greatly appreciated.

Alex

mperham · 2018-09-27T19:33:43Z

Sidekiq does not offer such tools. You are looking for a Redis RDB analyzer:

https://github.com/mperham/sidekiq/wiki/Using-Redis#notes

Your inclination is correct, processed jobs are discarded and do not take memory.

alexevanczuk · 2018-09-28T04:51:04Z

Thanks for the reference to that tool. I hooked up a snapshot of our DB to these analysis tools and found that the vast majority of keys are taken up by sidekiq related keys of the form: sidekiq_jobs_development:unique:.*? (~12K keys) and stat:.*? (~2K keys)

I do not think that the 12K unique keys are necessary, although I don't know the internals of sidekiq. We have less than 2K jobs scheduled, and very often we drop to 0 jobs, yet our memory usage has remained constant. is it possible sidekiq is holding onto unique keys for some reason? Is there some way to look into what jobs, if any, are using the unique keys?

I also found that sidekiq jobs and their parameters seem to be stored across a number of different keys. For example, instead of a single key, value for a job, it looks like sidekiq holds the parameters across many different key, value pairs with keys that reference the unique job ID. Does this sound accurate? Is there (internal?) documentation on this, as I am curious about how sidekiq uses redis.

Thanks again,
Alex

mperham · 2018-09-28T15:48:52Z

Unique keys can linger if jobs die or are stuck in the retry queue. They will persist until the unique_for value. If it's 4.hours, the key's TTL is 4 hours.

stat:* is normal, those keys are how we track the jobs processed/failed per day for the Dashboard.

Sidekiq does not duplicate job data in Redis, job data is only ever in one place at a time: in queue, retry, dead, or private queue (with super_fetch).

alexevanczuk · 2018-09-28T16:53:35Z

Thanks for the explanation. In our organization we use practically infinite (many years) unique for values to attempt to limit us to only once job execution. This means those keys would never be cleaned up it sounds like. Is it possible to find what jobs those unique keys are for? If not and we just delete them, would it cause any corrupt data issues as far as sidekiq is concerned, or will it simply allow jobs to run that otherwise weren’t allowed to? I was thinking of waiting until our queue is at 0 then deleting the unique keys, since if our queue can reach zero it implies those unique keys are dangling as far as I understand.

mperham · 2018-09-28T17:42:30Z

Using a long expiry is not a good idea, the wiki tries to make it clear that unique locks are best effort and shouldn't be used as a replacement for idempotency. The unique lock is a lock that is only removed (by default) when the job succeeds. If the job is lost somehow or dies, the lock will (effectively) never expire and unlock.

I will add an expiry validation that unique locks longer than X hours are discouraged, where X is maybe 48.

It's not possible to map a unique lock back to the job that created it as it's a one-way hash. If you delete the locks, you'll be able to run those jobs again.

alexevanczuk · 2018-09-28T18:37:19Z

The reason we are doing this is as a debouncing mechanism. Namely, we use a "always unique" unique_for value and then perform_in 10.minutes. This debounces for 10 minutes. It sounds like it is suggested to instead use another debouncing mechanism. For example, postgres offers locking strategies that could be used similarly, where we only queue up the job if there is no postgres debouncing lock (and the lock gets dropped after the job completes).

Thanks for the information regarding lock -> job identification. I think I might just delete the unique keys in this case.

mperham · 2018-09-28T18:48:54Z

Sure, if you want to debounce for 10 minutes, set unique_for: 10.minutes.

alexevanczuk · 2018-10-03T21:06:29Z

@mperham I have 2 questions if you don't mind!

It looks like the increase is a result of queues that have hundreds of thousands of jobs in what my colleague is telling me is a reliable queue.

> lrange hawaiian_ice_sidekiq_jobs_development:queue:audits_1 0 9
 1) "{\"class\":\"PaperTrailEeErIdBackfiller\",\"args\":[287937],\"retry\":true,\"queue\":\"audits\",\"jid\":\"ec0751f4e052a90667548252\",\"created_at\":1533964394.776366,\"enqueued_at\":1533964394.7887251}"

Also possible to use the redis ruby client:

$redis.lrange('hawaiian_ice_sidekiq_jobs_development:queue:audits_1', 0, 9)

all_keys.select{|k| k.include?('queue')}.map{|k| [k, ($redis.llen(k) rescue "another data type for key")]}.sort_by{|arr| arr[1].to_i}
    ],
    [24] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_2",
        [1] 381184
    ],
    [25] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_1",
        [1] 462850
    ],
    [26] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_0",
        [1] 761760
    ],
    [27] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_3",
        [1] 780285
    ]
]

Ideally I'd like to just delete these queues using Sidekiq::Queue.new('audits_0').clear. My question is: why aren't these showing up in the sidekiq UI as scheduled jobs?

Another question -- if the scheduled set is empty, is it generally safe to delete all unique keys? I don't have a deep enough understanding of sidekiq to know if this will cause unexpected consequences beyond simply unblocking jobs from running. No jobs should run though when deleting the keys if there is nothing in the scheduled set. Can you confirm if deleting keys has any other side effects beyond unblocking jobs from running?

mperham · 2018-10-04T02:02:40Z

Jobs can be in scheduled set, dead set, retry set, etc. while holding a unique lock. It's safe to delete a unique lock from Sidekiq's perspective -- all that might happen is a duplicate enqueue.

Those queues are private queues from reliable_fetch. reliable_fetch has been deprecated for two years now in favor of super_fetch. The UI does not show the contents of private queues. If you are still using reliable_fetch, I'd urge you to update Sidekiq Pro and upgrade to super_fetch immediately.

alexevanczuk closed this as completed Sep 26, 2018

alexevanczuk reopened this Sep 27, 2018

mperham closed this as completed Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused about sidekiq enqueued keys and memory usage #3978

Confused about sidekiq enqueued keys and memory usage #3978

alexevanczuk commented Sep 26, 2018 •

edited

Loading

mperham commented Sep 26, 2018 via email

alexevanczuk commented Sep 26, 2018

alexevanczuk commented Sep 27, 2018

mperham commented Sep 27, 2018

alexevanczuk commented Sep 28, 2018

mperham commented Sep 28, 2018

alexevanczuk commented Sep 28, 2018 via email •

edited

Loading

mperham commented Sep 28, 2018

alexevanczuk commented Sep 28, 2018

mperham commented Sep 28, 2018

alexevanczuk commented Oct 3, 2018 •

edited

Loading

mperham commented Oct 4, 2018

Confused about sidekiq enqueued keys and memory usage #3978

Confused about sidekiq enqueued keys and memory usage #3978

Comments

alexevanczuk commented Sep 26, 2018 • edited Loading

mperham commented Sep 26, 2018 via email

alexevanczuk commented Sep 26, 2018

alexevanczuk commented Sep 27, 2018

mperham commented Sep 27, 2018

alexevanczuk commented Sep 28, 2018

mperham commented Sep 28, 2018

alexevanczuk commented Sep 28, 2018 via email • edited Loading

mperham commented Sep 28, 2018

alexevanczuk commented Sep 28, 2018

mperham commented Sep 28, 2018

alexevanczuk commented Oct 3, 2018 • edited Loading

mperham commented Oct 4, 2018

alexevanczuk commented Sep 26, 2018 •

edited

Loading

alexevanczuk commented Sep 28, 2018 via email •

edited

Loading

alexevanczuk commented Oct 3, 2018 •

edited

Loading