Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about sidekiq enqueued keys and memory usage #3978

Closed
AlexEvanczuk opened this Issue Sep 26, 2018 · 12 comments

Comments

Projects
None yet
2 participants
@AlexEvanczuk
Copy link

AlexEvanczuk commented Sep 26, 2018

Ruby version: 2.3.5
Sidekiq / Pro / Enterprise version(s):
enterprise, "4.2.10"

I'm looking into some issues we are having on our production application where starting a month ago our application started using redis excessively even when nothing special was happening.
screen shot 2018-09-25 at 11 33 12 pm

This corresponds to an operation where queued up millions of sidekiq jobs, exceeding the redis memory and causing sidekiq to drop jobs. We ended up having to manually try to delete jobs using the sidekiq API. However, it appears that maybe some jobs are left over. A couple of days ago sidekiq queued up and tried to run 500 instances of the jobs that were queued up more than a month ago (all immediately failing because the worker class has since been deleted).

The sidekiq API is not showing anything, so I looked at our redis keys to see what I could find.

all_keys = []
$redis.scan_each(match: '*') do |key|
    all_keys << key
end

PRODUCTION 27> all_keys.count
18539

all_keys.map{|k| k.split(":").first}.group_by(&:itself).transform_values(&:count)

{
               "app_name_sidekiq_jobs_development" => 14095,
                                    "cache-production" => 4370,
                                     ... [ OTHER UNRELATED THINGS ] ...
}

$redis.mget('app_name_sidekiq_jobs_development:unique:f93b7540c29b11e99cb33d08d6ec191538bf8ab8')
[
    [0] "4692669172.565897"
]
workers = Sidekiq::Workers.new
workers.each do |process_id, thread_id, work|
  if Time.at(work['run_at']).to_datetime < 1.month.ago
    puts "ProcessId: #{process_id}, Q: #{work['queue']}, JID: #{work['payload']['jid']}, Retry: #{work['payload']['retry']}, RunAt: #{work['run_at'] ? Time.at(work['run_at']).to_datetime : 'nil'},   EnqueuedAt: #{work['payload']['enqueued_at'] ? Time.at(work['payload']['enqueued_at']) : 'nil'}"
  end
end

[
    [0] "0:15:439b379dfc19",
    [1] "1:15:ee39c095b9dd",
    [2] "2:14:e4bb2d1626fe",
    [3] "3:15:5a02e3edfdd5"
]

What is going on here? Why do we have 14K sidekiq jobs that have the development suffix? Why aren't they showing up in the API, and why are their values just a decimal type?

Is this something that Sidekiq::Workers.new.prune would help with?

Thank you!!

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Sep 26, 2018

@AlexEvanczuk

This comment has been minimized.

Copy link
Author

AlexEvanczuk commented Sep 26, 2018

Thanks @mperham. I knew there was something fishy going on and that seems clear to me now that you say it.

Appreciate your quick response and all the hard work on this service.

@AlexEvanczuk

This comment has been minimized.

Copy link
Author

AlexEvanczuk commented Sep 27, 2018

Hi @mperham -

I am opening this up again because we realized that these keys were actually production keys based on our configuration:

sidekiq:
  namespace: # YML logic that results in 'development'
  load_periodic_jobs: true
  load_product_periodic_jobs: true

I definitely realize this is not a good practice and it was likely a mistake in our initial configuration, and migrating to a new namespace is a future project for us.

For now, we'd like to analyze what is using so much space, and maybe clear out any cruft data. Does sidekiq offer any first-class ways to analyze redis usage? If not, how might you recommend doing this? Based on the graph above, I think it might have to do with a huge increase in the number of processed jobs. However, I would have thought that sidekiq would just have a single key that holds the number of processed jobs and other stats in O(1) space.

Any advice on this would be greatly appreciated.

Alex

@AlexEvanczuk AlexEvanczuk reopened this Sep 27, 2018

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Sep 27, 2018

Sidekiq does not offer such tools. You are looking for a Redis RDB analyzer:

https://github.com/mperham/sidekiq/wiki/Using-Redis#notes

Your inclination is correct, processed jobs are discarded and do not take memory.

@AlexEvanczuk

This comment has been minimized.

Copy link
Author

AlexEvanczuk commented Sep 28, 2018

Thanks for the reference to that tool. I hooked up a snapshot of our DB to these analysis tools and found that the vast majority of keys are taken up by sidekiq related keys of the form: sidekiq_jobs_development:unique:.*? (~12K keys) and stat:.*? (~2K keys)

I do not think that the 12K unique keys are necessary, although I don't know the internals of sidekiq. We have less than 2K jobs scheduled, and very often we drop to 0 jobs, yet our memory usage has remained constant. is it possible sidekiq is holding onto unique keys for some reason? Is there some way to look into what jobs, if any, are using the unique keys?

I also found that sidekiq jobs and their parameters seem to be stored across a number of different keys. For example, instead of a single key, value for a job, it looks like sidekiq holds the parameters across many different key, value pairs with keys that reference the unique job ID. Does this sound accurate? Is there (internal?) documentation on this, as I am curious about how sidekiq uses redis.

Thanks again,
Alex

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Sep 28, 2018

Unique keys can linger if jobs die or are stuck in the retry queue. They will persist until the unique_for value. If it's 4.hours, the key's TTL is 4 hours.

stat:* is normal, those keys are how we track the jobs processed/failed per day for the Dashboard.

Sidekiq does not duplicate job data in Redis, job data is only ever in one place at a time: in queue, retry, dead, or private queue (with super_fetch).

@AlexEvanczuk

This comment has been minimized.

Copy link
Author

AlexEvanczuk commented Sep 28, 2018

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Sep 28, 2018

Using a long expiry is not a good idea, the wiki tries to make it clear that unique locks are best effort and shouldn't be used as a replacement for idempotency. The unique lock is a lock that is only removed (by default) when the job succeeds. If the job is lost somehow or dies, the lock will (effectively) never expire and unlock.

I will add an expiry validation that unique locks longer than X hours are discouraged, where X is maybe 48.

It's not possible to map a unique lock back to the job that created it as it's a one-way hash. If you delete the locks, you'll be able to run those jobs again.

@AlexEvanczuk

This comment has been minimized.

Copy link
Author

AlexEvanczuk commented Sep 28, 2018

The reason we are doing this is as a debouncing mechanism. Namely, we use a "always unique" unique_for value and then perform_in 10.minutes. This debounces for 10 minutes. It sounds like it is suggested to instead use another debouncing mechanism. For example, postgres offers locking strategies that could be used similarly, where we only queue up the job if there is no postgres debouncing lock (and the lock gets dropped after the job completes).

Thanks for the information regarding lock -> job identification. I think I might just delete the unique keys in this case.

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Sep 28, 2018

Sure, if you want to debounce for 10 minutes, set unique_for: 10.minutes.

@mperham mperham closed this Sep 28, 2018

@AlexEvanczuk

This comment has been minimized.

Copy link
Author

AlexEvanczuk commented Oct 3, 2018

@mperham I have 2 questions if you don't mind!

It looks like the increase is a result of queues that have hundreds of thousands of jobs in what my colleague is telling me is a reliable queue.

> lrange hawaiian_ice_sidekiq_jobs_development:queue:audits_1 0 9
 1) "{\"class\":\"PaperTrailEeErIdBackfiller\",\"args\":[287937],\"retry\":true,\"queue\":\"audits\",\"jid\":\"ec0751f4e052a90667548252\",\"created_at\":1533964394.776366,\"enqueued_at\":1533964394.7887251}"

Also possible to use the redis ruby client:

$redis.lrange('hawaiian_ice_sidekiq_jobs_development:queue:audits_1', 0, 9)

all_keys.select{|k| k.include?('queue')}.map{|k| [k, ($redis.llen(k) rescue "another data type for key")]}.sort_by{|arr| arr[1].to_i}
    ],
    [24] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_2",
        [1] 381184
    ],
    [25] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_1",
        [1] 462850
    ],
    [26] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_0",
        [1] 761760
    ],
    [27] [
        [0] "hawaiian_ice_sidekiq_jobs_development:queue:audits_3",
        [1] 780285
    ]
]

Ideally I'd like to just delete these queues using Sidekiq::Queue.new('audits_0').clear. My question is: why aren't these showing up in the sidekiq UI as scheduled jobs?

Another question -- if the scheduled set is empty, is it generally safe to delete all unique keys? I don't have a deep enough understanding of sidekiq to know if this will cause unexpected consequences beyond simply unblocking jobs from running. No jobs should run though when deleting the keys if there is nothing in the scheduled set. Can you confirm if deleting keys has any other side effects beyond unblocking jobs from running?

@mperham

This comment has been minimized.

Copy link
Owner

mperham commented Oct 4, 2018

Jobs can be in scheduled set, dead set, retry set, etc. while holding a unique lock. It's safe to delete a unique lock from Sidekiq's perspective -- all that might happen is a duplicate enqueue.

Those queues are private queues from reliable_fetch. reliable_fetch has been deprecated for two years now in favor of super_fetch. The UI does not show the contents of private queues. If you are still using reliable_fetch, I'd urge you to update Sidekiq Pro and upgrade to super_fetch immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.