-
-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow evalsha causing timeouts #668
Comments
At first glance, there is nothing there that points to sidekiq unique jobs. I'd venture as far as to suggest you look into monitoring redis. Keep in mind that with 60 workers, you'll be hitting redis pretty hard. If you also have rails or something hitting redis from the other side you could be out of connection or something. If you are on something like heroku, you likely need to increase your timeouts to redis. There is also this whole thing with connection pools (sidekiq does a good job usually but you might still want to look into it). As for timeouts, I believe I saw something like 8 seconds for heroku but I could be mistaken. |
|
Figured it out ! First of all thanks for your reply. I've continued monitoring and digging, and I kept being pointed in the direction of that lua script being called through evalsha. This particular queue receives 2 types of workers: a worker that fetched some data (without a config on uniqueness), and a cleanup worker, which had a uniqueness constraint. What happened is that every "fetching" worker would enqueue said cleanup worker x time in the future. But that cleanup worked had the following config:
As the queue was being processed, between 50-100 jobs a second, Redis clogged up, which I think makes sense, considering https://github.com/mhenrixon/sidekiq-unique-jobs/blob/f09d41b818286ed5549ac90ccc2ea386be9b5ca4/lib/sidekiq_unique_jobs/lua/shared/_delete_from_queue.lua has to iterate over (a part of) 500.000+ jobs in steps of 50 Moving the cleanup job to a short queue completely fixed it. Is this expected behavior? |
I was bit pretty hard by this. I had a couple workers that used the Do you think a separate set should be used for unique scheduled jobs, maybe keyed on unique ID? Something with constant time deletes would be super beneficial for the |
Uff! Sorry to hear that @ezekg. I just saw your status update on Twitter/X. I wasn't aware you were using the gem as well. I got some heavy hitters using the gem. At that scale, the scheduled queue would be a problem. It is late right now, and I have been working on recursion and converting Mongodb to Postgresql, but I will look at this first thing tomorrow morning. I could provide a warning as I reschedule the job in case the queue hits up a number. That said, monitoring the sidekiq latency and queue size is likely a great idea. If you want to jump on a call and discuss our options, I'm always down. I have been working alone for so long, I wouldn't mind some company. |
Not your fault. We typically had ~20k scheduled jobs in there, but as a new feature was adopted, the queue size also started to grow. The straw that broke the camel's back was a new customer adopting the feature for a large user base, and that boomed the queue from ~20k to 250k in a As it grew, our performance dropped and Redis read/write/connect timeouts started to happen, but I couldn't figure out why. We've handled that load before without a hitch, and our Redis instance is beefy, but this time, something was different. The difference was the workload. Nobody else pushed that many jobs to the schedule queue, and so each time a job was pushed, Redis would block while it iterated the schedule queue trying to find any unique jobs to replace on conflict. Don't feel bad. It's absolutely not your fault, and this gem has been monumental in scaling Keygen (which is why we sponsor it!). I just wanted to share, so that we could think of a solution that could scale for the next users. 🙂 I knew my heavy usage of scheduled jobs was going to bite me eventually, and oh did it. |
My idea was to move this part to ruby to avoid doing such a long-running task in lua. Or perhaps check the latency or the size of the queues before calling lua and either helpfully crash or use a ruby script. I initially thought lua ensures the job isn't picked up by sidekiq as I am trying to replace it. Maybe that's an edge case, and it might be possible to solve with a reliable fetcher. |
Yeah, I guess moving to Ruby could potentially be better since it'd be non-blocking compared to a single Lua script. But it'd still be iterating over potentially millions of scheduled jobs, and I think that's where the inefficiency lies. Iterating a queue that big just hammers Redis if it's done too often, and in my case it was being done tens of times per-second. And the job that was being searched for in my case was likely towards the end of the queue, since they were being scheduled far out. Not sure what else you could do though, since you can't do any type of search algorithm since the data is essentially random, and requires a string search to find the job with the unique job digest. If the schedule queue were a different data structure (it's a sorted set right now IIRC), e.g. a hash keyed on jid, it'd make O(1) look ups (and deletions/replacements) possible by keeping an internal mapping of unique job digest to jid, but right now, that's not how Sidekiq is built. Only other solution I can think of would be to keep a separate queue for unique jobs, with a better data structure, and maybe tell Sidekiq to pull from that queue as well when looking for scheduled jobs. |
Thinking some more — if we kept an internal mapping of unique job digest to job payload, we could do a fast But if we kept a mapping of digest to score (timestamp) of the job in Sidekiq's Here's some psuedo-code to show what I'm thinking: local function delete_from_sorted_set(name, digest)
local per = 50
local score = redis.call("hget", "digest-scores", digest)
local items = redis.call("ZRANGE", name, score, score + per -1, "BYSCORE")
local result
for _, item in pairs(items) do
if string.find(item, digest) then
redis.call("ZREM", name, item)
result = item
break
end
end
return result
end May need to be adjusted to loop if more than 50 jobs are scheduled at the exact same time (score). What do you think? Would reduce the search space if we used the score. |
Hi Mikael, first of all thank you so much for your work on this lib ! We are very satisfied with it. I've tried my best below to be as precise and provide as much context as possible, as I believe the issue is related to sidekiq-unique-jobs, but I am aware I could be mistaking.
Describe the bug
Our workers are reporting a lot of
Redis::TimeoutError
connection timed out
errors. On the server I can see redis is using 100% cpu (one core). The slowlog is filled with entries that I can to some degree trace back tosidekiq-unique-jobs
through thedelete_job_by_digest
argument: see 1) 4) 11). The entries look like this:When the queues are small, this does not seem to be an issue, but when we increased from 180.000 to 500.000 enqueued jobs, the amount of timeouts increase tremendously. We used to have 60 concurrent workers of this class without any issue. What stands out for me: we have not defined any unique-jobs specific config in the worker, and the worker itself does nothing complex.
Expected behavior
No timeouts
Current behavior
±150 very slow
evalsha
callsWorker class
Additional context
SidekiqUniqueJobs was updated today from
7.0.8
to7.1.12
to no availSidekiq
6.3.1
redis-cli info
returnsDoes this ring any bells? Thanks a lot in advance, if I can provide any more info please let me know
The text was updated successfully, but these errors were encountered: