-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lost jobs in busy queue - reliable_fetch not working #1527
Comments
Can you provide more info? Sidekiq/Pro/Ruby versions? Is it always the same type of job or all different types? Are you using any Sidekiq extensions or middleware? @jonhyman has been running 100,000+ jobs/min in his system with reliable fetch with great success now. |
Sidekiq pro @jonhyman - whoa! we've been struggling to get that kind of performance... we're processing about 100k jobs per hour and struggling to scale up to that. Definitely could use your advice. Any chance you're on heroku? Here's my sidekiq config:
|
Here's a barrage of questions, but: How often is this happening? Do you have logging around errors Sidekiq sees? What else is connecting to your Redis server, and do you notice any errors around that? The path I'm trying to go down here is if there are connectivity issues to Redis which might cause some jobs to get "stuck". We are not on Heroku, we had been entirely on AWS but now we're almost done migrating to physical servers at Rackspace. What I've seen before is connectivity issues to the AZ where my Redis was located, my best guess was that an The interesting thing to me is that you say that "sometimes they process" -- are you sure about that, or do you just see them disappear from the Busy page? That is, do you have logging or a way to confirm that the job finished? If the job is "stuck processing", can you TTIN the worker to get a backtrace of what the threads are doing? As for "[restarting] cleared the busy queue without processing any of the jobs." do you know if the jobs processed via logging? The "busy queue" on the sidekiq dashboard is just a Redis set that workers add to before a job starts and remove from when it finishes. If the worker is hanging, I'd expect a restart of Sidekiq to re-insert the jobs from the local queue into the main queue, then when the process exits it might clear from the busy queue. When you restart the process, what log statements do you see from Sidekiq? Sidekiq should log how many quiet threads it was shutting down when you INT/TERM it, and if it has to force kill threads, it should tell you how many and the jobs in each. What do those logs say? |
@jonhyman - This is happening randomly... maybe once or twice a day... but when it does the jobs don't finish processing. We know this because we adding logging around the jobs. When it happens we are typically seeing a lot of these errors:
There are a couple jobs where we have the establish a connection with a new database and I think this might be what's causing the issue. It seems like calling establish_connection may be clearing out the existing connection pool. I have yet to investigate or verify this. When we establish another connection we are doing the following:
Any thoughts? |
Sorry we don't use active record so I do not have experience with it. Sent from my mobile device
|
If you are connecting to multiple databases, you need to ensure that each database connection pool is sized correctly, not just the default pool. That AR establish_connection code looks pretty suspect. I wonder if you can monkeypatch AR to change the pool size default of 5 to 30? |
Can you provide the source for the worker in question? |
@mperham - here is the basic worker that just uploads a file to s3. I can try to monkeypatch AR but it does seem like it's setting the pool size correctly. The strange this is that despite setting all the configs to increase timeout, pool, ect before establishing any connections I'm still randomly getting this. There doesn't appear to be any reason why.
Here's an example of my worker.
|
There's no perform method. It's pretty important we see the exact code, not pseudo-code. Have you tried without hirefire just to make sure it's not introducing some weirdness? |
Oh, perhaps you're using |
I'm just calling it using delay:
I have tried using scaling methods other than hirefire and got the same issues. I suppose it could be a result of adding and removing workers but I haven't been able to test that. |
My gut feeling is that this is a HTTP timeout issue but have no data to back that up. I don't know what store_in_s3 is doing but make sure it has proper read and write timeouts set. |
This is the active record connection pool method that is throwing the error... and when I lose jobs from the busy queue or the busy queue gets stuck I have a lot of those database connection errors. I don't believe it's an HTTP timeout issue because I'm seeing nothing related to that in my logs. I'm guessing that there's something going on that's not threadsafe in activerecord or in the way I'm connecting via activerecord. |
I don't know if a lot of people are using 4.1 yet so it's possible it's a Rails pool issue. Can you give a full stack trace? Determining who's holding onto the connections and why they haven't been returned to the pool will require a lot of debugging on your part, I can't help with that. One thing you can do is further increase the pool size and see what effect that has, does it increase the time between error storms? |
Did this get resolved? I saw you opened another, different issue now. |
@mperham I'm having the same problem. Right now we're chasing down a bug with Heroku's Ruby 2.0 installation where When this happens, Sidekiq crashes and any jobs being worked get stuck in the busy queue. When the dyno reboots, these jobs never get worked again but just stay in the queue forever. I played around with some code to detect these zombie jobs 'till I realized reliable fetch should be preventing it. workers = Sidekiq::Workers.new
client = Sidekiq::Client.new
stale_time = 10.minutes.ago
workers.each do |name,work,started_at|
time = Time.parse(started_at)
next unless time < stale_time
queue = work.fetch("queue")
payload = work.fetch("payload")
jid = payload.fetch("jid")
bid = payload.fetch("bid")
payload["resurrected"] = true # was going to use this for a check to prevent infinite re-resurrection of the offending job
puts "Detected #{payload.fetch('class')} zombie job on #{queue} worker set, pushing back into queue (job #{jid}, batch #{bid})"
client.push(payload)
end ; nil
workers.prune(10.minutes) sidekiq (2.17.7) if it helps, here's how I'm able to reproduce the segfault require "net/https"
require "uri"
login_url = "https://publishers.chitika.com/login"
uri = URI.parse(login_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request) |
@subelsky when the dyno reboots, does its hostname change? If so, then I would expect the jobs to stay stuck in the queue. |
it's definitely possible - I'm using the |
Glad to see this is not just me. We're also still having the same issues as @subelsky as well... sounds like the exact same problem. Also using the @subelsky -- we're actually seeing the jobs eventually disappear from the busy queue (after an hour or so) but they often never get processed. |
Yep, if the hostname is dynamic, reliable fetch doesn't work well as designed. You'd need to "rescue" those orphaned working queues manually and put the jobs back into the main queue. I don't have a workaround for this at the moment. If someone can enter a new issue, I hope to get to this as part of my Sidekiq Pro 2.0 effort in April/May. |
@mperham a quick workaround would be to let them pass in the hostname via something like |
Though I just realized that if the value of $DYNO gets reassigned to a worker that processes a different queue, you'll get jobs stuck anyway. That is, let's say that $DYNO == 5. A worker processing the "foo" queue fails, but you decide to scale up workers which process the "bar" queue before that "foo" worker gets replaced. A new "bar" worker with $DYNO == 5 will be consuming the "bar" queue, and the "foo" job will get stuck. |
There's no easy solution here. I'll continue to think about ways we can improve reliable fetch to work with Heroku and other systems with dynamic hostnames. Giving people more knobs to twiddle (like -h) often mean more ways to misconfigure and break things. I prefer to have things auto-configure if possible. |
@mperham in sync, the more I thought about it, the more brittle it became. |
@mperham I can write up some zombie job rescue code, but is there something missing from my code above? Seems like it was not copasetic with batches. Maybe I should push them without the original |
Thanks! May I close this issue @anazar? Can you confirm your lost jobs are due to changing hostnames? |
@mperham I'm not having this problem any more, the fix works great |
We are still noticing a large number of jobs stuck in the busy queue. These are jobs that are high priority like emails that need to be processed immediately but instead are sitting unprocessed in the busy queue for a while. Sometimes they process after 30 minutes or more.. other times they just get removed from the busy queue without processing.
As a test we tried to restart our heroku workers. This cleared the busy queue without processing any of the jobs.
We're using reliable_fetch and reliable_push as well.
The text was updated successfully, but these errors were encountered: