lost jobs in busy queue - reliable_fetch not working #1527

anazar · 2014-03-03T02:55:27Z

We are still noticing a large number of jobs stuck in the busy queue. These are jobs that are high priority like emails that need to be processed immediately but instead are sitting unprocessed in the busy queue for a while. Sometimes they process after 30 minutes or more.. other times they just get removed from the busy queue without processing.

As a test we tried to restart our heroku workers. This cleared the busy queue without processing any of the jobs.

We're using reliable_fetch and reliable_push as well.

mperham · 2014-03-03T05:30:08Z

Can you provide more info? Sidekiq/Pro/Ruby versions? Is it always the same type of job or all different types? Are you using any Sidekiq extensions or middleware?

@jonhyman has been running 100,000+ jobs/min in his system with reliable fetch with great success now.

anazar · 2014-03-03T05:36:52Z

Sidekiq pro
Sidekiq master
Ruby 2.0.0
Rails 4.1.0.rc1
Hirefire (to scale worker dynos... we're scaling up to 5)

@jonhyman - whoa! we've been struggling to get that kind of performance... we're processing about 100k jobs per hour and struggling to scale up to that. Definitely could use your advice. Any chance you're on heroku?

Here's my sidekiq config:

require 'sidekiq'
require 'sidekiq/pro/api'
require 'sidekiq/pro/reliable_push'

max_workers = 5
sidekiq_concurrency_number = 25

Sidekiq.configure_server do |config|
  require 'sidekiq/pro/reliable_fetch'
  sidekiq_pool = (sidekiq_concurrency_number+5) #give it some buffer

  if defined?(ActiveRecord::Base)
    Rails.logger.debug("Setting custom connection pool size of #{sidekiq_pool} for Sidekiq Server")
    conf = Rails.application.config.database_configuration[Rails.env]
    conf['reaping_frequency'] = 20 # seconds
    conf['dead_connection_timeout'] = 600
    conf['checkout_timeout'] = 20
    conf['timeout'] = 30
    conf['pool']              = sidekiq_pool
    ActiveRecord::Base.establish_connection(conf)

    Rails.logger.info("Connection pool size for Sidekiq Server is now: #{ActiveRecord::Base.connection.pool.instance_variable_get('@size')}")
  end
end

jonhyman · 2014-03-03T15:41:07Z

Here's a barrage of questions, but: How often is this happening? Do you have logging around errors Sidekiq sees? What else is connecting to your Redis server, and do you notice any errors around that? The path I'm trying to go down here is if there are connectivity issues to Redis which might cause some jobs to get "stuck". We are not on Heroku, we had been entirely on AWS but now we're almost done migrating to physical servers at Rackspace. What I've seen before is connectivity issues to the AZ where my Redis was located, my best guess was that an rpoplpush into the local queue finished on the server but never made its way to the client because it timed out. In this case, you wouldn't see anything in your "Busy" queue on the Sidekiq dashboard, but the job would be "stuck." To combat this, we wrote custom functional monitors every 30 minutes to look for "stuck" jobs and alert us. It happens infrequently, but it does happen.

The interesting thing to me is that you say that "sometimes they process" -- are you sure about that, or do you just see them disappear from the Busy page? That is, do you have logging or a way to confirm that the job finished? If the job is "stuck processing", can you TTIN the worker to get a backtrace of what the threads are doing? As for "[restarting] cleared the busy queue without processing any of the jobs." do you know if the jobs processed via logging? The "busy queue" on the sidekiq dashboard is just a Redis set that workers add to before a job starts and remove from when it finishes. If the worker is hanging, I'd expect a restart of Sidekiq to re-insert the jobs from the local queue into the main queue, then when the process exits it might clear from the busy queue. When you restart the process, what log statements do you see from Sidekiq? Sidekiq should log how many quiet threads it was shutting down when you INT/TERM it, and if it has to force kill threads, it should tell you how many and the jobs in each. What do those logs say?

anazar · 2014-03-07T07:20:45Z

@jonhyman - This is happening randomly... maybe once or twice a day... but when it does the jobs don't finish processing. We know this because we adding logging around the jobs.

When it happens we are typically seeing a lot of these errors:

ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds

There are a couple jobs where we have the establish a connection with a new database and I think this might be what's causing the issue. It seems like calling establish_connection may be clearing out the existing connection pool. I have yet to investigate or verify this.

When we establish another connection we are doing the following:

  establish_connection ENV['OTHER_DATABASE_URL']
  config = Rails.application.config.database_configuration[Rails.env]
  config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
  config['pool']            = ENV['DB_POOL'] || 30
  config['checkout_timeout'] = 10
  config['dead_connection_timeout'] = 600
  config['timeout'] = 15
  ActiveRecord::Base.establish_connection(config)

Any thoughts?

jonhyman · 2014-03-07T13:19:32Z

Sorry we don't use active record so I do not have experience with it.

Sent from my mobile device
On Mar 7, 2014 2:20 AM, "Adam" notifications@github.com wrote:

@jonhyman https://github.com/jonhyman - This is happening randomly...
maybe once or twice a day... but when it does the jobs don't finish
processing. We know this because we adding logging around the jobs.

When it happens we are typically seeing a lot of these errors:

ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds

There are a couple jobs where we have the establish a connection with a
new database and I think this might be what's causing the issue. It seems
like calling establish_connection may be clearing out the existing
connection pool. I have yet to investigate or verify this.

When we establish another connection we are doing the following:

establish_connection ENV['OTHER_DATABASE_URL']
config = Rails.application.config.database_configuration[Rails.env]
config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
config['pool'] = ENV['DB_POOL'] || 30
config['checkout_timeout'] = 10
config['dead_connection_timeout'] = 600
config['timeout'] = 15
ActiveRecord::Base.establish_connection(config)

Any thoughts?

Reply to this email directly or view it on GitHubhttps://github.com//issues/1527#issuecomment-36973519
.

mperham · 2014-03-07T20:00:53Z

If you are connecting to multiple databases, you need to ensure that each database connection pool is sized correctly, not just the default pool. That AR establish_connection code looks pretty suspect.

I wonder if you can monkeypatch AR to change the pool size default of 5 to 30?

mperham · 2014-03-07T20:03:32Z

Can you provide the source for the worker in question?

anazar · 2014-03-07T22:23:55Z

@mperham - here is the basic worker that just uploads a file to s3. I can try to monkeypatch AR but it does seem like it's setting the pool size correctly.

The strange this is that despite setting all the configs to increase timeout, pool, ect before establishing any connections I'm still randomly getting this. There doesn't appear to be any reason why.

ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)

Here's an example of my worker.

require 'open-uri'

class S3Worker

  def self.upload_to_s3 upload_id

    upload = Upload.where(id: upload_id).first

    return if !upload

    begin
      file_content = open(upload.file_url).read
      return false if !file_content
    rescue
      return false
    end

    s3_key = upload.s3_path

    success = upload.store_in_s3(file_content, s3_key)

    if success
      upload.s3 = s3_key
      upload.save
      puts "Uploaded: #{upload.id}"
    end

    return success
  end

end

mperham · 2014-03-07T22:31:03Z

There's no perform method. It's pretty important we see the exact code, not pseudo-code.

Have you tried without hirefire just to make sure it's not introducing some weirdness?

mperham · 2014-03-07T22:31:32Z

Oh, perhaps you're using .delay?

anazar · 2014-03-07T22:34:56Z

I'm just calling it using delay:

S3Worker.delay_for(10.minutes, queue: :seldom).upload_to_s3 upload.id

I have tried using scaling methods other than hirefire and got the same issues. I suppose it could be a result of adding and removing workers but I haven't been able to test that.

mperham · 2014-03-07T22:42:53Z

My gut feeling is that this is a HTTP timeout issue but have no data to back that up. I don't know what store_in_s3 is doing but make sure it has proper read and write timeouts set.

anazar · 2014-03-07T22:48:33Z

This is the active record connection pool method that is throwing the error... and when I lose jobs from the busy queue or the busy queue gets stuck I have a lot of those database connection errors.
https://github.com/rails/rails/blob/4-1-stable/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L176

I don't believe it's an HTTP timeout issue because I'm seeing nothing related to that in my logs.

I'm guessing that there's something going on that's not threadsafe in activerecord or in the way I'm connecting via activerecord.

mperham · 2014-03-07T23:05:16Z

I don't know if a lot of people are using 4.1 yet so it's possible it's a Rails pool issue.

Can you give a full stack trace? Determining who's holding onto the connections and why they haven't been returned to the pool will require a lot of debugging on your part, I can't help with that. One thing you can do is further increase the pool size and see what effect that has, does it increase the time between error storms?

mperham · 2014-03-13T04:59:51Z

Did this get resolved? I saw you opened another, different issue now.

subelsky · 2014-03-17T18:02:13Z

@mperham I'm having the same problem. Right now we're chasing down a bug with Heroku's Ruby 2.0 installation where Net::HTTP can cause a segfault when connecting to certain SSL hosts.

When this happens, Sidekiq crashes and any jobs being worked get stuck in the busy queue. When the dyno reboots, these jobs never get worked again but just stay in the queue forever. I played around with some code to detect these zombie jobs 'till I realized reliable fetch should be preventing it.

workers = Sidekiq::Workers.new
client = Sidekiq::Client.new

stale_time = 10.minutes.ago

workers.each do |name,work,started_at|
  time = Time.parse(started_at) 
  next unless time < stale_time

  queue = work.fetch("queue")
  payload = work.fetch("payload")

  jid = payload.fetch("jid")
  bid = payload.fetch("bid")
  payload["resurrected"] = true # was going to use this for a check to prevent infinite re-resurrection of the offending job

  puts "Detected #{payload.fetch('class')} zombie job on #{queue} worker set, pushing back into queue (job #{jid}, batch #{bid})"
  client.push(payload)
end ; nil

workers.prune(10.minutes)

sidekiq (2.17.7)
sidekiq-pro (1.4.3)
ruby 2.0.0p451 on heroku

if it helps, here's how I'm able to reproduce the segfault

require "net/https"
require "uri"
login_url = "https://publishers.chitika.com/login"
uri = URI.parse(login_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)

jonhyman · 2014-03-17T18:19:55Z

@subelsky when the dyno reboots, does its hostname change? If so, then I would expect the jobs to stay stuck in the queue.

subelsky · 2014-03-17T20:40:39Z

it's definitely possible - I'm using the sidekiq -i $DYNO trick, plus we are always tweaking the number of workers we have running with Hirefire

anazar · 2014-03-17T20:54:52Z

Glad to see this is not just me. We're also still having the same issues as @subelsky as well... sounds like the exact same problem. Also using the sidekiq -i $DYNO trick.

@subelsky -- we're actually seeing the jobs eventually disappear from the busy queue (after an hour or so) but they often never get processed.

mperham · 2014-03-17T21:05:37Z

Yep, if the hostname is dynamic, reliable fetch doesn't work well as designed. You'd need to "rescue" those orphaned working queues manually and put the jobs back into the main queue. I don't have a workaround for this at the moment. If someone can enter a new issue, I hope to get to this as part of my Sidekiq Pro 2.0 effort in April/May.

jonhyman · 2014-03-17T21:14:53Z

@mperham a quick workaround would be to let them pass in the hostname via something like -h instead of using Socket.gethostname. Heroku users could then do something like -i $DYNO -h heroku. Though if you have a real solution in mind, I think that's preferable to a hack.

jonhyman · 2014-03-17T21:17:13Z

Though I just realized that if the value of $DYNO gets reassigned to a worker that processes a different queue, you'll get jobs stuck anyway. That is, let's say that $DYNO == 5. A worker processing the "foo" queue fails, but you decide to scale up workers which process the "bar" queue before that "foo" worker gets replaced. A new "bar" worker with $DYNO == 5 will be consuming the "bar" queue, and the "foo" job will get stuck.

mperham · 2014-03-18T00:00:05Z

There's no easy solution here. I'll continue to think about ways we can improve reliable fetch to work with Heroku and other systems with dynamic hostnames. Giving people more knobs to twiddle (like -h) often mean more ways to misconfigure and break things. I prefer to have things auto-configure if possible.

jonhyman · 2014-03-18T00:01:52Z

@mperham in sync, the more I thought about it, the more brittle it became.

subelsky · 2014-03-18T00:57:27Z

@mperham I can write up some zombie job rescue code, but is there something missing from my code above? Seems like it was not copasetic with batches. Maybe I should push them without the original jid but I was trying to disturb the messages as little as possible

subelsky · 2014-03-18T20:05:27Z

@mperham per your request I made a new issue #1573

mperham · 2014-03-18T20:07:38Z

Thanks! May I close this issue @anazar? Can you confirm your lost jobs are due to changing hostnames?

subelsky · 2014-04-21T18:31:59Z

@mperham I'm not having this problem any more, the fix works great

subelsky mentioned this issue Mar 17, 2014

Retry and cleanup correctly after an abnormal exit #1493

Closed

subelsky mentioned this issue Mar 18, 2014

Make reliable fetch work well with dynamic hostnames #1573

Closed

subelsky mentioned this issue Mar 19, 2014

Sidekiq::Batch::NoSuchBatch errors in web UI #1575

Closed

seuros closed this as completed Jul 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lost jobs in busy queue - reliable_fetch not working #1527

lost jobs in busy queue - reliable_fetch not working #1527

anazar commented Mar 3, 2014

mperham commented Mar 3, 2014

anazar commented Mar 3, 2014

jonhyman commented Mar 3, 2014

anazar commented Mar 7, 2014

jonhyman commented Mar 7, 2014

mperham commented Mar 7, 2014

mperham commented Mar 7, 2014

anazar commented Mar 7, 2014

mperham commented Mar 7, 2014

mperham commented Mar 7, 2014

anazar commented Mar 7, 2014

mperham commented Mar 7, 2014

anazar commented Mar 7, 2014

mperham commented Mar 7, 2014

mperham commented Mar 13, 2014

subelsky commented Mar 17, 2014

jonhyman commented Mar 17, 2014

subelsky commented Mar 17, 2014

anazar commented Mar 17, 2014

mperham commented Mar 17, 2014

jonhyman commented Mar 17, 2014

jonhyman commented Mar 17, 2014

mperham commented Mar 18, 2014

jonhyman commented Mar 18, 2014

subelsky commented Mar 18, 2014

subelsky commented Mar 18, 2014

mperham commented Mar 18, 2014

subelsky commented Apr 21, 2014

lost jobs in busy queue - reliable_fetch not working #1527

lost jobs in busy queue - reliable_fetch not working #1527

Comments

anazar commented Mar 3, 2014

mperham commented Mar 3, 2014

anazar commented Mar 3, 2014

jonhyman commented Mar 3, 2014

anazar commented Mar 7, 2014

jonhyman commented Mar 7, 2014

mperham commented Mar 7, 2014

mperham commented Mar 7, 2014

anazar commented Mar 7, 2014

mperham commented Mar 7, 2014

mperham commented Mar 7, 2014

anazar commented Mar 7, 2014

mperham commented Mar 7, 2014

anazar commented Mar 7, 2014

mperham commented Mar 7, 2014

mperham commented Mar 13, 2014

subelsky commented Mar 17, 2014

jonhyman commented Mar 17, 2014

subelsky commented Mar 17, 2014

anazar commented Mar 17, 2014

mperham commented Mar 17, 2014

jonhyman commented Mar 17, 2014

jonhyman commented Mar 17, 2014

mperham commented Mar 18, 2014

jonhyman commented Mar 18, 2014

subelsky commented Mar 18, 2014

subelsky commented Mar 18, 2014

mperham commented Mar 18, 2014

subelsky commented Apr 21, 2014