Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lost jobs in busy queue - reliable_fetch not working #1527

Closed
anazar opened this issue Mar 3, 2014 · 28 comments
Closed

lost jobs in busy queue - reliable_fetch not working #1527

anazar opened this issue Mar 3, 2014 · 28 comments

Comments

@anazar
Copy link

anazar commented Mar 3, 2014

We are still noticing a large number of jobs stuck in the busy queue. These are jobs that are high priority like emails that need to be processed immediately but instead are sitting unprocessed in the busy queue for a while. Sometimes they process after 30 minutes or more.. other times they just get removed from the busy queue without processing.

As a test we tried to restart our heroku workers. This cleared the busy queue without processing any of the jobs.

We're using reliable_fetch and reliable_push as well.

@mperham
Copy link
Collaborator

mperham commented Mar 3, 2014

Can you provide more info? Sidekiq/Pro/Ruby versions? Is it always the same type of job or all different types? Are you using any Sidekiq extensions or middleware?

@jonhyman has been running 100,000+ jobs/min in his system with reliable fetch with great success now.

@anazar
Copy link
Author

anazar commented Mar 3, 2014

Sidekiq pro
Sidekiq master
Ruby 2.0.0
Rails 4.1.0.rc1
Hirefire (to scale worker dynos... we're scaling up to 5)

@jonhyman - whoa! we've been struggling to get that kind of performance... we're processing about 100k jobs per hour and struggling to scale up to that. Definitely could use your advice. Any chance you're on heroku?

Here's my sidekiq config:

require 'sidekiq'
require 'sidekiq/pro/api'
require 'sidekiq/pro/reliable_push'

max_workers = 5
sidekiq_concurrency_number = 25

Sidekiq.configure_server do |config|
  require 'sidekiq/pro/reliable_fetch'
  sidekiq_pool = (sidekiq_concurrency_number+5) #give it some buffer

  if defined?(ActiveRecord::Base)
    Rails.logger.debug("Setting custom connection pool size of #{sidekiq_pool} for Sidekiq Server")
    conf = Rails.application.config.database_configuration[Rails.env]
    conf['reaping_frequency'] = 20 # seconds
    conf['dead_connection_timeout'] = 600
    conf['checkout_timeout'] = 20
    conf['timeout'] = 30
    conf['pool']              = sidekiq_pool
    ActiveRecord::Base.establish_connection(conf)

    Rails.logger.info("Connection pool size for Sidekiq Server is now: #{ActiveRecord::Base.connection.pool.instance_variable_get('@size')}")
  end
end

@jonhyman
Copy link
Contributor

jonhyman commented Mar 3, 2014

Here's a barrage of questions, but: How often is this happening? Do you have logging around errors Sidekiq sees? What else is connecting to your Redis server, and do you notice any errors around that? The path I'm trying to go down here is if there are connectivity issues to Redis which might cause some jobs to get "stuck". We are not on Heroku, we had been entirely on AWS but now we're almost done migrating to physical servers at Rackspace. What I've seen before is connectivity issues to the AZ where my Redis was located, my best guess was that an rpoplpush into the local queue finished on the server but never made its way to the client because it timed out. In this case, you wouldn't see anything in your "Busy" queue on the Sidekiq dashboard, but the job would be "stuck." To combat this, we wrote custom functional monitors every 30 minutes to look for "stuck" jobs and alert us. It happens infrequently, but it does happen.

The interesting thing to me is that you say that "sometimes they process" -- are you sure about that, or do you just see them disappear from the Busy page? That is, do you have logging or a way to confirm that the job finished? If the job is "stuck processing", can you TTIN the worker to get a backtrace of what the threads are doing? As for "[restarting] cleared the busy queue without processing any of the jobs." do you know if the jobs processed via logging? The "busy queue" on the sidekiq dashboard is just a Redis set that workers add to before a job starts and remove from when it finishes. If the worker is hanging, I'd expect a restart of Sidekiq to re-insert the jobs from the local queue into the main queue, then when the process exits it might clear from the busy queue. When you restart the process, what log statements do you see from Sidekiq? Sidekiq should log how many quiet threads it was shutting down when you INT/TERM it, and if it has to force kill threads, it should tell you how many and the jobs in each. What do those logs say?

@anazar
Copy link
Author

anazar commented Mar 7, 2014

@jonhyman - This is happening randomly... maybe once or twice a day... but when it does the jobs don't finish processing. We know this because we adding logging around the jobs.

When it happens we are typically seeing a lot of these errors:

ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds

There are a couple jobs where we have the establish a connection with a new database and I think this might be what's causing the issue. It seems like calling establish_connection may be clearing out the existing connection pool. I have yet to investigate or verify this.

When we establish another connection we are doing the following:

  establish_connection ENV['OTHER_DATABASE_URL']
  config = Rails.application.config.database_configuration[Rails.env]
  config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
  config['pool']            = ENV['DB_POOL'] || 30
  config['checkout_timeout'] = 10
  config['dead_connection_timeout'] = 600
  config['timeout'] = 15
  ActiveRecord::Base.establish_connection(config)

Any thoughts?

@jonhyman
Copy link
Contributor

jonhyman commented Mar 7, 2014

Sorry we don't use active record so I do not have experience with it.

Sent from my mobile device
On Mar 7, 2014 2:20 AM, "Adam" notifications@github.com wrote:

@jonhyman https://github.com/jonhyman - This is happening randomly...
maybe once or twice a day... but when it does the jobs don't finish
processing. We know this because we adding logging around the jobs.

When it happens we are typically seeing a lot of these errors:

ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds

There are a couple jobs where we have the establish a connection with a
new database and I think this might be what's causing the issue. It seems
like calling establish_connection may be clearing out the existing
connection pool. I have yet to investigate or verify this.

When we establish another connection we are doing the following:

establish_connection ENV['OTHER_DATABASE_URL']
config = Rails.application.config.database_configuration[Rails.env]
config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
config['pool'] = ENV['DB_POOL'] || 30
config['checkout_timeout'] = 10
config['dead_connection_timeout'] = 600
config['timeout'] = 15
ActiveRecord::Base.establish_connection(config)

Any thoughts?

Reply to this email directly or view it on GitHubhttps://github.com//issues/1527#issuecomment-36973519
.

@mperham
Copy link
Collaborator

mperham commented Mar 7, 2014

If you are connecting to multiple databases, you need to ensure that each database connection pool is sized correctly, not just the default pool. That AR establish_connection code looks pretty suspect.

I wonder if you can monkeypatch AR to change the pool size default of 5 to 30?

@mperham
Copy link
Collaborator

mperham commented Mar 7, 2014

Can you provide the source for the worker in question?

@anazar
Copy link
Author

anazar commented Mar 7, 2014

@mperham - here is the basic worker that just uploads a file to s3. I can try to monkeypatch AR but it does seem like it's setting the pool size correctly.

The strange this is that despite setting all the configs to increase timeout, pool, ect before establishing any connections I'm still randomly getting this. There doesn't appear to be any reason why.

ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.000 seconds)

Here's an example of my worker.

require 'open-uri'

class S3Worker

  def self.upload_to_s3 upload_id

    upload = Upload.where(id: upload_id).first

    return if !upload

    begin
      file_content = open(upload.file_url).read
      return false if !file_content
    rescue
      return false
    end

    s3_key = upload.s3_path

    success = upload.store_in_s3(file_content, s3_key)

    if success
      upload.s3 = s3_key
      upload.save
      puts "Uploaded: #{upload.id}"
    end

    return success
  end

end

@mperham
Copy link
Collaborator

mperham commented Mar 7, 2014

There's no perform method. It's pretty important we see the exact code, not pseudo-code.

Have you tried without hirefire just to make sure it's not introducing some weirdness?

@mperham
Copy link
Collaborator

mperham commented Mar 7, 2014

Oh, perhaps you're using .delay?

@anazar
Copy link
Author

anazar commented Mar 7, 2014

I'm just calling it using delay:

S3Worker.delay_for(10.minutes, queue: :seldom).upload_to_s3 upload.id

I have tried using scaling methods other than hirefire and got the same issues. I suppose it could be a result of adding and removing workers but I haven't been able to test that.

@mperham
Copy link
Collaborator

mperham commented Mar 7, 2014

My gut feeling is that this is a HTTP timeout issue but have no data to back that up. I don't know what store_in_s3 is doing but make sure it has proper read and write timeouts set.

@anazar
Copy link
Author

anazar commented Mar 7, 2014

This is the active record connection pool method that is throwing the error... and when I lose jobs from the busy queue or the busy queue gets stuck I have a lot of those database connection errors.
https://github.com/rails/rails/blob/4-1-stable/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L176

I don't believe it's an HTTP timeout issue because I'm seeing nothing related to that in my logs.

I'm guessing that there's something going on that's not threadsafe in activerecord or in the way I'm connecting via activerecord.

@mperham
Copy link
Collaborator

mperham commented Mar 7, 2014

I don't know if a lot of people are using 4.1 yet so it's possible it's a Rails pool issue.

Can you give a full stack trace? Determining who's holding onto the connections and why they haven't been returned to the pool will require a lot of debugging on your part, I can't help with that. One thing you can do is further increase the pool size and see what effect that has, does it increase the time between error storms?

@mperham
Copy link
Collaborator

mperham commented Mar 13, 2014

Did this get resolved? I saw you opened another, different issue now.

@subelsky
Copy link
Contributor

@mperham I'm having the same problem. Right now we're chasing down a bug with Heroku's Ruby 2.0 installation where Net::HTTP can cause a segfault when connecting to certain SSL hosts.

When this happens, Sidekiq crashes and any jobs being worked get stuck in the busy queue. When the dyno reboots, these jobs never get worked again but just stay in the queue forever. I played around with some code to detect these zombie jobs 'till I realized reliable fetch should be preventing it.

workers = Sidekiq::Workers.new
client = Sidekiq::Client.new

stale_time = 10.minutes.ago

workers.each do |name,work,started_at|
  time = Time.parse(started_at) 
  next unless time < stale_time

  queue = work.fetch("queue")
  payload = work.fetch("payload")

  jid = payload.fetch("jid")
  bid = payload.fetch("bid")
  payload["resurrected"] = true # was going to use this for a check to prevent infinite re-resurrection of the offending job

  puts "Detected #{payload.fetch('class')} zombie job on #{queue} worker set, pushing back into queue (job #{jid}, batch #{bid})"
  client.push(payload)
end ; nil

workers.prune(10.minutes)

sidekiq (2.17.7)
sidekiq-pro (1.4.3)
ruby 2.0.0p451 on heroku

if it helps, here's how I'm able to reproduce the segfault

require "net/https"
require "uri"
login_url = "https://publishers.chitika.com/login"
uri = URI.parse(login_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)

@jonhyman
Copy link
Contributor

@subelsky when the dyno reboots, does its hostname change? If so, then I would expect the jobs to stay stuck in the queue.

@subelsky
Copy link
Contributor

it's definitely possible - I'm using the sidekiq -i $DYNO trick, plus we are always tweaking the number of workers we have running with Hirefire

@anazar
Copy link
Author

anazar commented Mar 17, 2014

Glad to see this is not just me. We're also still having the same issues as @subelsky as well... sounds like the exact same problem. Also using the sidekiq -i $DYNO trick.

@subelsky -- we're actually seeing the jobs eventually disappear from the busy queue (after an hour or so) but they often never get processed.

@mperham
Copy link
Collaborator

mperham commented Mar 17, 2014

Yep, if the hostname is dynamic, reliable fetch doesn't work well as designed. You'd need to "rescue" those orphaned working queues manually and put the jobs back into the main queue. I don't have a workaround for this at the moment. If someone can enter a new issue, I hope to get to this as part of my Sidekiq Pro 2.0 effort in April/May.

@jonhyman
Copy link
Contributor

@mperham a quick workaround would be to let them pass in the hostname via something like -h instead of using Socket.gethostname. Heroku users could then do something like -i $DYNO -h heroku. Though if you have a real solution in mind, I think that's preferable to a hack.

@jonhyman
Copy link
Contributor

Though I just realized that if the value of $DYNO gets reassigned to a worker that processes a different queue, you'll get jobs stuck anyway. That is, let's say that $DYNO == 5. A worker processing the "foo" queue fails, but you decide to scale up workers which process the "bar" queue before that "foo" worker gets replaced. A new "bar" worker with $DYNO == 5 will be consuming the "bar" queue, and the "foo" job will get stuck.

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2014

There's no easy solution here. I'll continue to think about ways we can improve reliable fetch to work with Heroku and other systems with dynamic hostnames. Giving people more knobs to twiddle (like -h) often mean more ways to misconfigure and break things. I prefer to have things auto-configure if possible.

@jonhyman
Copy link
Contributor

@mperham in sync, the more I thought about it, the more brittle it became.

@subelsky
Copy link
Contributor

@mperham I can write up some zombie job rescue code, but is there something missing from my code above? Seems like it was not copasetic with batches. Maybe I should push them without the original jid but I was trying to disturb the messages as little as possible

@subelsky
Copy link
Contributor

@mperham per your request I made a new issue #1573

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2014

Thanks! May I close this issue @anazar? Can you confirm your lost jobs are due to changing hostnames?

@subelsky
Copy link
Contributor

@mperham I'm not having this problem any more, the fix works great

@seuros seuros closed this as completed Jul 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants