Lock not getting properly cleared for some jobs #560

giuliana-quickmail · 2021-01-11T10:39:44Z

Describe the bug

The locks for some jobs were not removed after completing the job. This lead to new jobs getting rejected despite being unique with no other similar jobs in the queue.

This problem only affected some jobs immediately after we did a release adding the lock on the worker. We suspect the bug is related to the fact that we had many duplicate jobs scheduled and enqueued at the time of the release.

Expected behavior

No new jobs should not be rejected when they are unique.

Having duplicate jobs scheduled/enqueued already when introducing the lock should not cause locks to get stuck. Alternatively, there should be a warning in the documentation if this is a known problem.

Current behavior

The situation is similar to #379.

We have a CRON running every 10 minutes, as well as other processes in the app, scheduling jobs like so:

InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_async(inbox_id: inbox.id)

The job also re-enqueues itself like so:

InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(number_of_sec, inbox_id: inbox.id, check_at: check_at)

As explained above, we noticed a number of jobs had not run for hours after introducing the lock. This was critical so we didn't have much time to debug further. We took the following steps to solve the immediate problem:
We rolled back the release, manually removed uniquejobs Redis keys, and released again.

Jobs seem to be running fine for now. However we're not 100% sure what caused the bug and definitely want to prevent jobs from getting stuck this way again.

Worker class

# frozen_string_literal: true

module InboxManagement
  module Job
    class ProcessSendQueue
      include Sidekiq::Worker

      sidekiq_options lock: :until_executed, unique_args: ->(args) { args }, on_conflict: :log

      # ------------------------------------------------------------------------------
      def perform(options)
        puts "Calling InboxManagement::Job::ProcessSendQueue with options #{options}"

        inbox_id = options.fetch('inbox_id')
        inbox = Inbox.find_by(id: inbox_id)

        throttle = inbox.get_throttling_time_in_sec
        check_at = Time.now.utc.to_i + throttle + 1

        if inbox.is_locked?
          # Too soon to process. Make sure there is at least 1 sendqueue processing item in the job queue
          delay = inbox.seconds_before_can_send + 1
          puts "Inbox(#{inbox.id}) #{inbox.email} - Too soon, retry in #{delay} seconds"
          inbox.unlock
          InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(delay, inbox_id: inbox.id, check_at: check_at)
          return
        end

        # --- Perform processing here ---

        # Always have a sendqueue check as last step
        puts "Inbox(#{inbox.id}) #{inbox.email} - schedule next processsendqueue in #{throttle}+1"
        inbox.unlock
        InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(throttle + 1, inbox_id: inbox.id, check_at: check_at)
      end
    end
  end
end

Additional context

gem 'sidekiq', '~> 6.1.2'
gem 'sidekiq-scheduler', '~> 2.2.2'
gem 'sidekiq-status', '~> 1.1.4'
gem 'sidekiq-unique-jobs', '7.0.0.beta27'

We also found many warnings like these in the server logs right after the release:

330 <190>1 2021-01-07T19:03:37.142273+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: d3ec01c74beff41ed222804c)
328 <190>1 2021-01-07T19:03:37.1459+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:37800753c47b7564a86721f0b28bf311 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:37800753c47b7564a86721f0b28bf311, job_id: b26be9241d357a2e043c89b3)
330 <190>1 2021-01-07T19:03:37.245831+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d569e7b8d189a774f2c2e67784c70285 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d569e7b8d189a774f2c2e67784c70285, job_id: c7e29af7ae83436fd3ccccc7)
330 <190>1 2021-01-07T19:03:37.250079+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: 3e8eb90413f35ebc7b6a14d7)
330 <190>1 2021-01-07T19:03:37.278398+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: b5430cfead1ab9cbfda7d698)
330 <190>1 2021-01-07T19:03:37.288918+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: 3d9191a2ca6333480c0a8ee0)
330 <190>1 2021-01-07T19:03:37.298174+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: 4d19b48aeb0e69413577ca04)
330 <190>1 2021-01-07T19:03:37.312854+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: f13c240ee433061cbcd60c13)
330 <190>1 2021-01-07T19:03:37.316287+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: a70cf0d6887a0920d7a1addc)
330 <190>1 2021-01-07T19:03:37.340184+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:f687ad9fffe7614804d7e54799106bbe WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:f687ad9fffe7614804d7e54799106bbe, job_id: a1981f5e045d2e02ee97a1ca)
330 <190>1 2021-01-07T19:03:37.388268+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:0bb75f49f2a1d29faafe872698124f9c WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:0bb75f49f2a1d29faafe872698124f9c, job_id: 39a716b48c4c466dc5025e81)

One of the jobs affected by the bug is listed there, uniquejobs:1c3f9b4b86ec4a62d650a48096f90d97. We found it odd that these warnings do not have the corresponding INFO log saying job will be skipped, and that the lock type until_executed is not displayed before the digest.

The text was updated successfully, but these errors were encountered:

mhenrixon · 2021-01-11T12:11:30Z

If you change your worker configuration to something like:

SidekiqUniqueJobs.config.lock_info = true

module InboxManagement
  module Job
    class ProcessSendQueue
      include Sidekiq::Worker

      sidekiq_options lock: :until_executed, lock_info: true, lock_prefix: :psq, on_conflict: :log
    end
  end
end

There should be a lock_info page that you can check what is going on with the locks:

From there you should be able to provide some more details. The lock_prefix: :psq is just a suggestion, makes the locks a little better searchable when debugging.

mhenrixon · 2021-01-11T12:12:27Z

Further info, there was a fix in v7.0.0.beta28 that addresses a problem with reaping some jobs. What is your settings for the gem itself?

giuliana-quickmail · 2021-01-13T10:09:52Z

Thanks for the suggestions, @mhenrixon.

Settings are:

Sidekiq.configure_server do |config|
  config.death_handlers << lambda { |job, ex|
    puts "*** DEAD JOB *** #{job['class']} #{job['jid']} just died with error #{ex.message}."
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  }
end

mhenrixon · 2021-01-13T12:23:05Z

For v7 it is lock_digest in the death handler. @giuliana-quickmail

mhenrixon · 2021-01-16T07:55:50Z

As I commented here: #524 (comment) you can provide a lock_ttl: 20.minutes or something to let the lock expire automatically after 20 minutes.

mhenrixon · 2021-01-22T21:19:53Z

#571 v7.0.1 now automatically configures death handler and cleanup of orphaned locks according to the guide.

giuliana-quickmail · 2021-01-26T11:04:01Z

Thanks, @mhenrixon! For now, we fixed the death handler config and jobs seem to be running smoothly, so we'll stick to 7.0.0.beta27 as we're devoting all resources to other initiatives.

mhenrixon self-assigned this Jan 22, 2021

mhenrixon added bug v7.x labels Jan 22, 2021

mhenrixon closed this as completed Jan 26, 2021

This was referenced Mar 12, 2021

Bump sidekiq-unique-jobs from 6.0.25 to 7.0.4 mastodon/mastodon#15878

Merged

Bump sidekiq-unique-jobs from 6.0.25 to 7.0.4 postxiami/mastodon#6

Closed

Bump sidekiq-unique-jobs from 6.0.25 to 7.0.4 YoheiZuho/mastodon#333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock not getting properly cleared for some jobs #560

Lock not getting properly cleared for some jobs #560

giuliana-quickmail commented Jan 11, 2021

mhenrixon commented Jan 11, 2021

mhenrixon commented Jan 11, 2021

giuliana-quickmail commented Jan 13, 2021

mhenrixon commented Jan 13, 2021

mhenrixon commented Jan 16, 2021

mhenrixon commented Jan 22, 2021

giuliana-quickmail commented Jan 26, 2021

Lock not getting properly cleared for some jobs #560

Lock not getting properly cleared for some jobs #560

Comments

giuliana-quickmail commented Jan 11, 2021

mhenrixon commented Jan 11, 2021

mhenrixon commented Jan 11, 2021

giuliana-quickmail commented Jan 13, 2021

mhenrixon commented Jan 13, 2021

mhenrixon commented Jan 16, 2021

mhenrixon commented Jan 22, 2021

giuliana-quickmail commented Jan 26, 2021