Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock not getting properly cleared for some jobs #560

Closed
giuliana-quickmail opened this issue Jan 11, 2021 · 7 comments
Closed

Lock not getting properly cleared for some jobs #560

giuliana-quickmail opened this issue Jan 11, 2021 · 7 comments
Assignees

Comments

@giuliana-quickmail
Copy link

Describe the bug

The locks for some jobs were not removed after completing the job. This lead to new jobs getting rejected despite being unique with no other similar jobs in the queue.

This problem only affected some jobs immediately after we did a release adding the lock on the worker. We suspect the bug is related to the fact that we had many duplicate jobs scheduled and enqueued at the time of the release.

Expected behavior

No new jobs should not be rejected when they are unique.

Having duplicate jobs scheduled/enqueued already when introducing the lock should not cause locks to get stuck. Alternatively, there should be a warning in the documentation if this is a known problem.

Current behavior

The situation is similar to #379.

We have a CRON running every 10 minutes, as well as other processes in the app, scheduling jobs like so:

InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_async(inbox_id: inbox.id)

The job also re-enqueues itself like so:

InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(number_of_sec, inbox_id: inbox.id, check_at: check_at)

As explained above, we noticed a number of jobs had not run for hours after introducing the lock. This was critical so we didn't have much time to debug further. We took the following steps to solve the immediate problem:
We rolled back the release, manually removed uniquejobs Redis keys, and released again.

Jobs seem to be running fine for now. However we're not 100% sure what caused the bug and definitely want to prevent jobs from getting stuck this way again.

Worker class

# frozen_string_literal: true

module InboxManagement
  module Job
    class ProcessSendQueue
      include Sidekiq::Worker

      sidekiq_options lock: :until_executed, unique_args: ->(args) { args }, on_conflict: :log

      # ------------------------------------------------------------------------------
      def perform(options)
        puts "Calling InboxManagement::Job::ProcessSendQueue with options #{options}"

        inbox_id = options.fetch('inbox_id')
        inbox = Inbox.find_by(id: inbox_id)

        throttle = inbox.get_throttling_time_in_sec
        check_at = Time.now.utc.to_i + throttle + 1

        if inbox.is_locked?
          # Too soon to process. Make sure there is at least 1 sendqueue processing item in the job queue
          delay = inbox.seconds_before_can_send + 1
          puts "Inbox(#{inbox.id}) #{inbox.email} - Too soon, retry in #{delay} seconds"
          inbox.unlock
          InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(delay, inbox_id: inbox.id, check_at: check_at)
          return
        end

        # --- Perform processing here ---

        # Always have a sendqueue check as last step
        puts "Inbox(#{inbox.id}) #{inbox.email} - schedule next processsendqueue in #{throttle}+1"
        inbox.unlock
        InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(throttle + 1, inbox_id: inbox.id, check_at: check_at)
      end
    end
  end
end

Additional context

gem 'sidekiq', '~> 6.1.2'
gem 'sidekiq-scheduler', '~> 2.2.2'
gem 'sidekiq-status', '~> 1.1.4'
gem 'sidekiq-unique-jobs', '7.0.0.beta27'

We also found many warnings like these in the server logs right after the release:

330 <190>1 2021-01-07T19:03:37.142273+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: d3ec01c74beff41ed222804c)
328 <190>1 2021-01-07T19:03:37.1459+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:37800753c47b7564a86721f0b28bf311 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:37800753c47b7564a86721f0b28bf311, job_id: b26be9241d357a2e043c89b3)
330 <190>1 2021-01-07T19:03:37.245831+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d569e7b8d189a774f2c2e67784c70285 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d569e7b8d189a774f2c2e67784c70285, job_id: c7e29af7ae83436fd3ccccc7)
330 <190>1 2021-01-07T19:03:37.250079+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: 3e8eb90413f35ebc7b6a14d7)
330 <190>1 2021-01-07T19:03:37.278398+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: b5430cfead1ab9cbfda7d698)
330 <190>1 2021-01-07T19:03:37.288918+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: 3d9191a2ca6333480c0a8ee0)
330 <190>1 2021-01-07T19:03:37.298174+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: 4d19b48aeb0e69413577ca04)
330 <190>1 2021-01-07T19:03:37.312854+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: f13c240ee433061cbcd60c13)
330 <190>1 2021-01-07T19:03:37.316287+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: a70cf0d6887a0920d7a1addc)
330 <190>1 2021-01-07T19:03:37.340184+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:f687ad9fffe7614804d7e54799106bbe WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:f687ad9fffe7614804d7e54799106bbe, job_id: a1981f5e045d2e02ee97a1ca)
330 <190>1 2021-01-07T19:03:37.388268+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:0bb75f49f2a1d29faafe872698124f9c WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:0bb75f49f2a1d29faafe872698124f9c, job_id: 39a716b48c4c466dc5025e81)

One of the jobs affected by the bug is listed there, uniquejobs:1c3f9b4b86ec4a62d650a48096f90d97. We found it odd that these warnings do not have the corresponding INFO log saying job will be skipped, and that the lock type until_executed is not displayed before the digest.

@mhenrixon
Copy link
Owner

If you change your worker configuration to something like:

SidekiqUniqueJobs.config.lock_info = true

module InboxManagement
  module Job
    class ProcessSendQueue
      include Sidekiq::Worker

      sidekiq_options lock: :until_executed, lock_info: true, lock_prefix: :psq, on_conflict: :log
    end
  end
end

There should be a lock_info page that you can check what is going on with the locks:

image

From there you should be able to provide some more details. The lock_prefix: :psq is just a suggestion, makes the locks a little better searchable when debugging.

@mhenrixon
Copy link
Owner

Further info, there was a fix in v7.0.0.beta28 that addresses a problem with reaping some jobs. What is your settings for the gem itself?

@giuliana-quickmail
Copy link
Author

Thanks for the suggestions, @mhenrixon.

Settings are:

Sidekiq.configure_server do |config|
  config.death_handlers << lambda { |job, ex|
    puts "*** DEAD JOB *** #{job['class']} #{job['jid']} just died with error #{ex.message}."
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  }
end

@mhenrixon
Copy link
Owner

For v7 it is lock_digest in the death handler. @giuliana-quickmail

@mhenrixon
Copy link
Owner

As I commented here: #524 (comment) you can provide a lock_ttl: 20.minutes or something to let the lock expire automatically after 20 minutes.

@mhenrixon
Copy link
Owner

#571 v7.0.1 now automatically configures death handler and cleanup of orphaned locks according to the guide.

@giuliana-quickmail
Copy link
Author

Thanks, @mhenrixon! For now, we fixed the death handler config and jobs seem to be running smoothly, so we'll stick to 7.0.0.beta27 as we're devoting all resources to other initiatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants