Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please try v7.1.5 #617

Open
mhenrixon opened this issue Jun 29, 2021 · 31 comments
Open

Please try v7.1.5 #617

mhenrixon opened this issue Jun 29, 2021 · 31 comments
Assignees

Comments

@mhenrixon
Copy link
Owner

mhenrixon commented Jun 29, 2021

I added an UPGRADING.md document that I will use for guiding people on how to upgrade.

A ton of bug fixes went into this release! Even UntilAndWhileExecuting should be less problematic now.

Finally a special shoutout to my sponsors. You make all of this work possible and worth it!

Link to the release

@mhenrixon mhenrixon self-assigned this Jun 29, 2021
@mhenrixon mhenrixon pinned this issue Jun 29, 2021
@ArturT
Copy link
Contributor

ArturT commented Jun 30, 2021

After the gem upgrade I noticed this issue (in rails console):

3.0.1 :001 > Worker.perform_async("223")
/Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291:in `>': no implicit conversion of Integer into Hash (TypeError)
3.0.1 :002 > Worker.perform_async(223)
/Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291:in `>': no implicit conversion of Integer into Hash (TypeError)

I've checked source code of: /Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291

  def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
    synchronize do |client|
      command = [:brpoplpush, source, destination, timeout]
      require 'pry'; binding.pry # I added it for debugging
      timeout += client.timeout if timeout > 0 # this line is failing on timeout > 0
      client.call_with_timeout(command, timeout)
    end
  end

I did

3.0.1 :003 > Worker.perform_async(221)

From: /Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1292 Redis#brpoplpush:

    1288: def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
    1289:   synchronize do |client|
    1290:     command = [:brpoplpush, source, destination, timeout]
    1291:     require 'pry'; binding.pry
 => 1292:     timeout += client.timeout if timeout > 0
    1293:     client.call_with_timeout(command, timeout)
    1294:   end
    1295: end

[1] pry(#<Redis>)> deprecated_timeout
=> {:timeout=>0}
[2] pry(#<Redis>)> client.timeout
=> 5.0
[3] pry(#<Redis>)> timeout
=> {:timeout=>0}
[4] pry(#<Redis>)> timeout += client.timeout
NoMethodError: undefined method `+' for {:timeout=>0}:Hash
from (pry):4:in `block in brpoplpush'
[5] pry(#<Redis>)> timeout > 0
TypeError: no implicit conversion of Integer into Hash
from (pry):5:in `>'
[6] pry(#<Redis>)> this one is failing timeout > 0 because timeout is  {:timeout=>0}

Is it possible that sidekiq unique jobs is passing timeout in wrong format down to redis-rb?

@mhenrixon
Copy link
Owner Author

Must be, I'll see if I can replicate this in a test.

@mhenrixon
Copy link
Owner Author

@ArturT worker configuration? Just need the sidekiq_options really.

@mhenrixon
Copy link
Owner Author

mhenrixon commented Jun 30, 2021

@ArturT I tried to replicate the problem either in the rails console or in tests.

CleanShot 2021-06-30 at 13 18 16@2x

CleanShot 2021-06-30 at 13 17 57@2x

CleanShot 2021-06-30 at 13 18 50@2x

@ArturT
Copy link
Contributor

ArturT commented Jun 30, 2021

Here is my worker:

class MyWorker
    include Sidekiq::Worker

    sidekiq_options queue: :my_queue_name,
      # Locks when the client pushes the job to the queue.
      # The queue will be unlocked when the server starts processing the job.
      # The server then goes on to creating a runtime lock for the job to prevent
      # simultaneous jobs from being executed. As soon as the server starts
      # processing a job, the client can push the same job to the queue.
      # https://github.com/mhenrixon/sidekiq-unique-jobs#until-and-while-executing
      lock: :until_and_while_executing,
      # search in logs SidekiqUniqueJobs keyword to find duplicates
      log_duplicate_payload: true

    def perform(id)
    end
end

I'll try to debug it more later on when finding some time.

@mhenrixon
Copy link
Owner Author

It was good you had me have a look at this because I discovered that I unintentionally created a breaking change that was fixed in 7.1.1 just now.

@mhenrixon mhenrixon changed the title Please try v7.1.0 Please try v7.1.1 Jun 30, 2021
@ArturT
Copy link
Contributor

ArturT commented Jun 30, 2021

I can confirm v7.1.1 works fine now. Thanks :)

@ArturT
Copy link
Contributor

ArturT commented Jul 1, 2021

@mhenrixon I have a different error this time.

Some of the jobs are timeouting due to the issue to connect with Redis. It's recommended to use a higher timeout for RedisCloud addon on Heroku. So I use 5 seconds network_timeout. But it seems like now the default timeout is 1 second (see error message):

ConnectionPool::TimeoutError: Waited 1 sec
+58 non-project frames
1
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 77 in block (2 levels) in pop
2
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 69 in loop
3
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 69 in block in pop
4
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 68 in synchronize
5
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 68 in pop
6
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 78 in checkout
7
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 60 in block in with
8
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 59 in handle_interrupt
9
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 59 in with
10
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq.rb" line 95 in redis
11
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/connection.rb" line 19 in redis
12
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/locksmith.rb" line 104 in execute
13
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/lock/until_executed.rb" line 30 in execute
14
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware/server.rb" line 25 in call
15
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware.rb" line 41 in block in call
16
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 123 in block in with_logging_context
17
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/logger.rb" line 11 in with
18
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 138 in call
19
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 138 in with_configured_loggers_context
20
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 122 in with_logging_context
21
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware.rb" line 40 in call
22
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
23
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-status-2.0.1/lib/sidekiq-status/server_middleware.rb" line 42 in call
24
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
25
File "/app/vendor/bundle/ruby/3.0.0/gems/rollbar-3.2.0/lib/rollbar/plugins/sidekiq/plugin.rb" line 11 in call
26
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke

Is it possible that sidekiq unique jobs does not respect timeout 5 seconds anymore?

This is my config:

# config/initializers/sidekiq.rb

sidekiq_redis_config = {
  url: Secret.sidekiq_redis_url,
  namespace: REDIS_SIDEKIQ_NAMESPACE,
  # tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
  network_timeout: 5
}

Sidekiq.configure_server do |config|
  # A Sidekiq server process requires at least (concurrency + 5) connections
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
  config.redis = sidekiq_redis_config

...
end

...

@mhenrixon
Copy link
Owner Author

@ArturT I can't think of a single thing. I've checked both the connection pool gem and the sidekiq gem and this behavior seems like a mystery to me.

@ArturT
Copy link
Contributor

ArturT commented Jul 2, 2021

I suspect this could be a coincident issue. Maybe RedisCloud had temporary issues yesterday/during this night that caused timeout and this revealed misconfiguration in my project.

I found out that I have a separate redis config for sidekiq server than sidekiq client. In case of sidekiq server config I had timeout: 5 option instead of network_timeout: 5 (I suspect the wrong key name could cause using default timeout 1s instead of 5s). I've updated the config and I also upgraded sidekiq-unique-jobs to 7.1.2. I hope this will help.

@mhenrixon
Copy link
Owner Author

Thanks for reporting back @ArturT! I was always wondering why sidekiq uses network_timeout: to set timeout: on the redis connection in redis-rb. Seems like it would be better to just pass on timeout: since that obviously is done anyway.

@ArturT
Copy link
Contributor

ArturT commented Jul 2, 2021

I was wrong about the misconfiguration in my config. In one place I use Redis.new. It requires timeout key. The network_timeout key is only specific for the Sidekiq.

I found in another place that I use ConnectionPool so I added there missing timeout key. But this seems like config for sidekiq client (not server) so I'm not sure if this will help because exceptions happened only for sidekiq workers (not web requests).

Sidekiq.configure_client do |config|
  threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }

  # Add 2 more connections to the pool just in case.
  # This should help with error ConnectionPool::TimeoutError when connection is stuck.
  # Connections are lazy loaded so they won't be opened unless necessary.
  # https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
  size = threads_count.to_i + 2

  config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)

...
end

@mhenrixon
Copy link
Owner Author

Hi @ArturT, it seems like pool_timeout in sidekiq translates to timeout in connection pool and connect_timeout translates to timeout in Redis so indeed you must be using both. If you increase connect_timeout to 5, then you would still have to (in the server block) configure pool_timeout.

Super confusing, I can't think of a single time when I would like my pool timeout to be less than the connect timeout. Quite the contrary, maybe I'd like to wait for an additional 1 in the pool timeout.

@mhenrixon
Copy link
Owner Author

mhenrixon commented Jul 2, 2021

sidekiq_redis_config = {
  url: Secret.sidekiq_redis_url,
  namespace: REDIS_SIDEKIQ_NAMESPACE,
  # tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
  network_timeout: 5,
  pool_timeout: 5
}

Sidekiq.configure_server do |config|
  # A Sidekiq server process requires at least (concurrency + 5) connections
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
  config.redis = sidekiq_redis_config
end

redis_conn  = lambda do
  Redis.new(timeout: 5, url: Secret.sidekiq_redis_url, namespace: REDIS_SIDEKIQ_NAMESPACE)
end

Sidekiq.configure_client do |config|
  threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }

  # Add 2 more connections to the pool just in case.
  # This should help with error ConnectionPool::TimeoutError when connection is stuck.
  # Connections are lazy loaded so they won't be opened unless necessary.
  # https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
  size = threads_count.to_i + 2

  config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)
end

Try something like this @ArturT

@ArturT
Copy link
Contributor

ArturT commented Jul 2, 2021

@mhenrixon That's a good catch with pool_timeout. I added it to my config. If I will see any exceptions I will report back. Thank you for the help.

@benedikt
Copy link

benedikt commented Jul 5, 2021

We started seeing a lot of ConnectionPool::TimeoutError errors after upgrading to 7.1.2. Downgrading to 7.0.12 resolved the problem. Our timeout is at 1s as well, but it used to be fine and it's fine again after the downgrade.

Is there any chance that a connection isn't checked back into the pool?

@mhenrixon
Copy link
Owner Author

No that would not be the case @benedikt. I recommend you check what @ArturT has done above and the conclusions we came to while debugging his issues (that by the way was not related to this gem).

It could be that you need an additional connection due to the upgrade.

@benedikt
Copy link

benedikt commented Jul 5, 2021

@mhenrixon I don't think increasing pool_timeout is a necessarily a good solution. Yes, it might decrease the chance of getting ConnectionPool::TimeoutError errors, but wouldn't it also increase total execution time? Sidekiq would have to sit there and wait for a connection to become available.

Please correct me if I'm wrong, but the way I understand it pool_timeout is the amount of time Sidekiq waits to get a connection from the pool. Ideally there should always be enough connections for all threads around, so getting a connection should be (almost) instant once the pool is maxed out.

Sidekiq tries to automatically configure the pool to be big enough. Seeing the ConnectionPool::TimeoutError pop up makes me think that it suddenly isn't big enough anymore?

It could be that you need an additional connection due to the upgrade.

Did something change in this release that uses an additional connection? #604 maybe? Is there a way to create a new connection outside of the pool for this?

@ArturT
Copy link
Contributor

ArturT commented Jul 5, 2021

Maybe there is something going on here in what @benedikt said. It seems like I had pool_timeout 1-second default forever until the upgrade to the new sidekiq-unique-jobs version revealed the issue ConnectionPool::TimeoutError. Probably having a low pool_timeout=1 limit is just fine.

Is it possible that sidekiq creates not enough pool of connections and sidekiq-unique-jobs consumes them which prevents from opening new connections and the whole pool of connections could be busy hence `pool_timeout=1 limit is reached?
Sidekiq could not take into account the needs of available connections needed by sidekiq-unique-jobs. Or maybe there is some leak of connections done by sidekiq-unique-jobs?

@benedikt
Copy link

benedikt commented Jul 5, 2021

It seems like #604 is adding using a SidekiqUniqueJobs::TimerTask, which under the hood uses a new Thread to repeatedly execute reaper_registered?. This increases the total number of threads that use the pool.

I believe Sidekiq itself needs two additional connections on top of the number of worker threads it creates.

With the SidekiqUniqueJobs::RepearResurrector and the SidekiqUniqueJobs::OrphanManager each creating an additional thread, the pool is too small. When all worker threads, Sidekiq itself, the RepearResurrector and the OrphanManager try to use a Redis connection, one of them will have to wait for a new connection to become available.

My guess is that this was okay until now because Sidekiq doesn't really use one of the additional two connections, so it was fine until the SidekiqUniqueJobs::RepearResurrector was added.

@mhenrixon
Copy link
Owner Author

Point taken, I'll have a look at making sure we only use a connection when needed.

@mhenrixon
Copy link
Owner Author

There is no leaking of connections from what I can tell. That said, it does require more redis connection for the extra work. If you didn't already have a ton of trouble with the reaper, you can turn the resurrector off with SidekiqUniqueJobs.config.reaper_resurrector_enabled = false

@mhenrixon
Copy link
Owner Author

Let me know if that helps

@benedikt
Copy link

benedikt commented Jul 8, 2021

Upgraded to the latest version and disabled the reaper resurrector. No issues so far 👍

@mhenrixon
Copy link
Owner Author

Thank you for the confirmation @benedikt! I will disable it per default in the next release and refactor the resurrector.

@ArturT
Copy link
Contributor

ArturT commented Jul 10, 2021

Just to let you know despite a pool timeout 5 seconds I got last night's exception ConnectionPool::TimeoutError: Waited 5 sec after a week.

@woodhull
Copy link

woodhull commented Jul 15, 2021

We've got a busy sidekiq worker pool, and I'm also seeing the same ConnectionPool::TimeoutError issues as others under version 7.1.2 of this gem.

We've tried setting the sidekiq connection pool to a larger version than the default with some code like this, but we're still seeing this issue.

Sidekiq.configure_server do |config|
  # as of sidekiq-unique-jobs 7.1.1, the SidekiqUniqueJobs::RepearResurrector and the SidekiqUniqueJobs::OrphanManager
  # each need a connection beyond the thread concurrency count + 5 connections required for the standard operation of
  # sidekiq.
  #
  # See: https://github.com/mhenrixon/sidekiq-unique-jobs/issues/617
  config.redis = sidekiq_redis_config.merge(size: Sidekiq.options[:concurrency] + 7)
  
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  SidekiqUniqueJobs::Server.configure(config)
end

However this does not seem to have helped too much. Our experience of the issue is that the rate of exceptions start a few hours after a new code deploy / sidekiq process restart and then accelerate from there before the sidekiq worker instance is unable to find connections to process its own jobs. It feels to us like there must be some issue that causes the connections to be used but not returned to the pool with this version of the gem.

We're planning on disabling the reaper process as our next attempt at correcting this issue.

@mhenrixon
Copy link
Owner Author

@woodhull #617 (comment)

@millerjs
Copy link
Contributor

millerjs commented Jul 19, 2021

@mhenrixon First off, thanks for the awesome gem!

We were seeing this issue in production after upgrading and had to revert back to our much older version once we started noticing jobs being dropped.

I believe I was able to isolate the circumstance by introducing latency in the redis connection for until_and_while_executing jobs.

It appears that if the latency is somewhere around 18ms, then the job will be rejected. This was tested over localhost on OSX, sidekiq-unique-jobs version 7.1.2, SidekiqUniqueJobs.config.reaper_resurrector_enabled = false in the initializer.

gem 'sidekiq-ent', '~> 1.8.1'
gem 'sidekiq-pro', '~> 5.0.0'
gem 'sidekiq-unique-jobs', '~> 7.1.2'

the sidekiq portion of this test is:

class TestWorker
  include Sidekiq::Worker

  def perform(timestamp)
    Rails.logger.info("performing")
    File.write('/tmp/timestamp', timestamp)
    sleep(10)
  end
end

class UniqueUntilAndWhileExecutingTestWorker < TestWorker
  sidekiq_options queue: :medium, lock: :until_and_while_executing
end

class UniqueUntilExecutingTestWorker < TestWorker
  sidekiq_options queue: :medium, lock: :until_executing
end

class UniqueUntilExecutedTestWorker < TestWorker
  sidekiq_options queue: :medium, lock: :until_executed
end

And executed in a rails console

def test(worker)
  timestamp = Time.now.to_i.to_s
  p worker.perform_async(timestamp)
  puts "check #{timestamp}?"
  gets
  p File.read('/tmp/timestamp') == timestamp
end

test(TestWorker)
test(UniqueUntilAndWhileExecutingTestWorker)
test(UniqueUntilExecutingTestWorker)
test(UniqueUntilExecutedTestWorker)

The ruby TCP proxy script used to introduce latency is in the following gist: https://gist.github.com/millerjs/8ab3a4ce63c4502da4c34db017ae417d

ruby ~/bp/scratch/latency_proxy.rb localhost:6379 8379

with the following results

  - test(TestWorker)
    - 300ms latency, fails 0/10

  - test(UniqueUntilAndWhileExecutingTestWorker)

    | latency | failure rate (fails / trys) |            |
    |---------+-----------------------------+------------|
    | 0 ms    | fails 0/1                   |            |
    | 10 ms   | fails 0/2                   |            |
    | 15 ms   | fails 0/2                   |            |
    | 18 ms   | fails 3/4                   | threshold? |
    | 20 ms   | fails 3/3                   |            |
    | 25 ms   | fails 1/1                   |            |
    | 50 ms   | fails 1/1                   |            |
    | 150 ms  | fails 1/2                   |            |
    | 230 ms  | fails 1/1                   |            |
    | 300 ms  | fails 4/4                   |            |

  - test(UniqueUntilExecutingTestWorker)
    - 300ms latency, fails 0/3

  - test(UniqueUntilExecutedTestWorker)
    - 300ms latency, fails 0/3

I'm happy to provide any more info, logs, redis keys, etc if that will help debug this issue

edit: If I crank CLOCK_DRIFT_FACTOR to 1.0, this seems to "resolve" the latency issue
edit 2: does this just mean I need to set a high lock_ttl, if so, why just until_and_while_executing?
edit 3: looks like runtime_lock.execute { return yield } is failing to acquire. I checked WhileExecuting, and it appears to have the same issue in this scenario

@millerjs
Copy link
Contributor

millerjs commented Jul 20, 2021

@mhenrixon the README says lock_timeout: nil # lock indefinitely, but

def pop_queued(conn, wait = nil)
wait ||= config.timeout if config.wait_for_lock?
if wait.nil?
rpoplpush(conn)
executes a non blocking call if config.timeout is nil.

Are these two things contradictory?

edit 1

it looks like method(:primed_async) in

def execute(&block)
raise SidekiqUniqueJobs::InvalidArgument, "#execute needs a block" unless block
redis(redis_pool) do |conn|
lock!(conn, method(:primed_async), &block)
end
end

isn't respecting any lock_timeout wait settings. If .lock_timeout is a positive integer, then the blocking pop will succeed, but the promise will have already returned a nil value

primed_jid = Concurrent::Promises
.future(conn) { |red_con| pop_queued(red_con, wait) }
.value(add_drift(wait || config.ttl))
handle_primed(primed_jid, &block)

so the primed_jid will be nil and the execute block will not be yielded (i.e. the job will be silently dropped)

def handle_primed(primed_jid)
return yield job_id if [job_id, "1"].include?(primed_jid)

this explains why upping the ttl or the CLOCK_DRIFT_FACTOR will resolve the latency-dropped-job issue while upping the lock_timeout does not.

@millerjs
Copy link
Contributor

millerjs commented Jul 20, 2021

@mhenrixon I believe the solution (at least to the problem I'm seeing) is to ensure that the config.timeout is to the drift, e.g. something like

diff --git a/lib/sidekiq_unique_jobs/locksmith.rb b/lib/sidekiq_unique_jobs/locksmith.rb
index f90f1ce0..41fb02ab 100644
--- a/lib/sidekiq_unique_jobs/locksmith.rb
+++ b/lib/sidekiq_unique_jobs/locksmith.rb
@@ -102,7 +102,7 @@ module SidekiqUniqueJobs
       raise SidekiqUniqueJobs::InvalidArgument, "#execute needs a block" unless block

       redis(redis_pool) do |conn|
-        lock!(conn, method(:primed_async), &block)
+        lock!(conn, method(:primed_async), config.timeout, &block)
       end
     end

though that doesn't handle the timeout.nil? case, where the promise should wait indefinitely?

I think a workaround for now would be to set both the ttl and the timeout higher than the max latency expected.

sorry for the comment spam

edit: Pull request with the above #624

millerjs added a commit to millerjs/sidekiq-unique-jobs that referenced this issue Jul 20, 2021
relates to mhenrixon#617

There is an issue where an integer `config.timeout` was being
respected in `pop_queued`, but the promise in `primed_async` was only
waiting the `config.ttl` plus the drift.

This means that high latency redis connections will return a `nil`
`primed_jid` and while_executing jobs will get dropped.

The proposed solution is to ensure that `config.timeout` gets passed in
as the `wait` argument to the bound `method(:primed_async)`, which
allows the `Concurrent::Promises` to timeout after the same duration
(plus drift) as the `brpoplpush` call.

The issue triage and approach is documented in:

  - mhenrixon#617 (comment)
  - mhenrixon#617 (comment)
  - mhenrixon#617 (comment)
millerjs added a commit to millerjs/sidekiq-unique-jobs that referenced this issue Jul 20, 2021
relates to mhenrixon#617

There is an issue where an integer `config.timeout` was being
respected in `pop_queued`, but the promise in `primed_async` was only
waiting the `config.ttl` plus the drift.

This means that high latency redis connections will return a `nil`
`primed_jid` and while_executing jobs will get dropped.

The proposed solution is to ensure that `config.timeout` gets passed in
as the `wait` argument to the bound `method(:primed_async)`, which
allows the `Concurrent::Promises` to timeout after the same duration
(plus drift) as the `brpoplpush` call.

The issue triage and approach is documented in:

  - mhenrixon#617 (comment)
  - mhenrixon#617 (comment)
  - mhenrixon#617 (comment)
mhenrixon pushed a commit that referenced this issue Jul 21, 2021
relates to #617

There is an issue where an integer `config.timeout` was being
respected in `pop_queued`, but the promise in `primed_async` was only
waiting the `config.ttl` plus the drift.

This means that high latency redis connections will return a `nil`
`primed_jid` and while_executing jobs will get dropped.

The proposed solution is to ensure that `config.timeout` gets passed in
as the `wait` argument to the bound `method(:primed_async)`, which
allows the `Concurrent::Promises` to timeout after the same duration
(plus drift) as the `brpoplpush` call.

The issue triage and approach is documented in:

  - #617 (comment)
  - #617 (comment)
  - #617 (comment)
@mhenrixon mhenrixon changed the title Please try v7.1.1 Please try v7.1.4 Jul 21, 2021
@mhenrixon mhenrixon changed the title Please try v7.1.4 Please try v7.1.5 Jul 28, 2021
@mhenrixon mhenrixon unpinned this issue Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants