Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same job (identical JID) runs twice #3303

Closed
krallin opened this issue Jan 2, 2017 · 7 comments
Closed

Same job (identical JID) runs twice #3303

krallin opened this issue Jan 2, 2017 · 7 comments

Comments

@krallin
Copy link

krallin commented Jan 2, 2017

Hi there,

Ruby version: ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]
Sidekiq / Pro / Enterprise version(s): sidekiq (3.5.0)

A little (hopefully useful) context: we're running Sidekiq in a slightly unconventional setup. Sidekiq is used to coordinate jobs across processing clusters that are running in different AWS regions and each run their own Redis instances (we have several hundreds of these).

Occasionally, these Redis instances may be unreachable or laggy for various reasons:

  • They can be geographically pretty far away
  • There's other stuff running on these Redis instances, which can occasionally compete for resources

Bottom line, now and then it might take a little while to enqueue jobs to Redis, such that things time out.

This is pretty apparent in our logs: most of the time, it takes 0.25 seconds to enqueue a job, but now and then, it takes ~ 5.25 seconds (5 seconds happens to be the default timeout in redis-rb, which we use).

Unfortunately, when it does take ~ 5.25 seconds, we end up with the same job (same JID) running twice on the destination instance concurrently. Here are some examples from our logs:

2016-12-28T23:05:45.137Z 14090 TID-m2cek Sweetness::Jobs::Delegate JID-5fe3127779677d1e133d1b40 INFO: start
2016-12-28T23:05:45.160Z 14090 TID-1emhzg Sweetness::Jobs::Delegate JID-5fe3127779677d1e133d1b40 INFO: start
2016-12-28T23:05:46.920Z 14090 TID-1emhzg Sweetness::Jobs::Delegate JID-5fe3127779677d1e133d1b40 INFO: done: 1.761 sec
2016-12-28T23:05:47.052Z 14090 TID-m2cek Sweetness::Jobs::Delegate JID-5fe3127779677d1e133d1b40 INFO: done: 1.916 sec
2016-12-28T23:14:19.572Z 18537 TID-fstk8 Sweetness::Jobs::Delegate JID-66a2c3098751a54024b37cb4 INFO: start
2016-12-28T23:14:19.587Z 18537 TID-fgiwo Sweetness::Jobs::Delegate JID-66a2c3098751a54024b37cb4 INFO: start
2016-12-28T23:14:21.296Z 18537 TID-fstk8 Sweetness::Jobs::Delegate JID-66a2c3098751a54024b37cb4 INFO: done: 1.723 sec
2016-12-28T23:14:21.370Z 18537 TID-fgiwo Sweetness::Jobs::Delegate JID-66a2c3098751a54024b37cb4 INFO: done: 1.784 sec
2016-12-29T00:47:06.545Z 8580 TID-fasn4 Sweetness::Jobs::Delegate JID-d28199210162a8974746661d INFO: start
2016-12-29T00:47:06.559Z 8580 TID-e73h4 Sweetness::Jobs::Delegate JID-d28199210162a8974746661d INFO: start
2016-12-29T00:47:07.616Z 8580 TID-e73h4 Sweetness::Jobs::Delegate JID-d28199210162a8974746661d INFO: done: 1.057 sec
2016-12-29T00:47:07.906Z 8580 TID-fasn4 Sweetness::Jobs::Delegate JID-d28199210162a8974746661d INFO: done: 1.361 sec

In each of these instances, the two threads running the same job at the same time are in the same process. I should also mention that retries are disabled here.

I suspect this is caused by the fact that redis-rb automatically re-runs a command when it times out.

I tried reproducing this locally by using socat TCP-LISTEN:6379,fork,bind=0.0.0.0,reuseaddr STDOUT to emulate an unresponsive Redis, and it appears that redis-rb will indeed push twice in this scenario (here's the output when I use Redis.new.lpush('foo', 'bar')):

*3
$5
lpush
$3
foo
$3
bar
*3
$5
lpush
$3
foo
$3
bar

(of course, since Socat posing as Redis is completely unresponsive even after the retry, this will take 10 seconds before timing out on the client side, but the second *3 line does show up 5 seconds after the first one)

For comparison, using Redis.new(reconnect_attempts: 0).lpush('foo', 'bar') does throw an error after the 5 seconds timeout, without retrying the LPUSH operation a second time.

We're planning to deploy the reconnect_attempts: 0 workaround throughout our infrastructure soon (and we're happy to report whether that solves the issue), but I believe it's a design goal for Sidekiq that the same JID cannot execute concurrently across two workers (from here: #2398), so I thought perhaps you'd like to hear about this failure scenario! (and perhaps you'd want to set reconnect_attempts to 0 by default in Sidekiq?).

Thanks!

@mperham
Copy link
Collaborator

mperham commented Jan 2, 2017

This is some great detective work, thanks for investigating. What you write makes perfect sense. It's quite possible this is a cause of some of the mysterious problems that other people have seen throughout the years.

My takeaway is that I should have investigated each option provided by Redis-rb. The default values are not necessarily appropriate for Sidekiq; I had assumed that network issues would bubble up immediately.

@krallin
Copy link
Author

krallin commented Jan 2, 2017

Thanks for the quick response @mperham!

We've rolled out the workaround I mentioned earlier today — I'll monitor and report here whether we see this problem again (we were seeing this quite infrequently, so it might take a little while before I can confirm whether the issue is indeed resolved).

@krallin
Copy link
Author

krallin commented Jan 9, 2017

Just a quick heads up that we've confirmed that setting reconnect_attempts: 0 ensures jobs aren't run multiple times (we get an exception on the client-side, and the job executes — but only once — on the server-side, which is what we expected).

@mperham
Copy link
Collaborator

mperham commented Jan 11, 2017

@krallin Many people have been having issues with this fix in #3311 and it's possible I will need to revert it. Can you describe your Redis and networking setup in more detail? Perhaps this behavior is also a function of your deployment architecture.

@krallin
Copy link
Author

krallin commented Jan 12, 2017

Hey @mperham,

Sure.

I suspect the issue here (and we're investigating this on our end as well — more on that below) is that — as you observed in #3311 —, Redis connections aren't as robust as we'd want them to be (they may break over time due to timeouts, or due to the process forking and holding on to old connections).

Here is where my understanding of the problem is so far:

When Sidekiq tries to reuse an existing connection from the pool, that connection may or may not be still be alive. If it is, then the request goes through and everything is fine. If it's not, then you get an error, whereas before this fix, you'd have had a transparent reconnect, and everything would have been fine.

Now, if the connection is alive, you can still get a timeout when writing the push to Redis (that's the problem I brought up in this issue). What happens then is pretty much undetermined (as the push may or may not have gone through to Redis, and we can't know that), but it's related to reconnect_attempts: with transparent reconnects (reconnect_attempts: 1), we may push the job once or twice, and without them (reconnect_attempts: 0), we may zero times or once.

Note that in either case, redis-rb will "fix" the connection before it attempts to use it again (admittedly that would be a pretty major problem here): it'll disconnect if it gets an error, or if we attempt reusing a connection that was waiting on some data.


Now; as I mentioned, we're investigating this as well...

It does appear making this change replaced some cases where we had jobs running twice concurrently (these cause major problems for us right now, so given a choice we'll always err on the side of caution) with cases where jobs are not being enqueued at all (which isn't as problematic for our use case, but still something we'd like to minimize). I haven't looked closely at whether the rate is similar (I suspect it might be), largely because it's not as big a problem anymore

At least for our use case, I think a clear improvement would be to retry when we know for sure that it's safe to do so (and not in other cases), which would be a good middle ground between the two options we have right now (which are respectively retry never with reconnect_attempts: 0 and retry always with reconnect_attempts: 1).

For now, I'm planning to try and PING Redis before trying to use the connection (for our use case, hitting the roundtrip latency there does not matter at all). That doesn't guarantee that the connection is going to stay healthy, but it might ensure we don't use a connection that died hours ago. In other words, it gives us one reconnect attempt (since redis-rb will auto-reconnect if that PING fails) before we attempt to send the real traffic through.

Here's the implementation I have for now. I have confirmed that this recovers from killing all clients in Redis (CLIENT KILL TYPE normal) then attempting to push from Sidekiq, whereas the standard ConnectionPool does not (it'll attempt to reuse the dead connection, fail with Redis::ConnectionError: Connection lost (ECONNRESET), and then not retry).

module Operations
  class PingPool < ConnectionPool
    # This is a Redis connection pool that pings connections before using them.
    # For us, this works because we'd rather accept a small delay connecting to
    # Redis if that means we're more confident that we're not reusing a
    # connection that was already dead.
    # See: https://github.com/mperham/sidekiq/issues/3303
    class << self
      def create(redis_url, redis_timeout, pool_size, pool_timeout)
        # We disable auto-reconnect because that can result in a job
        # being pushed multiple times. See the aforementioned Sidekiq issue.
        new(size: pool_size, timeout: pool_timeout) do
          Redis.new(
            url: redis_url,
            timeout: redis_timeout,
            reconnect_attempts: 0,
            driver: 'ruby'
          )
        end
      end
    end

    def checkout(*)
      conn = super

      begin
        Rails.logger.info("#{prefix}: PING")
        r = conn.ping
      rescue StandardError => e
        # If the connection was dead when we PING'd, then redis-rb will
        # auto-reconnect when we try to actually use this connection.
        Rails.logger.info("#{prefix}: PING: Error: #{e.class}: #{e}")
      else
        Rails.logger.info("#{prefix}: PING: #{r}")
      end

      conn
    end

    def prefix
      # We can't reference a Sidekiq logger from the pool, since it will be
      # shared across several jobs (and even by the cron pushers), but we can
      # log the current thread (in the format Sidekiq does), so that it's
      # easier to tie things together.
      tid = Thread.current.object_id.to_s(36)
      "#{Time.now.utc.iso8601(3)} #{::Process.pid} TID-#{tid}"
    end
  end
end

Looking at #3311, there's also at least Redis::InheritedError that is most definitely an exception that could be retried: at that point, redis-rb hasn't even tried to talk to Redis yet.

That being said, I suspect a more generally applicable (pinging before using the connection most likely isn't the right option for everyone) and robust solution here might be for Sidekiq's job push to be idempotent as far as Redis is concerned (in which case it's always safe to retry)... but I imagine this might represent quite a bit of work!


Finally, to answer your questions about our setup: we basically have a lot of Redis servers (several hundreds as of right now) that Sidekiq pushes jobs to. Most of these aren't particularly busy, and right now we don't enforce any connection timeouts on them (except for one exception..!).

Finally, currently, we use version 3.3.1 of the redis-rb gem.

I'm happy to continue trying out some things here (with the caveat that we're still on Sidekiq 3.x); just let me know if I can help provide more information 😄.

@mperham
Copy link
Collaborator

mperham commented Jan 13, 2017

For now, I believe the easiest thing to do is rollback the Sidekiq default for reconnect_attempts to 1. You and others with the same problem can elect to go with 0 and deal with the issues that entails. I wish there was a safer solution but most people aren't commonly pushing jobs to Redis which take 5+ seconds and triggering timeouts.

@krallin
Copy link
Author

krallin commented Jan 13, 2017

That's fair; thanks for the heads up @mperham!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants