New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Same job (identical JID) runs twice #3303
Comments
This is some great detective work, thanks for investigating. What you write makes perfect sense. It's quite possible this is a cause of some of the mysterious problems that other people have seen throughout the years. My takeaway is that I should have investigated each option provided by Redis-rb. The default values are not necessarily appropriate for Sidekiq; I had assumed that network issues would bubble up immediately. |
Thanks for the quick response @mperham! We've rolled out the workaround I mentioned earlier today — I'll monitor and report here whether we see this problem again (we were seeing this quite infrequently, so it might take a little while before I can confirm whether the issue is indeed resolved). |
Just a quick heads up that we've confirmed that setting |
Hey @mperham, Sure. I suspect the issue here (and we're investigating this on our end as well — more on that below) is that — as you observed in #3311 —, Redis connections aren't as robust as we'd want them to be (they may break over time due to timeouts, or due to the process forking and holding on to old connections). Here is where my understanding of the problem is so far: When Sidekiq tries to reuse an existing connection from the pool, that connection may or may not be still be alive. If it is, then the request goes through and everything is fine. If it's not, then you get an error, whereas before this fix, you'd have had a transparent reconnect, and everything would have been fine. Now, if the connection is alive, you can still get a timeout when writing the push to Redis (that's the problem I brought up in this issue). What happens then is pretty much undetermined (as the push may or may not have gone through to Redis, and we can't know that), but it's related to Note that in either case, redis-rb will "fix" the connection before it attempts to use it again (admittedly that would be a pretty major problem here): it'll disconnect if it gets an error, or if we attempt reusing a connection that was waiting on some data. Now; as I mentioned, we're investigating this as well... It does appear making this change replaced some cases where we had jobs running twice concurrently (these cause major problems for us right now, so given a choice we'll always err on the side of caution) with cases where jobs are not being enqueued at all (which isn't as problematic for our use case, but still something we'd like to minimize). I haven't looked closely at whether the rate is similar (I suspect it might be), largely because it's not as big a problem anymore At least for our use case, I think a clear improvement would be to retry when we know for sure that it's safe to do so (and not in other cases), which would be a good middle ground between the two options we have right now (which are respectively retry never with For now, I'm planning to try and Here's the implementation I have for now. I have confirmed that this recovers from killing all clients in Redis (
Looking at #3311, there's also at least That being said, I suspect a more generally applicable (pinging before using the connection most likely isn't the right option for everyone) and robust solution here might be for Sidekiq's job push to be idempotent as far as Redis is concerned (in which case it's always safe to retry)... but I imagine this might represent quite a bit of work! Finally, to answer your questions about our setup: we basically have a lot of Redis servers (several hundreds as of right now) that Sidekiq pushes jobs to. Most of these aren't particularly busy, and right now we don't enforce any connection timeouts on them (except for one exception..!). Finally, currently, we use version 3.3.1 of the redis-rb gem. I'm happy to continue trying out some things here (with the caveat that we're still on Sidekiq 3.x); just let me know if I can help provide more information 😄. |
For now, I believe the easiest thing to do is rollback the Sidekiq default for |
That's fair; thanks for the heads up @mperham! |
Hi there,
Ruby version:
ruby 2.2.4p230 (2015-12-16 revision 53155) [x86_64-linux]
Sidekiq / Pro / Enterprise version(s):
sidekiq (3.5.0)
A little (hopefully useful) context: we're running Sidekiq in a slightly unconventional setup. Sidekiq is used to coordinate jobs across processing clusters that are running in different AWS regions and each run their own Redis instances (we have several hundreds of these).
Occasionally, these Redis instances may be unreachable or laggy for various reasons:
Bottom line, now and then it might take a little while to enqueue jobs to Redis, such that things time out.
This is pretty apparent in our logs: most of the time, it takes 0.25 seconds to enqueue a job, but now and then, it takes ~ 5.25 seconds (5 seconds happens to be the default timeout in redis-rb, which we use).
Unfortunately, when it does take ~ 5.25 seconds, we end up with the same job (same JID) running twice on the destination instance concurrently. Here are some examples from our logs:
In each of these instances, the two threads running the same job at the same time are in the same process. I should also mention that retries are disabled here.
I suspect this is caused by the fact that
redis-rb
automatically re-runs a command when it times out.I tried reproducing this locally by using
socat TCP-LISTEN:6379,fork,bind=0.0.0.0,reuseaddr STDOUT
to emulate an unresponsive Redis, and it appears thatredis-rb
will indeed push twice in this scenario (here's the output when I useRedis.new.lpush('foo', 'bar')
):(of course, since Socat posing as Redis is completely unresponsive even after the retry, this will take 10 seconds before timing out on the client side, but the second
*3
line does show up 5 seconds after the first one)For comparison, using
Redis.new(reconnect_attempts: 0).lpush('foo', 'bar')
does throw an error after the 5 seconds timeout, without retrying theLPUSH
operation a second time.We're planning to deploy the
reconnect_attempts: 0
workaround throughout our infrastructure soon (and we're happy to report whether that solves the issue), but I believe it's a design goal for Sidekiq that the same JID cannot execute concurrently across two workers (from here: #2398), so I thought perhaps you'd like to hear about this failure scenario! (and perhaps you'd want to setreconnect_attempts
to 0 by default in Sidekiq?).Thanks!
The text was updated successfully, but these errors were encountered: