Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gracefully handle connection errors #2550

Merged
merged 2 commits into from Sep 11, 2015
Merged

gracefully handle connection errors #2550

merged 2 commits into from Sep 11, 2015

Conversation

mperham
Copy link
Collaborator

@mperham mperham commented Sep 11, 2015

ref: redis/redis-rb#543

When AWS ElastiCache fails over from one node to another node, redis clients can be left connected to the original node which is now read-only. Sidekiq will then be unable to continue accepting or processing jobs until the connections are reset.

It seems that the best place to handle the exception would be within the sidekiq code as there are reasons why a general redis client might want to continue holding a connection to a read-only redis slave.

the specific error raised when this happens is:
Redis::CommandErrorREADONLY You can't write against a read only slave.

Ideally when this type of error occurs, sidekiq should drop the redis connection and retry the operation.

@mperham
Copy link
Collaborator

mperham commented Sep 11, 2015

Unfortunately it's not as easy as that. connection_pool manages the connections and assumes that connections are self-healing. Furthermore, Sidekiq has no use at all for a readonly Redis connection. In a perfect world, redis-rb would reconnect upon any READONLY errors. I'll ponder if there's something I can do to make it better.

@ryansch
Copy link
Contributor

ryansch commented Sep 11, 2015

Seems like we'd have to create a shim/proxy around the client object in RedisConnection not unlike how Redis::Namespace works.

Edit: I had success using connection_pool this way in https://github.com/aceofsales/faraday_persistent_excon/blob/master/lib/faraday_persistent_excon/connection_pools.rb#L33

@mperham
Copy link
Collaborator

mperham commented Sep 11, 2015

I was thinking something like:

module Sidekiq
  def self.redis
    @redis_pool.with do |conn|
      begin
        yield conn
      rescue Redis::BaseError => ex
        (conn.disconnect; retry) if ex.message =~ /READONLY/
      end
    end 
  end
end

@ryansch
Copy link
Contributor

ryansch commented Sep 11, 2015

👍 for retry

@mperham
Copy link
Collaborator

mperham commented Sep 11, 2015

I'm not sure 100% of Sidekiq's redis operations go through Sidekiq.redis but most of them do.

mperham added a commit that referenced this pull request Sep 11, 2015
gracefully handle connection errors
@mperham mperham merged commit adb8e1e into master Sep 11, 2015
@mperham mperham deleted the readonly_disconnect branch September 11, 2015 18:07
@craigmcnamara
Copy link

I have an issue redis-rb#550 open for this and a patch gem specifically for Redis on Elasticache https://github.com/craigmcnamara/redis-elasticache

@mperham
Copy link
Collaborator

mperham commented Oct 10, 2015

3.5.1 was just released with a EC "READONLY" fix

On Oct 10, 2015, at 01:17, Craig McNamara notifications@github.com wrote:

I have an issue redis-rb#550 open for this and a patch gem specifically for Redis on Elasticache https://github.com/craigmcnamara/redis-elasticache


Reply to this email directly or view it on GitHub.

franzliedke added a commit to franzliedke/sidekiq that referenced this pull request Sep 6, 2021
These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
sidekiq#2550 and sidekiq#4495) this way, it seems senbile to also handle this one.
franzliedke added a commit to franzliedke/sidekiq that referenced this pull request Sep 6, 2021
These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
sidekiq#2550 and sidekiq#4495) this way, it seems senbile to also handle this one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants