New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Recovery #201

Open
allen-servedio opened this Issue Oct 4, 2017 · 6 comments

Comments

Projects
None yet
4 participants
@allen-servedio

allen-servedio commented Oct 4, 2017

I am not sure if (or should) anything can be done here, but I am seeing this scenario and wonder if there is a way to have the system recover from this more quickly:

  1. Service has started normally and is able to connect to redis and make requests.
  2. Either redis goes down and/or there is something else that breaks the connection between the service and redis (so a network partition, etc..)
  3. Redis comes back up (or network partition is resolved).
  4. Next attempt by service to contact redis will fail with an EOFException (see below). Any subsequent requests will work because this failure appears to trigger EITHER a reconnect or throwing away the connection (I did not dig into the code to see which).

Is it possible (and is it a good idea?) for Carmine to try to refresh the connection and retry the call once, when this happens, before giving up and failing?

I have not tried this in a highly concurrent environment yet where I would have several connections in my pool so I don't know if this would also happen for each connection in the pool...

Here is the error that I see (truncated to knock out my code):

#error {
 :cause nil
 :via
 [{:type java.io.EOFException
   :message nil
   :at [java.io.DataInputStream readByte DataInputStream.java 267]}]
 :trace
 [[java.io.DataInputStream readByte DataInputStream.java 267]
  [taoensso.carmine.protocol$get_unparsed_reply invokeStatic protocol.clj 122]
  [taoensso.carmine.protocol$get_unparsed_reply invoke protocol.clj 106]
  [taoensso.carmine.protocol$eval20396$get_parsed_reply__20397 invoke protocol.clj 198]
  [taoensso.carmine.protocol$execute_requests invokeStatic protocol.clj 320]
  [taoensso.carmine.protocol$execute_requests invoke protocol.clj 284]
  [taoensso.carmine.protocol$_with_replies invokeStatic protocol.clj 342]
  [taoensso.carmine.protocol$_with_replies invoke protocol.clj 328]
@aravindbaskaran

This comment has been minimized.

aravindbaskaran commented May 6, 2018

@allen-servedio Just to get more context - Is this within a with-new-pubsub-listener or with-new-listener?

If yes, I have the same questions, have been trying to address it here - #207. Or if that is an overkill, have added some extra macros that address your questions here - https://github.com/aravindbaskaran/redis-pubsub

Do let me know your thoughts on that?

@kendagriff

This comment has been minimized.

kendagriff commented Oct 24, 2018

I'll pile on—we're also getting frequent EOFExceptions, including SocketTimeoutExceptions and "Carmine connection error"s. Our application is NOT using pub/sub, only blocking stuff, and we have a super basic setup.

(def conn {:pool {} :spec {:uri (env :redis-url)}})

We are, however, on Heroku which has a default connection timeout of 300 (seconds). Is that possibly impacting the situation? We're trying to decide whether we need to simply wrap our wcar macro in with-retries type of stuff.

@ptaoussanis

This comment has been minimized.

Owner

ptaoussanis commented Oct 24, 2018

Hi all-

Haven't had an opportunity to look at this properly yet, but just checking in the meantime: has anyone tried tuning their connection pool config? Carmine doesn't have a built-in retry mechanism, but the pool's behaviour is fairly configurable (for example- you could setup test-before-use).

@kendagriff

This comment has been minimized.

kendagriff commented Oct 24, 2018

Thanks for the speedy response. I'm seeing test-on-borrow?, test-on-return?, and test-while-idle?. Is that what you're talking about?

@kendagriff

This comment has been minimized.

kendagriff commented Oct 24, 2018

And I'm assuming they have a performance trade-off?

@kendagriff

This comment has been minimized.

kendagriff commented Oct 29, 2018

For what it's worth, Heroku responded to my questions about dropped connections:

There are many factors that can cause a Redis connection to die, including networking blips on either the Redis or Dyno side, or network blips in the underlying platform (AWS). But your database connection pool should protect you against this because if a connection fails it should be replaced. You might want to check your client to see if it has connection testing, or periodic pinging configuration options. This can sometimes prevent connections from dying and detect failures early.

@ptaoussanis: Just making sure I understand correctly: When a connection in the pool goes bad, that's only discoverable by a failed query, right?

I implemented a with-retries-type strategy (using again), and it's made a huge improvement. Having said that, however, is there not a case for implementing a retry strategy in carmine? test-on-borrow?, from the little research I've done, appears to introduce a big performance trade-off, which you wouldn't want turned on to address occasional network blips.

I can totally appreciate a case against a retry strategy too (given that retry strategy implies, well, a strategy, and apps might differ in how they want that set up); just curious what your thoughts are...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment