Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Database Outage Recovery in Lagom 1.4.0 #1250

Closed
octonato opened this issue Mar 9, 2018 · 8 comments
Closed

Issues with Database Outage Recovery in Lagom 1.4.0 #1250

octonato opened this issue Mar 9, 2018 · 8 comments

Comments

@octonato
Copy link
Member

octonato commented Mar 9, 2018

Up to now we believe there are two bugs involved:

  • Slick DB not recovering after long period of failures
  • RestartSource stopping to retry after long period of failures
@octonato octonato changed the title **Issues with Database Outage Recovery in Lagom 1.4.0** Issues with Database Outage Recovery in Lagom 1.4.0 Mar 9, 2018
@octonato
Copy link
Member Author

octonato commented Mar 9, 2018

After some further investigations, I came to the conclusion that there is nothing wrong with the RestartSource. It does restart forever.

All eyes are turned now to a possible deadlock in the Slick DB. Note that the Source that is fails to be restarted depends on two calls to the DB. One to fetch the Offset and another for the journal query.

RestartSource.withBackoff(
  config.minBackoff,
  config.maxBackoff,
  config.randomBackoffFactor
) { () =>
  val handler = processor().buildHandler
  val futureOffset = handler.prepare(tag) // <- this hits the DB
  Source.fromFuture(futureOffset).flatMapConcat {
    offset =>
      val eventStreamSource = eventStreamFactory(tag, offset)
      val userlandFlow = handler.handle()
      eventStreamSource.via(userlandFlow) // <- hits the DB again
  }
}

If the DB is somehow blocked, the Future with the Offset may never complete and that means that the Source wrapped by RestartSource will never fail because it never realy started. It will just hang forever in time and space.

@octonato octonato self-assigned this Mar 9, 2018
@octonato
Copy link
Member Author

Found the error in Slick. This is a regression that was introduced when fixing slick/slick#1274.

I have an open PR for Slick:
slick/slick#1876

@octonato
Copy link
Member Author

In Lagom, we still need to fix the way we are using Source.fromFuture(futureOffset) (see code fragment above).

If futureOffset never completes, the Source will never start and the backoff will never restart it.

@ignasi35
Copy link
Contributor

@TimMoore if the problem is that handler.prepare(tag) doesn't time out (doesn't complete in some scenarios), then adding the timeout out from the outside would just hold more connections.

@octonato
Copy link
Member Author

@TimMoore I fixed it locally using

   val eventualTimeout =
     after(duration = 3.seconds, using = context.system.scheduler) {
       Future.failed(new TimeoutException(s"Timeout while fetching read-side processor offset"))
     }
     val futureOffset = handler.prepare(tag).toScala
     val offsetWithTimeout = Future firstCompletedOf Seq(futureOffset, eventualTimeout)

@ignasi35, I rechecked the code and this is not true. The reason this particular future is never completing is that Slick AsyncExecutor doesn't have any available threads for it.

Whenever we get a Future from another library we can't make the assumption that it will complete so I think it's fair to say that one should always protect against it. Actually, we should bring that fix into akka streams. We should have a variant of Source.fromFuture that also receives a timeout and Source should fail if it never completes. I will raise an issue there to know whay they think.

@octonato
Copy link
Member Author

Oh, it seems that there is already something we can use.

Source.fromFuture(fut).completionTimeout(3.seconds)

@octonato
Copy link
Member Author

There is an open PR (#1278) that fix the RestartSource issue. I'm keeping this open to track the fix in Slick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants