Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent never ending wait for cancelation of a failed (timed-out) attempt to acquire a connection. #2470

Merged
merged 3 commits into from Mar 28, 2019

Conversation

Projects
None yet
4 participants
@RafalSumislawski
Copy link
Contributor

RafalSumislawski commented Mar 23, 2019

This is a workaround for typelevel/cats-effect#487 causing blaze client to wait indefinitely if a requestTimeout occurs while manager.borrow(key) is running and the borrow eventually fails. This is what happens when we try to send a request to an unreachable host (incorrect IP address, firewalls, etc.).

IMO this is a solution for #2338. It literally implements "a reasonable timeout, even if hardcoded" as @rossabaker described it ;).

This is also a partial solution for #2386 . The requestTimeout still isn't enforced, instead we wait for an OS level timeout of an attempt to establish TCP connection, but at least the request fails eventually, instead of waiting indefinitely. I want to follow up with a more complete solution, but I feel it requires making some design decisions, which should be discussed with the maintainers.

RafalSumislawski added some commits Mar 23, 2019

Work around typelevel/cats-effect#487 causing a never ending wait for…
… cancelation of a failed (timed-out) attempt to acquire a connection.

#2386 #2338
invalidate(next.connection)
case _ =>

This comment has been minimized.

Copy link
@kevinmeredith

kevinmeredith Mar 26, 2019

Contributor

Could you please explain, @RafalSumislawski, why it's not necessary to explicitly handle the case (Left(error), _)) => ... case?

In other words, what's the import of returning IO.unit if, as I understand, manager.borrow(key).attempt returns an F(Left(RuntimeException(...)))?

This comment has been minimized.

Copy link
@RafalSumislawski

RafalSumislawski Mar 26, 2019

Author Contributor

The lines 65-70 are the release function. If acquiring the connection failed, there's nothing to release. There's also no need to propagate the error here, as it will already be propagated by case Left(t) => F.raiseError(t) on line 88. So we do nothing aka return IO.unit.

@@ -156,5 +157,29 @@ class ClientTimeoutSpec extends Http4sSpec {

c.fetchAs[String](FooRequest).unsafeRunSync must_== "done"
}

// Regression test for: https://github.com/http4s/http4s/issues/2386

This comment has been minimized.

Copy link
@kevinmeredith

kevinmeredith Mar 27, 2019

Contributor

Nice, well-commented test!

@rossabaker
Copy link
Member

rossabaker left a comment

Thanks! This looks good to me. Mild concern about test runtimes below.

"Eventually timeout on connect timeout" in {
val manager = ConnectionManager.basic[IO, BlazeConnection[IO]]({ _ =>
// In a real use case this timeout is under OS's control (AsynchronousSocketChannel.connect)
IO.sleep(2.seconds) *> IO.raiseError[BlazeConnection[IO]](new IOException())

This comment has been minimized.

Copy link
@rossabaker

rossabaker Mar 27, 2019

Member

It ate my comment.

What are the actual requirements for sleeping here? Somewhat longer than requestTimeout, less than the argument to unsafeRunTimed? I'm concerned about blocking a thread for two seconds, which slows down test runs and bogs other things down. I think the way we do this contributes to our instability in blaze tests on Travis CI.

The trick is that if the timeout is too aggressive, we run into race conditions, and if it's too slow, we have slow CI and can starve other tests. If we were smarter about cats.effect.Timer, we'd have less problems like this, but that would require more plumbing into blaze. That's well beyond the scope of this ticket.

Anyway, I'd like to make these timeouts as tight as we can without running into noisy failures. Just making sure I understand the ordering.

This comment has been minimized.

Copy link
@RafalSumislawski

RafalSumislawski Mar 28, 2019

Author Contributor

The requirement is that requestTimeout < IO.sleep < unsafeRunTimed. For as long as everything works this tests will run for the time of IO.sleep, so this it the number that we should care the most about. requestTimeout can be near zero. unsafeRunTimed affects only run time of failing test, so we can be a bit more generous with it.

Maybe we could go with something like requestTimeout = 50.millis, IO.sleep(1.second), unsafeRunTimed(2.seconds) ? That would make this test case the fastest in the ClientTimeoutSpec.

// if establishing connection fails first then it's an IOException

// The expected behaviour is that the requestTimeout will happen first, but fetchAs will additionally wait for the IO.sleep(2.seconds) to complete.
c.fetchAs[String](FooRequest).unsafeRunTimed(5.seconds).get must throwA[TimeoutException]

This comment has been minimized.

Copy link
@kevinmeredith

kevinmeredith Mar 28, 2019

Contributor

Per your PR's description:

The requestTimeout still isn't enforced, instead we wait for an OS level timeout of an attempt to establish TCP connection, but at least the request fails eventually, instead of waiting indefinitely.

Why wouldn't an IOException be thrown since "the requestTimeout still isn't enforced?"

This comment has been minimized.

Copy link
@RafalSumislawski

RafalSumislawski Mar 28, 2019

Author Contributor

In short: The requestTimeout is why the request fails, so that's what is returned., but the value of the requestTimeout is not when the request completes.

In detail - here's what happens:

  • sending a request (including establishing connection), and the requestTimeout are started in parallel racing with each other
  • after 1 second the requestTimeout is hit, it wins the race. The racePair completes with TimeoutException ( ) ,
  • fiber.cancel is called (
    case Right((fiber, t)) => fiber.cancel >> F.raiseError(t)
    ) attempting to cancel the proces of sending the request. At that time that process is still during the acuire part of a Bracket which is uninterruptable. So the fiber.cancel waits until the aquire completes and then interrupts the Bracket.
  • after 2 seconds from the start of the test, the aquire completes with an IOException. The fiber.cancelcompletes and gets flatMapped with F.raiseError(t), t being the original TimeoutException. That becomes the final result of the request.

This comment has been minimized.

Copy link
@kevinmeredith

kevinmeredith Mar 28, 2019

Contributor

Thanks for that detailed explanation, Rafał!

reformat code
reduce timeouts in test
@rossabaker
Copy link
Member

rossabaker left a comment

👍 on green.

@ChristopherDavenport ChristopherDavenport merged commit 4cd1ea2 into http4s:master Mar 28, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.