Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent never ending wait for cancelation of a failed (timed-out) attempt to acquire a connection. #2470

Merged
merged 3 commits into from Mar 28, 2019

Conversation

@RafalSumislawski
Copy link
Contributor

@RafalSumislawski RafalSumislawski commented Mar 23, 2019

This is a workaround for typelevel/cats-effect#487 causing blaze client to wait indefinitely if a requestTimeout occurs while manager.borrow(key) is running and the borrow eventually fails. This is what happens when we try to send a request to an unreachable host (incorrect IP address, firewalls, etc.).

IMO this is a solution for #2338. It literally implements "a reasonable timeout, even if hardcoded" as @rossabaker described it ;).

This is also a partial solution for #2386 . The requestTimeout still isn't enforced, instead we wait for an OS level timeout of an attempt to establish TCP connection, but at least the request fails eventually, instead of waiting indefinitely. I want to follow up with a more complete solution, but I feel it requires making some design decisions, which should be discussed with the maintainers.

invalidate(next.connection)
case _ =>
Copy link
Contributor

@kevinmeredith kevinmeredith Mar 26, 2019

Could you please explain, @RafalSumislawski, why it's not necessary to explicitly handle the case (Left(error), _)) => ... case?

In other words, what's the import of returning IO.unit if, as I understand, manager.borrow(key).attempt returns an F(Left(RuntimeException(...)))?

Copy link
Contributor Author

@RafalSumislawski RafalSumislawski Mar 26, 2019

The lines 65-70 are the release function. If acquiring the connection failed, there's nothing to release. There's also no need to propagate the error here, as it will already be propagated by case Left(t) => F.raiseError(t) on line 88. So we do nothing aka return IO.unit.

@@ -156,5 +157,29 @@ class ClientTimeoutSpec extends Http4sSpec {

c.fetchAs[String](FooRequest).unsafeRunSync must_== "done"
}

// Regression test for: https://github.com/http4s/http4s/issues/2386
Copy link
Contributor

@kevinmeredith kevinmeredith Mar 27, 2019

Nice, well-commented test!

Copy link
Member

@rossabaker rossabaker left a comment

Thanks! This looks good to me. Mild concern about test runtimes below.

"Eventually timeout on connect timeout" in {
val manager = ConnectionManager.basic[IO, BlazeConnection[IO]]({ _ =>
// In a real use case this timeout is under OS's control (AsynchronousSocketChannel.connect)
IO.sleep(2.seconds) *> IO.raiseError[BlazeConnection[IO]](new IOException())
Copy link
Member

@rossabaker rossabaker Mar 27, 2019

It ate my comment.

What are the actual requirements for sleeping here? Somewhat longer than requestTimeout, less than the argument to unsafeRunTimed? I'm concerned about blocking a thread for two seconds, which slows down test runs and bogs other things down. I think the way we do this contributes to our instability in blaze tests on Travis CI.

The trick is that if the timeout is too aggressive, we run into race conditions, and if it's too slow, we have slow CI and can starve other tests. If we were smarter about cats.effect.Timer, we'd have less problems like this, but that would require more plumbing into blaze. That's well beyond the scope of this ticket.

Anyway, I'd like to make these timeouts as tight as we can without running into noisy failures. Just making sure I understand the ordering.

Copy link
Contributor Author

@RafalSumislawski RafalSumislawski Mar 28, 2019

The requirement is that requestTimeout < IO.sleep < unsafeRunTimed. For as long as everything works this tests will run for the time of IO.sleep, so this it the number that we should care the most about. requestTimeout can be near zero. unsafeRunTimed affects only run time of failing test, so we can be a bit more generous with it.

Maybe we could go with something like requestTimeout = 50.millis, IO.sleep(1.second), unsafeRunTimed(2.seconds) ? That would make this test case the fastest in the ClientTimeoutSpec.

// if establishing connection fails first then it's an IOException

// The expected behaviour is that the requestTimeout will happen first, but fetchAs will additionally wait for the IO.sleep(2.seconds) to complete.
c.fetchAs[String](FooRequest).unsafeRunTimed(5.seconds).get must throwA[TimeoutException]
Copy link
Contributor

@kevinmeredith kevinmeredith Mar 28, 2019

Per your PR's description:

The requestTimeout still isn't enforced, instead we wait for an OS level timeout of an attempt to establish TCP connection, but at least the request fails eventually, instead of waiting indefinitely.

Why wouldn't an IOException be thrown since "the requestTimeout still isn't enforced?"

Copy link
Contributor Author

@RafalSumislawski RafalSumislawski Mar 28, 2019

In short: The requestTimeout is why the request fails, so that's what is returned., but the value of the requestTimeout is not when the request completes.

In detail - here's what happens:

  • sending a request (including establishing connection), and the requestTimeout are started in parallel racing with each other
  • after 1 second the requestTimeout is hit, it wins the race. The racePair completes with TimeoutException ( ) ,
  • fiber.cancel is called (
    case Right((fiber, t)) => fiber.cancel >> F.raiseError(t)
    ) attempting to cancel the proces of sending the request. At that time that process is still during the acuire part of a Bracket which is uninterruptable. So the fiber.cancel waits until the aquire completes and then interrupts the Bracket.
  • after 2 seconds from the start of the test, the aquire completes with an IOException. The fiber.cancelcompletes and gets flatMapped with F.raiseError(t), t being the original TimeoutException. That becomes the final result of the request.

Copy link
Contributor

@kevinmeredith kevinmeredith Mar 28, 2019

Thanks for that detailed explanation, Rafał!

reduce timeouts in test
Copy link
Member

@rossabaker rossabaker left a comment

👍 on green.

@ChristopherDavenport ChristopherDavenport merged commit 4cd1ea2 into http4s:master Mar 28, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
RafalSumislawski added a commit to RafalSumislawski/http4s that referenced this issue May 3, 2019
- remove the workarounds for typelevel/cats-effect#487 introduced in http4s#2470 as the issue is fix in cats-effects 1.3.0
aeons added a commit that referenced this issue May 4, 2019
…and-remove-workarounds

Use cats-effect 1.3.0 and remove workarounds introduced in #2470
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants