Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed issue in PoolManager which caused requests not to ever be executed under some circumstances #2562

Merged
merged 8 commits into from May 15, 2019

Conversation

Projects
None yet
6 participants
@aldiyen
Copy link
Contributor

commented May 9, 2019

I believe this properly fixes the issue described in PR #2546. I'm not terribly familiar with this code, so it's possible that there was a reason why numConnectionsCheckHolds was being called outside of the monadic context inside of createConnection, in which case I think an equivalent solution could be localized within borrow.

rossabaker and others added some commits May 2, 2019

Merge pull request #2559 from gvolpe/docs/link-talk-and-library
Linking http4s-tracer and blog posts / talk about error handling with…
Merge pull request #2554 from scala-steward/update/sbt-native-package…
…r-1.3.21

Update sbt-native-packager to 1.3.21
Fixed issue in PoolManager where numConnectionsCheckHolds() was calle…
…d inside of createConnection() outside of the asynchronous context, which meant the connection count was being checked before decrConnection()'s effect had been executed and things were added to the wait queue when they were intended to be run

@aldiyen aldiyen changed the title Fixed issue in PoolManager which caused requests not to be executed under some circumstances Fixed issue in PoolManager which caused requests not to ever be executed under some circumstances May 9, 2019

} else {
addToWaitQueue(key, callback)
}
F.ifM(F.delay(numConnectionsCheckHolds(key)))(

This comment has been minimized.

Copy link
@jmcardon

jmcardon May 9, 2019

Member

numConnectionsCheckHolds accesses a mutable map, but it does not mutate it. I don't think this requires delay.

Every call also within the if is thunked within delay as well.

@aldiyen

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

The problem is that createConnection is called after decrConnection is called, however decrConnection decrements the connection count asynchronously, while createConnection checks the connection count synchronously. As a result, if the connection pool is at maxTotalConnections, createConnection adds to the wait queue when the code expects it to create a new connection.

The unit tests I added fail without this change. I’m open to considering alternative solutions if someone can point me in the right direction here.

@jmcardon

This comment has been minimized.

Copy link
Member

commented May 9, 2019

You are likely right, thus this seems more of a synchronization issue than an issue with thunking a boolean check.

ifM and delay only thunk statements, they do not guarantee any sort of atomicity in execution, thus likely your tests are only passing because the thunking forces it to go through the evaluator and execute slower (thus you observe it passing under a single fiber test), but under a bunch of concurrent calls to borrow, this will likely still fail. The real problem is with allocated, being a mutable map that is not thread safe.

There's a few solutions:

  • fs2.Ref Map instead of mutable.Map.
  • AtomicReference instead of ref (at the cost of handling low level synchro yourself, so ref is suggested)
  • @volatile reference if you deem you don't need compareAndSet semantics.
@aldiyen

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

Hm, you’re probably right. Unfortunately state information is contained both in the current total connection count and the map of RequestKey to connection, which I think implies a more significant refactoring of PoolManager’s state management than even what you’ve just proposed.

@aldiyen

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

Seems like @rossabaker is going to resolve this by replacing PoolManager with a better-managed keyed pool, which seems good. Closing this PR in favor of that solution.

@aldiyen aldiyen closed this May 10, 2019

@jmcardon

This comment has been minimized.

Copy link
Member

commented May 11, 2019

Thanks for the attempt @aldiyen ! If it wasn't for this PR we might have not caught the small synchro issue.

@rossabaker

This comment has been minimized.

Copy link
Member

commented May 13, 2019

I should note that the keypool is non-trivial. The current implementation only enforces limits on the idle connections, not overall. And retrofitting it is proving to be a struggle. If there is a quick win here, it's still welcome, even if I believe in the keypool in the long term.

@aldiyen

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2019

My best guess as to why it's never returning in some cases is that the connection count is being decremented before the callback is being added to the wait queue. Something like:

  1. Connection count is at exactly maxTotal
  2. close connection async operation is created
  3. addToWaitQueue async operation is created
  4. close connection async operation thunks, but nothing is in wait queue
  5. addToWaitQueue thunks, but nothing ever checks wait queue again

This may not be an issue in situations with more activity going on in parallel, because I assume something will eventually check the wait queue, but under certain types of utilization I imagine it would block real applications indefinitely.

I think the change I've made here will actually alleviate this problem for that use case, as it fixes the order of operations, but it does seem likely that some very specific parallelized utilization of the pool could still get it locked up like this if things line up just right, even with that change. The only guaranteed non-halting solution that comes to mind offhand is to make it so that the code in borrow for the None if curTotal == maxTotal case creates a new connection without considering whether or not it would exceed maxTotal -- this would cause situations where the pool would exceed its configured limits, but due to the lack of synchronization on the state that is likely possible as-is anyway.

@jmcardon

This comment has been minimized.

Copy link
Member

commented May 13, 2019

See, the thing is I don't believe if fixes any order of operations, because there's no specific guarantee that F.delay does anything other than, by thunking, create a literal delay of hopping through the interpreter.

Now, checking a mutable data structure is actually an effect, so it's not as if F.delay(checkMutableThing) is wrong to do, but there's no compiler optimization applied to this snippet where this matters (it matters in the case that you have a compiler optimization that reorders pure statements, in the form of common subexpression elimination).

I think the change I've made here will actually alleviate this problem for that use case, as it fixes the order of operations

Every call to createConnection is following a thunk, connected by *>. F.delay doesn't guarantee any such behavior will be fixed, at all. Like I said prior, there's no operation order fixed by this, since there's absolutely no atomicity guarantees between the check and the update. In certain cases the queue may be updated, and in others, it's likely that it wont be, it depends entirely on operation ordering between threads and how the jvm chooses to update the variable. We have no way to enforce such guarantees in the current code.

This may not be an issue in situations with more activity going on in parallel, because I assume something will eventually check the wait queue, but under certain types of utilization I imagine it would block real applications indefinitely.

This is likely an issue no matter the type of use. We simply cannot make any guarantees about what the state of the queue is like when we're very clearly using a mutable, non-thread safe data structure in code that updates it in different threads. Without any sort of atomicity or synchronization guarantees, we have absolutely no guarantees of what the state of the waitQueue will be at certain points in time.

@aldiyen

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

Hm, interesting. Forgive my ignorance, my real-world experience with Cats is fairly limited, and so I have probably been operating under some faulty assumptions.

Are you saying that in code like decrConnection(key) *> createConnection(k, cb) the decrConnection(key) operation could be evaluated prior to the createConnection(k, cb) operation? Or simply that these two operations, while ordered relative to one another, are not guaranteed to execute anywhere near each other in time because the underlying threadpool may be doing many other things in between these two operations?

@jmcardon

This comment has been minimized.

Copy link
Member

commented May 14, 2019

Are you saying that in code like decrConnection(key) *> createConnection(k, cb) the decrConnection(key) operation could be evaluated prior to the createConnection(k, cb) operation? 

It for sure will be evaluated before createConnection.

What I'm getting at, is, in this snippet:

incrConnection(key) *> F.start {
          Async.shift(executionContext) *> builder(key).attempt.flatMap {
            case Right(conn) =>
              F.delay(callback(Right(NextConnection(conn, fresh = true))))
            case Left(error) =>
              disposeConnection(key, None) *> F.delay(callback(Left(error)))
          }
        }.void

The shift is what causes a potentially incorrect view of total allocated connections because how the operation following that shift views the state of allocated might not be the state that should be seen, it's hard to say how operation reordering happens without explicitly printing things and observing the state when going through different threads.

numConnectionsCheckHolds is:

private def numConnectionsCheckHolds(key: RequestKey): Boolean =
    curTotal < maxTotal && allocated.getOrElse(key, 0) < maxConnectionsPerRequestKey(key)

^ allocated is not being updated at all, no state is being modified. Thunking this operation doesn't guarantee anything about how the following shifted operation will view allocated. There's no significant semantic difference between if (numConnectionsCheckHolds(key)) and F.ifM(F.delay... that will give us a nice ordering guarantee, because ultimately all you're doing is making this check hop through an additional thunk, but the happens-before guarantees are the same.

@aldiyen

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

I see, okay, that is what I thought. Where, though, in the shifted code does it check allocated, though? I'm fairly certain that the issue described in #2546 and reproduced in the tests added in this PR is occurring in the operations preceding the incrConnection(key) *> F.start { ... } portion, which is also the only place I see the check against allocated being made (inside of numConnectionsCheckHolds).

I do see that it could choose the "wrong" path after thunking F.delay(numConnectionsCheckHolds(key)), in the sense that it could lead to a situation where there are more connections than should be allowed by maxTotal, but that seems preferable to a situation where one can get an effect from borrow which might not ever be evaluated.

@rossabaker rossabaker reopened this May 15, 2019

@rossabaker

This comment has been minimized.

Copy link
Member

commented May 15, 2019

The more ambitous fix is taking too long, so reconsidering this.

@jmcardon

This comment has been minimized.

Copy link
Member

commented May 15, 2019

I'm alright with approving this, I just don't think it will fix anything @rossabaker.

I think it's pretty clear the issue is with the internals of the objs.

@rossabaker
Copy link
Member

left a comment

To show that it fixes the problem, I added a unit test for the scenario reported in #2546. It fails on series/0.20.x, and passes on top of @aldiyen's commit.

I agree with the overall commentary on the pool, but this is low risk and a demonstrable improvement.

@rossabaker rossabaker changed the base branch from master to series/0.20 May 15, 2019

@jmcardon
Copy link
Member

left a comment

I'll give it my 👍 .

I'm not convinced this fixes the issue, and I do think it might be partially coincidental that it's working this way (i.e the extra interpreter hop adds a delay that fixes the synchro issue, but won't fix it in all situations), but I do appreciate this PR a lot since it actually found the root cause, and is a fairly quick bandaid.

Thank you @aldiyen!

@rossabaker rossabaker merged commit 1f9dd2d into http4s:series/0.20 May 15, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.