x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely #33425
Comments
/cc @bradfitz @tombergan |
Any other information I can provide to help speed up any investigation? I've seen some outages in production caused by this issue, and want to help get a fix out. |
Anything we can help contribute to help with this? Since a single bad backend can deadlock the whole transport, the impact if quite large for us. |
Is this a dup of #32388? |
It looks similar to #32388, but I don't see any mention of the I took the in-review change and applied it to my repro, and it didn't seem to help. Presumably, there's more changes required.
The deadlock is caused by:
And getting the
So the blocked |
The mentioned change doesn't fix anything because its incomplete. All of these problems are all the same with no good solution because eventually the write lock will become the issue that backs up to the mu given the relationship that exists between them. |
Yeah, I dug through the code and was unable to find any easy way to fix the issue. I think the main issue is that In this scenario, it'd be nice to know if the lock was held since we can skip any connection with the lock held (probably safe to assume it's not usable right now). To prototype this idea, I used a huge hack that lets me check whether a lock is held, and return a
With this change, my repro above passes everytime. A less hacky of my approach can be done using a channel with size 1 (blocking write to |
@prashantv While your solution fixes idleState it breaks everything else that relies on the read mutex. |
Can you expand on what's broken by this? The The only thing that's affected is calls to In fact, I've run the HTTP2 tests multiple times with my change without issues:
Pushed a branch with my change if you want to test it: My change is still a hack, so it doesn't fully solve the problem (it's still possible that |
Here is a simple issue that you have: |
My mistake, I meant to set locked = 1 while holding the lock. Once that's flipped, apart from not fixing the issue 100%, are there other things that are broken? You mentioned "everything that relies on read mutex", not exactly sure what you meant by "read mutex". I want to understand if the problem being solved (Making This uses a channel to implement a lock that supports a |
And eventually the channel can be blocked. You can't just short circuit idleStateLocked, it actually computes the answer. Returning an incorrect value will break other guarantees. |
I assume the guarantee you mean is that we'll end up with more connections than required, since my Looking at the change linked to the other issue, I'm not sure if there's a clear path to fixing the underlying locking issues. I can try to investigate that approach, but in the meantime, we're stuck with deadlocks. Is there a short-term workaround? E.g., write deadlines? A workaround (even if it triggers more connections than necessary) would be preferable to a deadlock in production. |
The only solution I can think of right now is to build your own client pool which tracks outstanding requests per client. Don't share the transport. |
Alternatively, how do we like the idea of per-backend/address lock, instead of a shared lock for the entire client connection pool? A proof-of-concept: https://github.com/jaricftw/go-http2-stuck-transport/compare/client-conn-pool?expand=1, which ensures a single stuck backend won't block requests to other backends. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Used the HTTP/2 library to make calls as a client. One of the backends stalled and stopped reading from the TCP connection (although it was still running and the TCP connection was active).
This caused the TCP write buffer on the client to fill up.
What did you expect to see?
I expected that all future calls to the blocked server would timeout eventually, and that calls to any other server would not be affected.
What did you see instead?
All calls on the transport are blocked forever, including calls to working backends. Stack traces should the calls are blocked on a mutex, so any timeouts are ignored.
Repro
Repro: https://github.com/prashantv/go-http2-stuck-repro/
The repro sets up 2 "backends":
stuckURL
: Accepts a TCP connection and does not read from the connection, to simulate a stuck backendnormal1URL
: Echoes the request body back to the caller.stuckURL
with a payload that never completes. This causes the TCP write buffer to fill up, and theWrite
to block, while holding the*ClientConn.wmu
(write lock on the connection). We sleep a second to make sure the TCP buffer fills up, and the write is blocked.Stack trace
stuckURL
is made, which is blocked inroundTrip
trying to get the*ClientConn.wmu
(since (1) holds this lock), and with the*ClientConn.mu
lock held.Stack trace
stuckURL
grabs the connection pool lock,*clientConnPool.mu
, then iterates over all connections to that address to check theidleState()
which tries to grab*ClientConn.mu
. This blocks since (2) is already holding this lock.Stack trace
*clientConnPool
(e.g., any call through the same client/transport) block on trying to get*clientConnPool.mu
.Stack trace
I've reduced the size of TCP read/write buffers to trigger the issue more quickly.
The repro includs a flag for controlling steps 2/3. Setting the
--num-stuck-calls
to 2 or higher will reproduce the issue (the echo calls will fail), while 0 or 1 will cause the test to pass.The text was updated successfully, but these errors were encountered: