-
Notifications
You must be signed in to change notification settings - Fork 18k
net/http: tls handshake panic with custom dialer #58469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What is |
Unfortunately no, it is non-trivial for me to provide transportDialFunc, which is a custom dialer that orchestrates multiple tcp connections between various nodes. I theorized that this may be due to the dialer taking longer than the TLS handshake timeout, therefore it may be possible to create a failing test case by doing a simple dial after a time.Sleep() longer than the TLS handshake timeout. Irrespective of the contents of transportDialFunc, it seems prudent to add the |
The relevant part of http code is below. Lines 1621 to 1634 in e03ee85
|
To add to what @seankhliao said, the only way to construct a Hence this seems like a bug in your |
TL;DR Observations: (1) The only reference I can find to (2) I might argue it would be safer to check for
(3) That dial() implementation is only ever called from In addTLS() the pconn.conn is used as the conn member in tlsConn. Subsequently, that's the proximal cause of the nil dereference in Since I can't find anywhere that explicitly sets tlsConn.conn after the initial construction, the nil (or non-nil bad) value must have come from dial(). (4) Note that this panic would require that the handshakeCtx is Done() possibly due to the incoming parent context being Done(). Note: The comments in Anyway, the parent context to handshakeCtx is the "top level"
Given these values either the handshakeCtx or parent context were canceled during the handshake in less than 5 seconds, or the system was so busy that it scheduled the handshakeCtx's goroutine prior to the (5) Anyway, my dialer function looks like this. I guess there is a race condition in this code, right? Specifically, if the wg.Wait() wakes up transportDialFunc() first, and err is set to nil in the goroutine second, err might be returned as nil irrespective of ret's value.
|
Where exactly is The crux of my question is, why are you calling |
I'll answer you below, but I think the more relevant item for this issue is whether dial() should check the (nil,nil) case for a custom DialContext()? I would say "yes" The code in 5 above is simplified, hiding implementation details of the app that are not immediately relevant here. SetupStream() runs in a goroutine for all the usual reasons such as: Needing to run in a context where no locks are held, needing to perform tasks that could execute in parallel with those that run after readiness signalled by wg.Done(), etc. etc. As you gleaned, that callback in fact does happen within SetupStream ... but really it could happen any time. The point is we don't have any guarantees as to when PS: I recognize the logical issue here, of relying both on the synchronous return value of the function, and waiting on the waitgroup. Both are necessary, so I'll be modiying the code accordingly. |
My intent was to identity the root cause of the issue. The lack of a With regards to if err := SetupStream(&foo); err != nil {
return nil, err
}
if foo.StreamConn == nil {
return nil, errors.New(...)
}
return foo.StreamConn, nil |
I agree with you. I said what you said, differently.
I think you're making the wrong assessment about the nil check because we are no longer talking about the same nil check. The nil check I'm talking about now is not the same one I suggested when I opened the issue. Instead it's the one in 2 above. It would prevent this panic, and instead (correctly) behave as if the dialer failed to provide a net.Conn. Explicitly, I'm talking about doing:
instead of:
|
Right, that's what I was referring to. You'd get an error back that says |
transportDialFunc is not in fact "totally" broken, because the nil net.Conn is the significant return value -- that said, as we've discussed, it may return (nil,nil) instead of (nil,!nil), but if it returns a nil net.Conn, it's a failure to dial, and the (nil,nil) should (as I think we agree) not cause a panic. I'll send in a PR. Just for completeness, another user seems to concur about this bug/improvement: https://groups.google.com/g/golang-nuts/c/qCn3BDUq74U |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes, 1.20.0
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
What did you expect to see?
No panics from inside golang stdlib.
What did you see instead?
Proposed fix:
In crypto/tls/conn.go around line 1441:
I believe the problem may be that this should read:
I have not deeply understood the net.Conn or tls dialer internals, but, I believe the issue here stems from the possibility that the tls handshake timer may expire before the dialer is actually finished dialing. Whether this implies that there should also be a deeper/upstream fix is beyond me.
I've prepared all of my credentials so that I can submit a related PR to googlesource.com & gerrit etc. once I get some feedback.
I would appreciate any tips on how to work around this issue in the meantime. I'm not sure how to replicate or override a core golang package to patch in this fix??
The text was updated successfully, but these errors were encountered: