Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectStart/ConnectDone() called multiple times, causing panic #266

Closed
lizthegrey opened this issue Oct 31, 2019 · 4 comments · Fixed by #283
Closed

ConnectStart/ConnectDone() called multiple times, causing panic #266

lizthegrey opened this issue Oct 31, 2019 · 4 comments · Fixed by #283

Comments

@lizthegrey
Copy link
Member

Stack trace from @austinlparker:

11:19 AM
11:19 AM
goroutine 592 [running]:
11:19 AM
go.opentelemetry.io/plugin/httptrace.(*clientTracer).close(0xc00055ec00, 0x828607, 0xc)
11:19 AM
	/app/pkg/mod/go.opentelemetry.io@v0.0.0-20191023164856-8cb378678945/plugin/httptrace/clienttrace.go:79 +0x21f
11:19 AM
go.opentelemetry.io/plugin/httptrace.(*clientTracer).connectDone(...)
11:19 AM
	/app/pkg/mod/go.opentelemetry.io@v0.0.0-20191023164856-8cb378678945/plugin/httptrace/clienttrace.go:125
11:19 AM
net.(*sysDialer).dialSingle.func1(0xc000493cc0, 0xc0004d0f80, 0xc000512bc0, 0xe, 0xc0001b9e00)
11:19 AM
	/tmp/go/src/net/dial.go:571 +0x6b
11:19 AM
net.(*sysDialer).dialSingle(0xc0004d0f80, 0x8b1900, 0xc000483f00, 0x8ad2e0, 0xc000485e30, 0x8b4140, 0xc00031e008, 0x0, 0x0)
11:19 AM
	/tmp/go/src/net/dial.go:594 +0x2fa
11:19 AM
net.(*sysDialer).dialSerial(0xc0004d0f80, 0x8b1900, 0xc000483f00, 0xc000529770, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
11:19 AM
	/tmp/go/src/net/dial.go:546 +0x221
11:19 AM
net.(*sysDialer).dialParallel.func1(0x8b1900, 0xc000483f00, 0xc000529701)
11:19 AM
	/tmp/go/src/net/dial.go:467 +0x95
11:19 AM
created by net.(*sysDialer).dialParallel
11:19 AM
	/tmp/go/src/net/dial.go:482 +0x25a
11:19 AM
exit status 2

My analysis:

Okay, so this happens in https://github.com/open-telemetry/opentelemetry-go/blob/master/plugin/httptrace/clienttrace.go#L80

this happens when the close() call happens _before_ the open() has happened. because concurrency and goroutines and delayed reporting of HTTP events :/

One solution is to just... not panic and drop the span. I prefer that one. but the other is to actually block and wait for the open span to happen before the close span call is run, if it isn't in the map. *cry*

this used to be worse, it would close the last span reported, rather than closing the correct span. so if run in the wrong order, you'd see open-x, open-y, close-y (meant to close-x), close-??? (uh oh) happening. because of the ordering not being open-close-open-close.

we now deal correctly with open-open-close-open-close-close, but not with close-open.
@lizthegrey
Copy link
Member Author

#273 (comment)

Further diagnosis from myself and @paivagustavo: this is due to the IPv6/IPv4 racing behavior of golang resulting in multiple connect() calls being made, one per destination address.

@lizthegrey lizthegrey changed the title httpclient tracing fails at high concurrency/contention. ConnectStart/ConnectDone() called multiple times, causing panic Nov 3, 2019
@lizthegrey
Copy link
Member Author

Verified this issue goes away when I explicitly connect to 127.0.0.1 and not to localhost which resolves both to 127.0.0.1 and ::1 causing the fallback after 300ms delay. This now no longer is a blocker bug for me delivering the OTel workshop on Tuesday :)

To trip this bug, the remote host needs to fail to respond after 300ms on IPv4 and needs to also have an IPv6 address for the hostname. So this is indeed an edge case, but still worth fixing.

@paivagustavo
Copy link
Member

As @lizthegrey said, the clientTracer.connectStart(network, addr string) can be called multiples times for a list of different address. I thought of two ways to solve this:

The first and easier way is to add the addr to the name of the span. The downside here is that the span name would have high cardinality, the result should be something similar to:
image

The second way is to change how we track the state of these spans on the clientTracer and add the addr as an Attribute of the span.

Particularly, I like the second approach most.

@lizthegrey
Copy link
Member Author

I agree that the second approach feels more correct to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants