Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/tools/gopls: deadlock due to telemetry worker #33692

Closed
muirdm opened this issue Aug 16, 2019 · 9 comments
Closed

x/tools/gopls: deadlock due to telemetry worker #33692

muirdm opened this issue Aug 16, 2019 · 9 comments
Labels
Milestone

Comments

@muirdm
Copy link

@muirdm muirdm commented Aug 16, 2019

My gopls on master (caa95bb40b630f80d344d1f710f7e39be971d3e8) keeps locking up. The debug interface doesn't even respond. I attached via dlv and I saw goroutines were blocked here:

/Users/muir/projects/tools/internal/telemetry/export/worker.go:38 golang.org/x/tools/internal/telemetry/export.Do

Furthermore, one of the goroutines was the telemetry worker goroutine itself. It was invoking a task that published to the workQueue channel (which must have been full, causing a deadlock).

(dlv) bt
0  0x0000000001030baf in runtime.gopark
   at /usr/local/go/src/runtime/proc.go:302
1  0x0000000001006c6b in runtime.goparkunlock
   at /usr/local/go/src/runtime/proc.go:307
2  0x0000000001006c6b in runtime.chansend
   at /usr/local/go/src/runtime/chan.go:236
3  0x0000000001006a25 in runtime.chansend1
   at /usr/local/go/src/runtime/chan.go:127
4  0x0000000001314b62 in golang.org/x/tools/internal/telemetry/export.Do
   at /Users/muir/projects/tools/internal/telemetry/export/worker.go:38
5  0x000000000131421f in golang.org/x/tools/internal/telemetry/export.Metric
   at /Users/muir/projects/tools/internal/telemetry/export/export.go:91
6  0x000000000139dc03 in golang.org/x/tools/internal/telemetry/metric.(*Int64Data).modify.func1
   at /Users/muir/projects/tools/internal/telemetry/metric/metric.go:231
7  0x0000000001315527 in golang.org/x/tools/internal/telemetry/export.init.0.func1
   at /Users/muir/projects/tools/internal/telemetry/export/worker.go:21
8  0x000000000105d7e1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1337

/cc @ianthehat

@gopherbot gopherbot added this to the Unreleased milestone Aug 16, 2019
@gopherbot gopherbot added the gopls label Aug 16, 2019
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 16, 2019

Thank you for filing a gopls issue! Please take a look at the Troubleshooting section of the gopls Wiki page, and make sure that you have provided all of the relevant information here.

@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Aug 17, 2019

Thank you for reporting this issue @muirrn! I shall send a fix for this immediately.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 17, 2019

Change https://golang.org/cl/190637 mentions this issue: internal/telemetry/export: add load-shedding for workQueue

@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Aug 17, 2019

@muirdm

This comment has been minimized.

Copy link
Author

@muirdm muirdm commented Aug 17, 2019

Thanks for looking at this. It looks like your change would avoid the deadlock, but it would introduce substantial latency under heavy contention. There is also guaranteed loss of telemetry data whenever it hits the current deadlock case.

It seems to me that the telemetry worker needs to be rearchitected in a more fundamental way. I'm sure @ianthehat already has great ideas, but I'm going to say my idea anyway: we could accumulate all telemetry messages for a single request in memory, and then kick off a goroutine to send them after the request completes. This way the telemetry messages for a given request are still sent in order, but they are always handled asynchronously with respect to LSP requests. If there is some requirement that the telemetry messages be processed "live" then this obviously wouldn't work.

@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Aug 17, 2019

Thanks for looking at this. It looks like your change would avoid the deadlock, but it would introduce substantial latency under heavy contention. There is also guaranteed loss of telemetry data whenever it hits the current deadlock case.

Right, it implements load shedding as a best effort and then drops overflowing events as it is a circular buffer of sorts, but ensures no deadlocks.

and then kick off a goroutine to send them after the request completes. This way the telemetry messages for a given request are still sent in order, but they are always handled asynchronously with respect to LSP requests. If there is some requirement that the telemetry messages be processed "live" then this obviously wouldn't work.

Events are processed online but the suggested batching will mean that the global ordering of events to process will be lost. Also, when metrics are collected, we'd now have to include timestamps too.

Anyways, good thoughts and concerns @muirrn! We can discuss more as we work on this issue.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 18, 2019

Change https://golang.org/cl/190737 mentions this issue: internal/telemetry: change concurrency model

@ianthehat

This comment has been minimized.

Copy link

@ianthehat ianthehat commented Aug 18, 2019

I think the load shedding may cause other significant issues (incorrect counts in rare events would be very bad, but much worse is memory leaks caused by unmatched start/finish span pairs for instance)
The concurrency model was always temporary, it was the quickest thing to implement that would be adequate while I got the API right.
I have put up a CL that might take us a bit further, and fixes the deadlock, but the final design needs more thought.

@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Aug 19, 2019

Thank you for the discourse @muirrn and @ianthehat! Sure, for now we can the mutex as Ian has posted up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.