x/telemetry/config: add gopls/*/latency histograms #63129
Labels
gopls
Issues related to the Go language server, gopls.
telemetry
x/telemetry issues
Telemetry-Proposal
Issues proposing new telemetry counters.
Milestone
Counter names
gopls/completion/latency:{<10ms,<50ms, <100ms, <200ms, <500ms, <1s, <5s, >=5s}
gopls/definition/latency:{<10ms,<50ms, <100ms, <200ms, <500ms, <1s, <5s, >=5s}
gopls/hover/latency:{<10ms,<50ms, <100ms, <200ms, <500ms, <1s, <5s, >=5s}
gopls/implementations/latency:{<10ms,<50ms, <100ms, <200ms, <500ms, <1s, <5s, >=5s}
gopls/references/latency:{<10ms,<50ms, <100ms, <200ms, <500ms, <1s, <5s, >=5s}
gopls/symbol/latency:{<10ms,<50ms, <100ms, <200ms, <500ms, <1s, <5s, >=5s}
Description
These counters measure server-side observed latency for various latency sensitive LSP operations. For pragmatic reasons, as well as to reduce variability, timing starts when gopls actually begins handling the request, and does not include jsonrpc2 queue time.
Bucketing is approximately exponential, though adjusted to capture meaningful boundaries. For example, the default completion budget is 100ms, and we typically think of anything faster than 200ms as OK. 500ms, 1s, and 5s are various landmarks of slowness.
More precisely, the <10ms bucket captures latency in the range [0, 10ms), and subsequent buckets save the last capture [P, V), where P is the previous bucket endpoint and V is the current bucket endpoint. The >=5 bucket captures everything else. Is there a different convention I should be following for bucket naming?.
The set of operations for this initial batch of instrumentation is chosen based on our experience, as the most critical latency-sensitive operations we support. (CC @adonovan for his opinion on this)
Rationale
The latency of these operations greatly affects the experience of using gopls. With the large variety of editing environments that gopls supports, it is impossible to adequately benchmark these operations in every scenario. Furthermore, users tend not to file issues for incremental performance regressions ("boiling the frog"), and we've seen cases where a significant regression went unnoticed by the team and largely unreported for months (for example #62665).
By capturing these latency distributions, we can form a picture of the typical user experience, identify the frequency of outliers / atypical experiences, catch performance regressions following releases, and better prioritize our performance work.
Do the counters carry sensitive user information?
No.
How?
We'll instrument timing around these operations in the relevant gopls request handlers.
Proposed Graph Config
New or Update
New
The text was updated successfully, but these errors were encountered: