-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/tools/gopls: express completion scoring in terms of down-ranking factors #63282
Comments
I'd be curious to see how often the chosen completion candidate is a deep candidate. Do we have any data on that? It would also be good to know at what point doing our search do we typically find the winning candidate (e.g. do we normally find it during the first 10ms of deep searching?).
Stopping earlier for leaf objects makes sense, but for intermediate objects like structs it is hard to stop early since one of the struct fields could be the needle in the haystack (and the intermediate objects are what explode the search space). Whether it is score or something else, using heuristics to prune/limit the search space would certainly help a lot.
Hopefully most of the adhoc score adjustments have |
@muirdm thanks for the response. Responses inline:
No! Until very recently (i.e. the last two weeks) we had no telemetry in gopls that could measure these things across a larger sample size. However, we are adding opt-in telemetry in the next release, and have already instrumented completion latency (#62665 (comment)), which I would believe would have immediately highlighted the regression of #62665. I think we can also instrument selected completion items (provided the instrumentation contains no information other than metadata such as "is deep"). I think we need to gather data from real usage, because synthetic usage (or even the Go team's usage) tends to be very skewed.
Indeed. We always include all first-level struct fields (that was the "fix" that led to the regression of #62665). However, you are right that we may want to continue searching children of the current candidate even if the candidate is excluded. We'd have to refactor to more cleanly separate searching from scoring.
Yes, there are a good number of @rank tests, though I think we should generate more before starting to tweak the scoring.
Indeed, this is on my list of things to do. I've been porting the completion tests from the old marker framework to the new marker framework, and was struct by how heavy-weight the @item annotations are. In many cases, we only care about existence and rank of candidates. |
While investigating #62665, I noticed that there were lots of opportunities to optimize deep completion (see also #63263). However, one obstacle to optimization is the fact that completion candidate score adjustment occurs in multiple places, and is a combination of multiplication by numbers >1 and <1, explicit setting of score, and subtraction.
It's clear that a lot of thought and experimentation went into the current scoring, and it works well overall. However, the way scoring is expressed makes it hard to reason about and hard to refactor (because operations don't commute), and therefore hard to improve.
Additionally, it means that we can never determine that the score is already too low and stop doing work. For example, in one of our kubernetes benchmarks we spend approximately 80ms in type checking and "normal" completion combined, and then another 100-200ms doing deep completion. All of that extra work produces at most 3 additional candidates. It is likely that by doing cheaper scoring operations first, we could immediately invalidate most of the candidates without having to do more expensive operations such as fuzzy matching and type inference.
I therefore believe that a first step toward improving completions should be to enforce the following rule: scores only go down, by multiplicative factors. In other words:
I think we can make this change in the existing codebase without changing anything else: each scoring operation needs to be recalibrated to express its outcome using a down-ranking factor.
Unfortunately, there is a lot of knowledge embedded in the current scoring, and therefore making this change naively is likely to result in inferior completions in the short term. @pjweinb has been looking at some large scale testing that may be useful for calibrating completion results, and may help us make this change without regression (or, perhaps, while simultaneously improving results).
CC @adonovan @muirdm @heschi, who have experience with the completion code. WDYT?
The text was updated successfully, but these errors were encountered: