Fix server hang after "Running scheduled diagnostics refresh"#3628
Fix server hang after "Running scheduled diagnostics refresh"#3628andrewbranch merged 15 commits intomainfrom
Conversation
- Make RefreshDiagnostics fire-and-forget: send the request but don't wait for a response (client always returns null). Any response from the client is silently ignored by the read loop. - Add 1s timeout to updateWatch for WatchFiles/UnwatchFiles client requests to prevent holding watchesMu indefinitely if the client is slow or unresponsive. Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/9d82d127-ea83-455b-bd36-297c9fe864b7 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/9d82d127-ea83-455b-bd36-297c9fe864b7 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Addresses an LSP server hang triggered by “Running scheduled diagnostics refresh” by removing an unnecessary client-response wait and bounding how long watch registration holds a critical mutex.
Changes:
- Make
RefreshDiagnosticssendworkspace/diagnostic/refreshas fire-and-forget (no response channel / no wait). - Add a 1s timeout for
WatchFiles/UnwatchFilesclient requests made while holdingsession.watchesMu. - Introduce
sendClientRequestFireAndForgethelper for non-essential client responses.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| internal/project/session.go | Adds a watch request timeout context for client watcher register/unregister calls to avoid indefinite mutex holds. |
| internal/lsp/server.go | Changes diagnostics refresh to fire-and-forget and adds a helper to send client requests without awaiting responses. |
|
@copilot fix review comments |
…lock, rollback + retry on WatchFiles failure Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/b1cea66d-c309-4c26-97cc-f815ab98e5fe Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Addressed all review comments in
|
…apshot updates re-attempt naturally Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/02f6f16f-047c-455f-981d-b83603fc1df8 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/b572b8f8-1888-4b53-a86f-59468b89d099 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
All review comments have been addressed across |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
internal/project/session.go:1203
- pending watcher IDs are only cleared on successful registration in the newWatcher path. If a watcher times out (MarkPending) and is later removed (updateWatch called with newWatcher == nil), its pending entry is never cleared, so watchRegistry.pending can grow over the lifetime of a session. Consider calling ClearPending for oldWatcher.Watchers().WatcherID when removing/unwatching, or otherwise pruning pending entries when the corresponding watcher is no longer present.
if oldWatcher != nil {
w := oldWatcher.Watchers()
watchers := append(w.WorkspaceWatchers, w.OutsideWorkspaceWatchers...)
if len(watchers) > 0 {
var removedIDs []WatcherID
for _, watcher := range watchers {
if id, removed := session.watches.Release(watcher); removed {
removedIDs = append(removedIDs, id)
}
}
for _, id := range removedIDs {
if err := session.client.UnwatchFiles(watchCtx, id); err != nil {
errors = append(errors, err)
} else if logger != nil && newWatcher == nil {
logger.Log(fmt.Sprintf("Removed watch: %s", id))
}
}
}
}
| // Retry config watchers whose IDs didn't change but whose previous registration failed. | ||
| for path, newEntry := range newSnapshot.ConfigFileRegistry.configs { | ||
| if oldEntry, ok := oldSnapshot.ConfigFileRegistry.configs[path]; ok { | ||
| if oldEntry.rootFilesWatch.ID() == newEntry.rootFilesWatch.ID() && s.watches.IsPending(newEntry.rootFilesWatch.ID()) { | ||
| errors = append(errors, updateWatch(ctx, s, s.logger, nil, newEntry.rootFilesWatch)...) | ||
| } |
There was a problem hiding this comment.
The pending retry path calls updateWatch(ctx, s, ..., nil, newWatcher). If the original updateWatch call partially succeeded (some WatchFiles calls returned nil before a later one timed out), the registry will still contain refcounts for the successful watchers. Retrying with oldWatcher==nil will Acquire again and permanently increment those refcounts (no matching Release), which can prevent future UnwatchFiles from ever firing and leak client registrations. To keep refcounts consistent, consider making the failure path roll back the entire parent watcher’s bookkeeping when any error occurs (not just the failed watcher), or implement a retry helper that registers only missing keys without incrementing refcounts for already-tracked entries.
| // watchRegistry tracks the current watch globs and how many individual | ||
| // WatchedFiles reference each glob. It guards concurrent access with a mutex | ||
| // and provides ref-count helpers so callers don't manipulate the map directly. | ||
| // It also tracks which parent watcher IDs have pending (failed) registrations | ||
| // so that updateWatches can retry them even when the watcher identity hasn't changed. | ||
| type watchRegistry struct { | ||
| mu sync.Mutex | ||
| entries map[fileSystemWatcherKey]*fileSystemWatcherValue | ||
| pending map[WatcherID]struct{} | ||
| } | ||
|
|
||
| func newWatchRegistry() *watchRegistry { | ||
| return &watchRegistry{ | ||
| entries: make(map[fileSystemWatcherKey]*fileSystemWatcherValue), | ||
| pending: make(map[WatcherID]struct{}), | ||
| } | ||
| } | ||
|
|
||
| // Acquire increments the ref count for a watcher. If this is the first | ||
| // reference (count goes from 0 to 1), it returns true so the caller knows | ||
| // to register the watcher with the client. | ||
| func (r *watchRegistry) Acquire(watcher *lsproto.FileSystemWatcher, id WatcherID) (isNew bool) { | ||
| key := toFileSystemWatcherKey(watcher) | ||
| value := r.entries[key] | ||
| if value == nil { | ||
| value = &fileSystemWatcherValue{id: id} | ||
| r.entries[key] = value | ||
| } | ||
| value.count++ | ||
| return value.count == 1 | ||
| } | ||
|
|
There was a problem hiding this comment.
watchRegistry is documented as guarding concurrent access with its mutex, but Acquire/Release/MarkPending/ClearPending do not lock r.mu (only IsPending does). Since session.go now locks session.watches.mu directly, the locking discipline is split across call sites, which makes it easy to accidentally call these methods without holding the mutex (data race) and undermines the stated encapsulation. Consider either (1) moving the locking inside watchRegistry methods (and removing external r.mu access), or (2) updating the type/method docs and API to make the required locking explicit (e.g., AcquireLocked/ReleaseLocked or a WithLock helper) so future callers can’t misuse it.
| session.watches.mu.Lock() | ||
| defer session.watches.mu.Unlock() | ||
| // Use a timeout for client requests to prevent holding the mutex indefinitely | ||
| // if the client is slow or unresponsive. | ||
| watchCtx, watchCancel := context.WithTimeout(ctx, watchRequestTimeout) | ||
| defer watchCancel() | ||
| if newWatcher != nil { | ||
| w := newWatcher.Watchers() | ||
| watchers := append(w.WorkspaceWatchers, w.OutsideWorkspaceWatchers...) | ||
| if len(watchers) > 0 { | ||
| var newWatchers collections.OrderedMap[WatcherID, *lsproto.FileSystemWatcher] | ||
| for i, watcher := range watchers { | ||
| key := toFileSystemWatcherKey(watcher) | ||
| value := session.watches[key] | ||
| globId := WatcherID(fmt.Sprintf("%s.%d", w.WatcherID, i)) | ||
| if value == nil { | ||
| value = &fileSystemWatcherValue{id: globId} | ||
| session.watches[key] = value | ||
| } | ||
| value.count++ | ||
| if value.count == 1 { | ||
| if session.watches.Acquire(watcher, globId) { | ||
| newWatchers.Set(globId, watcher) | ||
| } | ||
| } | ||
| for id, watcher := range newWatchers.Entries() { | ||
| if err := session.client.WatchFiles(ctx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| if err := session.client.WatchFiles(watchCtx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| session.watches.Release(watcher) | ||
| errors = append(errors, err) | ||
| } else if logger != nil { |
There was a problem hiding this comment.
updateWatch creates a single watchCtx with a 1s deadline and reuses it for all WatchFiles/UnwatchFiles calls in this update. If multiple registrations/unregistrations are needed, later calls may inherit very little time remaining (e.g., first call takes 900ms → second gets ~100ms), causing avoidable timeouts even when the client is responsive. Consider creating a fresh per-request timeout context inside the WatchFiles/UnwatchFiles loops (or otherwise resetting the deadline per call) while still ensuring the mutex is not held indefinitely.
| for i, watcher := range watchers { | ||
| key := toFileSystemWatcherKey(watcher) | ||
| value := session.watches[key] | ||
| globId := WatcherID(fmt.Sprintf("%s.%d", w.WatcherID, i)) | ||
| if value == nil { | ||
| value = &fileSystemWatcherValue{id: globId} | ||
| session.watches[key] = value | ||
| } | ||
| value.count++ | ||
| if value.count == 1 { | ||
| if session.watches.Acquire(watcher, globId) { | ||
| newWatchers.Set(globId, watcher) | ||
| } | ||
| } | ||
| for id, watcher := range newWatchers.Entries() { | ||
| if err := session.client.WatchFiles(ctx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| if err := session.client.WatchFiles(watchCtx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil { | ||
| session.watches.Release(watcher) | ||
| errors = append(errors, err) | ||
| } else if logger != nil { |
There was a problem hiding this comment.
On WatchFiles error, updateWatch rolls back bookkeeping via session.watches.Release(watcher) exactly once for that watcher. If the input watcher list contains duplicate patterns (same fileSystemWatcherKey), Acquire will increment the refcount multiple times but only one client registration is attempted; a failed registration would then leave the refcount >0, preventing future retries from ever treating that watch as “new”. This is reachable because some glob lists are only sorted (not deduped) before becoming watchers. Consider deduping watcher keys before calling Acquire, or tracking how many increments were applied per key and fully rolling them back on registration failure.
|
@copilot address review comments |
…up watchers, rollback all on failure, stronger test assertions Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/81267cbb-20f5-41d0-b92b-9f1b30604e5d Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/81267cbb-20f5-41d0-b92b-9f1b30604e5d Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Addressed all five review comments in
|
|
@copilot I don't really understand the "dedupe watchers by key to prevent refcount inflation from duplicate patterns" issue. If a watcher produced duplicate patterns, previously, the fileSystemWatcherValue corresponding to that pattern would get reffed once for each instance. During deregistration, the ref count would decrement once for each instance. The registrations were already deduped, so why does it matter what the actual ref counts were as long as they were balanced? It looks to me like you've added another unnecessary layer of deduping, but please explain if I'm misunderstanding something. |
You're right that in the normal lifecycle the refcounts are balanced — register and deregister both iterate the same list (including duplicates), so the increments and decrements match. The issue is specifically with the rollback path. On failure, lines 1184-1186 iterate An alternative would be to track all acquired watchers (including duplicates) and Release each one on rollback. If you prefer that approach I can switch to it — the dedup just seemed simpler and avoids the extra Acquire/Release cycles for duplicates. |
|
@copilot understood. Instead, let's move the deduplication into |
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/dd1001aa-1bbc-4e9f-a0a3-edc2a7e29627 Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Moved in |
|
@copilot move mutex locking into watchRegistry methods. |
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/6ef1f6c5-5c45-4461-bf8c-933c384252fe Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Done in |
Two unbounded waits on the critical path fired by
Running scheduled diagnostics refreshcan deadlock the server when the client is slow to respond, leaving the LSP completely unresponsive until manually restarted.RefreshDiagnostics— fire and forgetworkspace/diagnostic/refreshalways returnsnull; waiting for it is unnecessaryupdateWatch— per-request 1s timeout on client requests with rollbackWatchFiles/UnwatchFilescalls holdsession.watchesMuwhile synchronously waiting onclient/registerCapabilityresponsesupdateWatchcallers behind the mutex, and ifrequestQueuefills up, the read loop can't deliver responses — creating a self-locking deadlockWatchFiles/UnwatchFilescall gets its own 1scontext.WithTimeoutso earlier calls don't consume the deadline for later onesWatchFilesfailure, ALL newly-acquired watchers in that batch are rolled back (not just the failed one) to keep refcounts clean for retryupdateWatchesretries them on the next snapshot update, even when the watcher identity hasn't changedWatchedFiles.Watchers()— deduplicated at the sourceWatchers()now deduplicatespatternsInsideWorkspaceanddirectoriesOutsideWorkspaceviaslices.Compact(slices.Sorted(...))before creatingFileSystemWatcherobjectswatchRegistry— encapsulated watch bookkeepingwatchRegistrytype intointernal/project/watch.gothat encapsulates the ref-counted map, mutex, and providesAcquireandReleasemethodsMarkPending,ClearPending, andIsPendingmethods to track watcher IDs whose registration failedmuinternally, fully encapsulating the locking discipline so callers never need to manage the mutex directlyupdateWatchnow delegates all bookkeeping to the registry instead of manipulating the map directly, making the function easier to followupdateWatcheschecksIsPendingfor all watcher types (config, program files, typings, auto-imports) and retries as fresh adds when the watcher identity is unchanged but the previous registration failedTests
TestUpdateWatchTimeoutAndRollbackininternal/project/watchtimeout_test.gousingsynctestandprojecttestutilto verify timeout, rollback, and retry behavior with real session operationsDidOpenFileto open a file (triggering watches that time out with a mocked slow client), thenDidChangeFilewith a single character edit (which does not change any watcher identity), followed byGetLanguageServiceto flush the pending change and triggerupdateWatchesupdateWatchesto retry the failed registrations and that the retry succeeds, even though the watcher identities are unchanged