Startup of insiders is hung when opening anything #142786

rchiodo · 2022-02-10T23:52:59Z

Does this issue occur when all extensions are disabled?: Can't tell. Went away on restart

VS Code Version: 1.65.0-insider (user setup)
Commit: 82a8bec
Date: 2022-02-10T05:16:36.093Z
Electron: 16.0.8
Chromium: 96.0.4664.110
Node.js: 16.9.1
V8: 9.6.180.21-electron.0
OS: Windows_NT x64 10.0.19044

Steps to Reproduce:

Open a folder
Open a notebook

I get this:

The text was updated successfully, but these errors were encountered:

roblourens · 2022-02-11T00:49:30Z

I hit this once too. That warning message implies that loading extensions is stuck, cc @deepak1556 in case this is related to the electron update?

deepak1556 · 2022-02-14T07:25:07Z

This seems to be same as #140114 but we didn't have a clear repro at that time. @alexdima is there any additional logs we can obtain to narrow down where exactly the extension host got stuck when trying to load extensions ?

alexdima · 2022-02-14T15:35:52Z

Another occurrence in #142766

alexdima · 2022-02-22T14:32:23Z

@bpasero @deepak1556 Do we plan to ship Electron 16 on stable? This might be a blocker.

deepak1556 · 2022-02-22T15:12:14Z

@alexdima yup the current plan is to ship this milestone.

Maybe we can add some additional logging to help narrow down where in the extension load code path things are getting stuck, this can help if we get similar reports once shipped to stable. Currently we had only a single occurrence of this issue for those who encountered it, so there is not much we could progress on diagnosing it. I didn't want to block the update because of that.

Also if we revert to Electron 13, we will be on an EOL release line https://www.electronjs.org/blog/electron-17-0#end-of-support-for-13xy and anything > 13 will have render process reuse by default but the downside will be not having self hosted with insiders.

alexdima · 2022-02-22T16:14:07Z

@deepak1556 I was able to reproduce when running scripts\code.bat and doing a few Ctrl+R on a Windows VM (i.e. slow machine). Maybe it needs 10 reloads, but it repros eventually. I am using a TS file with a compilation error and the hint is that the compilation error doesn't show. There is no other indication that something is wrong. I believe that this might be the same weird behavior around named pipes / sockets that do not send data we saw when we met some time ago trying out Playwright for Electron.

@bpasero @deepak1556 Could you please also try to reproduce?

deepak1556 · 2022-02-22T16:26:37Z

Oh was not aware of that, will give it a try. Thanks for the steps!

bpasero · 2022-02-23T06:55:10Z

I tried to reproduce in my Windows 11 ARM VM and was not successful. Tried a few window reloads on a small TypeScript project with an editor open that has a compile error.

deepak1556 · 2022-02-23T07:45:14Z

@bpasero I am also on Windows 11 ARM vm and couldn't repro it so far, but the trick to repro it is to have some memory intensive tasks running in the background (building electron does the trick for me). I am now able to repro the issue.

deepak1556 · 2022-02-23T11:19:52Z

Created a minimal repro https://gist.github.com/deepak1556/59f981cd8a55cda92beadd6bf97330a9, that shows a ping/pong via nodejs sockets between renderer process and child process forked from the main process. After a couple of reloads, the server/renderer will get into a state were the async tasks gets queued up and does not process until another explicit task is executed to flush out the queue.

An example state when event loop is paused:

Forked process: Child process created
Forked process: Connected to named pipe as client
[4124:0223/200645.092:INFO:CONSOLE(35)] "Connected to named pipe as server", source: file:///C:/Users/demohan/github/electron-quick-start/renderer.js (35)
Forked process: rec'd from server: pong!
[4124:0223/200658.631:INFO:CONSOLE(48)] "rec'd from client: ping 0", source: file:///C:/Users/demohan/github/electron-quick-start/renderer.js (48

At this point, executing a setImmediate(() => { console.log('hello') }) resumes the loop for sometime but it gets to the bad state after a few more tasks. At first I thought it might be a side effect of the occlusion tracker on windows https://chromium.googlesource.com/chromium/src/+/master/docs/windows_native_window_occlusion_tracking.md that puts the renderer into a suspended state but the bug repros even when launching with --disable-features=CalculateNativeWinOcclusion. Pretty confident that this bug is what we are seeing at the larger scale with our extension host setup. Continuing to investigate why Electron renderer gets into this state, looks like a return of electron/electron#1968 from the past with render process reuse enabled changes.

zcbenz · 2022-02-25T08:44:00Z

It seems that after reloading the window, the IOCP will get into a weird state and event loop integration will stop functioning, so a workaround is to recreate the IOCP for each reload:

--- a/shell/common/node_bindings.cc
+++ b/shell/common/node_bindings.cc
@@ -576,7 +576,11 @@ void NodeBindings::LoadEnvironment(node::Environment* env) {
 }

 void NodeBindings::PrepareMessageLoop() {
-#if !BUILDFLAG(IS_WIN)
+#if BUILDFLAG(IS_WIN)
+  if (uv_loop_->iocp && uv_loop_->iocp != INVALID_HANDLE_VALUE)
+    CloseHandle(uv_loop_->iocp);
+  uv_loop_->iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 2);
+#else
   int handle = uv_backend_fd(uv_loop_);

   // If the backend fd hasn't changed, don't proceed.

I'm still trying to figure out the root cause.

deepak1556 · 2022-02-25T10:26:34Z

Currently testing the above workaround with vscode.

deepak1556 · 2022-02-25T12:33:39Z

After a bit of testing, the extension host does not get unresponsive when the iocp is reinitialized on reload. Although, we now get a prompt that says Reloading the window is taking longer than expected.. and it seems the issue arises from pending operation join.closeStorage. I will have to investigate why it regressed.

@zcbenz instead of reinitializing the iocp for the same process reload case, can we instead close the default uv loop and also perform the background thread cleanup which is usually done only during ~NodeBindings, and reinitialize after reload ? Since we destroy the node environment for reload, I feel this should be safe to do, thoughts ?

zcbenz · 2022-02-26T07:20:52Z

When navigating Chromium may load the new page before the old page is destroyed, so it is possible for multiple environments sharing the same loop in one process, and we probably don't have a good timing to close the default uv loop. But reinitializing IOCP has the same problem too.

In theory destroying and initializing Node.js environments should not affect the uv loop at all, by reading the code I couldn't find anywhere closing or initializing the uv loop, but somehow having multiple Node.js environments is making our node integration run into weird states, there were some efforts working around it (electron/electron#25869 and electron/electron#27582) but I think they are not fixing the root cause.

zcbenz · 2022-03-02T23:01:44Z

It seems that the node integration only breaks when the first node environment is destroyed, everything works fine if we just create and destroy subframes or child windows. So I wonder if we can work around the problem by creating a dummy node environment in renderer before any web frame is created and keeping it alive forever.

The root cause is still mysterious to me though.

deepak1556 · 2022-03-08T11:20:56Z

@zcbenz looked a bit more into the root issue and have some updates,

a) First let me walkthrough the basics of Electron event loop integration on windows and make sure I haven't misunderstood any of the components.

Electron polls the IOCP on background thread via GetQueuedCompletionStatus to know when there is completion packet available for async I/O operation posted to this IOCP by the libuv event loop.
Once there is a packet available Electron will post the packet back into the queue via PostQueuedCompletionStatus and interrupt the main thread via the chromium task runner and runs the libuv event loop on the main thread which will now process the completion packet.
The IOCP is created by libuv and only once when the first call to uv_default_loop was made via https://github.com/libuv/libuv/blob/4296fec7f50145e2a307f3db7ae22984713976a7/src/win/core.c#L236
Even across reloads when the process is reused the IOCP created in the previous step remains the same as the default loop never gets destroyed.

b) Next a couple of principles when using IOCP stated in https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports which are relevant for this bug,

A thread (either one created by the main thread or the main thread itself) uses the GetQueuedCompletionStatus function to wait for a completion packet to be queued to the I/O completion port, rather than waiting directly for the asynchronous I/O to complete. Threads that block their execution on an I/O completion port are released in last-in-first-out (LIFO) order, and the next completion packet is pulled from the I/O completion port's FIFO queue for that thread. This means that, when a completion packet is released to a thread, the system releases the last (most recent) thread associated with that port, passing it the completion information for the oldest I/O completion.
Although any number of threads can call GetQueuedCompletionStatus for a specified I/O completion port, when a specified thread calls GetQueuedCompletionStatus the first time, it becomes associated with the specified I/O completion port until one of three things occurs: The thread exits, specifies a different I/O completion port, or closes the I/O completion port. In other words, a single thread can be associated with, at most, one I/O completion port.
When a completion packet is queued to an I/O completion port, the system first checks how many threads associated with that port are running. If the number of threads running is less than the concurrency value (discussed in the next section), one of the waiting threads (the most recent one) is allowed to process the completion packet. When a running thread completes its processing, it typically calls GetQueuedCompletionStatus again, at which point it either returns with the next completion packet or waits if the queue is empty.
The most important property of an I/O completion port to consider carefully is the concurrency value. The concurrency value of a completion port is specified when it is created with CreateIoCompletionPort via the NumberOfConcurrentThreads parameter. This value limits the number of runnable threads associated with the completion port. When the total number of runnable threads associated with the completion port reaches the concurrency value, the system blocks the execution of any subsequent threads associated with that completion port until the number of runnable threads drops below the concurrency value.
The system also allows a thread waiting in GetQueuedCompletionStatus to process a completion packet if another running thread associated with the same I/O completion port enters a wait state for other reasons, for example the SuspendThread function. When the thread in the wait state begins running again, there may be a brief period when the number of active threads exceeds the concurrency value. However, the system quickly reduces this number by not allowing any new active threads until the number of active threads falls below the concurrency value.

c) Finally what is happening in the process reuse scenario on windows,

Libuv creates IOCP with concurrency value 1 (i-e) only one active running thread can process the completion packet at a time. (point b.4)
Although the polling thread NodeBindings::EmbedThreadRunner makes the call to GetQueuedCompletionStatus, on creation it will first wait on the semaphore that gets signalled from the main thread only when NodeBindings::UvRunOnce gets called, which we actually trigger for the first time when the process is created via NodeBindings::RunMessageLoop. This actually causes the main thread to make a call to uv_run which in turn will call to GetQueuedCompletionStatus and hence main thread gets associated with the IOCP (point b.2)
But since the background thread is in wait state on the semaphore we still satisfy (point b.3 and point b.5) and hence are able to process the IOCP on this thread once the signal arrives from the main thread via NodeBindings::PollEvents.
Now on reload, when reinitializing the default node environment we make calls to NodeBindings::PrepareMessageLoop and NodeBindings::RunMessageLoop which ends up creating new background thread and the thread count increases depending on the number of reloads. It can happen in this situation than more than one background thread gets to operate on the IOCP due to race and it ends up getting suspended on evaluating the GetQueuedCompletionStatus due to (point b.4)

At this point the background thread is unable to signal back to the main thread and we see it as user facing bug that non of node callbacks got executed as from the repro. Another test that confirmed this was, expose NodeBindings::UvRunOnce to the user land and when the test enters this error state try calling into UvRunOnce explicitly and you will see that main thread is still able to complete the GetQueuedCompletionStatus calls and is able to trigger the callbacks. This confirms that IOCP is healthy and the problem is only because of the background threads got suspended.

The workaround posted in #142786 (comment) helped because on reload although there are multiple background threads, for the new IOCP only main thread and only newly created background thread are associated, so they continue to work based on the previously listed principles.

Based on the above results, I now try to maintain a single background thread across reloads into the same process and this always works fine. I found that there was an attempt in Electron that would have achieved the same electron/electron#27582 but was later reverted electron/electron#28175. The cause for hang before the revert is incorrect, the problem was not with IOCP but rather in electron/electron#27582 although we short-circuited the NodeBindings::PrepareMessageLoop we still ended up calling NodeBindings::RunMessageLoop on reload which caused an incorrect semaphore count.

The semaphore dance works like this https://github.com/libuv/libuv/blob/4296fec7f50145e2a307f3db7ae22984713976a7/src/win/thread.c#L309-L333,

In Main thread
  PrepareMessageLoop embed_sem_ // initialized with value 0, any value greater > 0 will signal the thread waiting on this semahore https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createsemaphorea

In Background thread
  EmbedThreadRunner waits on embed_sem_ // once this gets signalled, the count will be reduced by 1 which will bring it back to 0 https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject

In Main Thread
 UvRunOnce posts to embed_sem_ // increases the count by 1 and signals the thread https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-releasesemaphore

With electron/electron#27582 there can be situation with active I/O tasks on reload, the semaphore value on main thread would already be incremented by 1 to signal action to the background thread and another call to NodeBindings::RunMessageLoop would increment it now to a value of 2. And even if the background thread now decrements it after wait call, it will be signalled forever inside loop as the value is not 0 and we end up in situation were two threads are operating on the same IOCP and things get broken. The fix here is to simply not call the NodeBindings::RunMessageLoop along with the NodeBindings::PrepareMessageLoop on reload and by maintaining a single polling thread it fixes both the bug mentioned in this issue and as well electron/electron#28154

deepak1556 · 2022-03-08T11:21:59Z

I will put up a PR in Electron to further discuss this solution.

rchiodo · 2022-03-25T22:39:15Z

/verified

Definitely working for me. Although I haven't repro'd this in while.

vscode-triage-bot assigned rebornix Feb 11, 2022

roblourens assigned alexdima and unassigned rebornix Feb 11, 2022

deepak1556 self-assigned this Feb 14, 2022

deepak1556 added electron-16-update Issues related to electron 16 update extension-host Extension host issues info-needed Issue requires more information from poster labels Feb 14, 2022

alexdima mentioned this issue Feb 14, 2022

VS Code hung after window reload #142766

Closed

alexdima reopened this Feb 22, 2022

alexdima added this to the February 2022 milestone Feb 22, 2022

deepak1556 added confirmation-pending and removed info-needed Issue requires more information from poster labels Feb 22, 2022

deepak1556 added bug Issue identified by VS Code Team member as probable bug confirmed Issue has been confirmed by VS Code Team member windows VS Code on Windows issues and removed confirmation-pending labels Feb 23, 2022

deepak1556 added upstream Issue identified as 'upstream' component related (exists outside of VS Code) electron Issues and items related to Electron electron-blocker Issues in next update of Electron preventing update labels Feb 23, 2022

alexdima added a commit that referenced this issue Feb 24, 2022

Create the named pipe server on the extension host (#142786)

9244f01

alexdima mentioned this issue Feb 24, 2022

Create the named pipe server on the extension host #143891

Closed

deepak1556 modified the milestones: February 2022, March 2022 Feb 24, 2022

zcbenz mentioned this issue Feb 25, 2022

Windows socket and named pipe servers don't work in main process electron/electron#1968

Closed

bpasero added this to 🏃 In Progress in Electron Integration Mar 3, 2022

deepak1556 mentioned this issue Mar 9, 2022

fix: iocp integration when process is reused electron/electron#33207

Merged

5 tasks

deepak1556 removed the electron-blocker Issues in next update of Electron preventing update label Mar 10, 2022

bpasero moved this from 🏃 In Progress to ✔️ Done / Deferred in Electron Integration Mar 10, 2022

deepak1556 added mitigated Issue has workaround in place electron-17-update Issues related to electron 17 update and removed electron-16-update Issues related to electron 16 update labels Mar 17, 2022

alexdima removed their assignment Mar 24, 2022

deepak1556 closed this as completed in f65642b Mar 25, 2022

rzhao271 added the author-verification-requested Issues potentially verifiable by issue author label Mar 25, 2022

rchiodo added the verified Verification succeeded label Mar 25, 2022

github-actions bot locked and limited conversation to collaborators May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup of insiders is hung when opening anything #142786

Startup of insiders is hung when opening anything #142786

rchiodo commented Feb 10, 2022

roblourens commented Feb 11, 2022 •

edited

deepak1556 commented Feb 14, 2022

alexdima commented Feb 14, 2022

alexdima commented Feb 22, 2022 •

edited

deepak1556 commented Feb 22, 2022 •

edited

alexdima commented Feb 22, 2022 •

edited

deepak1556 commented Feb 22, 2022

bpasero commented Feb 23, 2022

deepak1556 commented Feb 23, 2022

deepak1556 commented Feb 23, 2022

zcbenz commented Feb 25, 2022

deepak1556 commented Feb 25, 2022

deepak1556 commented Feb 25, 2022

zcbenz commented Feb 26, 2022

zcbenz commented Mar 2, 2022

deepak1556 commented Mar 8, 2022

deepak1556 commented Mar 8, 2022

rchiodo commented Mar 25, 2022

Startup of insiders is hung when opening anything #142786

Startup of insiders is hung when opening anything #142786

Comments

rchiodo commented Feb 10, 2022

roblourens commented Feb 11, 2022 • edited

deepak1556 commented Feb 14, 2022

alexdima commented Feb 14, 2022

alexdima commented Feb 22, 2022 • edited

deepak1556 commented Feb 22, 2022 • edited

alexdima commented Feb 22, 2022 • edited

deepak1556 commented Feb 22, 2022

bpasero commented Feb 23, 2022

deepak1556 commented Feb 23, 2022

deepak1556 commented Feb 23, 2022

zcbenz commented Feb 25, 2022

deepak1556 commented Feb 25, 2022

deepak1556 commented Feb 25, 2022

zcbenz commented Feb 26, 2022

zcbenz commented Mar 2, 2022

deepak1556 commented Mar 8, 2022

deepak1556 commented Mar 8, 2022

rchiodo commented Mar 25, 2022

roblourens commented Feb 11, 2022 •

edited

alexdima commented Feb 22, 2022 •

edited

deepak1556 commented Feb 22, 2022 •

edited

alexdima commented Feb 22, 2022 •

edited