Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup of insiders is hung when opening anything #142786

Closed
rchiodo opened this issue Feb 10, 2022 · 20 comments
Closed

Startup of insiders is hung when opening anything #142786

rchiodo opened this issue Feb 10, 2022 · 20 comments
Assignees
Labels
bug Issue identified by VS Code Team member as probable bug confirmed Issue has been confirmed by VS Code Team member electron Issues and items related to Electron electron-17-update Issues related to electron 17 update extension-host Extension host issues insiders-released Patch has been released in VS Code Insiders mitigated Issue has workaround in place upstream Issue identified as 'upstream' component related (exists outside of VS Code) verified Verification succeeded windows VS Code on Windows issues
Milestone

Comments

@rchiodo
Copy link
Contributor

rchiodo commented Feb 10, 2022

Does this issue occur when all extensions are disabled?: Can't tell. Went away on restart

  • VS Code Version: 1.65.0-insider (user setup)
    Commit: 82a8bec
    Date: 2022-02-10T05:16:36.093Z
    Electron: 16.0.8
    Chromium: 96.0.4664.110
    Node.js: 16.9.1
    V8: 9.6.180.21-electron.0
    OS: Windows_NT x64 10.0.19044

Steps to Reproduce:

  1. Open a folder
  2. Open a notebook

I get this:

image

@roblourens
Copy link
Member

roblourens commented Feb 11, 2022

I hit this once too. That warning message implies that loading extensions is stuck, cc @deepak1556 in case this is related to the electron update?

@roblourens roblourens assigned alexdima and unassigned rebornix Feb 11, 2022
@deepak1556
Copy link
Contributor

This seems to be same as #140114 but we didn't have a clear repro at that time. @alexdima is there any additional logs we can obtain to narrow down where exactly the extension host got stuck when trying to load extensions ?

@deepak1556 deepak1556 self-assigned this Feb 14, 2022
@deepak1556 deepak1556 added electron-16-update Issues related to electron 16 update extension-host Extension host issues info-needed Issue requires more information from poster labels Feb 14, 2022
@alexdima
Copy link
Member

Another occurrence in #142766

@alexdima alexdima reopened this Feb 22, 2022
@alexdima
Copy link
Member

alexdima commented Feb 22, 2022

@bpasero @deepak1556 Do we plan to ship Electron 16 on stable? This might be a blocker.

@alexdima alexdima added this to the February 2022 milestone Feb 22, 2022
@deepak1556
Copy link
Contributor

deepak1556 commented Feb 22, 2022

@alexdima yup the current plan is to ship this milestone.

Maybe we can add some additional logging to help narrow down where in the extension load code path things are getting stuck, this can help if we get similar reports once shipped to stable. Currently we had only a single occurrence of this issue for those who encountered it, so there is not much we could progress on diagnosing it. I didn't want to block the update because of that.

Also if we revert to Electron 13, we will be on an EOL release line https://www.electronjs.org/blog/electron-17-0#end-of-support-for-13xy and anything > 13 will have render process reuse by default but the downside will be not having self hosted with insiders.

@alexdima
Copy link
Member

alexdima commented Feb 22, 2022

@deepak1556 I was able to reproduce when running scripts\code.bat and doing a few Ctrl+R on a Windows VM (i.e. slow machine). Maybe it needs 10 reloads, but it repros eventually. I am using a TS file with a compilation error and the hint is that the compilation error doesn't show. There is no other indication that something is wrong. I believe that this might be the same weird behavior around named pipes / sockets that do not send data we saw when we met some time ago trying out Playwright for Electron.

@bpasero @deepak1556 Could you please also try to reproduce?

@deepak1556
Copy link
Contributor

Oh was not aware of that, will give it a try. Thanks for the steps!

@deepak1556 deepak1556 added confirmation-pending and removed info-needed Issue requires more information from poster labels Feb 22, 2022
@bpasero
Copy link
Member

bpasero commented Feb 23, 2022

I tried to reproduce in my Windows 11 ARM VM and was not successful. Tried a few window reloads on a small TypeScript project with an editor open that has a compile error.

@deepak1556
Copy link
Contributor

@bpasero I am also on Windows 11 ARM vm and couldn't repro it so far, but the trick to repro it is to have some memory intensive tasks running in the background (building electron does the trick for me). I am now able to repro the issue.

@deepak1556 deepak1556 added bug Issue identified by VS Code Team member as probable bug confirmed Issue has been confirmed by VS Code Team member windows VS Code on Windows issues and removed confirmation-pending labels Feb 23, 2022
@deepak1556
Copy link
Contributor

Created a minimal repro https://gist.github.com/deepak1556/59f981cd8a55cda92beadd6bf97330a9, that shows a ping/pong via nodejs sockets between renderer process and child process forked from the main process. After a couple of reloads, the server/renderer will get into a state were the async tasks gets queued up and does not process until another explicit task is executed to flush out the queue.

An example state when event loop is paused:

Forked process: Child process created
Forked process: Connected to named pipe as client
[4124:0223/200645.092:INFO:CONSOLE(35)] "Connected to named pipe as server", source: file:///C:/Users/demohan/github/electron-quick-start/renderer.js (35)
Forked process: rec'd from server: pong!
[4124:0223/200658.631:INFO:CONSOLE(48)] "rec'd from client: ping 0", source: file:///C:/Users/demohan/github/electron-quick-start/renderer.js (48

At this point, executing a setImmediate(() => { console.log('hello') }) resumes the loop for sometime but it gets to the bad state after a few more tasks. At first I thought it might be a side effect of the occlusion tracker on windows https://chromium.googlesource.com/chromium/src/+/master/docs/windows_native_window_occlusion_tracking.md that puts the renderer into a suspended state but the bug repros even when launching with --disable-features=CalculateNativeWinOcclusion. Pretty confident that this bug is what we are seeing at the larger scale with our extension host setup. Continuing to investigate why Electron renderer gets into this state, looks like a return of electron/electron#1968 from the past with render process reuse enabled changes.

@deepak1556 deepak1556 added upstream Issue identified as 'upstream' component related (exists outside of VS Code) electron Issues and items related to Electron electron-blocker Issues in next update of Electron preventing update labels Feb 23, 2022
@zcbenz
Copy link

zcbenz commented Feb 25, 2022

It seems that after reloading the window, the IOCP will get into a weird state and event loop integration will stop functioning, so a workaround is to recreate the IOCP for each reload:

--- a/shell/common/node_bindings.cc
+++ b/shell/common/node_bindings.cc
@@ -576,7 +576,11 @@ void NodeBindings::LoadEnvironment(node::Environment* env) {
 }

 void NodeBindings::PrepareMessageLoop() {
-#if !BUILDFLAG(IS_WIN)
+#if BUILDFLAG(IS_WIN)
+  if (uv_loop_->iocp && uv_loop_->iocp != INVALID_HANDLE_VALUE)
+    CloseHandle(uv_loop_->iocp);
+  uv_loop_->iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 2);
+#else
   int handle = uv_backend_fd(uv_loop_);

   // If the backend fd hasn't changed, don't proceed.

I'm still trying to figure out the root cause.

@deepak1556
Copy link
Contributor

Currently testing the above workaround with vscode.

@deepak1556
Copy link
Contributor

After a bit of testing, the extension host does not get unresponsive when the iocp is reinitialized on reload. Although, we now get a prompt that says Reloading the window is taking longer than expected.. and it seems the issue arises from pending operation join.closeStorage. I will have to investigate why it regressed.

@zcbenz instead of reinitializing the iocp for the same process reload case, can we instead close the default uv loop and also perform the background thread cleanup which is usually done only during ~NodeBindings, and reinitialize after reload ? Since we destroy the node environment for reload, I feel this should be safe to do, thoughts ?

@zcbenz
Copy link

zcbenz commented Feb 26, 2022

When navigating Chromium may load the new page before the old page is destroyed, so it is possible for multiple environments sharing the same loop in one process, and we probably don't have a good timing to close the default uv loop. But reinitializing IOCP has the same problem too.

In theory destroying and initializing Node.js environments should not affect the uv loop at all, by reading the code I couldn't find anywhere closing or initializing the uv loop, but somehow having multiple Node.js environments is making our node integration run into weird states, there were some efforts working around it (electron/electron#25869 and electron/electron#27582) but I think they are not fixing the root cause.

@zcbenz
Copy link

zcbenz commented Mar 2, 2022

It seems that the node integration only breaks when the first node environment is destroyed, everything works fine if we just create and destroy subframes or child windows. So I wonder if we can work around the problem by creating a dummy node environment in renderer before any web frame is created and keeping it alive forever.

The root cause is still mysterious to me though.

@bpasero bpasero added this to 🏃 In Progress in Electron Integration Mar 3, 2022
@deepak1556
Copy link
Contributor

@zcbenz looked a bit more into the root issue and have some updates,

a) First let me walkthrough the basics of Electron event loop integration on windows and make sure I haven't misunderstood any of the components.

  1. Electron polls the IOCP on background thread via GetQueuedCompletionStatus to know when there is completion packet available for async I/O operation posted to this IOCP by the libuv event loop.

  2. Once there is a packet available Electron will post the packet back into the queue via PostQueuedCompletionStatus and interrupt the main thread via the chromium task runner and runs the libuv event loop on the main thread which will now process the completion packet.

  3. The IOCP is created by libuv and only once when the first call to uv_default_loop was made via https://github.com/libuv/libuv/blob/4296fec7f50145e2a307f3db7ae22984713976a7/src/win/core.c#L236

  4. Even across reloads when the process is reused the IOCP created in the previous step remains the same as the default loop never gets destroyed.

b) Next a couple of principles when using IOCP stated in https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports which are relevant for this bug,

  1. A thread (either one created by the main thread or the main thread itself) uses the GetQueuedCompletionStatus function to wait for a completion packet to be queued to the I/O completion port, rather than waiting directly for the asynchronous I/O to complete. Threads that block their execution on an I/O completion port are released in last-in-first-out (LIFO) order, and the next completion packet is pulled from the I/O completion port's FIFO queue for that thread. This means that, when a completion packet is released to a thread, the system releases the last (most recent) thread associated with that port, passing it the completion information for the oldest I/O completion.

  2. Although any number of threads can call GetQueuedCompletionStatus for a specified I/O completion port, when a specified thread calls GetQueuedCompletionStatus the first time, it becomes associated with the specified I/O completion port until one of three things occurs: The thread exits, specifies a different I/O completion port, or closes the I/O completion port. In other words, a single thread can be associated with, at most, one I/O completion port.

  3. When a completion packet is queued to an I/O completion port, the system first checks how many threads associated with that port are running. If the number of threads running is less than the concurrency value (discussed in the next section), one of the waiting threads (the most recent one) is allowed to process the completion packet. When a running thread completes its processing, it typically calls GetQueuedCompletionStatus again, at which point it either returns with the next completion packet or waits if the queue is empty.

  4. The most important property of an I/O completion port to consider carefully is the concurrency value. The concurrency value of a completion port is specified when it is created with CreateIoCompletionPort via the NumberOfConcurrentThreads parameter. This value limits the number of runnable threads associated with the completion port. When the total number of runnable threads associated with the completion port reaches the concurrency value, the system blocks the execution of any subsequent threads associated with that completion port until the number of runnable threads drops below the concurrency value.

  5. The system also allows a thread waiting in GetQueuedCompletionStatus to process a completion packet if another running thread associated with the same I/O completion port enters a wait state for other reasons, for example the SuspendThread function. When the thread in the wait state begins running again, there may be a brief period when the number of active threads exceeds the concurrency value. However, the system quickly reduces this number by not allowing any new active threads until the number of active threads falls below the concurrency value.

c) Finally what is happening in the process reuse scenario on windows,

  1. Libuv creates IOCP with concurrency value 1 (i-e) only one active running thread can process the completion packet at a time. (point b.4)

  2. Although the polling thread NodeBindings::EmbedThreadRunner makes the call to GetQueuedCompletionStatus, on creation it will first wait on the semaphore that gets signalled from the main thread only when NodeBindings::UvRunOnce gets called, which we actually trigger for the first time when the process is created via NodeBindings::RunMessageLoop. This actually causes the main thread to make a call to uv_run which in turn will call to GetQueuedCompletionStatus and hence main thread gets associated with the IOCP (point b.2)

  3. But since the background thread is in wait state on the semaphore we still satisfy (point b.3 and point b.5) and hence are able to process the IOCP on this thread once the signal arrives from the main thread via NodeBindings::PollEvents.

  4. Now on reload, when reinitializing the default node environment we make calls to NodeBindings::PrepareMessageLoop and NodeBindings::RunMessageLoop which ends up creating new background thread and the thread count increases depending on the number of reloads. It can happen in this situation than more than one background thread gets to operate on the IOCP due to race and it ends up getting suspended on evaluating the GetQueuedCompletionStatus due to (point b.4)

At this point the background thread is unable to signal back to the main thread and we see it as user facing bug that non of node callbacks got executed as from the repro. Another test that confirmed this was, expose NodeBindings::UvRunOnce to the user land and when the test enters this error state try calling into UvRunOnce explicitly and you will see that main thread is still able to complete the GetQueuedCompletionStatus calls and is able to trigger the callbacks. This confirms that IOCP is healthy and the problem is only because of the background threads got suspended.

The workaround posted in #142786 (comment) helped because on reload although there are multiple background threads, for the new IOCP only main thread and only newly created background thread are associated, so they continue to work based on the previously listed principles.

Based on the above results, I now try to maintain a single background thread across reloads into the same process and this always works fine. I found that there was an attempt in Electron that would have achieved the same electron/electron#27582 but was later reverted electron/electron#28175. The cause for hang before the revert is incorrect, the problem was not with IOCP but rather in electron/electron#27582 although we short-circuited the NodeBindings::PrepareMessageLoop we still ended up calling NodeBindings::RunMessageLoop on reload which caused an incorrect semaphore count.

The semaphore dance works like this https://github.com/libuv/libuv/blob/4296fec7f50145e2a307f3db7ae22984713976a7/src/win/thread.c#L309-L333,

In Main thread
  PrepareMessageLoop embed_sem_ // initialized with value 0, any value greater > 0 will signal the thread waiting on this semahore https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createsemaphorea

In Background thread
  EmbedThreadRunner waits on embed_sem_ // once this gets signalled, the count will be reduced by 1 which will bring it back to 0 https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject

In Main Thread
 UvRunOnce posts to embed_sem_ // increases the count by 1 and signals the thread https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-releasesemaphore

With electron/electron#27582 there can be situation with active I/O tasks on reload, the semaphore value on main thread would already be incremented by 1 to signal action to the background thread and another call to NodeBindings::RunMessageLoop would increment it now to a value of 2. And even if the background thread now decrements it after wait call, it will be signalled forever inside loop as the value is not 0 and we end up in situation were two threads are operating on the same IOCP and things get broken. The fix here is to simply not call the NodeBindings::RunMessageLoop along with the NodeBindings::PrepareMessageLoop on reload and by maintaining a single polling thread it fixes both the bug mentioned in this issue and as well electron/electron#28154

@deepak1556
Copy link
Contributor

I will put up a PR in Electron to further discuss this solution.

@deepak1556 deepak1556 removed the electron-blocker Issues in next update of Electron preventing update label Mar 10, 2022
@bpasero bpasero moved this from 🏃 In Progress to ✔️ Done / Deferred in Electron Integration Mar 10, 2022
@deepak1556 deepak1556 added mitigated Issue has workaround in place electron-17-update Issues related to electron 17 update and removed electron-16-update Issues related to electron 16 update labels Mar 17, 2022
@alexdima alexdima removed their assignment Mar 24, 2022
@rzhao271 rzhao271 added the author-verification-requested Issues potentially verifiable by issue author label Mar 25, 2022
@rchiodo
Copy link
Contributor Author

rchiodo commented Mar 25, 2022

/verified

Definitely working for me. Although I haven't repro'd this in while.

@rchiodo rchiodo added the verified Verification succeeded label Mar 25, 2022
@github-actions github-actions bot locked and limited conversation to collaborators May 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue identified by VS Code Team member as probable bug confirmed Issue has been confirmed by VS Code Team member electron Issues and items related to Electron electron-17-update Issues related to electron 17 update extension-host Extension host issues insiders-released Patch has been released in VS Code Insiders mitigated Issue has workaround in place upstream Issue identified as 'upstream' component related (exists outside of VS Code) verified Verification succeeded windows VS Code on Windows issues
Projects
Archived in project
Electron Integration
  
✔️ Done / Deferred
Development

No branches or pull requests

9 participants