Isolate plugins in out-of-process COM host#40120
Conversation
There was a problem hiding this comment.
Pull request overview
This PR moves WSL plugins from in-process LoadLibrary inside wslservice.exe to isolated out-of-process wslpluginhost.exe COM local servers, aiming to keep the service alive even if a plugin crashes.
Changes:
- Introduces
wslpluginhost.exe(COM local server) that loads one plugin DLL and forwards lifecycle events, while proxying plugin API callbacks back to the service. - Adds new COM contracts (
IWslPluginHost,IWslPluginHostCallback) and consolidates proxy/stub generation intowslserviceproxystub.dll. - Updates service-side plugin management and adds a new
shared_mutexpath intended to avoid re-entrancy/deadlock on COM RPC callback threads.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/windows/wslpluginhost/exe/resource.h | Adds resource IDs for new host executable. |
| src/windows/wslpluginhost/exe/PluginHost.h | Declares COM class implementing IWslPluginHost plus API callback stubs. |
| src/windows/wslpluginhost/exe/PluginHost.cpp | Implements plugin DLL loading, lifecycle dispatch, and callback forwarding to service. |
| src/windows/wslpluginhost/exe/main.rc | Adds version/icon resources for wslpluginhost.exe. |
| src/windows/wslpluginhost/exe/main.cpp | Implements COM local server entrypoint and class factory registration. |
| src/windows/wslpluginhost/exe/CMakeLists.txt | Adds build target for wslpluginhost.exe. |
| src/windows/wslpluginhost/CMakeLists.txt | Wires new subdirectory into build. |
| src/windows/service/stub/CMakeLists.txt | Adds MIDL proxy/stub sources for WslPluginHost.idl into wslserviceproxystub. |
| src/windows/service/inc/WslPluginHost.idl | Defines out-of-proc COM interfaces for plugin hosting + callbacks. |
| src/windows/service/inc/CMakeLists.txt | Adds new wslpluginhostidl MIDL generation target. |
| src/windows/service/exe/PluginManager.h | Refactors plugin manager to track out-of-proc hosts and adds callback implementation type. |
| src/windows/service/exe/PluginManager.cpp | Implements COM activation of hosts, job object assignment, and service-side callback handlers. |
| src/windows/service/exe/LxssUserSession.h | Adds shared_mutex and makes plugin-callback methods private/friend-only. |
| src/windows/service/exe/LxssUserSession.cpp | Switches plugin callback locking from m_instanceLock to m_callbackLock and gates VM teardown. |
| src/windows/service/exe/CMakeLists.txt | Adds dependency on wslpluginhostidl. |
| src/windows/common/precomp.h | Adds <shared_mutex> include for new locking. |
| msipackage/package.wix.in | Installs wslpluginhost.exe and registers COM AppID/CLSID/interfaces for activation and proxy/stub. |
| msipackage/CMakeLists.txt | Adds wslpluginhost.exe to packaged binaries and build dependencies. |
| CMakeLists.txt | Adds subdirectory for wslpluginhost and adjusts global include directories. |
Comments suppressed due to low confidence (2)
src/windows/service/exe/LxssUserSession.cpp:3644
- CreateLinuxProcess now only takes m_callbackLock, but it calls _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and reads m_runningInstances. This is both a locking-contract violation and can race with writers that still use m_instanceLock only. Refactor so callback code can safely read the running-instance map (e.g., provide a callback-safe lookup guarded by m_callbackLock, and ensure all writes to m_runningInstances/m_utilityVm also take m_callbackLock exclusively after m_instanceLock per the stated lock ordering).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
src/windows/service/exe/LxssUserSession.cpp:2614
- m_runningInstances is updated here without taking m_callbackLock, but plugin callbacks now read m_runningInstances under m_callbackLock (and intentionally do not take m_instanceLock). Because these are different locks, this doesn’t provide synchronization and can lead to data races/UB when a callback runs concurrently with instance creation/termination. Writers that mutate m_runningInstances (and m_utilityVm if accessed by callbacks) need to also take m_callbackLock (exclusive) in the documented order m_instanceLock → m_callbackLock.
// This needs to be done before plugins are notified because they might try to run a command inside the distribution.
m_runningInstances[registration.Id()] = instance;
if (version == LXSS_WSL_VERSION_2)
{
auto cleanupOnFailure =
wil::scope_exit_log(WI_DIAGNOSTICS_INFO, [&]() { m_runningInstances.erase(registration.Id()); });
m_pluginManager.OnDistributionStarted(&m_session, instance->DistributionInformation());
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3643
- CreateLinuxProcess now only takes m_callbackLock, but it reads m_runningInstances via _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and the map itself is Guarded_by(m_instanceLock). This is a real race/contract violation (and can also break static analysis). You’ll need a consistent locking strategy for callback threads (e.g., make all accesses/mutations of m_runningInstances + m_utilityVm also take m_callbackLock in the documented order, or refactor callbacks to avoid touching m_runningInstances without m_instanceLock).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
3418f7e to
3f03d4f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3653
CreateLinuxProcessnow only takesm_callbackLockbut calls_RunningInstance(Distro)(which is annotated_Requires_lock_held_(m_instanceLock)and touchesm_lockedDistributions). This violates the locking contract and can race with instance state changes or deadlock if someone later adds the required lock. Consider adding a callback-safe lookup that is guarded bym_callbackLockonly (and does not call_EnsureNotLocked), or refactoring_RunningInstance/ the guarded annotations so callback code never needsm_instanceLock.
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
THROW_HR_IF(WSL_E_WSL2_NEEDED, !wsl2Distro);
|
Hey @benhillis 👋 — Following up on this PR. It currently has merge conflicts that need to be resolved, and the CI build didn't run (shows action_required). There are also 20 unresolved review threads, including some security-relevant findings (TOCTOU on DLL signature validation, |
f32e629 to
5641289
Compare
|
CI is all green now and conflicts are resolved. Would love to get some eyes on this when someone has a chance — the main design change is moving plugin loading out of wslservice into separate COM hosts so a bad plugin can't take down the service. |
3492d6b to
57501d9
Compare
ed98200 to
c66d3d7
Compare
d0bb1fc to
cd546c0
Compare
cd546c0 to
bf62d56
Compare
5239259 to
8cebe31
Compare
8cebe31 to
6ca2d5d
Compare
6ca2d5d to
f7419fb
Compare
f7419fb to
4f63862
Compare
Moves WSL plugin DLLs out of wslservice.exe and into a separate
wslpluginhost.exe COM server, so plugin code can no longer crash or
destabilize the service. Plugins are activated via CLSCTX_LOCAL_SERVER
and reached through a versioned COM interface (WslPluginHost.idl); a
host process is created per-plugin and tied to a service-owned job
object so all hosts terminate cleanly when wslservice exits.
Service-side changes
- New PluginHostCallbackImpl exposes the plugin->service API surface
(MountFolder, ExecuteBinary, ExecuteBinaryInDistribution) over COM.
- New m_callbackLock (std::shared_mutex) on LxssUserSessionImpl:
callbacks acquire shared; _VmTerminate acquires exclusive after
OnVmStopping notification fires to drain in-flight callbacks before
destroying m_utilityVm.
- Plugin hook dispatch is serialized via m_hookLock and the
g_hookThreadId handshake used by PluginError.
- g_pluginHost is atomic so plugin worker threads can call API stubs
from any thread (cross-apartment, post-hook).
- New IsHostCrash() detection recognises RPC_E_DISCONNECTED,
RPC_E_SERVER_DIED, RPC_E_SERVER_DIED_DNE, CO_E_OBJNOTCONNECTED,
RPC_S_SERVER_UNAVAILABLE, RPC_S_CALL_FAILED, RPC_S_CALL_FAILED_DNE
and RPC_E_CALL_REJECTED as 'host died, log and continue' rather than
fatal plugin errors.
- COM is initialised on the callback / activation / hook dispatch
paths; PluginHost keep-alive ref is tied to ctor/dtor; missing
return in EnsureInitialized fixed; unused IDL surface trimmed and
job-object failures surfaced from the plugin loader; wWinMain exit
code fixed.
Test coverage
- WSL1 plugin tests broadened alongside the refactor.
- Input validation tightened.
- New plugin tests covering the isolation + locking surface:
* HostCrashIsolation: kills wslpluginhost.exe mid-OnVmStarted and
verifies wslservice survives and m_initOnce stays sticky.
* ConcurrentCallbacks: 4 plugin threads behind a start-gate hammer
MountFolder + ExecuteBinary, exercising shared-mode m_callbackLock.
* AsyncApiCallFromWorker: plugin worker thread calls into the
service post-hook (cross-apartment, non-COM-initialized thread).
* CallbacksDuringTerminationDoNotCrash: detached workers race
_VmTerminate's exclusive m_callbackLock acquire / m_utilityVm.reset(),
with an OnVmStopping-set stop signal so they exit deterministically.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4f63862 to
99b8998
Compare
Moves plugin DLLs from LoadLibrary in wslservice.exe to isolated wslpluginhost.exe processes via COM. A crashing plugin kills only its host process — the service logs it and continues. Plugin API is unchanged; Defender and existing plugins work unmodified.
Callbacks (MountFolder, ExecuteBinary) arrive on a different COM thread, so they can't re-enter the recursive mutex. Instead of restructuring every call site, callbacks use a separate
shared_mutex— shared for reads, exclusive in_VmTerminatebefore destroying the VM.Plugin hosts are in a job object for automatic cleanup on service exit. COM activation is SYSTEM-only via AppID. Proxy/stub is consolidated into wslserviceproxystub.dll. One new exe, no new DLLs.
Host-crash classification
IsHostCrashrecognises the "server died" HRESULTs surfaced by RPC/COM when the plugin host process is gone or unreachable:RPC_E_DISCONNECTED,RPC_E_SERVER_DIED,RPC_E_SERVER_DIED_DNE,CO_E_OBJNOTCONNECTED,RPC_S_SERVER_UNAVAILABLE,RPC_S_CALL_FAILED,RPC_S_CALL_FAILED_DNE. WithoutRPC_S_CALL_FAILED, a plugin host that died mid-call would surface asWsl/Service/RPC_S_CALL_FAILEDto the user instead of being logged and skipped — caught by the new HostCrashIsolation test below.RPC_E_CALL_REJECTEDis not classified as a host crash: it's a transient COM busy state (an STA message filter rejecting the call) rather than a "server process died" signal, and the plugin host is MTA without a message filter so it shouldn't surface in this codebase. Treating it as a crash would silently skip future legitimate calls.New tests
m_initOncestays sticky.m_callbackLock._VmTerminate's exclusivem_callbackLockacquire /m_utilityVm.reset(), with an OnVmStopping-set stop signal so they exit deterministically.Existing WSL1 plugin tests broadened alongside the refactor.