Windows Socket IO causes VM crash #495

akgrant43 · 2022-10-25T05:56:23Z

Socket IO on Windows can crash the VM with an access violation due to a race condition on memory freeing in aioWin.c.

The sequence of events that can lead to the crash is:

allHandles is malloc'd in aioPoll() and passed to sliceWaitForMultipleObjects().
sliceWaitForMultipleObjects() stores a pointer to allHandles in sliceData->handles.
waitHandlesThreadFunction() is then called from one or more threads and copies the data from allHandles.
aioPoll() then waits for an event using WaitForMultipleObjectsEx().
Once WaitForMultipleObjectsEx() returns it checks the results and frees allHandles.

However this assumes that every thread gets a chance to run prior to WaitForMultipleObjectsEx() returning. But WaitForMultipleObjectsEx() will return either after a timeout or after the first thread indicates an event - since WaitForMultipleObjectsEx() is called with bWaitAll = FALSE.

On a machine with only a couple of cores and a large number of sockets open allHandles can be freed prior to all threads getting enough CPU time, causing an access violation.

The solution is to allocate a buffer per thread and copy the relevant portion of allHandles in to it prior to spawning the thread.

As not all threads get a chance to complete prior to WaitForMultipleObjectsEx() returning, it also means that socket IO may not be recognised, and the socket not read (or written).

Polling all sockets even on timeout ensures that all IO is recognised. The time for checkEventsInHandles() is less than 1mS, so the overhead is minimal, with basically no work being done if no handles are registered for asynchronous IO.

A PR is on the way.

Socket IO on Windows can crash the VM with an access violation due to a race condition on memory freeing in aioWin.c. The sequence of events that can lead to the crash is: allHandles is malloc'd in aioPoll() and passed to sliceWaitForMultipleObjects(). sliceWaitForMultipleObjects() stores a pointer to allHandles in sliceData->handles. waitHandlesThreadFunction() is then called from one or more threads and copies the data from allHandles. aioPoll() then waits for an event using WaitForMultipleObjectsEx(). Once WaitForMultipleObjectsEx() returns it checks the results and frees allHandles. However this assumes that every thread gets a chance to run prior to WaitForMultipleObjectsEx() returning. But WaitForMultipleObjectsEx() will return either after a timeout or after the first thread indicates an event - since WaitForMultipleObjectsEx() is called with bWaitAll = FALSE. On a machine with only a couple of cores and a large number of sockets open allHandles can be freed prior to all threads getting enough CPU time, causing an access violation. The solution is to allocate a buffer per thread and copy the relevant portion of allHandles in to it prior to spawning the thread. As not all threads get a chance to complete prior to WaitForMultipleObjectsEx() returning, it also means that socket IO may not be recognised, and the socket not read (or written). Polling all sockets even on timeout ensures that all IO is recognised. The time for checkEventsInHandles() is less than 1mS, so the overhead is minimal, with basically no work being done if no handles are registered for asynchronous IO. Fixes: pharo-project#495

guillep · 2022-10-25T07:43:29Z

This is the bug you talked to me at Esug? Super good catch!

chisandrei · 2022-10-25T12:39:42Z

In case someone wants to reproduce these kind of crashes locally, we were using the setup here: https://github.com/feenkcom/pharo-socket-stability-experiment/tree/main/rust-worker. The idea there was to same thousand of small workers in Rust that connect through sockets to a runner in Pharo.

GitHub
pharo-socket-stability-experiment/rust-worker at main · feenkcom/pharo-socket-stability-experiment
A home for the experiments on sockets in Pharo. Contribute to feenkcom/pharo-socket-stability-experiment development by creating an account on GitHub.

akgrant43 · 2022-10-26T11:29:02Z

Hi @guillep,

Yes, this is the issue I mentioned. :-)

Prior to the fix if we had about 5 incoming sockets there was a reasonable chance the VM would crash. By the time that got up to 25 sockets it was almost certain to crash.

With the fix we have 960 open incoming connections regularly and so far no crash.

Windows Socket IO causes VM crash (Issue #495)

akgrant43 mentioned this issue Oct 26, 2022

Windows Socket IO causes VM crash (Issue #495) #497

Merged

tesonep closed this as completed in #497 Dec 14, 2022

tesonep added a commit that referenced this issue Dec 14, 2022

Merge pull request #497 from akgrant43/IssuePhVm495

2af5c8f

Windows Socket IO causes VM crash (Issue #495)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows Socket IO causes VM crash #495

Windows Socket IO causes VM crash #495

akgrant43 commented Oct 25, 2022

guillep commented Oct 25, 2022

chisandrei commented Oct 25, 2022 •

edited by unfurl-links bot

akgrant43 commented Oct 26, 2022

Windows Socket IO causes VM crash #495

Windows Socket IO causes VM crash #495

Comments

akgrant43 commented Oct 25, 2022

guillep commented Oct 25, 2022

chisandrei commented Oct 25, 2022 • edited by unfurl-links bot

akgrant43 commented Oct 26, 2022

chisandrei commented Oct 25, 2022 •

edited by unfurl-links bot