Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Socket IO causes VM crash #495

Closed
akgrant43 opened this issue Oct 25, 2022 · 3 comments · Fixed by #497
Closed

Windows Socket IO causes VM crash #495

akgrant43 opened this issue Oct 25, 2022 · 3 comments · Fixed by #497

Comments

@akgrant43
Copy link
Contributor

Socket IO on Windows can crash the VM with an access violation due to a race condition on memory freeing in aioWin.c.

The sequence of events that can lead to the crash is:

  1. allHandles is malloc'd in aioPoll() and passed to sliceWaitForMultipleObjects().
  2. sliceWaitForMultipleObjects() stores a pointer to allHandles in sliceData->handles.
  3. waitHandlesThreadFunction() is then called from one or more threads and copies the data from allHandles.
  4. aioPoll() then waits for an event using WaitForMultipleObjectsEx().
  5. Once WaitForMultipleObjectsEx() returns it checks the results and frees allHandles.

However this assumes that every thread gets a chance to run prior to WaitForMultipleObjectsEx() returning. But WaitForMultipleObjectsEx() will return either after a timeout or after the first thread indicates an event - since WaitForMultipleObjectsEx() is called with bWaitAll = FALSE.

On a machine with only a couple of cores and a large number of sockets open allHandles can be freed prior to all threads getting enough CPU time, causing an access violation.

The solution is to allocate a buffer per thread and copy the relevant portion of allHandles in to it prior to spawning the thread.

As not all threads get a chance to complete prior to WaitForMultipleObjectsEx() returning, it also means that socket IO may not be recognised, and the socket not read (or written).

Polling all sockets even on timeout ensures that all IO is recognised. The time for checkEventsInHandles() is less than 1mS, so the overhead is minimal, with basically no work being done if no handles are registered for asynchronous IO.

A PR is on the way.

akgrant43 added a commit to akgrant43/opensmalltalk-vm that referenced this issue Oct 25, 2022
Socket IO on Windows can crash the VM with an access violation due to a race condition on memory freeing in aioWin.c.

The sequence of events that can lead to the crash is:

    allHandles is malloc'd in aioPoll() and passed to sliceWaitForMultipleObjects().
    sliceWaitForMultipleObjects() stores a pointer to allHandles in sliceData->handles.
    waitHandlesThreadFunction() is then called from one or more threads and copies the data from allHandles.
    aioPoll() then waits for an event using WaitForMultipleObjectsEx().
    Once WaitForMultipleObjectsEx() returns it checks the results and frees allHandles.

However this assumes that every thread gets a chance to run prior to WaitForMultipleObjectsEx() returning. But WaitForMultipleObjectsEx() will return either after a timeout or after the first thread indicates an event - since WaitForMultipleObjectsEx() is called with bWaitAll = FALSE.

On a machine with only a couple of cores and a large number of sockets open allHandles can be freed prior to all threads getting enough CPU time, causing an access violation.

The solution is to allocate a buffer per thread and copy the relevant portion of allHandles in to it prior to spawning the thread.

As not all threads get a chance to complete prior to WaitForMultipleObjectsEx() returning, it also means that socket IO may not be recognised, and the socket not read (or written).

Polling all sockets even on timeout ensures that all IO is recognised. The time for checkEventsInHandles() is less than 1mS, so the overhead is minimal, with basically no work being done if no handles are registered for asynchronous IO.

Fixes: pharo-project#495
@guillep
Copy link
Member

guillep commented Oct 25, 2022

This is the bug you talked to me at Esug? Super good catch!

@chisandrei
Copy link
Contributor

chisandrei commented Oct 25, 2022

In case someone wants to reproduce these kind of crashes locally, we were using the setup here: https://github.com/feenkcom/pharo-socket-stability-experiment/tree/main/rust-worker. The idea there was to same thousand of small workers in Rust that connect through sockets to a runner in Pharo.

GitHub
A home for the experiments on sockets in Pharo. Contribute to feenkcom/pharo-socket-stability-experiment development by creating an account on GitHub.

@akgrant43
Copy link
Contributor Author

Hi @guillep,

Yes, this is the issue I mentioned. :-)

Prior to the fix if we had about 5 incoming sockets there was a reasonable chance the VM would crash. By the time that got up to 25 sockets it was almost certain to crash.

With the fix we have 960 open incoming connections regularly and so far no crash.

tesonep added a commit that referenced this issue Dec 14, 2022
Windows Socket IO causes VM crash (Issue #495)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants