Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between Apple SSL destroy and event callback #2825

Merged
merged 22 commits into from Feb 9, 2022

Conversation

sauwming
Copy link
Member

A race condition can occur between destroying of Apple SSL socket and its event callback. Specifically, the race happens between ssl_destroy() and ssl_network_event_poll() in ssl_sock_apple.m. ssl_destroy() will try to remove all events that belong to the socket by calling event_manager_remove_events(ssock), but at the same time, the event polling may be in the process of calling those event's callback.

Unfortunately, there doesn't seem to be a simple solution for this. Holding the event manager's lock when calling the callbacks will cause deadlock, while adding reference to the socket's group lock prior to calling the callback may have been too late since the destroying process has already started (in fact, ssl_destroy() is called by the group lock's destructor handler).

Note: Since event callback is unique to Apple SSL, this shouldn't affect other SSL backends.

@nanangizz
Copy link
Member

What about introducing flags such as is_destroying & is_event_running, and both operations (destroy & callback) should check & update the flags inside the lock, can this trick help?

@sauwming
Copy link
Member Author

Okay, it's finally ready for review.

A few notes:

  • is_event_running
    I was thinking the same thing. We used cb_mutex in pjmedia_event, but as with any mutex, it causes deadlock such as the one described in Avoid deadlock on unsubscribe event #2771. So I thought of using a flag this time. But ssl_network_event_poll() is called by pj_ioqueue_poll() and it can be multithreaded, so a boolean flag won't work, while an integer counter will be unfeasible (waiting for the event counter to be zero before destroying may never happen under busy load).

  • is_destroying
    I can't find any place to set and check this flag to reliably and completely prevent the race. The destruction can happen any time when the callback is being called. So checking prior to the callback will not totally eliminate the problem, but may help filter further event postings. So yes, this flag does has some use.

The ideal solution, I believe, is to use the group lock, similar to timer heap. There is one complication, though. The correct place to add the group lock's ref, in order to prevent ssl sock being destroyed while in the event list, is when posting the event (similar to timer when scheduling). But event_manager_post_event() is called within a gcd, which means we cannot call pj_grp_lock_add_ref() there, unless we do thread registration. But the point of using our own event dispatch is to avoid this thread registration! (see the super long discussion here: #2482 (review))

So, I'm forced to add the ref in the polling instead. There's still (theoretically) a tiny window when ssl socket can be destroyed right before adding the group lock's ref, but with the help of the is_closing flag, hopefully the chance is eliminated.

@sauwming sauwming mentioned this pull request Dec 27, 2021
@sauwming sauwming merged commit 74c0309 into master Feb 9, 2022
@sauwming sauwming deleted the ssl_apple-race branch February 9, 2022 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants