Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrates possible thread safety or resource exhaustion issue on OSX #169

Closed
wants to merge 6 commits into from
Closed

Conversation

neallester
Copy link

Description

This change adds a test which demonstrates a possible thread safety or resource exhaustion issue on OSX. If the issue occurs on linux it does not occur as frequently as on OSX (it did not occur in 8 test runs).

On my OSX system (which is accessing a remote mongo instance on cloud.mongodb.com) running this test reliably produces the exception provided below.

Evidence pointing toward thread issue as opposed to resource exhaustion:

The issue seems to occur even though the test has a reasonable number of sockets open at time of failure (8 - 20). This is the number of sockets reported by netstat -vnp tcp | grep pid where PID is the value reported by Activity Monitor for the process named xctest.

xctest(11165,0x700007cc8000) malloc: *** error for object 0x102493c70: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug

#0 0x00007fff60c90b6e in __pthread_kill ()
#1 0x00007fff60e5b080 in pthread_kill ()
#2 0x00007fff60bec1ae in abort ()
#3 0x00007fff60cea822 in free ()
#4 0x00007fff44d3a920 in SSLRecordServiceWriteQueueInternal ()
#5 0x00007fff44b538a7 in SSLServiceWriteQueue ()
#6 0x00007fff44b57150 in SSLRead ()
#7 0x00000001047db637 in closure #5 in MongoSocket.init(address:port:options:onRead:onError:) at /Users/neal/xcode/MongoKitten/Sources/MongoSocket/Socket.swift:290
#8 0x00000001047dba01 in partial apply for closure #5 in MongoSocket.init(address:port:options:onRead:onError:) ()
#9 0x00000001047dba3d in thunk for @escaping @callee_guaranteed () -> () ()
#10 0x00007fff60b1ce0e in _dispatch_block_async_invoke2 ()
#11 0x00007fff60b06db8 in _dispatch_client_callout ()
#12 0x00007fff60b19e81 in _dispatch_continuation_pop ()
#13 0x00007fff60b09081 in _dispatch_source_invoke ()
#14 0x00007fff60b1b07a in _dispatch_queue_serial_drain ()
#15 0x00007fff60b0e166 in _dispatch_queue_invoke ()
#16 0x00007fff60b1bf0d in _dispatch_root_queue_drain_deferred_wlh ()
#17 0x00007fff60b1fd21 in _dispatch_workloop_worker_thread ()
#18 0x00007fff60e57fd2 in _pthread_wqthread ()
#19 0x00007fff60e57be9 in start_wqthread ()

Checklist:

  • [N/A ] If applicable, I have updated the documentation accordingly.
  • If applicable, I have added tests to cover my changes.

@Joannis
Copy link
Member

Joannis commented Jun 12, 2018

I think the resource exhaustion is caused by the test, not MongoKitten, since you're creating a concurrent queue This is an issue we've had in many places/libs (Vapor, MK3) before.

let workQueue = DispatchQueue (label: "work", attributes: .concurrent)

Then after creating a concurrent queue you dispatch 500 async operations.

while trialCounter < 500 {
    ...
    workQueue.async { ... }
}

My reasoning

This ends up creating (up to) 500 threads to handle the urgent need for concurrency. Meaning that whatever little resources there are left for MongoKitten, they're unlikely to get a thread. This means that the 500 threads are now waiting for a MK4 which is waiting for those threads 😬

There may be other reasons to cause this to fail, though. But the above is my best guess.

Solution

  • Remove the .concurrent flag from the DispatchQueue
  • Use less async calls in general.

Creating 500 threads (and their overhead) won't benefit you, since they all boil back down to the (8-20) threads MongoKitten creates and their associated sockets. Creating more than this many threads waiting for MongoKitten can be problematic and has (for this reason) influenced us changing the architecture with MK5 to a single async (non-blocking) connection.

@neallester
Copy link
Author

The actual behavior of the test is not entirely consistent with the idea that threads are the resource being exhausted. If it were a thread exhaustion problem you would expect to see the number of threads rise (quickly) until the limit was reached and then the test fail. This is not what occurs. What occurs is that during execution the number of threads (almost) instantly reaches a specific value and then remains constant (on my system 79 - 80 threads). The test then runs for a very variable duration until the failure occurs. Also, it is possible to trigger the same error with fewer async calls although it doesn't occur as reliably which is why I added more async calls. I understand that GCD performs async calls using a thread pool so 1 async call != 1 thread, and it appears that in this case GCD is correctly managing the thread pool by queuing requests until a thread is available to execute them. Also, the failure occurs on OSX where GCD is very mature.

This behavior, and the error message, are more consistent with the idea that occasionally a thread will release a socket after another thread has started to release the same socket but before the release has completed. Whichever thread looses the race to finish the release throws the error.

However, it is also possible that the system is having a problem creating a socket (due to resource exhaustion) but somehow doesn't raise an error until it goes to release the socket. This seems unlikely because of the relatively small number of sockets with open connections (8-20) at the time of failure.

@Joannis
Copy link
Member

Joannis commented Jun 19, 2018

Sorry for the late reply.

What occurs is that during execution the number of threads (almost) instantly reaches a specific value and then remains constant (on my system 79 - 80 threads).
This much makes sense, having seen the code. You're creating 500 async requests, which quickly hits a limit in Dispatch. At this point we have the main thread + the 5-20 MongoKitten threads for the connections and 59-39 threads for your requests.

The test then runs for a very variable duration until the failure occurs.

Slowly but surely the possibility arises that when MongoKitten completed a request and is now idle, the thread is rescheduled for your task (executing a Query to MongoKitten). This may not always happen, but it's not unlikely either that MongoKitten's precense gets swamped by yours.

Not sure if it works, but a possible fix could be to execute your queries asynchronously on a background thread, not a normal one. Another thing you could try is not asynchronously dispatching this many threads and seeing what happens.

However

Having had a second look at your problem, it may indeed be a thread safety issue because you're doing requests from many simultaneous threads and the crash is in the SSL Connection. It may be free-ing the same pointer twice somehow. Definitely worth an investigation.

@Joannis Joannis closed this Nov 11, 2019
@neallester neallester deleted the socketIssue branch November 11, 2019 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants