Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Careful SMP locking - Fix very occasional hangs
Louis Zulli reported that Stockfish suffers from very occasional hangs with his 20 cores machine. Careful SMP debugging revealed that this was caused by "a ghost split point slave", where thread was marked as a split point slave, but wasn't actually working on it. The only logical explanation for this was double booking, where due to SMP race, the same thread is booked for two different split points simultaneously. Due to very intermittent nature of the problem, we can't say exactly how this happens. The current handling of Thread specific variables is risky though. Volatile variables are in some cases changed without spinlock being hold. In this case standard doesn't give us any kind of guarantees about how the updated values are propagated to other threads. We resolve the situation by enforcing very strict locking rules: - Values for key thread variables (splitPointsSize, activeSplitPoint, searching) can only be changed when the thread specific spinlock is held. - Structural changes for splitPoints[] are only allowed when the thread specific spinlock is held. - Thread booking decisions (per split point) can only be done when the thread specific spinlock is held. With these changes hangs didn't occur anymore during 2 days torture testing on Zulli's machine. We probably have a slight performance penalty in SMP mode due to more locking. STC (7 threads): ELO: -1.00 +-2.2 (95%) LOS: 18.4% Total: 30000 W: 4538 L: 4624 D: 20838 However stability is worth more than 1-2 ELO points in this case. No functional change Resolves #422
- Loading branch information
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zamar, I see that one of the synchronization barriers added here was part of original SF-6 code, but it was removed in this patch:
81c7975
i.e the one before decrementing splitPointsSize.
I guess you decided to remove it because searching = true which means no other thread will try to book this thread.
But I was wondering if the objective is to make volatile variable 'searching' visible to other threads, isn't it more appropriate to add a write memory barrier on line 208?(or is this automatically taken care for volatile variables?). How does adding a spinlock.acquire() here make a difference?? I'm confused... Can you help me understand why you removed it and added it back again?
PS: I'm not an expert in multithreaded programming :-).
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zamar, ok, I understand... one more question:
sp.spinlock.acquire(); --> line 199 of thread.cpp
sp.spinlock.release(); --> line 220 of thread.cpp
do you see a scenario where this is really required? Since all slaves have already exited the splitpoint and allSlavesSearching = false. I don't see a need for this either since no other thread will now try to modify splitpoint variables. But I may be missing something.
Overall, I see a need for two acquire/release in search.cpp. (line 1625, 1637).
but the remaining ones added in this patch I presume are just for additional safety. Can you confirm?
I'm hoping to rewrite a patch which is faster so I'm trying to understand which additions you think are really required and which were just for safety...
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zamar, yes, ofcourse I understand...you tried out the safest approach and it worked the very first time... its tough to narrow down on this without a highend machine to try out different things... Let me think more about this and see if I can improve things.. Thanks a lot for your inputs!
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zamar, I've submitted a new test to really speedup splitting and latejoing(if it works!). I think it should also reduce any hang scenario:
http://tests.stockfishchess.org/tests/view/55f32bce0ebc5976a2d6db5f
If this doesn't work, then we have to rely on @LouisZulli's tests to narrow down on what causes the hang.
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of prevent a hang can´t you make something to undo a hang?
The logic is that you are using too much to prevent it when the fact is that the hangs are rare. The better solution would be just letting it occour and then from time to time making a simple check and undo the hang and sync threads again. Sorry if I´m talking about the impossible. I don´t know if when a hang occour you can´t do anything after that.
613dc66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RIght now the patch is gutted and there is evidence (Sugar engine) that the patch as it was really adds lots of elo.