Skip to content

Commit

Permalink
Careful SMP locking - Fix very occasional hangs
Browse files Browse the repository at this point in the history
Louis Zulli reported that Stockfish suffers from very occasional hangs with his 20 cores machine.

Careful SMP debugging revealed that this was caused by "a ghost split point slave", where thread
was marked as a split point slave, but wasn't actually working on it.

The only logical explanation for this was double booking, where due to SMP race, the same thread
is booked for two different split points simultaneously.

Due to very intermittent nature of the problem, we can't say exactly how this happens.

The current handling of Thread specific variables is risky though. Volatile variables are in some
cases changed without spinlock being hold. In this case standard doesn't give us any kind of
guarantees about how the updated values are propagated to other threads.

We resolve the situation by enforcing very strict locking rules:
- Values for key thread variables (splitPointsSize, activeSplitPoint, searching)
can only be changed when the thread specific spinlock is held.
- Structural changes for splitPoints[] are only allowed when the thread specific spinlock is held.
- Thread booking decisions (per split point) can only be done when the thread specific spinlock is held.

With these changes hangs didn't occur anymore during 2 days torture testing on Zulli's machine.

We probably have a slight performance penalty in SMP mode due to more locking.

STC (7 threads):
ELO: -1.00 +-2.2 (95%) LOS: 18.4%
Total: 30000 W: 4538 L: 4624 D: 20838

However stability is worth more than 1-2 ELO points in this case.

No functional change

Resolves #422
  • Loading branch information
zamar committed Sep 10, 2015
1 parent 3e2591d commit 613dc66
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 2 deletions.
2 changes: 2 additions & 0 deletions src/search.cpp
Expand Up @@ -1622,6 +1622,7 @@ void Thread::idle_loop() {
else
assert(false);

spinlock.acquire();
assert(searching);

searching = false;
Expand All @@ -1633,6 +1634,7 @@ void Thread::idle_loop() {
// After releasing the lock we can't access any SplitPoint related data
// in a safe way because it could have been released under our feet by
// the sp master.
spinlock.release();
sp->spinlock.release();

// Try to late join to another split point if none of its slaves has
Expand Down
8 changes: 6 additions & 2 deletions src/thread.cpp
Expand Up @@ -145,6 +145,7 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
SplitPoint& sp = splitPoints[splitPointsSize];

sp.spinlock.acquire(); // No contention here until we don't increment splitPointsSize
spinlock.acquire();

sp.master = this;
sp.parentSplitPoint = activeSplitPoint;
Expand All @@ -167,6 +168,7 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
++splitPointsSize;
activeSplitPoint = &sp;
activePosition = nullptr;
spinlock.release();

// Try to allocate available threads
Thread* slave;
Expand Down Expand Up @@ -194,6 +196,9 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes

Thread::idle_loop(); // Force a call to base class idle_loop()

sp.spinlock.acquire();
spinlock.acquire();

// In the helpful master concept, a master can help only a sub-tree of its
// split point and because everything is finished here, it's not possible
// for the master to be booked.
Expand All @@ -205,15 +210,14 @@ void Thread::split(Position& pos, Stack* ss, Value alpha, Value beta, Value* bes
// We have returned from the idle loop, which means that all threads are
// finished. Note that decreasing splitPointsSize must be done under lock
// protection to avoid a race with Thread::can_join().
sp.spinlock.acquire();

--splitPointsSize;
activeSplitPoint = sp.parentSplitPoint;
activePosition = &pos;
pos.set_nodes_searched(pos.nodes_searched() + sp.nodes);
*bestMove = sp.bestMove;
*bestValue = sp.bestValue;

spinlock.release();
sp.spinlock.release();
}

Expand Down

9 comments on commit 613dc66

@ajithcj
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zamar, I see that one of the synchronization barriers added here was part of original SF-6 code, but it was removed in this patch:
81c7975

i.e the one before decrementing splitPointsSize.
I guess you decided to remove it because searching = true which means no other thread will try to book this thread.

But I was wondering if the objective is to make volatile variable 'searching' visible to other threads, isn't it more appropriate to add a write memory barrier on line 208?(or is this automatically taken care for volatile variables?). How does adding a spinlock.acquire() here make a difference?? I'm confused... Can you help me understand why you removed it and added it back again?

PS: I'm not an expert in multithreaded programming :-).

@zamar
Copy link
Author

@zamar zamar commented on 613dc66 Sep 11, 2015 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajithcj
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zamar, ok, I understand... one more question:
sp.spinlock.acquire(); --> line 199 of thread.cpp
sp.spinlock.release(); --> line 220 of thread.cpp

do you see a scenario where this is really required? Since all slaves have already exited the splitpoint and allSlavesSearching = false. I don't see a need for this either since no other thread will now try to modify splitpoint variables. But I may be missing something.

Overall, I see a need for two acquire/release in search.cpp. (line 1625, 1637).
but the remaining ones added in this patch I presume are just for additional safety. Can you confirm?

I'm hoping to rewrite a patch which is faster so I'm trying to understand which additions you think are really required and which were just for safety...

@zamar
Copy link
Author

@zamar zamar commented on 613dc66 Sep 11, 2015 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zamar
Copy link
Author

@zamar zamar commented on 613dc66 Sep 11, 2015 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajithcj
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zamar, yes, ofcourse I understand...you tried out the safest approach and it worked the very first time... its tough to narrow down on this without a highend machine to try out different things... Let me think more about this and see if I can improve things.. Thanks a lot for your inputs!

@ajithcj
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zamar, I've submitted a new test to really speedup splitting and latejoing(if it works!). I think it should also reduce any hang scenario:
http://tests.stockfishchess.org/tests/view/55f32bce0ebc5976a2d6db5f
If this doesn't work, then we have to rely on @LouisZulli's tests to narrow down on what causes the hang.

@AlexandreMasta
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of prevent a hang can´t you make something to undo a hang?
The logic is that you are using too much to prevent it when the fact is that the hangs are rare. The better solution would be just letting it occour and then from time to time making a simple check and undo the hang and sync threads again. Sorry if I´m talking about the impossible. I don´t know if when a hang occour you can´t do anything after that.

@AlexandreMasta
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RIght now the patch is gutted and there is evidence (Sugar engine) that the patch as it was really adds lots of elo.

Please sign in to comment.