-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduction of the hang #2309
Comments
Example of output: |
Well, just to note that their implementation is simply wrong. Correct ones must ensure a lock is held during the period of unlocking external lock, entering wait and other notify calls, that is probably not possible without an internal lock in the conditional variable implementation itself. |
Example of output with the deterministic hang with one thread: |
@noobpwnftw |
Their recent master branch has reverted the changes that broke their implementation, however no releases have been made yet, these bugged ones will be out in the wild.
Best is just to advice users avoid using MINGW once and for all. |
In the example I have the hang is with only one thread? |
@snicolet With |
@snicolet I'm trying to work with you to figure out the problem with hanging I reported in #2305. You said to compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 1 Where do I specify these commands? I compile with: make profile-build ARCH=x86-64-bmi2 COMP=gcc -j 36 And can I compile any version that hangs for me or only for this one: https://github.com/snicolet/Stockfish/tree/hang3 |
@hero2017 |
I think making |
You are talking about this commit? |
given that mingw fixed the bug in their implementation, should we just document somewhere which version of mingw is working, which is known buggy, and move on? I think it is not a good idea to try and reimplement something like a condition_variable, because on one system the implementation was buggy. |
On the other hand, snicolet/Stockfish@7da2b2b is quite a simplification, removing the mutex and the condition variable. It is much easier to reason about the code. |
I agree that's easier to follow, and at first sight, it could just be tested as a simplification. However, the condition_variable implies (following cppreference) that the thread will be suspended, while this_thread::yield() is an implementation-dependent scheduler hint. Are we sure that we won't be consuming resources of the opponent while we're waiting for our next 'go' command (on multiple implementations)? I would be happy if this were the case. |
@vondele @noobpwnftw You are right, my atomic(bool) version is not good enough. It is not a problem of implicit mutex around the atomic (atomic(bool) is not guaranted to be lock-free by the standard, but in practice it is lock-free on all modern CPUs, see But the semi-buzy waiting loops are consuming a little bit more of CPU than I expected, because it is not really blocking threads but doing a semi-active waiting loop. I called thread::yield to be polite to other threads but I tested it to be significantly more than 0% CPU time even in idle state. It would pass the simplification test easily against master, because it's cheating on the scheduler :-) |
I am now convinced too that the best solution is to document the library bug somewhere in our wiki on the page "building Stockfish on Windows", to provide links to proper versions of the DLL and to move on. It's a pity for the clarity of the atomic(bool) code :-) |
IMO the original version is also very clear a demonstration of how to use signaling instead of a polling loop. Pity is that some implementations broke the standard, and we are affected. |
I'd suggest replacing custom mutex with standard mutex (I'm sure there is no measurable performance difference at all, in context of how it's currently used in SF. Yes, if you run that mutex lock million times in a loop, there will be a difference, but SF doesn't do that). And replace |
@mooskagh, I agree with that (and made a branch #2307 (comment) ) however, it was tested to be 2.5% slower (#2291 (comment) ). Personally, I would take the little performance hit. If not, we should at least augment our Mutex implementation with the needed deletion of operator= |
@vondele I however fail to see why using standard mutex could cause a slowdown: |
no idea, I was also not expecting any significant change. It is indeed not on a hot path. Maybe it requires one more round of testing. Edit: note the bench was done with a default depth of 13.... that's very low, especially threaded, and maybe more sensitive to overhead. I think should be repeated with e.g. depth 20 or so. |
@mooskagh I agree to use more fool-proof methods within STL, yet this time they messed up |
@vondele @mooskagh new speedup test master vs noCustomMutex using dynamic binaries w/ msys2 g++, older libwinpthread $ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex.exe 10
run base test diff
1 1821485 1818860 -2625
2 1814875 1813199 -1676
3 1817987 1814254 -3733
4 1827135 1812951 -14184
5 1817800 1823239 +5439
6 1825939 1819672 -6267
7 1827261 1805849 -21412
8 1820172 1818923 -1249
9 1825814 1827891 +2077
10 1826065 1813757 -12308
base = 1822453 +/- 2822
test = 1816859 +/- 3836
diff = -5593 +/- 5046
speedup = -0.003069 same result using $ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex.exe 50
run base test diff
1 1935737 1943663 +7926
2 1945433 1944548 -885
3 1949873 1948984 -889
4 1944548 1938372 -6176
5 1947207 1940133 -7074
6 1945433 1948984 +3551
7 1945433 1938372 -7061
8 1949873 1947207 -2666
9 1946320 1948095 +1775
10 1941896 1938372 -3524
11 1945433 1946320 +887
12 1948095 1942779 -5316
13 1940133 1946320 +6187
14 1942779 1948095 +5316
15 1951655 1948095 -3560
16 1949873 1948984 -889
17 1944548 1939252 -5296
18 1941896 1945433 +3537
19 1928746 1946320 +17574
20 1949873 1943663 -6210
21 1948095 1946320 -1775
22 1947207 1949873 +2666
23 1941896 1947207 +5311
24 1949873 1943663 -6210
25 1944548 1944548 +0
26 1945433 1944548 -885
27 1948095 1942779 -5316
28 1946320 1948984 +2664
29 1947207 1942779 -4428
30 1933985 1946320 +12335
31 1948984 1940133 -8851
32 1941896 1942779 +883
33 1943663 1937493 -6170
34 1943663 1948984 +5321
35 1948095 1943663 -4432
36 1933109 1930489 -2620
37 1944548 1947207 +2659
38 1944548 1951655 +7107
39 1945433 1941014 -4419
40 1945433 1948095 +2662
41 1948095 1943663 -4432
42 1941896 1944548 +2652
43 1944548 1949873 +5325
44 1941896 1945433 +3537
45 1950763 1944548 -6215
46 1946320 1924402 -21918
47 1943663 1945433 +1770
48 1948095 1941896 -6199
49 1946320 1946320 +0
50 1944548 1951655 +7107
base = 1944859 +/- 1250
test = 1944365 +/- 1381
diff = -493 +/- 1745
speedup = -0.000254 In the #2291 (comment) the mean speed of the noCustomMutex binary was slow (1854418). I made that test downloading the zip from the branch, now I used git and rebased the noCustomMutex on master.
$ git log --oneline --graph -n 6
* 74c51ad6 (HEAD -> noCustomMutex) Remove custom mutex implementation. No functional change.
* a858defd (vdv-upstream/master, upstream/master, origin/master, origin/HEAD, master) Raise stack size to 8MB for pthreads
* 7b064752 Scale down endgame factor when shuffling
* 843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
* e5cfa14f Assorted trivial cleanups
* a83d1a0e Use queens of either color in RookOnQueenFile
$ git log --oneline --graph -n 6
* a858defd (HEAD -> master, vdv-upstream/master, upstream/master, origin/master, origin/HEAD) Raise stack size to 8MB for pthreads
* 7b064752 Scale down endgame factor when shuffling
* 843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
* e5cfa14f Assorted trivial cleanups
* a83d1a0e Use queens of either color in RookOnQueenFile
* 8a04b3a1 Update Makefile documentation |
So the custom Mutex class can be removed entirely, that is good. |
new speedup test w/ g++ 8.1.0 built by mingw-w64, static binaries
$ bash bench-parallel.sh ./stockfish.exe ./stockfish-noCustomMutex.exe 50
run base test diff
1 1956123 1959712 +3589
2 1958813 1968743 +9930
3 1948984 1964217 +15233
4 1914056 1920941 +6885
5 1954333 1969650 +15317
6 1947207 1960611 +13404
7 1944548 1964217 +19669
8 1958813 1959712 +899
9 1954333 1958813 +4480
10 1944548 1969650 +25102
11 1958813 1959712 +899
12 1958813 1963314 +4501
13 1964217 1962412 -1805
14 1957019 1963314 +6295
15 1961511 1961511 +0
16 1959712 1960611 +899
17 1943663 1964217 +20554
18 1942779 1970559 +27780
19 1959712 1960611 +899
20 1949873 1961511 +11638
21 1955227 1965120 +9893
22 1957019 1967836 +10817
23 1961511 1960611 -900
24 1958813 1961511 +2698
25 1952547 1957019 +4472
26 1948984 1955227 +6243
27 1960611 1961511 +900
28 1952547 1965120 +12573
29 1957916 1957916 +0
30 1937493 1969650 +32157
31 1942779 1948984 +6205
32 1952547 1966025 +13478
33 1957916 1954333 -3583
34 1951655 1966025 +14370
35 1958813 1955227 -3586
36 1958813 1966025 +7212
37 1959712 1969650 +9938
38 1955227 1963314 +8087
39 1953439 1970559 +17120
40 1945433 1962412 +16979
41 1960611 1968743 +8132
42 1959712 1971468 +11756
43 1960611 1968743 +8132
44 1938372 1954333 +15961
45 1955227 1960611 +5384
46 1950763 1956123 +5360
47 1948095 1966025 +17930
48 1948095 1967836 +19741
49 1955227 1967836 +12609
50 1954333 1963314 +8981
base = 1952958 +/- 2367
test = 1962262 +/- 2164
diff = 9304 +/- 2228
speedup = 0.004764
$ bash bench-parallel.sh ./stockfish.exe ./stockfish-noCustomMutex.exe 10
run base test diff
1 1847383 1851318 +3935
2 1847705 1852742 +5037
3 1848091 1856245 +8154
4 1849187 1846032 -3155
5 1849897 1854427 +4530
6 1852677 1852677 +0
7 1849897 1856830 +6933
8 1850995 1857286 +6291
9 1847190 1856700 +9510
10 1852418 1854816 +2398
base = 1849544 +/- 1244
test = 1853907 +/- 2126
diff = 4363 +/- 2367
speedup = 0.002359 |
@vondele @noobpwnftw @snicolet noCustomMutex not rebased is a 2.7% slowdown wrt update master, noCustomMutex binary has nearly the same slow speed showed in test #2291 (comment) I fear that my speedup tests in #2291 could be not valid: the branches downloaded as zip perhaps were not aligned with the master. Let me know if you want repeat some speedup tests.
$ git log --oneline -n 6
78f680be (HEAD -> noCustomMutex, vdv-upstream/noCustomMutex) Remove custom mutex implementation. No functional change.
843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
e5cfa14f Assorted trivial cleanups
a83d1a0e Use queens of either color in RookOnQueenFile
8a04b3a1 Update Makefile documentation
db00e162 Add sse4 if bmi2 is enabled |
@ppigazzini thanks for the repeated testing, yes, these results after rebasing make sense. I'll make a pull request with that patch. |
as part of the investigation of the hang caused by an incorrect implementation of condition_variable in libwinpthread, it was realized that our custom Mutex implementation is no longer needed. Prior to lazySMP this custom implementation resulted in a 30% speedup, but now no speed difference can be measured (see official-stockfish#2309 (comment) and official-stockfish#2309 (comment)) as no mutex is used on the hot path in lazySMP. This removes platform-specific code, which is thus less tested. No functional change.
@vondele nobody asked for tests on Linux, so I suppose that we are testing custom Windows code. Anyway here the speedup tests (
|
@ppigazzini yes, this is code under _WIN32 only. Under the PR #2313 there is the question if there is any effect when TB are enabled. Not sure if you can test that, but if you can that would be useful. |
in your script you can use something like:
to run bench with more special options. |
With the Here Document syntax you can pass the input directly to stockfish. This works w/ a script and w/ the command line: ./stockfish << EOF
setoption name SyzygyPath value C:\tablebases\wdl345;C:\tablebases\wdl6;D:\tablebases\dtz345;D:\tablebases\dtz6
bench 128 4 20 default depth
ucinewgame
EOF |
@vondele I never used tablebases, I don't know if depth 20 is useful for your test
script: #!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value C:\tablebases\wdl345; C:\tablebases\dtz345
bench 128 4 $depth default depth
quit
EOF
}
if [[ $# -ne 4 ]]; then
cat << EOF
usage: $0 base test depth n_runs
fast test:
$0 ./stockfish_base ./stockfish_test 13 10
slow test:
$0 ./stockfish_base ./stockfish_test 20 10
EOF
exit 1
fi
base=$1
test=$2
depth=$3
n_runs=$4
# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt
# preload of CPU/cache/memory
($base bench >/dev/null 2>&1)&
($test bench >/dev/null 2>&1)&
wait
# bench loop: SMP bench with background subshells
for k in `seq 1 $n_runs`; do
printf "run %3d /%3d\n" $k $n_runs
# swap the execution order to avoid bias
if [ $((k%2)) -eq 0 ]; then
(_bench $base base000.txt)&
(_bench $test test000.txt)&
wait
else
(_bench $test test000.txt)&
(_bench $base base000.txt)&
wait
fi
done
# text processing to extract nps values
cat base000.txt | grep second | grep -Eo '[0-9]{1,}' > base001.txt
cat test000.txt | grep second | grep -Eo '[0-9]{1,}' > test001.txt
for k in `seq 1 $n_runs`; do
echo $k >> tmp000.txt
done
printf "\nrun\tbase\ttest\tdiff\n"
paste tmp000.txt base001.txt test001.txt | awk '{printf "%3d %d %d %+d\n", $1, $2, $3, $3-$2}'
paste base001.txt test001.txt | awk '{printf "%d\t%d\t%d\n", $1, $2, $2-$1}' > tmp000.txt
# compute: sample mean, 1.96 * std of sample mean (95% of samples), speedup
# std of sample mean = sqrt(NR/(NR-1)) * (std population) / sqrt(NR)
cat tmp000.txt | awk '{sum1 += $1 ; sumq1 += $1**2 ;sum2 += $2 ; sumq2 += $2**2 ;sum3 += $3 ; sumq3 += $3**2 } END {printf "\nbase = %10d +/- %d\ntest = %10d +/- %d\ndiff = %10d +/- %d\nspeedup = %.6f\n\n", sum1/NR , 1.96 * sqrt(sumq1/NR - (sum1/NR)**2)/sqrt(NR-1) , sum2/NR , 1.96 * sqrt(sumq2/NR - (sum2/NR)**2)/sqrt(NR-1) , sum3/NR , 1.96 * sqrt(sumq3/NR - (sum3/NR)**2)/sqrt(NR-1) , (sum2 - sum1)/sum1 }'
# remove temporary files
rm -f base000.txt test000.txt tmp000.txt base001.txt test001.txt |
@ppigazzini did I understand correctly you ran the test with table bases? If so, I think this indicates that it is performance neutral also with TB. However:
The space should be removed after the (as a minor nit, I would also use the _bench() function for the 'preload' phase, so the accesses to the files have been done as well). |
@vondele ok, I'll redo the test w/ your corrections. |
script snippet code #!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
bench 128 4 $depth default depth
EOF
}
if [[ $# -ne 4 ]]; then
cat << EOF
usage: $0 base test depth n_runs
fast test:
$0 ./stockfish_base ./stockfish_test 13 10
slow test:
$0 ./stockfish_base ./stockfish_test 20 10
EOF
exit 1
fi
base=$1
test=$2
depth=$3
n_runs=$4
# preload of CPU/cache/memory
(_bench $base base000.txt)&
(_bench $test test000.txt)&
wait
# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt |
@vondele you are correct, the space count
|
Any chance you could run the parallel case for longer, or redo. Right now we have |
@vondele I ran also the single thread bench because a multi thread one is always noisy. |
|
Closing this issue now that the reasons for the hang are well understood (buggy external pthread library for GCC 9.1 and 9.2 compiler suite on MinGW64). |
As part of the investigation of the hang caused by an incorrect implementation of condition_variable in libwinpthread, it was realized that our custom Mutex implementation is no longer needed. Prior to lazySMP this custom implementation resulted in a 30% speedup, but now no speed difference can be measured as no mutex is used on the hot path in lazySMP. #2291 #2309 (comment) #2309 (comment) The interest of this patch is that it removes platform-specific code, which is always less tested. No functional change.
This is a follow-up of the hang report in #2291 and the discussion in the pull request #2307
I have pushed a branch here with debugging facilities for our threads, the searching flag, the condition variable states, that sort of things: https://github.com/snicolet/Stockfish/tree/hang3
Features:
The UCI thread is numbered 999999 instead of 0 in master. I changed that because we have two threads with idx 0 in master, and it made it a little bit confusing to see which was doing what in the log messages.
a DEBUG_HANG macro in misc.h to show/hide all the debug messages in the console
#define DEBUG_HANG 1 && sync_cout << "[DEBUG_HANG] "
to show the messages#define DEBUG_HANG 0 && sync_cout << "[DEBUG_HANG] "
to hide the messagesa USE_CUSTOM_CONDITION_VARIABLE macro in thread_win32_osx.h to add the possibility to use a custom condition variable implementation.
a) if USE_CUSTOM_CONDITION_VARIABLE is 0, the code is the same as in current master.
b) if USE_CUSTOM_CONDITION_VARIABLE is 1, we use the version provided by @noobpwnftw (a copy of the buggy DLL implementation). But now this compiles on all systems with the same code path: in particular, I managed to simulate the repeatable and deterministic hang with 1 thread on my Mac, on the first position of bench.
notify_one() now has a new parameter with the thread index to help debugging
wait() has now two new parameters: thread index and sleep duration during the condition loop. This is used in the idle loop in line 179 of thread.cpp :
Examples:
• to see the normal flow of the threads synchronization, compile with
DEBUG_HANG 1
,USE_CUSTOM_CONDITION_VARIABLE 0
, and then type• to see the buggy flow and the deterministic hang, compile with
DEBUG_HANG 1
,USE_CUSTOM_CONDITION_VARIABLE 1
, and then typeHope this helps
The text was updated successfully, but these errors were encountered: