Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduction of the hang #2309

Closed
snicolet opened this issue Sep 18, 2019 · 40 comments
Closed

Reduction of the hang #2309

snicolet opened this issue Sep 18, 2019 · 40 comments

Comments

@snicolet
Copy link
Member

snicolet commented Sep 18, 2019

This is a follow-up of the hang report in #2291 and the discussion in the pull request #2307

I have pushed a branch here with debugging facilities for our threads, the searching flag, the condition variable states, that sort of things: https://github.com/snicolet/Stockfish/tree/hang3

Features:

  1. The UCI thread is numbered 999999 instead of 0 in master. I changed that because we have two threads with idx 0 in master, and it made it a little bit confusing to see which was doing what in the log messages.

  2. a DEBUG_HANG macro in misc.h to show/hide all the debug messages in the console
    #define DEBUG_HANG 1 && sync_cout << "[DEBUG_HANG] " to show the messages
    #define DEBUG_HANG 0 && sync_cout << "[DEBUG_HANG] "to hide the messages

  3. a USE_CUSTOM_CONDITION_VARIABLE macro in thread_win32_osx.h to add the possibility to use a custom condition variable implementation.
    a) if USE_CUSTOM_CONDITION_VARIABLE is 0, the code is the same as in current master.
    b) if USE_CUSTOM_CONDITION_VARIABLE is 1, we use the version provided by @noobpwnftw (a copy of the buggy DLL implementation). But now this compiles on all systems with the same code path: in particular, I managed to simulate the repeatable and deterministic hang with 1 thread on my Mac, on the first position of bench.

  4. notify_one() now has a new parameter with the thread index to help debugging

  5. wait() has now two new parameters: thread index and sleep duration during the condition loop. This is used in the idle loop in line 179 of thread.cpp :

cv.wait(this->thread_index(), lk, [&]{ return searching; }, 1000); 

Examples:

• to see the normal flow of the threads synchronization, compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 0, and then type

./stockfish
bench
stop
quit

• to see the buggy flow and the deterministic hang, compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 1, and then type

./stockfish
bench
stop
quit

Hope this helps

@snicolet
Copy link
Member Author

@noobpwnftw
Copy link
Contributor

Well, just to note that their implementation is simply wrong. Correct ones must ensure a lock is held during the period of unlocking external lock, entering wait and other notify calls, that is probably not possible without an internal lock in the conditional variable implementation itself.
Problem this will cause is 1) notifies get lost(not the case with semaphores since they support releasing before there are waiters), 2) wrong threads consume notifies(if a thread has taken the external lock and there is an outstanding notify, this thread is guaranteed to receive it, in MINGW's case it is broken and caused the hangs, adding a delay after unlocking the external lock and before trying to consume the notify can make it reliably reproducible).

@snicolet
Copy link
Member Author

snicolet commented Sep 18, 2019

Example of output with the deterministic hang with one thread:
https://gist.github.com/snicolet/59686f8db4055be1a579d08a65417e4c

@snicolet
Copy link
Member Author

snicolet commented Sep 18, 2019

@noobpwnftw
My aim is to decorticate the hang to understand exactly the data race scenario, because I would like to rewrite our synchronization with something else than condition variables (atomic<bool> comes to mind, as in snicolet/Stockfish@7da2b2b ). Any racing scenario leading to dead lock with the buggy condition variable implementation is a good test for the robustness of the alternative, IMHO. Many many thanks for your code, of course!!!!

@noobpwnftw
Copy link
Contributor

Their recent master branch has reverted the changes that broke their implementation, however no releases have been made yet, these bugged ones will be out in the wild.
There are two ways to reduce the impact it caused but none would guarantee something else won't break given that their entire C++ runtime is also linked to it.

  1. Fire extra notifies in case of a wrong thread consumed them.
  2. Remove the usage of conditional variables or implement or own.

Best is just to advice users avoid using MINGW once and for all.
They seem to have poor versioning system as there is not an obvious way to tell what version their files are built upon, and no testing on whether their implementations are standard-compliant.

@snicolet
Copy link
Member Author

snicolet commented Sep 18, 2019

  1. wrong threads consume notifies

In the example I have the hang is with only one thread?

@noobpwnftw
Copy link
Contributor

@snicolet With bench, there are two consumers, UCI thread and MainThread.
In the same run you will not observe hangs with only go command, because then there will be only one consumer.

@hero2017
Copy link

@snicolet I'm trying to work with you to figure out the problem with hanging I reported in #2305.

You said to compile with DEBUG_HANG 1, USE_CUSTOM_CONDITION_VARIABLE 1

Where do I specify these commands? I compile with:

make profile-build ARCH=x86-64-bmi2 COMP=gcc -j 36

And can I compile any version that hangs for me or only for this one: https://github.com/snicolet/Stockfish/tree/hang3

@snicolet
Copy link
Member Author

@hero2017
DEBUG_HANG macro in misc.h
USE_CUSTOM_CONDITION_VARIABLE macro in thread_win32_osx.h

@noobpwnftw
Copy link
Contributor

I think making searching an atomic type is not desired because it makes the threads that are in idle state busy with locking/unlocking and yield is a system call.
Making extra notifications on the other hand is considerably cheaper, since it only turns into a loop when a thread consumes a notify where it shouldn't.

@snicolet
Copy link
Member Author

[mingw-w64] recent master branch has reverted the changes that broke their implementation[...]

You are talking about this commit?
https://sourceforge.net/p/mingw-w64/mingw-w64/ci/330025c54b85512d54b6960fad07498365c8fee3/

@vondele
Copy link
Member

vondele commented Sep 19, 2019

given that mingw fixed the bug in their implementation, should we just document somewhere which version of mingw is working, which is known buggy, and move on? I think it is not a good idea to try and reimplement something like a condition_variable, because on one system the implementation was buggy.

@snicolet
Copy link
Member Author

snicolet commented Sep 19, 2019

On the other hand, snicolet/Stockfish@7da2b2b is quite a simplification, removing the mutex and the condition variable. It is much easier to reason about the code.

@vondele
Copy link
Member

vondele commented Sep 19, 2019

I agree that's easier to follow, and at first sight, it could just be tested as a simplification.

However, the condition_variable implies (following cppreference) that the thread will be suspended, while this_thread::yield() is an implementation-dependent scheduler hint. Are we sure that we won't be consuming resources of the opponent while we're waiting for our next 'go' command (on multiple implementations)? I would be happy if this were the case.

@snicolet
Copy link
Member Author

snicolet commented Sep 19, 2019

@vondele @noobpwnftw You are right, my atomic(bool) version is not good enough.

It is not a problem of implicit mutex around the atomic (atomic(bool) is not guaranted to be lock-free by the standard, but in practice it is lock-free on all modern CPUs, see std::atomic<bool>{}.is_lock_free()).

But the semi-buzy waiting loops are consuming a little bit more of CPU than I expected, because it is not really blocking threads but doing a semi-active waiting loop. I called thread::yield to be polite to other threads but I tested it to be significantly more than 0% CPU time even in idle state.

It would pass the simplification test easily against master, because it's cheating on the scheduler :-)

@snicolet
Copy link
Member Author

snicolet commented Sep 19, 2019

I am now convinced too that the best solution is to document the library bug somewhere in our wiki on the page "building Stockfish on Windows", to provide links to proper versions of the DLL and to move on. It's a pity for the clarity of the atomic(bool) code :-)

@noobpwnftw
Copy link
Contributor

IMO the original version is also very clear a demonstration of how to use signaling instead of a polling loop. Pity is that some implementations broke the standard, and we are affected.

@mooskagh
Copy link

I'd suggest replacing custom mutex with standard mutex (I'm sure there is no measurable performance difference at all, in context of how it's currently used in SF. Yes, if you run that mutex lock million times in a loop, there will be a difference, but SF doesn't do that).

And replace std::condition_variable_any with std::condition_variable, which is much easier for libraries to get right.

@vondele
Copy link
Member

vondele commented Sep 19, 2019

@mooskagh, I agree with that (and made a branch #2307 (comment) ) however, it was tested to be 2.5% slower (#2291 (comment) ).

Personally, I would take the little performance hit. If not, we should at least augment our Mutex implementation with the needed deletion of operator=

@noobpwnftw
Copy link
Contributor

@vondele I however fail to see why using standard mutex could cause a slowdown:
We do not use it anywhere in the hot path, at most it slows our sync_cout a bit and TB probing, provided that bench doesn't probe TBs, where does this 2.5% slowdown come from?

@vondele
Copy link
Member

vondele commented Sep 19, 2019

no idea, I was also not expecting any significant change. It is indeed not on a hot path. Maybe it requires one more round of testing.

Edit: note the bench was done with a default depth of 13.... that's very low, especially threaded, and maybe more sensitive to overhead. I think should be repeated with e.g. depth 20 or so.

@noobpwnftw
Copy link
Contributor

@mooskagh I agree to use more fool-proof methods within STL, yet this time they messed up pthread_cond_wait which I believe would be used in both places. :)

@ppigazzini
Copy link
Contributor

ppigazzini commented Sep 20, 2019

@vondele @mooskagh new speedup test master vs noCustomMutex using bench 16 1 20

dynamic binaries w/ msys2 g++, older libwinpthread

$ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex.exe 10
run     base    test    diff
  1  1821485  1818860  -2625
  2  1814875  1813199  -1676
  3  1817987  1814254  -3733
  4  1827135  1812951  -14184
  5  1817800  1823239  +5439
  6  1825939  1819672  -6267
  7  1827261  1805849  -21412
  8  1820172  1818923  -1249
  9  1825814  1827891  +2077
 10  1826065  1813757  -12308

base =    1822453 +/- 2822
test =    1816859 +/- 3836
diff =      -5593 +/- 5046
speedup = -0.003069

same result using bench

$ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex.exe 50
run     base    test    diff
  1  1935737  1943663  +7926
  2  1945433  1944548  -885
  3  1949873  1948984  -889
  4  1944548  1938372  -6176
  5  1947207  1940133  -7074
  6  1945433  1948984  +3551
  7  1945433  1938372  -7061
  8  1949873  1947207  -2666
  9  1946320  1948095  +1775
 10  1941896  1938372  -3524
 11  1945433  1946320  +887
 12  1948095  1942779  -5316
 13  1940133  1946320  +6187
 14  1942779  1948095  +5316
 15  1951655  1948095  -3560
 16  1949873  1948984  -889
 17  1944548  1939252  -5296
 18  1941896  1945433  +3537
 19  1928746  1946320  +17574
 20  1949873  1943663  -6210
 21  1948095  1946320  -1775
 22  1947207  1949873  +2666
 23  1941896  1947207  +5311
 24  1949873  1943663  -6210
 25  1944548  1944548  +0
 26  1945433  1944548  -885
 27  1948095  1942779  -5316
 28  1946320  1948984  +2664
 29  1947207  1942779  -4428
 30  1933985  1946320  +12335
 31  1948984  1940133  -8851
 32  1941896  1942779  +883
 33  1943663  1937493  -6170
 34  1943663  1948984  +5321
 35  1948095  1943663  -4432
 36  1933109  1930489  -2620
 37  1944548  1947207  +2659
 38  1944548  1951655  +7107
 39  1945433  1941014  -4419
 40  1945433  1948095  +2662
 41  1948095  1943663  -4432
 42  1941896  1944548  +2652
 43  1944548  1949873  +5325
 44  1941896  1945433  +3537
 45  1950763  1944548  -6215
 46  1946320  1924402  -21918
 47  1943663  1945433  +1770
 48  1948095  1941896  -6199
 49  1946320  1946320  +0
 50  1944548  1951655  +7107

base =    1944859 +/- 1250
test =    1944365 +/- 1381
diff =       -493 +/- 1745
speedup = -0.000254

In the #2291 (comment) the mean speed of the noCustomMutex binary was slow (1854418). I made that test downloading the zip from the branch, now I used git and rebased the noCustomMutex on master.

  • noCustomMutex rebased on master:
$ git log --oneline --graph -n 6
* 74c51ad6 (HEAD -> noCustomMutex) Remove custom mutex implementation. No functional change.
* a858defd (vdv-upstream/master, upstream/master, origin/master, origin/HEAD, master) Raise stack size to 8MB for pthreads
* 7b064752 Scale down endgame factor when shuffling
* 843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
* e5cfa14f Assorted trivial cleanups
* a83d1a0e Use queens of either color in RookOnQueenFile
  • master:
$ git log --oneline --graph -n 6
* a858defd (HEAD -> master, vdv-upstream/master, upstream/master, origin/master, origin/HEAD) Raise stack size to 8MB for pthreads
* 7b064752 Scale down endgame factor when shuffling
* 843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
* e5cfa14f Assorted trivial cleanups
* a83d1a0e Use queens of either color in RookOnQueenFile
* 8a04b3a1 Update Makefile documentation

@noobpwnftw
Copy link
Contributor

So the custom Mutex class can be removed entirely, that is good.

@ppigazzini
Copy link
Contributor

ppigazzini commented Sep 20, 2019

new speedup test w/ g++ 8.1.0 built by mingw-w64, static binaries

  • default bench
$ bash bench-parallel.sh ./stockfish.exe ./stockfish-noCustomMutex.exe 50
run     base    test    diff
  1  1956123  1959712  +3589
  2  1958813  1968743  +9930
  3  1948984  1964217  +15233
  4  1914056  1920941  +6885
  5  1954333  1969650  +15317
  6  1947207  1960611  +13404
  7  1944548  1964217  +19669
  8  1958813  1959712  +899
  9  1954333  1958813  +4480
 10  1944548  1969650  +25102
 11  1958813  1959712  +899
 12  1958813  1963314  +4501
 13  1964217  1962412  -1805
 14  1957019  1963314  +6295
 15  1961511  1961511  +0
 16  1959712  1960611  +899
 17  1943663  1964217  +20554
 18  1942779  1970559  +27780
 19  1959712  1960611  +899
 20  1949873  1961511  +11638
 21  1955227  1965120  +9893
 22  1957019  1967836  +10817
 23  1961511  1960611  -900
 24  1958813  1961511  +2698
 25  1952547  1957019  +4472
 26  1948984  1955227  +6243
 27  1960611  1961511  +900
 28  1952547  1965120  +12573
 29  1957916  1957916  +0
 30  1937493  1969650  +32157
 31  1942779  1948984  +6205
 32  1952547  1966025  +13478
 33  1957916  1954333  -3583
 34  1951655  1966025  +14370
 35  1958813  1955227  -3586
 36  1958813  1966025  +7212
 37  1959712  1969650  +9938
 38  1955227  1963314  +8087
 39  1953439  1970559  +17120
 40  1945433  1962412  +16979
 41  1960611  1968743  +8132
 42  1959712  1971468  +11756
 43  1960611  1968743  +8132
 44  1938372  1954333  +15961
 45  1955227  1960611  +5384
 46  1950763  1956123  +5360
 47  1948095  1966025  +17930
 48  1948095  1967836  +19741
 49  1955227  1967836  +12609
 50  1954333  1963314  +8981

base =    1952958 +/- 2367
test =    1962262 +/- 2164
diff =       9304 +/- 2228
speedup = 0.004764
  • bench 16 1 20
$ bash bench-parallel.sh ./stockfish.exe ./stockfish-noCustomMutex.exe 10
run     base    test    diff
  1  1847383  1851318  +3935
  2  1847705  1852742  +5037
  3  1848091  1856245  +8154
  4  1849187  1846032  -3155
  5  1849897  1854427  +4530
  6  1852677  1852677  +0
  7  1849897  1856830  +6933
  8  1850995  1857286  +6291
  9  1847190  1856700  +9510
 10  1852418  1854816  +2398

base =    1849544 +/- 1244
test =    1853907 +/- 2126
diff =       4363 +/- 2367
speedup = 0.002359

@ppigazzini
Copy link
Contributor

ppigazzini commented Sep 20, 2019

@vondele @noobpwnftw @snicolet noCustomMutex not rebased is a 2.7% slowdown wrt update master, noCustomMutex binary has nearly the same slow speed showed in test #2291 (comment)

I fear that my speedup tests in #2291 could be not valid: the branches downloaded as zip perhaps were not aligned with the master.

Let me know if you want repeat some speedup tests.

$ bash bench-parallel.sh ./stockfish-dyn.exe ./stockfish-dyn-noCustomMutex-norebased.exe 50

run     base    test    diff
  1  1948095  1893260  -54835
  2  1940133  1879025  -61108
  3  1937493  1877364  -60129
  4  1947207  1877364  -69843
  5  1944548  1883189  -61359
  6  1932235  1890732  -41503
  7  1935737  1895794  -39943
  8  1948095  1882355  -65740
  9  1939252  1894949  -44303
 10  1947207  1879025  -68182
 11  1945433  1880688  -64745
 12  1945433  1876535  -68898
 13  1934860  1889051  -45809
 14  1950763  1884860  -65903
 15  1910631  1883189  -27442
 16  1945433  1888211  -57222
 17  1939252  1894104  -45148
 18  1924402  1881521  -42881
 19  1939252  1891574  -47678
 20  1926137  1855229  -70908
 21  1941896  1894949  -46947
 22  1934860  1882355  -52505
 23  1928746  1894949  -33797
 24  1935737  1893260  -42477
 25  1948095  1881521  -66574
 26  1945433  1880688  -64745
 27  1939252  1892416  -46836
 28  1945433  1894104  -51329
 29  1939252  1887372  -51880
 30  1939252  1894104  -45148
 31  1945433  1887372  -58061
 32  1938372  1894104  -44268
 33  1948095  1894104  -53991
 34  1929617  1888211  -41406
 35  1933109  1886534  -46575
 36  1949873  1895794  -54079
 37  1945433  1893260  -52173
 38  1940133  1887372  -52761
 39  1948984  1889051  -59933
 40  1948095  1892416  -55679
 41  1942779  1892416  -50363
 42  1941014  1891574  -49440
 43  1944548  1895794  -48754
 44  1948984  1889051  -59933
 45  1947207  1888211  -58996
 46  1933985  1886534  -47451
 47  1937493  1883189  -54304
 48  1928746  1878194  -50552
 49  1948095  1892416  -55679
 50  1941014  1893260  -47754

base =    1940411 +/- 2200
test =    1887451 +/- 2077
diff =     -52959 +/- 2659
speedup = -0.027293
$ git log --oneline -n 6
78f680be (HEAD -> noCustomMutex, vdv-upstream/noCustomMutex) Remove custom mutex implementation. No functional change.
843a6c43 (vdv-upstream/noCustom) Introduce midgame initiative
e5cfa14f Assorted trivial cleanups
a83d1a0e Use queens of either color in RookOnQueenFile
8a04b3a1 Update Makefile documentation
db00e162 Add sse4 if bmi2 is enabled

@vondele
Copy link
Member

vondele commented Sep 21, 2019

@ppigazzini thanks for the repeated testing, yes, these results after rebasing make sense. I'll make a pull request with that patch.

vondele added a commit to vondele/Stockfish that referenced this issue Sep 21, 2019
as part of the investigation of the hang caused by an incorrect implementation of condition_variable in libwinpthread, it was realized that our custom Mutex implementation is no longer needed. Prior to lazySMP this custom implementation resulted in a 30% speedup, but now no speed difference can be measured (see official-stockfish#2309 (comment) and official-stockfish#2309 (comment)) as no mutex is used on the hot path in lazySMP.

This removes platform-specific code, which is thus less tested.

No functional change.
@ppigazzini
Copy link
Contributor

@vondele nobody asked for tests on Linux, so I suppose that we are testing custom Windows code.

Anyway here the speedup tests (bench 16 1 20) for Linux and WSL:

  • Ubuntu 18.04
$ bash bench-parallel.sh ./stockfish ./stockfish-noCustomMutex 10

run     base    test    diff
  1  1244629  1223008  -21621
  2  1231417  1246473  +15056
  3  1200954  1220446  +19492
  4  1236184  1236011  -173
  5  1222895  1212823  -10072
  6  1242296  1223431  -18865
  7  1229903  1222923  -6980
  8  1204556  1228620  +24064
  9  1263767  1247177  -16590
 10  1262112  1248969  -13143

base =    1233871 +/- 13015
test =    1230988 +/- 7956
diff =      -2883 +/- 10379
speedup = -0.002337
  • WSL (Ubuntu 18.04)
$ bash bench-parallel.sh ./stockfish ./stockfish-noCustomMutex 10

run     base    test    diff
  1  1870721  1869864  -857
  2  1869402  1871250  +1848
  3  1869930  1873233  +3303
  4  1869468  1870193  +725
  5  1868084  1874028  +5944
  6  1868150  1872175  +4025
  7  1872770  1870721  -2049
  8  1865652  1866769  +1117
  9  1870853  1873035  +2182
 10  1870787  1870457  -330

base =    1869581 +/- 1212
test =    1871172 +/- 1301
diff =       1590 +/- 1491
speedup = 0.000851

@vondele
Copy link
Member

vondele commented Sep 21, 2019

@ppigazzini yes, this is code under _WIN32 only. Under the PR #2313 there is the question if there is any effect when TB are enabled. Not sure if you can test that, but if you can that would be useful.

@vondele
Copy link
Member

vondele commented Sep 21, 2019

in your script you can use something like:

cat << EOF > inp
setoption name SyzygyPath value C:\tablebases\wdl345;C:\tablebases\wdl6;D:\tablebases\dtz345;D:\tablebases\dtz6
bench 128 4 20 default depth
ucinewgame
EOF

cat inp | ./stockfish

to run bench with more special options.

@ppigazzini
Copy link
Contributor

With the Here Document syntax you can pass the input directly to stockfish. This works w/ a script and w/ the command line:

./stockfish << EOF
setoption name SyzygyPath value C:\tablebases\wdl345;C:\tablebases\wdl6;D:\tablebases\dtz345;D:\tablebases\dtz6
bench 128 4 20 default depth
ucinewgame
EOF

@ppigazzini
Copy link
Contributor

@vondele I never used tablebases, I don't know if depth 20 is useful for your test

  • bench 16 1 20 default depth
$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10

run     base    test    diff
  1  1776866  1774840  -2026
  2  1778657  1776448  -2209
  3  1760029  1759737  -292
  4  1758626  1755069  -3557
  5  1792387  1789418  -2969
  6  1785797  1786580  +783
  7  1784954  1781830  -3124
  8  1795244  1790811  -4433
  9  1795914  1793298  -2616
 10  1793298  1792084  -1214

base =    1782177 +/- 8503
test =    1780011 +/- 8383
diff =      -2165 +/- 968
speedup = -0.001215
  • bench 128 4 20 default depth
$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10

run     base    test    diff
  1  4170820  4161026  -9794
  2  4148547  4128555  -19992
  3  4199758  4207346  +7588
  4  4188851  4063993  -124858
  5  4147759  3886493  -261266
  6  4059020  4137195  +78175
  7  4101758  4066646  -35112
  8  4018434  3962576  -55858
  9  3943653  4148282  +204629
 10  4208268  4026774  -181494

base =    4118686 +/- 54076
test =    4078888 +/- 61135
diff =     -39798 +/- 81034
speedup = -0.009663

script:

#!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value C:\tablebases\wdl345; C:\tablebases\dtz345
bench 128 4 $depth default depth
quit
EOF
}

if [[ $# -ne 4 ]]; then
cat << EOF
usage: $0 base test depth n_runs
fast test:
$0 ./stockfish_base ./stockfish_test 13 10
slow test:
$0 ./stockfish_base ./stockfish_test 20 10
EOF
exit 1
fi

base=$1
test=$2
depth=$3
n_runs=$4

# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt

# preload of CPU/cache/memory
($base bench >/dev/null 2>&1)&
($test bench >/dev/null 2>&1)&
wait

# bench loop: SMP bench with background subshells
for k in `seq 1 $n_runs`; do
    printf "run %3d /%3d\n" $k $n_runs

    # swap the execution order to avoid bias
    if [ $((k%2)) -eq 0 ]; then
        (_bench $base base000.txt)&
        (_bench $test test000.txt)&
        wait
    else
        (_bench $test test000.txt)&
        (_bench $base base000.txt)&
        wait
    fi
  done

# text processing to extract nps values
cat base000.txt | grep second | grep -Eo '[0-9]{1,}' > base001.txt
cat test000.txt | grep second | grep -Eo '[0-9]{1,}' > test001.txt

for k in `seq 1 $n_runs`; do
    echo $k >> tmp000.txt
done

printf "\nrun\tbase\ttest\tdiff\n"
paste tmp000.txt base001.txt test001.txt | awk '{printf "%3d  %d  %d  %+d\n", $1, $2, $3, $3-$2}'
paste base001.txt test001.txt | awk '{printf "%d\t%d\t%d\n", $1, $2, $2-$1}' > tmp000.txt

# compute: sample mean, 1.96 * std of sample mean (95% of samples), speedup
# std of sample mean = sqrt(NR/(NR-1)) * (std population) / sqrt(NR)
cat tmp000.txt | awk '{sum1 += $1 ; sumq1 += $1**2 ;sum2 += $2 ; sumq2 += $2**2 ;sum3 += $3 ; sumq3 += $3**2 } END {printf "\nbase = %10d +/- %d\ntest = %10d +/- %d\ndiff = %10d +/- %d\nspeedup = %.6f\n\n", sum1/NR , 1.96 * sqrt(sumq1/NR - (sum1/NR)**2)/sqrt(NR-1) , sum2/NR , 1.96 * sqrt(sumq2/NR - (sum2/NR)**2)/sqrt(NR-1) , sum3/NR  , 1.96 * sqrt(sumq3/NR - (sum3/NR)**2)/sqrt(NR-1) , (sum2 - sum1)/sum1 }'

# remove temporary files
rm -f base000.txt test000.txt tmp000.txt base001.txt test001.txt

@vondele
Copy link
Member

vondele commented Sep 22, 2019

@ppigazzini did I understand correctly you ran the test with table bases? If so, I think this indicates that it is performance neutral also with TB. However:

setoption name SyzygyPath value C:\tablebases\wdl345; C:\tablebases\dtz345

The space should be removed after the ; it might not have found the dtz files.

(as a minor nit, I would also use the _bench() function for the 'preload' phase, so the accesses to the files have been done as well).

@ppigazzini
Copy link
Contributor

@vondele ok, I'll redo the test w/ your corrections.

@ppigazzini
Copy link
Contributor

@vondele

$ ./stockfish-dyn
Stockfish 200919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
info string Found 145 tablebases
  • bench 16 1 20 default depth
$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10

run     base    test    diff
  1  1779758  1776668  -3090
  2  1804813  1803345  -1468
  3  1800294  1808002  +7708
  4  1806774  1802367  -4407
  5  1797617  1806099  +8482
  6  1800476  1800355  -121
  7  1803711  1801330  -2381
  8  1804935  1805425  +490
  9  1810463  1802917  -7546
 10  1808248  1804935  -3313

base =    1801708 +/- 5352
test =    1801144 +/- 5518
diff =       -564 +/- 3154
speedup = -0.000313
  • bench 128 4 20 default depth
$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 10

run     base    test    diff
  1  3963877  3934106  -29771
  2  4115473  4198858  +83385
  3  4105453  4076136  -29317
  4  4260282  4082531  -177751
  5  4092370  4054341  -38029
  6  4161935  4128838  -33097
  7  4219421  4180818  -38603
  8  4172937  4202060  +29123
  9  4226200  4096386  -129814
 10  4419807  4172611  -247196

base =    4173775 +/- 74988
test =    4112668 +/- 51324
diff =     -61107 +/- 60514
speedup = -0.014641

script snippet code

#!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
bench 128 4 $depth default depth
EOF
}

if [[ $# -ne 4 ]]; then
cat << EOF
usage: $0 base test depth n_runs
fast test:
$0 ./stockfish_base ./stockfish_test 13 10
slow test:
$0 ./stockfish_base ./stockfish_test 20 10
EOF
exit 1
fi

base=$1
test=$2
depth=$3
n_runs=$4

# preload of CPU/cache/memory
(_bench $base base000.txt)&
(_bench $test test000.txt)&
wait

# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt

@ppigazzini
Copy link
Contributor

@vondele you are correct, the space count

$ ./stockfish-dyn
Stockfish 200919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name SyzygyPath value D:\tablebases\dtz345; D:\tablebases\wdl345
info string Found 0 tablebases

$ ./stockfish-dyn
Stockfish 200919 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name SyzygyPath value D:\tablebases\dtz345;D:\tablebases\wdl345
info string Found 145 tablebases

@vondele
Copy link
Member

vondele commented Sep 22, 2019

Any chance you could run the parallel case for longer, or redo. Right now we have
diff = -61107 +/- 60514
which is still pretty uncertain.

@ppigazzini
Copy link
Contributor

@vondele I ran also the single thread bench because a multi thread one is always noisy.
Anyway I started a new 4 threads bench with 100 iterations.

@ppigazzini
Copy link
Contributor

@vondele

#!/bin/bash
_bench () {
$1 << EOF > /dev/null 2>> $2
setoption name SyzygyPath value D:\tablebases\wdl345;D:\tablebases\dtz345
bench 128 4 $depth default depth
EOF
}
$ bash bench-parallel.sh ./stockfish-dyn ./stockfish-dyn-noCustomMutex 20 100

run     base    test    diff
  1  4028411  3988264  -40147
  2  4156718  3988999  -167719
  3  4050217  4158813  +108596
  4  4075110  3978422  -96688
  5  4176691  4126898  -49793
  6  4178989  4166439  -12550
  7  4364471  4085587  -278884
  8  4264779  4172560  -92219
  9  4190011  4155278  -34733
 10  4094093  4267374  +173281
 11  4172055  4104206  -67849
 12  4090508  4303651  +213143
 13  4107431  4267457  +160026
 14  4135663  4156299  +20636
 15  4084970  4120275  +35305
 16  4120186  4157927  +37741
 17  4076877  4185759  +108882
 18  4079553  4123875  +44322
 19  4145038  4218113  +73075
 20  4248117  4401430  +153313
 21  4347036  4188048  -158988
 22  4203408  4123750  -79658
 23  4137384  4104131  -33253
 24  4146845  4202925  +56080
 25  4131730  4267634  +135904
 26  4120837  4135726  +14889
 27  4153182  4174302  +21120
 28  4245287  4202612  -42675
 29  4192294  4190502  -1792
 30  4192049  4160818  -31231
 31  4213037  4374423  +161386
 32  4208960  4302531  +93571
 33  4212250  4175957  -36293
 34  4171165  4210936  +39771
 35  4269870  4113833  -156037
 36  4198744  4205718  +6974
 37  4132380  4153210  +20830
 38  4159630  4213226  +53596
 39  4212473  4297142  +84669
 40  4132997  4295945  +162948
 41  4146415  4195987  +49572
 42  4105173  4155161  +49988
 43  4176861  4260353  +83492
 44  4296667  4147231  -149436
 45  4173323  4332942  +159619
 46  4387542  4219682  -167860
 47  4249836  4405287  +155451
 48  4431291  4171066  -260225
 49  4154054  4129157  -24897
 50  4152967  4273454  +120487
 51  4262182  4164639  -97543
 52  4009779  4198586  +188807
 53  4381828  4136225  -245603
 54  4119134  4147451  +28317
 55  4195190  4128934  -66256
 56  4207024  4292524  +85500
 57  4278349  4101010  -177339
 58  4288053  4175406  -112647
 59  4230221  4155640  -74581
 60  4182236  4233959  +51723
 61  4230142  4192374  -37768
 62  4076088  4098255  +22167
 63  4147207  4094302  -52905
 64  4193308  4159861  -33447
 65  4159802  4115800  -44002
 66  4229147  3973837  -255310
 67  4104917  4145674  +40757
 68  4277626  4159346  -118280
 69  4178098  4369557  +191459
 70  4215423  4082302  -133121
 71  4124281  4191971  +67690
 72  4205383  4237827  +32444
 73  4085438  4197867  +112429
 74  4340990  4129501  -211489
 75  4123887  4258141  +134254
 76  4114655  4134322  +19667
 77  3998934  4048559  +49625
 78  4006870  4142400  +135530
 79  4149633  4064129  -85504
 80  3987847  4083555  +95708
 81  4079099  4019175  -59924
 82  4069896  4075554  +5658
 83  4046002  4006692  -39310
 84  4145279  4120274  -25005
 85  4208562  4065266  -143296
 86  4058375  4099311  +40936
 87  4241879  4136356  -105523
 88  4110025  3983375  -126650
 89  4260752  4015073  -245679
 90  3977838  3978929  +1091
 91  3968656  4016923  +48267
 92  4117722  4197500  +79778
 93  4125479  4150627  +25148
 94  4013158  4225544  +212386
 95  4036261  4076191  +39930
 96  4203485  4119686  -83799
 97  4019309  4061404  +42095
 98  4107073  4309239  +202166
 99  4148978  4182035  +33057
100  4222571  4165275  -57296

base =    4161596 +/- 18125
test =    4161296 +/- 18229
diff =       -299 +/- 22476
speedup = -0.000072

@snicolet
Copy link
Member Author

Closing this issue now that the reasons for the hang are well understood (buggy external pthread library for GCC 9.1 and 9.2 compiler suite on MinGW64).

snicolet pushed a commit that referenced this issue Sep 26, 2019
As part of the investigation of the hang caused by an incorrect implementation
of condition_variable in libwinpthread, it was realized that our custom Mutex
implementation is no longer needed. Prior to lazySMP this custom implementation
resulted in a 30% speedup, but now no speed difference can be measured as no
mutex is used on the hot path in lazySMP.

#2291
#2309 (comment)  #2309 (comment)

The interest of this patch is that it removes platform-specific code, which is
always less tested.

No functional change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants