-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stockfish hang in TCEC S16 Division P Game 51 #2291
Comments
I used this as the opening book:
|
can you give the precise sequence of uci commands to reproduce the hang + info on the system you run on. |
I entered that FEN (but without any "Result") and SF_19082608 is up to d=82, 9.5 billion nodes, and no crash yet. |
Apologies, stall was due to bad memory setting by me :( |
right now, I see no issues with this particular fen... will need more info to see if this is an issue / what it is. |
When it happened at TCEC I would guess it was a hang, there was a msg from cutechess terminating the engine process. It had just under 17 minutes to use, but the longest recent thinks were around 400 and 300 seconds, depths 86/94 and 90/106, 33bn and 22bn nodes. |
well, recent multi-threaded tests are all crash free, for 100000s of games. So if it is a real issue, will be hard to find. |
Sure. Let's wait to see what TCEC say ... |
Perhaps an issue linked to low memory available in comparison with high multithread needs ? Edit : specially breadcrumbs ? don't know how many memory they need |
breadcrumbs use 16K memory, independent of the number of threads ... generally threads don't need much memory (mostly the history tables). It is also no reason to hang. |
Just curious, how did TCEC know it was hung as the result appears to be adjudicated on TCEC and did the TCEC stop play before the time on the clock was used up. If it was hung, it would have lost by time forfeit eventually ( I would think), and there would have been no need to adjudicated the result. So I am curious as to why the result was adjudicated, and if they adjudicated the result before time expired? |
It ran out of time. Edit: and was then unresponsive to cutechess so cutechess killed it. |
That was unfortunate as it was a dead draw at that point. No issue here on my machine ( slightly modified)
|
In the absence of other evidence TCEC have ruled it as a stockfish crash (well, hang really). A viewer said this was the last output, posted after about 22s of thought: <Stockfish 190826(13): info depth 83 seldepth 114 multipv 1 score cp -41 nodes 2661493724 nps 121020995 hashfull 97 tbhits 1495869 time 21992 pv e7f8 g1f2 a5c7 f2f3 f8e7 c6b7 c7a5 g3g4 h5g4 f3g3 a5c3 g3g4 c3a5 b7a6 a5c3 a6b5 c3d2 b5c6 d2e1 c6a8 e1a5 g4h5 a5d2 a8c6 d2e1 h5g4 e1d2 c6b5 d2a5 g4h3 a5d2 b5a6 d2c3 a6c8 c3e1 c8b7 e1d2 h3g2 d2c3 b7c6 c3d2 g2f1 d2c3 c6a8 c3d2 f1f2 e7f8 f2e2 d2b4 e2f3 f8e7 f3f2 b4d2 a8b7 e7f8 f2f3 d2e1 b7c6 f8e7 f3e2 e1b4 c6b5 ............ Roughly 16 mins later, after running out of time:
So it sounds like something happened while it was in its normal window of thinking time that made it not send any more output to cutechess. That could have been when it tried to send its move, of course, or it could be some kind of internal infinite loop, or ... In fact previous moves took 23/22/36/22/21 seconds, so 22s sounds like when it tried to send a move back to cutechess? |
This explanation seems the most logic. Specially the position has no reason to lead to infinite loops. But is it a SF issue or cutechess issue. I don't know ... Perhaps SF sended the move but cutechess didn't receive it. |
Whole log for this game is at https://tcec-chess.com/crash/s16divp_51.7z (= http://tinyurl.com/y597lyyp listed in Crash info) |
Maybe not infinite loops, but search explosion? The last output received from Stockfish was at depth 82 and selective depth 114 after (almost) 22 seconds, and then nothing happened in like 15 minutes before Stockfish was flagged:
But already at depth 29 we were reaching selective depth 113:
Maybe I am wrong, but the sort of position when this happened (shuffling in a draw opposite colored bishops endgame with score = 0.41, stockfish having not yet shown 0.00) seems to hint that maybe we observed a search explosion problem somewhere, maybe related to the shuffling patch? I wasn't really active when the shuffling patch was developed, but did we had problems like that during its development or tuning? |
It went fine through depth 85. Then I ran it again and I got some sort of memory leak at depth 79. Speed went down to a crawl. Third time seems to be just fine again. I am using Fritz 16 interface. |
unlikely that shuffle extensions are related. Threads will quit search if they observe the signal stop. I.e. pass at line 1199 in search. Similarly, stop will be signaled by the mainthread if that one reaches line 1828. It isn't very easy to prove this happens always, but maybe there is some very rare case (that I can't see, but why is the cycle detection code before the time check)? Memory corruption could of course cause anything, e.g. if some random write causes ponder to be true, this would happen. |
@Mindbreaker1 with or without TB ? |
With 6-man and less |
I had the hash set at 12+ GB |
Hmmm, maybe our code is not robust enough for time emergency, by the way. It is true that we exit the search in line 1199 in any thread if the main thread has signaled in check_time(), but line 1199 is after closing a subtree, so there could still be an (infinite) chain of subtree openings before we stop. I would prefer to add these lines around line 1199 when entering the search function, to be honest:
Of course that would just be to be 100% sure that our time emergency procedure is correct in case of explosion problem, but that doesn't fix the underlying problem. |
I really think this wasn't a time emergency. max_time (which is checked by check_time) must have been much shorter than the available time on the clock (would need to verify, but I'm rather sure, additionally with have the 1s moveoverhead). I agree we should be defensive in the code, but before we rush to a solution we should understand what is going on. |
Well, I got another unusual result. It stayed on 79 much longer, but everything else looked fine. |
@vondele What makes me really suspiscious is the depth 29 / selective depth 119 reported in the logs |
I had a 10 minute gap in writes to the log while testing (11 slow cores, but still): Note the ^C because nothing seemed to be happening and the date at that point compared to output of ls: $ tail -f sfmaster.log Output continued very shortly after: $ tail -f sfmaster.log Does |
Yes hashfull 1000 is 100% utilization of hash. Hashfull 123 is 12.3% full. |
I put some debug output in and maximumtime looks to be around say 15-20% of total time, so if the check is made, it should stop us in plenty of time. |
that's quite usual with the shuffle extensions, and was discussed in depth. This should still not cause the time_check mechanism to fail (if that's what is going on). |
@vondele
|
I think I can make it not hang under any conditions, without damaging performance, but it may look ugly. |
Updated my msys2 installation this morning, compiled current master and experienced no problems so far. |
@joergoster what is your CPU? At the moment we have hangs on Intel 3770k, Amd FX, Intel Xeon (bmi2) and the TCEC CPU.
|
@ppigazzini CPU i5-4570 However, just noticed that the compiles only run under msys2 prompt. |
@noobpwnftw #2307 so far so good (400/1000). Dynamic build w/ msys gcc 9.2 and latest msys2 libwinpthread. |
@noobpwnftw 1000 games completed successfully.
|
Just for reference, here is a version using atomic instead of condition variable: https://github.com/snicolet/Stockfish/tree/hang Does it hang? |
@snicolet No, it runs fine w/ the latest libwinpthread (ie the problematic one).
|
@noobpwnftw here the script to test the speed #!/bin/bash
if [[ $# -ne 3 ]]; then
echo "usage:" $0 "base test n_runs"
echo "example:" $0 "./stockfish_base ./stockfish_test 10"
exit 1
fi
base=$1
test=$2
n_runs=$3
# temporary files initialization
cat /dev/null > base000.txt
cat /dev/null > test000.txt
cat /dev/null > tmp000.txt
# preload of CPU/cache/memory
($base bench >/dev/null 2>&1)&
($test bench >/dev/null 2>&1)&
wait
# bench loop: SMP bench with background subshells
for k in `seq 1 $n_runs`;
do
printf "run %3d /%3d\n" $k $n_runs
# swap the execution order to avoid bias
if [ $((k%2)) -eq 0 ];
then
($base bench >/dev/null 2>> base000.txt)&
($test bench >/dev/null 2>> test000.txt)&
wait
else
($test bench >/dev/null 2>> test000.txt)&
($base bench >/dev/null 2>> base000.txt)&
wait
fi
done
# text processing to extract nps values
cat base000.txt | grep second | grep -Eo '[0-9]{1,}' > base001.txt
cat test000.txt | grep second | grep -Eo '[0-9]{1,}' > test001.txt
for k in `seq 1 $n_runs`;
do
echo $k >> tmp000.txt
done
printf "\nrun\tbase\ttest\tdiff\n"
paste tmp000.txt base001.txt test001.txt | awk '{printf "%3d %d %d %+d\n", $1, $2, $3, $3-$2}'
paste base001.txt test001.txt | awk '{printf "%d\t%d\t%d\n", $1, $2, $2-$1}' > tmp000.txt
# compute: sample mean, 1.96 * std of sample mean (95% of samples), speedup
# std of sample mean = sqrt(NR/(NR-1)) * (std population) / sqrt(NR)
cat tmp000.txt | awk '{sum1 += $1 ; sumq1 += $1**2 ;sum2 += $2 ; sumq2 += $2**2 ;sum3 += $3 ; sumq3 += $3**2 } END {printf "\nbase = %10d +/- %d\ntest = %10d +/- %d\ndiff = %10d +/- %d\nspeedup = %.6f\n\n", sum1/NR , 1.96 * sqrt(sumq1/NR - (sum1/NR)**2)/sqrt(NR-1) , sum2/NR , 1.96 * sqrt(sumq2/NR - (sum2/NR)**2)/sqrt(NR-1) , sum3/NR , 1.96 * sqrt(sumq3/NR - (sum3/NR)**2)/sqrt(NR-1) , (sum2 - sum1)/sum1 }'
# remove temporary files
rm -f base000.txt test000.txt tmp000.txt base001.txt test001.txt |
Seeing how @noobpwnftw closed his PR, I guess a practical question is, if SF makes it to the superfinals, do we 1) downgrade libwinpthread 2) use a different build of SF (e.g. abrok) 3) change the custom implementation of mutex/conditional variable in SF or 4) something else? |
If we are keeping MINGW compiles, then |
After db00e16 the speedup of msys2 build (pgo & lto) vs Abrok one (pgo & lto) is only 0.26% $ bash bench-parallel.sh ./stockfish_19091615_x64_bmi2.exe ./stockfish_msys2.exe 100
base = 1793776 +/- 9052
test = 1798456 +/- 9169
diff = 4679 +/- 3190
speedup = 0.002609 |
@ppigazzini Interesting script for the parallel execution, I do almost the same locally to test speed but I like to a) show the md5 signature of the two binaries (to avoid looking for speed difference when the binaries are the same :-)) # show the md5 signatures of the two tested binaries
md5 $testdir/testedpatch
md5 $testdir/master
# two long bench runs (depth 22), in parallel
$testdir/testedpatch bench 16 1 22 > /dev/null && echo "testedpatch" &
$testdir/master bench 16 1 22 > /dev/null && echo "master" |
@snicolet not a big difference in precision
$ bash bench-parallel.sh ./stockfish_msys2_ss3.exe ./stockfish_msys2_ss4.exe 10
run base test diff
1 1620088 1747310 +127222
2 1771216 1679313 -91903
3 1740901 1762447 +21546
4 1745168 1755928 +10760
5 1760994 1768283 +7289
6 1769748 1773421 +3673
7 1760269 1759544 -725
8 1762447 1765360 +2913
9 1754485 1750173 -4312
10 1751608 1753765 +2157
base = 1743692 +/- 27586
test = 1751554 +/- 16520
diff = 7862 +/- 32465
speedup = 0.004509
$ bash bench-parallel.sh ./stockfish_msys2_ss3.exe ./stockfish_msys2_ss4.exe 10
run base test diff
1 1663089 1655786 -7303
2 1662813 1639512 -23301
3 1663466 1658005 -5461
4 1663692 1672258 +8566
5 1731483 1740323 +8840
6 1785393 1784149 -1244
7 1779448 1778672 -776
8 1750255 1748615 -1640
9 1778701 1773629 -5072
10 1728167 1764137 +35970
base = 1720650 +/- 32824
test = 1721508 +/- 35972
diff = 857 +/- 9455
speedup = 0.000499 |
I have written some code to try to get a reduction of the hang: #2309 |
in the forum thread about the preparation for the super-final binary, Pasquale made the remark that the Abrok versions are build using MinGW too:
Now I am confused, is there a risk that the Abrok builds use the buggy condition variable implementation too? https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/xc3DM_xzxYA |
I think so, depends on what libraries are being used... that's why it needs testing. |
@ppigazzini @CoffeeOne @Krgp |
I think it says which version of mingw is used at the top of the abrok web
page? Does that help?
I thought it was an old version but I'm not sure, I'm not an expert on that
stuff.
|
On the top of the Abrok page it says: "They are compiled with gcc/mingw 7.3 on Ubuntu 18.04." But how can we sure? This is the problem, if the Abrok people decided to update their compiler but didn't change the page... Is there a way to get a binary and tell which compiler produced it? |
@snicolet different versions of MinGW, only the msys2 one uses the bugged libwinpthread gcc version 9.2.0 (Rev2, Built by MSYS2 project) gcc version 8.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project) gcc version 7.3-posix 20180312 (GCC) Ubuntu 18.04 Nevertheless @CoffeeOne tested that the proposed bugfix is working fine |
@snicolet Ubuntu 19.04 uses this: But PGO build works only w/ Ubuntu 18.04, so if Abrok will upgrade to a newer Ubuntu version, SF will be terribly slow :)
|
@snicolet please freeze in proper time the commit that you want to use in the SuFi. |
See pull request #2327 for a tentative to show at startup time which compiler was used to compile Stockfish. |
Closing this issue now that the reasons for the hang are well understood (buggy external pthread library for GCC 9.1 and 9.2 compiler suite on MinGW64). What remains to be done is documenting the reasons properly in our Wiki pages. Thanks to everybody here for all the investigations and testing during the last two weeks! |
As part of the investigation of the hang caused by an incorrect implementation of condition_variable in libwinpthread, it was realized that our custom Mutex implementation is no longer needed. Prior to lazySMP this custom implementation resulted in a 30% speedup, but now no speed difference can be measured as no mutex is used on the hot path in lazySMP. #2291 #2309 (comment) #2309 (comment) The interest of this patch is that it removes platform-specific code, which is always less tested. No functional change.
In the TCEC Archive here
Reproduces with a few seconds to move for me:
The text was updated successfully, but these errors were encountered: