-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stockfish much slower speeds than expected #2619
Comments
This is related to #2448 ? Almost certainly related to the fact you run on Win'10 Pro. |
Possibly yes. I thought it might be Win'10 too so I just tried Win'2012R2 and there's almost no difference in speed. Vondele got 186,000 kN/s with 128 cores although that was using Linux. But I'm quite sure that Win'2012R2 wouldn't degrade SF this much. I was expecting ~250,000 kN/s with 256 threads, not 80,000 kN/s. Could performance suffer so much because I'm testing with 2166MHz ram and only 4 dimm's instead of all 16 dimms populated at 3200MHz? I doubt it. I currently don't have 3200MHz ram so I can't say for sure. |
no, I'm sure it is not the memory frequency. |
(or almost sure.... 4dimms, I don't know). |
I guess it is a bit of a stretch, but can you try with Linux installed? |
Thanks, I may try Linux. But I'd be more curious if you also get low performance when using SF in Windows? Do you have Windows that you can quickly boot to? For the purposes of buying this hardware Linux won't work for me. So if you get 240 instead of 270 in Windows then I know 100% it's something with my system. Btw, what motherboard do you have, and do you use all 16 dimms with 3200MHz? |
no windows is not possible for me ... |
Ok that's fine. I'll try Linux. Could you please tell me your motherboard and if you use all 16 dimms with 3200MHz ram? I'm just trying to figure out why I have much slower speeds. |
I don't know hardware details, just had remote access. The Linux test will tell a lot, I assume. |
@hero2017 Before installing and trying Linux, can you answer following questions: It would also help to know the kN/s numbers for 16T/32T/48T/64T/72T/96T/128T, so you/we would know when the expected gain of speed starts to fail. |
This system has 4 numa nodes. Here are the results from Windows 10 Enterprise you requested. Bit better results but still far from the expected 250,000 kN/s range. Using Aquarium (64bit) gui though I always get about 75-80,000 kN/s at various positions, regardless if SMT is on/off: bench 1024 128 26 bench 1024 96 26 bench 1024 72 26 bench 1024 64 26 bench 1024 48 26 bench 1024 32 26: bench 1024 16 26: 256 cores (SMT/HT on): |
so that suggests that there are probably two independent effects. One is the gui, which might be doing something wrong (giving you 75000000), and the other is the outside of the gui. I suggest to first focus on outside of the gui. Outside of the gui, the numbers are somewhat better 131610538 and 147601713, but not quite as good as the number posted in #2448 (186631199 and 259440481). Not sure if the latter difference is OS or hardware related (not all dimm slots occupied). |
Thanks. I'm focusing on outside the gui. I even compiled SF manually using ARCH=x86-64-modern and added large pages but issue remains. As for the dimms, I just can't see SF slowing down so much just because not all dimms are being used or that they're 2166MHz. That's why I'm not ready to buy 16 x 3200MHz ram yet as that's an expensive upgrade and if it doesn't improve the speeds then it's a total waste. However at this point I'm not sure what else to try other than Win'2019 Server but that should not make a difference. |
I've now installed Win'2019 Std and with SMT/HT enabled I got: bench 1024 256 26 Total time (ms) : 283712 And this is with LP enabled. This has to be a Numa/Processor Group issue in Windows only. If it helps you to know, Win'2019 is reporting 4 processors even though I only have 2. This is likely because SMT is on and I don't think SF is taking advantage of it all. |
Your numbers are worse than a 3970x running Windows 10 Pro. I’m curious as to how much a system like that would cost - if you are in a position to share. Not the exact amount - just a ball park number. Linux is faster than Windows on the 3970x. |
Windows 10 Pro on a 3970X with Large Pages enabled Total time (ms) : 57605
|
Yes no kidding. I'm in Canada so the price was even worse but after duties, taxes, and all accessories to build this system it's about $10,000 cad, not including ram yet. Obviously I didn't spend this much to get 171000 kN/s or even worse, 80000 kN/s in Windows which is the primary purpose of this build. So as you can see without the help of the stockfish team I'm screwed with this system. If it helps I'd be willing to provide access to it so the team can add/update any code for numa/processor groups and do testing. I installed Linux CENTOS (8.1). Downloaded Stockfish Linux Modern (abrok): ./stockfish_20040717_x64_modern An improvement but still far from 250000 kN/s which is about what I should get with SMT enabled. |
on the linux install can you |
Sure, it's quite large so I've attached it here: info.txt |
Below is Linux with SMT/HT disabled. Basically no difference between 128 and 256 threads in terms of nps BUT total time is much lower than with HT enabled, 48626 HT off compared to 74175 HT on: ./stockfish_20040717_x64_modern |
@hero2017 I hope it gets resolved to your satisfaction. I believe 240M/250M nps should be within reach under Linux. |
I don't think there's anything more for me to try. My only hope is that it's because I'm using 2166MHz ram. I spent so much money I might as well fork out another $2K for 16*3200MHz ram and pray this makes a big difference. Other than that I take you don't see me doing anything wrong with my testing under Linux. Curious, how do I get the latest SF dev with LP so that I can try testing that, perhaps without me compiling under Linux? |
@hero2017 If you do not disable THP(transparent huge pages) under Linux, then it should transparently make the applications use large pages if possible. Might need to let the engine run for a while before the page migrations take place. |
so, I asked once around, populating only 4 dimms slots out of 16 will be bad for bandwidth, probably 1/4 only (the dual socket epyc 7742 should have 16 memory channels, and you might be using only 4, IIUC). Additionally the frequency is a further factor 2/3. So you might be at 1/6 of memory bandwidth of the system I tested. I don't exactly know how that impacts performance of SF, but it could have a significant impact. |
@noobpwnftw I didn't disable anything. New CentOS, downloaded Linux engine and ran benchmark. That's it. @vondele, @noobpwnftw But here's what I learned since. I've installed Win'10 Enterprise again. Now I get about 180,000 kN/s with LP enabled which still isn't 250,000 kN/s but that could be because I'm only using 8 of 16 2166MHz dimms (4 channel) instead of 16 dimms @ 3200MHz (8 channel) but I get this speed only in Aquarium gui. When I quit the gui and double-click the SF exe and run bench 1024 256 26 I always get only about 50,000 kN/s. Could there be something wrong with the bench section of the code, maybe not using the threads properly or numa or something? |
Guys, in Aquarium I'm now getting about 180,000 kN/s with HT and LP enabled (although I don't see any difference with LP disabled) with latest SF dev using Win'10 Ent for many positions. The reason why my bench results were so poor was because I was running them with 1024 MB hash. Check this out, with LP enabled and 32GB hash: =========================== Now with asmFish: asmFishW_2017-05-22_popcnt Total time (ms) : 108936 Quite a big difference! That's about 30% faster. Could it be a NUMA/processor groups bug in SF with this particular system? Btw, I also noticed that with asmFishW I get 3.25 GHz per node and with SF I only get 2.70 GHz per node. Lastly, if HT is enabled SF reports 5% faster speeds than without HT. However, with HT enabled asmFish reports 30% faster speeds than without HT. |
Was asked for Linux results earlier so here they are from Ubuntu 18.04 on this machine using today's SF which I compiled on this machine: (SMT enabled in bios)
This was from the 2nd run. The 1st run gave me only 98000000.
If I disable SMT (HT) in bios I get about 171,000,000 nps so it seems that with HT SF really suffers on this machine. Let me know what else I can do to help resolve the slow speed issue on this machine, especially with HT enabled, even under Ubuntu. I have Win'10 Ent and Ubuntu 18.04 dual booting so I can run tests in either OS easily. |
is the speed issue now resolved? If so, please, close this issue. |
I'll consider this fixed as #2656 gets merged. |
Wow. I have waited years for Large Pages support in SF. I ran a quick test between this version with LP and the next version without LP and the difference is night and day on this dual epyc 7742 system. On my previous hardware in the years past, LP usually provided a nice 10-15% speed boost. On my current hardware, with 4 NUMA nodes and 256 threads I'm getting an awesome boost: stockfish_20051319_x64_modern.exe (no LP):
|
hello i am samer707 |
Author: Sami Kiminki |
@samer707 we'll track that in the other issue. Just keep all info there. |
I'm trying to test it on a Win'10 Pro pc (x64) for the first time and checking Resource Monitor I only get about 80,000 kN/s in SF (trying the modern version from abrok since popcnt is supposed to be faster). I have set the engine to 128 threads as this is a dual epyc 7742 (retail not ES or QS, 64cores*2). Hash is set to 8192 MB for testing purposes. SMT is disabled. 32GB of ram. I'm only getting about 80,000 kN/s.
NOTE: I've also tried enabling SMT in bios and SF is running at 100% cpu with 256 threads. But the speed only goes to about 90,000 kN/s. According to benchmarks on ipman chess amd I should be getting about 190,000 kN/s using 128 threads or 270,000 kN/s using 256 threads (with SMT enabled).
The text was updated successfully, but these errors were encountered: