Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible fix for CCCC slowdown #3288

Closed

Conversation

mstembera
Copy link
Contributor

Don't unnecessarily refresh TT entries to lower memory bandwidth.
No functional change
bench: 4109336

@vondele
Copy link
Member

vondele commented Jan 8, 2021

would be good to have some performance numbers for this somehow. We don't have testing access to that machine. What might be interesting is to look at some hardware counters and see how much this influences read/write stats?

@mstembera
Copy link
Contributor Author

So I instrumented the code and it looks like during normal game play under STC conditions the check filters out 87% of the refreshes. While we don't have direct access to the machine it is likely CCCC will update SF from abrok before they get the memory upgrade especially if it says it may fix the slowdown. So we have a short window unique opportunity to find out if this matters on systems with low bandwidth relative to processor performance. If it doesn't then we won't have to worry it will bite us in the future and we will revert of course. If it does then we got a fix. Both are good outcomes.

@vondele
Copy link
Member

vondele commented Jan 8, 2021

let me try to do a benchmark with many threads (but better memory bandwidth) and see what happens.

@vondele
Copy link
Member

vondele commented Jan 8, 2021

No difference as far as I can see on this system:

for i in `seq 1 10`; do ./stockfish.patch < inp | grep -B1 bestmove | grep -o 'nps [0-9]* '; done
nps 174982601 
nps 175098662 
nps 173938262 
nps 175775665 
nps 175922876 
nps 177840037 
nps 176150106 
nps 178405315 
nps 177367900 
nps 176789379 
for i in `seq 1 10`; do ./stockfish.master < inp | grep -B1 bestmove | grep -o 'nps [0-9]* '; done
nps 177126350 
nps 174348947 
nps 176363794 
nps 178662346 
nps 173078670 
nps 175764493 
nps 174525722 
nps 176623130 
nps 172222530 
nps 177420439 
$ cat inp
setoption name Threads value 248
setoption name Hash value 50000
go depth 40
ucinewgame

@mstembera
Copy link
Contributor Author

mstembera commented Jan 8, 2021

Using bench is quite a different refresh pattern compared to game play because the starting bench positions are completely different from each other unlike consecutive positions in a game.
[Edit] Your results seem consistent. We shouldn't expect any improvement on your system since it doesn't have any issue in the first place.

@CoffeeOne
Copy link

CoffeeOne commented Jan 9, 2021

I think this patch is worth merging.
I have quad Opterons machines with Opteron 8439SE CPUs and I thought I should be able to measure a speedup (if the patch is good), because the machine uses DDR2 ECC RAM, which should be slow enough.
So I tried a normal bench between stockfish master and the "LowerMemBandwith" branch.
So Windows, gcc 10.2: 10 runs of bench 8192 24 26 each.
grafik

So the branch seems to be slightly faster, but I of course I am aware that the runs were very noisy.

But I was just thinking: I have some other good testing options.
That machine has an optimal memory configuration: Each CPU has a pair of 4 GB DDR2-800 modules with CL5, so 32 GB RAM in total and 8 (!) memory channels in total.
I can

  1. aritificially reduce the memory speed to DDR2-533 (so externally 266MHz, but then it's CL4 only :D)
  2. Remove all the RAM from CPU2, CPU3 and CPU4. This not only reduces the total amount of RAM to 8GB, but it dramatically reduces memory bandwidth, too.
  3. Combine 1) and 2)

Any comments?

EDIT:
I don't know if it is important. The builds were done with
make -j profile-build ARCH=x86-64-sse3-popcnt
the modern target does not run anymore on this machine.

@vondele
Copy link
Member

vondele commented Jan 9, 2021

@CoffeeOne if you could do experiment 2 (depopulate some memory channels) that would be very interesting. I think the 'reasonable' setup would be to remove 1 module per CPU.

@mstembera
Copy link
Contributor Author

I also think this is worth merging because it is known not to be harmful and we only have a short window of opportunity for CCCC to get it from abrock before they upgrade their memory. After that we can easily revert depending on the results. I'm not against any additional testing or investigation it's just that I wouldn't wait before we miss our chance at CCCC where the problem is know.

For testing I think the most important thing is that we test on a machine configuration that experiences a slowdown issue due to memory bandwidth. Something like 2x as does the CCCC machine. If removing some memory from your machine causes a slowdown replicating this condition that should be good.

@vondele
Copy link
Member

vondele commented Jan 9, 2021

I don't like the idea to commit to master to see if things help for some online tournament. If nobody can reproduce it locally it is too much of a corner case to worry about.

@CoffeeOne
Copy link

Hi,
I tried 2.)
I removed all RAM except 2 modules connected to CPU1. That's the bare minimum, otherwise the machine will not boot.
@vondele 1 module per CPU is not possible with this hardware, the modules must be added in pairs always.

So I tried again with the same executables:
grafik

I had to lower the size of the TT table, because the machine has now 8GB only.

So the branch is still faster, but now only ~0.5% (before it was 2%)

@mstembera
Copy link
Contributor Author

@CoffeeOne Thanks for testing. Since the nps for master is not much lower than it was before you removed the memory it still doesn't look like we are reproducing a bandwidth limited setup. The CCCC machine is about 2x slower than expected.

@mstembera
Copy link
Contributor Author

@vondele Do you have a way to remove memory from your machine? Since it's virtually the same as CCCC you may have the best shot of recreating the issue.

@vondele
Copy link
Member

vondele commented Jan 10, 2021

no I can't.

@vondele
Copy link
Member

vondele commented Jan 30, 2021

I'll close this, as it is not obviously a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants