-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Windows large pages #2656
Conversation
0a5283a
to
a388358
Compare
Seems that there is a hang in AppVeyor. I'll try to figure out if I can reproduce it on my box. |
@skiminki yes fixing CI will be needed of course. I'm reluctant to add things that introduce new UCI options, especially multi state. Can't the code automatically detect if large pages are supported and do the right thing if they are (not). Basically, make 'winLargePages == "Enabled"' the default, falling back to normal allocation if if fails. Basically only a modified aligned_ttmem_alloc (and probably a needed aligned_ttmem_free). TT.resize(0), if needed, is presumably best set in main(). |
There's basically three reasons why I added the UCI option:
But still, since people need to go to group policy editors and stuff, it's not like anyone would accidentally use large pages on Windows. So yes, I think we could delete the UCI option and have the code behave as if it was always in 'Auto' mode. Deleting code from a patch is always an easy thing to do anyways. We can also have all that logic in aligned_ttmem_alloc() and delete the winLargePages extra parameter. However, I would like to retain the text I wrote in readme on how to enable large pages, since it's far from obvious. But I so much wish that Windows had something similar to transparent huge pages in Linux, and all this would "just work" without the extra hoops... I marked this PR as work in progress, until I (or someone else) have figured out what chokes CI. |
Some documentation of the feature will indeed be needed, since it clearly needs some tricks by the user. Probably a short note in the readme with a link to the wiki? Anyway, that can be figured out later, when the text can reflect the implementation. concerning the CI, you might want to check (by adding another commit) if it is just a CI - fluke or a real issue. |
This might be a real issue, since it was the second time CI choked on this. I'll investigate in couple of days. |
I tried today the branch skiminki:windows-large-pages with a a 4 node machine times Xeon E7-4870. If that will be merged, I think it should be done with an UCI option, because there should be a possibilty to disallow fast pages usage. For example TCEC does not allow fast pages as far as I know. |
@CoffeeOne Thanks for posting the performance numbers. Why should it be a user flag/disallowed if large pages are used or not? It really should be an implementation detail. IMO, if the OS allows it (via the user setting the needed permissions) a program can just make use of it. That of course assumes that the implementation is robust and there are no significant side-effects. TCEC moved to Linux, so that won't matter. |
Yes, you are right, TCEC not relevant anymore for this pull-request :) |
Thanks @CoffeeOne for the numbers, and confirming that there doesn't seem to be performance surprises wrt NUMA. I'm a bit busy this week, but I'll try to provide the next patch iteration at some point anyways. Just to confirm, the plan is to get rid of the new UCI option, and then figure out CI if needed. I have some ideas what might go wrong. I'll also move the Readme update in a separate patch. |
I believe large pages are good idea, but should strictly be UCI option. I have 3 objections:
|
@DragonMist concerning:
|
Let me do the following:
Also, I concur that the CPU temp increase is because the CPU is doing more work, visible by more nps. Large pages reduce the number of TLB misses, and every TLB miss essentially means that the CPU will stall (= idle) until the TLB miss has been served. Less idling = more power consumed. It's completely up to the cooling system of the box whether there's noticeable temp increase or not, and whether that temp increase matters. (For example, I don't see any difference on my box, and even if there was a +5C bump, the temps would still be well below limits.) |
A quick follow-up to item 2. Since this was brought up, I did a quick check and at least the following engines always try to use large pages when available as far as I can tell:
No one has complained about their attempt to use large pages, not even TCEC during its Windows era. This said, I wrote my patch strictly based on the Microsoft online documentation. In particular, my code is not based on the code in either of those engines. |
a388358
to
bf3ed12
Compare
6e4fc72
to
6969e75
Compare
This revision should pass CI. I removed the WiP tag. |
One more question: What is the minimum Windows version that should be supported? GetLargePageMinimum() requires Windows Vista/Server 2k3 or later. Is that ok or should we use GetProcAddress() to resolve the function? The other new WINAPI functions (VirtualAlloc etc) are for Windows Xp/Server 2k3 or later. The above applies only to the 64-bit binary. The 32-bit binary of course continues to use malloc/free. |
There are very few versions of Windows that have x64 builds earlier than Vista/2k3, I know there is a 64bit XP but barely anyone is using it. |
@skiminki I don't know much about windows... So, my naive question would be ... what's the oldest version that is still supported by Microsoft ? Is the functionality absent on any officially supported versions ? If not, I would guess we're fine. |
I think Windows 8.1 / server 2k8 is the oldest still supported? https://support.microsoft.com/en-us/help/13853/windows-lifecycle-fact-sheet . As far as I can tell, all supported Windows versions should be happy with this patch. But I'm not a Windows expert, either. In fact, this is my first time touching Windows code since year 2005 or so... |
Ok, CI is now happy. Do you want me to fix the comment? Other than that, I should be done with this PR, unless there's still something more to fix. |
@skiminki I think the code is now in shape. If I commit, I might move some of the docs to the wiki and make the comment in the readme more concise. I'll take care of the comment. However, this now needs testing. I think normal fishtest is not appropriate as there are too few windows machines. I also don't have windows access, so some people with windows machines will need to speak up (@CoffeeOne @DragonMist @hero2017 @MichaelB7 etc). I guess this needs
|
I still think the user should get feedback whether or not large pages is actually being used since it is not guarantee it will always be enabled depending upon amount of RAM size and usage at the time the hash is initialized even when lock pages in Windows are enabled. Something like this works ( tested) :
|
I believe this concludes the testing. If I gathered everything correctly, the conclusion is:
I totally agree that if we do the automagic large pages, then some kind of user feedback is needed. But if we do large pages via an UCI option, then we probably don't need the feedback, since SF either launches or not.
Yeah, but let's do that in a follow-up patch? Anyways, I'm thinking that we might get more predictable NUMA results if we do the NUMA assignment during alloc using VirtualAllocExNuma, rather than doing the implicit assignment in the hash clear. But this may require more experimenting. At this point I think we need guidance from @vondele on how we should proceed. My suggestion is to go with this patch as a baseline, and add true/false UCI option for "Windows Large Pages". The default should be false. |
Some questions remarks:
|
Actually, even when large pages are enabled via UCI option, and lock pages are enabled, it is still possible for large pages not to be enabled if one increases memory usage to above 75%, in many cases large pages will not be enabled even when lock pages is set. Since it does require overt action by the user to set lock pages in Windows to enable large pages, I do not believe a UCI option is necessary, but providing user feedback is necessary and probably sufficient. That way the user can either close the apps that using large amounts of memory or do a reboot. A user can always disable Windows lock pages to disable large pages, if it was set previously. Of course if user never wanted larges pages and never enabled Windows lock pages, they will not have to do anything at all.. |
@vondele It is a graceful transition to normal allocation, maybe a slight 2 to 3 hundred millisecond delay. Currently there is no indicator of the transition, hence my request for some type of notification that it was successful and if the normal allocation occured, no notification would be required as it could be assumed it was either not set due to failure or not requested. Currently my suggested code reports out on both, but it can be easily modified to report success only. |
@MichaelB7 knowing it is graceful is important. Concerning the output, any reason you send it to std::cerr, or could it as well be sent to std::cout (via sync_cout) ? What do you achieve with the added memtest variable? Basically, could we use this patch instead: diff --git a/src/misc.cpp b/src/misc.cpp
index ba21a6df0..0f6d983b1 100644
--- a/src/misc.cpp
+++ b/src/misc.cpp
@@ -359,6 +359,10 @@ void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
// try allocate large pages
mem = aligned_ttmem_alloc_large_pages(allocSize);
+ if (mem)
+ sync_cout << "info string Hash table allocation: Windows large pages used." << sync_endl;
+ else
+ sync_cout << "info string Hash table allocation: Windows large pages not used." << sync_endl;
// fall back to regular allocation if necessary
if (!mem) |
@vondele The current patch ( with your changes) output when running bench:
It would be nicer if we can get that down to one message, when there is no change to hash size. Not sure of the best way to do that. |
@vondele
code:
|
@vondele
I personally like to see the Mb output whens setting hash , but that is simply my preference. |
It doesn't do what we want directly. However, we should be able to do the following:
Unfortunately, it's not completely clear to me whether the above is supported by Windows for NUMA (but it should be supported for just about everything else), so a bit of experimentation would be needed. Definitely for a later PR.
You get the output multiple times because Stockfish allocates the hash multiple times. As simple as that. Here are when the allocs happen in a regular run:
If we had an option for using large pages, there would also be resize on:
At some point I offered a patch to fix this (deferred hash allocation), which would only allocate the hash when needed. I wrote that patch to dramatically speed up SF launch on big hash sizes. The measured speedup was 2..12x on TCEC hardware, depending on the order of setoption "Hash" and setoption "Threads". The points when we need the hash allocated/resized (if it already isn't) are as follows:
This patch was rejected due to the concern that SF could lose on time if broken GUI/tournament software omitted the isready command. E.g.:
However, only a bit later I realized that even if we'd allocate the hash in setoption, we'd still get the same timeout if the GUI just spams the commands without isready, as there is no feedback from the setoption command. Anyways, the TCEC fix to avoid the 1-minute launch time was that I just hacked the TCEC cutechess-cli to always force 'Threads' always before 'Hash', which otherwise wouldn't be guaranteed at all. This reduced the launch time to around 8 secs. (Options are in an ordered map or hash map in cutechess, IIRC.) Back to the subject matter. If we want to avoid "Hash table allocation: Windows large pages used" message written multiple times, then IMO, we should avoid multiple resizes, instead of hiding the fact that we do multiple resizes. For reference, the patch for deferred allocs is here: skiminki@f078e43 . (This patch also avoids redundant clears, but that can be taken away.) |
@skiminki I think we should have the output every time we reallocate the hash. So please add the code snippet I posted to this PR. I believe that would be the state ready for merging. I have not seen a reason for an UCI option so far, this seems to work reasonably well, and users have control anyway. We're still months away from a release, we can adjust as needed. |
Agreed. I'll update the patches, probably later today. IIRC there were also some other minor things, I'll collect them too. I won't be pushing for the deferred hash resize patch, unless someone else wants it, too. |
Indeed, and it needs their action to set the windows policy.
it will. |
@vondele I am surprised that you answered by full editing my original comment. Why? Was my comment inappropriate for some reason? |
@Coolchessguykevin no, sorry, not at all, must have made a wrong click. (Didn't even know I could edit other's comments). I also don't see that I can revert it. |
As is with Linux, large pages may add a significant bump to nps numbers, especially on large hashes. On my Windows box, go depth 30 on startpos increases the speed from 13.8 Mnps to 14.6 Mnps with 32 GB hash when large pages are enabled. This is roughly a +6% speed increase. No functional change
No functional change
6969e75
to
af5e263
Compare
Full diff to previous version:
|
@skiminki thanks for the patch, and everybody else for testing. |
@skiminki Thank you for finally adding large pages to the official SF release. I've been pushing for this and NUMA for years. Tired of adding this manually to every release. On my hardware the speedup is astronomical. See my test results in # 2619. Again, thank you sir. |
As is with Linux, large pages may add a significant bump to nps
numbers, especially on large hashes.
On my Windows box, go depth 30 on startpos increases the speed from
13.8 Mnps to 14.6 Mnps with 32 GB hash when large pages are
enabled. This is roughly a +6% speed increase.
No functional change
Note: Since this patch switches from malloc() to VirtualAlloc() even without large pages enabled, this patch should be tested on a NUMA machine for speed regressions before merge.