Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Windows large pages #2656

Closed

Conversation

skiminki
Copy link
Contributor

@skiminki skiminki commented May 1, 2020

As is with Linux, large pages may add a significant bump to nps
numbers, especially on large hashes.

On my Windows box, go depth 30 on startpos increases the speed from
13.8 Mnps to 14.6 Mnps with 32 GB hash when large pages are
enabled. This is roughly a +6% speed increase.

No functional change

Note: Since this patch switches from malloc() to VirtualAlloc() even without large pages enabled, this patch should be tested on a NUMA machine for speed regressions before merge.

@skiminki skiminki force-pushed the windows-large-pages branch 2 times, most recently from 0a5283a to a388358 Compare May 1, 2020 15:56
@skiminki
Copy link
Contributor Author

skiminki commented May 1, 2020

Seems that there is a hang in AppVeyor. I'll try to figure out if I can reproduce it on my box.

@vondele
Copy link
Member

vondele commented May 2, 2020

@skiminki yes fixing CI will be needed of course.

I'm reluctant to add things that introduce new UCI options, especially multi state. Can't the code automatically detect if large pages are supported and do the right thing if they are (not). Basically, make 'winLargePages == "Enabled"' the default, falling back to normal allocation if if fails. Basically only a modified aligned_ttmem_alloc (and probably a needed aligned_ttmem_free). TT.resize(0), if needed, is presumably best set in main().

@skiminki skiminki changed the title Add support for Windows large pages Add support for Windows large pages [WiP] May 2, 2020
@skiminki
Copy link
Contributor Author

skiminki commented May 2, 2020

There's basically three reasons why I added the UCI option:

  • I wasn't completely sure whether the 'Auto' mode would sometimes trigger UAC popups or some other non-sense to regular users due to the permissions checks. Although now that I think of it, this shouldn't ever happen.
  • DragonMist66 suggested in TCEC chat that large pages are not always wanted. He cited something about EGTB memory creep being problematic with large pages. However, I'm not enabling large pages for Syzygy mappings here, so perhaps this won't be an issue.
  • other engines (e.g., BrainFish) have that option

But still, since people need to go to group policy editors and stuff, it's not like anyone would accidentally use large pages on Windows. So yes, I think we could delete the UCI option and have the code behave as if it was always in 'Auto' mode. Deleting code from a patch is always an easy thing to do anyways. We can also have all that logic in aligned_ttmem_alloc() and delete the winLargePages extra parameter.

However, I would like to retain the text I wrote in readme on how to enable large pages, since it's far from obvious. But I so much wish that Windows had something similar to transparent huge pages in Linux, and all this would "just work" without the extra hoops...

I marked this PR as work in progress, until I (or someone else) have figured out what chokes CI.

@vondele
Copy link
Member

vondele commented May 2, 2020

Some documentation of the feature will indeed be needed, since it clearly needs some tricks by the user. Probably a short note in the readme with a link to the wiki? Anyway, that can be figured out later, when the text can reflect the implementation.

concerning the CI, you might want to check (by adding another commit) if it is just a CI - fluke or a real issue.

@skiminki
Copy link
Contributor Author

skiminki commented May 2, 2020

This might be a real issue, since it was the second time CI choked on this. I'll investigate in couple of days.

@CoffeeOne
Copy link

CoffeeOne commented May 3, 2020

I tried today the branch skiminki:windows-large-pages with a a 4 node machine times Xeon E7-4870.
I tested it with a long series of bench 16384 80 26 (using 80 threads)
It was about 10% faster than master. Please note that such tests are very noisy, but I can post more details, if requested.

If that will be merged, I think it should be done with an UCI option, because there should be a possibilty to disallow fast pages usage. For example TCEC does not allow fast pages as far as I know.
So in my opinion, there should be 2 options: AUTO and OFF

@vondele
Copy link
Member

vondele commented May 3, 2020

@CoffeeOne Thanks for posting the performance numbers.

Why should it be a user flag/disallowed if large pages are used or not? It really should be an implementation detail. IMO, if the OS allows it (via the user setting the needed permissions) a program can just make use of it. That of course assumes that the implementation is robust and there are no significant side-effects.

TCEC moved to Linux, so that won't matter.

@CoffeeOne
Copy link

Yes, you are right, TCEC not relevant anymore for this pull-request :)
I also see no point to forbid the usage (when it is technically possible).

@skiminki
Copy link
Contributor Author

skiminki commented May 3, 2020

Thanks @CoffeeOne for the numbers, and confirming that there doesn't seem to be performance surprises wrt NUMA.

I'm a bit busy this week, but I'll try to provide the next patch iteration at some point anyways. Just to confirm, the plan is to get rid of the new UCI option, and then figure out CI if needed. I have some ideas what might go wrong. I'll also move the Readme update in a separate patch.

@DragonMist
Copy link

DragonMist commented May 3, 2020

I believe large pages are good idea, but should strictly be UCI option. I have 3 objections:

  1. on WIndows, even if allowed through group policy, if used in engine v engine matches after a while it will become unavailable - not sure if that would skew anything, but surely is undesirable
  2. despite TCEC being on Linux, if the rules of the competition are strictly "No large pages", then we have a problem.
  3. in any case, user should be warned that using large pages will produce higher temperatures on CPU (3-5 degrees).
    Edit: as for the memory leak as mentioned before by @skiminki it is there even without large pages, and I can not remember if large pages worsens the problem or not.

@vondele
Copy link
Member

vondele commented May 4, 2020

@DragonMist concerning:

  1. that's a real problem, and if this is the case would speak against this even under control of an option. I actually don't see why this would be the case, as long as the hash/threads are not frequently changed, this allocated memory is never reallocated or so.
  2. that rule would be a bit arbitrary. Fortunately, it can be controlled by the user by setting the system policy.
  3. CPU temp increase just reflects the binary running more efficiently (more instructions per second, keeps the CPU busy), in principle other changes to the code could lead to this as well.

@skiminki
Copy link
Contributor Author

skiminki commented May 4, 2020

Let me do the following:

  • I'll reduce this PR to large pages support always in "Auto" mode, no new UCI options.
  • I'll do a follow-up PR that adds the UCI option for disabling large pages on Windows. We can then have a discussion on whether we want the UCI option or not. I don't mind if that PR gets rejected.

Also, I concur that the CPU temp increase is because the CPU is doing more work, visible by more nps. Large pages reduce the number of TLB misses, and every TLB miss essentially means that the CPU will stall (= idle) until the TLB miss has been served. Less idling = more power consumed. It's completely up to the cooling system of the box whether there's noticeable temp increase or not, and whether that temp increase matters. (For example, I don't see any difference on my box, and even if there was a +5C bump, the temps would still be well below limits.)

@skiminki
Copy link
Contributor Author

skiminki commented May 4, 2020

A quick follow-up to item 2. Since this was brought up, I did a quick check and at least the following engines always try to use large pages when available as far as I can tell:

  • Fizbo 2
  • Houdini (leaked sources)

No one has complained about their attempt to use large pages, not even TCEC during its Windows era.

This said, I wrote my patch strictly based on the Microsoft online documentation. In particular, my code is not based on the code in either of those engines.

src/main.cpp Show resolved Hide resolved
src/misc.cpp Show resolved Hide resolved
@skiminki skiminki force-pushed the windows-large-pages branch 3 times, most recently from 6e4fc72 to 6969e75 Compare May 4, 2020 19:59
@skiminki skiminki changed the title Add support for Windows large pages [WiP] Add support for Windows large pages May 4, 2020
@skiminki
Copy link
Contributor Author

skiminki commented May 4, 2020

This revision should pass CI. I removed the WiP tag.

@skiminki
Copy link
Contributor Author

skiminki commented May 4, 2020

One more question: What is the minimum Windows version that should be supported? GetLargePageMinimum() requires Windows Vista/Server 2k3 or later. Is that ok or should we use GetProcAddress() to resolve the function?

The other new WINAPI functions (VirtualAlloc etc) are for Windows Xp/Server 2k3 or later.

The above applies only to the 64-bit binary. The 32-bit binary of course continues to use malloc/free.

@noobpwnftw
Copy link
Contributor

There are very few versions of Windows that have x64 builds earlier than Vista/2k3, I know there is a 64bit XP but barely anyone is using it.

@vondele
Copy link
Member

vondele commented May 4, 2020

@skiminki I don't know much about windows... So, my naive question would be ... what's the oldest version that is still supported by Microsoft ? Is the functionality absent on any officially supported versions ? If not, I would guess we're fine.

@skiminki
Copy link
Contributor Author

skiminki commented May 4, 2020

I think Windows 8.1 / server 2k8 is the oldest still supported? https://support.microsoft.com/en-us/help/13853/windows-lifecycle-fact-sheet . As far as I can tell, all supported Windows versions should be happy with this patch.

But I'm not a Windows expert, either. In fact, this is my first time touching Windows code since year 2005 or so...

@skiminki
Copy link
Contributor Author

skiminki commented May 4, 2020

Ok, CI is now happy. Do you want me to fix the comment? Other than that, I should be done with this PR, unless there's still something more to fix.

@vondele
Copy link
Member

vondele commented May 4, 2020

@skiminki I think the code is now in shape.

If I commit, I might move some of the docs to the wiki and make the comment in the readme more concise. I'll take care of the comment.

However, this now needs testing. I think normal fishtest is not appropriate as there are too few windows machines. I also don't have windows access, so some people with windows machines will need to speak up (@CoffeeOne @DragonMist @hero2017 @MichaelB7 etc).

I guess this needs

  • speed comparison without large pages enabled between master and this branch
  • speed comparison with large pages enabled
  • A long running engine - engine match to check for any problems with large pages.

@vondele vondele added the WIP label May 10, 2020
@MichaelB7
Copy link
Contributor

MichaelB7 commented May 10, 2020

I still think the user should get feedback whether or not large pages is actually being used since it is not guarantee it will always be enabled depending upon amount of RAM size and usage at the time the hash is initialized even when lock pages in Windows are enabled.

Something like this works ( tested) :

using namespace std;

namespace {
size_t memtest = 0; //lp mem test 
  // try allocate large pages
      mem = aligned_ttmem_alloc_large_pages(allocSize);
      if (mem && memtest != allocSize )
      {
           std::cerr << "info string Large Pages set to " << (allocSize >> 20)  << " Mb" << sync_endl;
           memtest = allocSize;
      }
      else if (!mem)
      {
          mem = VirtualAlloc(NULL, allocSize, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
          std::cerr << "info string " << (allocSize >> 20)
                << " Mb for Large Page memory for transposition table not set, switching to default memory" << std::endl;
      }

  // NOTE: VirtualAlloc returns memory at page boundary, so no need to align for
  // cachelines
  return mem;
}

@skiminki
Copy link
Contributor Author

skiminki commented May 10, 2020

I believe this concludes the testing. If I gathered everything correctly, the conclusion is:

  • large pages disabled is never a big regression wrt to malloc. In some cases a bit faster, and in some cases a bit slower. This is usually +/- 3%
  • large pages disabled somehow fixes the NUMA regression wrt to malloc() on that Opteron box, giving a significant boost for large hashes
  • large pages enabled is sometimes a regression (small hashes on that Opteron box). However, usually large pages is a win, and gives around +5..20% perf boost compared to large pages disabled.

I still think the user should get feedback whether or not large pages is actually being used

I totally agree that if we do the automagic large pages, then some kind of user feedback is needed. But if we do large pages via an UCI option, then we probably don't need the feedback, since SF either launches or not.

@skiminki Of course further improvements of windows numa performance would be very welcome.

Yeah, but let's do that in a follow-up patch?

Anyways, I'm thinking that we might get more predictable NUMA results if we do the NUMA assignment during alloc using VirtualAllocExNuma, rather than doing the implicit assignment in the hash clear. But this may require more experimenting.

At this point I think we need guidance from @vondele on how we should proceed. My suggestion is to go with this patch as a baseline, and add true/false UCI option for "Windows Large Pages". The default should be false.

@vondele
Copy link
Member

vondele commented May 10, 2020

Some questions remarks:

  • I'd like to know the effect of small number of threads (8) with small hash (<=1GB) on a NUMA machine, with lp enabled. This could reflect fishtest situation. I assume non-NUMA machines can not be negatively impacted.
  • I don't think small hash many threads on a NUMA machine is an important testcase. I had a quick look at VirtualAllocExNuma, but I don't think it will help. I believe what we need is an allocation of the hash spread evenly across all the NUMA nodes the threads are bound to. I couldn't see that option, at least on first reading.
  • I would like to know the behavior of the system when large page allocation fails (despite proper permissions). I.e. can we trigger the error condition of receiving a nullptr near line 345 of misc.cpp. Is that process 'graceful' or does the system become unusable first before it defaults to normal allocation (or a crash). @MichaelB7 should a simple check + error output near that line be sufficient?

@MichaelB7
Copy link
Contributor

@skiminki

I totally agree that if we do the automagic large pages, then some kind of user feedback is needed. But if we do large pages via an UCI option, then we probably don't need the feedback, since SF either launches or not.

Actually, even when large pages are enabled via UCI option, and lock pages are enabled, it is still possible for large pages not to be enabled if one increases memory usage to above 75%, in many cases large pages will not be enabled even when lock pages is set.

Since it does require overt action by the user to set lock pages in Windows to enable large pages, I do not believe a UCI option is necessary, but providing user feedback is necessary and probably sufficient. That way the user can either close the apps that using large amounts of memory or do a reboot. A user can always disable Windows lock pages to disable large pages, if it was set previously. Of course if user never wanted larges pages and never enabled Windows lock pages, they will not have to do anything at all..

@CoffeeOne
Copy link

Some questions remarks:

* I'd like to know the effect of small number of threads (8) with small hash (<=1GB) on a NUMA machine, with lp enabled. This could reflect fishtest situation. I assume non-NUMA machines can not be negatively impacted.

A test like this one?
Again using the 8 node Opteron 6386SE, bench with 512MB hash, 8 threads, depth 24:
6386lp8thread
large page disabled and master on par, large pages enabled version is a tiny bit faster.

It's not really fishtest conditions though, because that machine would have to run 7 8thread-games in parallel (when put to fishtest with concurrency 63), so those 8 thread workers will be for sure slower then.

* I don't think small hash many threads on a NUMA machine is an important testcase. I had a quick look at VirtualAllocExNuma, but I don't think it will help. I believe what we need is an allocation of the hash spread evenly across all the NUMA nodes the threads are bound to. I couldn't see that option, at least on first reading.

I fully agree. The marjority of 4 node or 8 node machine will have >=64GB, so the user of the engine will always try to use high values. When all cpus are used, it makes no sense to "save" some RAM.

@MichaelB7
Copy link
Contributor

Some questions remarks:
...
I would like to know the behavior of the system when large page allocation fails (despite proper permissions). I.e. can we trigger the error condition of receiving a nullptr near line 345 of misc.cpp. Is that process 'graceful' or does the system become unusable first before it defaults to normal allocation (or a crash). @MichaelB7 should a simple check + error output near that line be sufficient?

@vondele It is a graceful transition to normal allocation, maybe a slight 2 to 3 hundred millisecond delay. Currently there is no indicator of the transition, hence my request for some type of notification that it was successful and if the normal allocation occured, no notification would be required as it could be assumed it was either not set due to failure or not requested. Currently my suggested code reports out on both, but it can be easily modified to report success only.

@vondele
Copy link
Member

vondele commented May 11, 2020

@MichaelB7 knowing it is graceful is important. Concerning the output, any reason you send it to std::cerr, or could it as well be sent to std::cout (via sync_cout) ? What do you achieve with the added memtest variable? Basically, could we use this patch instead:

diff --git a/src/misc.cpp b/src/misc.cpp
index ba21a6df0..0f6d983b1 100644
--- a/src/misc.cpp
+++ b/src/misc.cpp
@@ -359,6 +359,10 @@ void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
 
   // try allocate large pages
   mem = aligned_ttmem_alloc_large_pages(allocSize);
+  if (mem)
+     sync_cout << "info string Hash table allocation: Windows large pages used." << sync_endl;
+  else
+     sync_cout << "info string Hash table allocation: Windows large pages not used." << sync_endl;
 
   // fall back to regular allocation if necessary
   if (!mem)

@MichaelB7
Copy link
Contributor

MichaelB7 commented May 12, 2020

@vondele
We could use your patch - with sync_cout - and your patch works as designed. one thing I was trying to do was to reduce the number of times the message is generated when there is no change to the hash size.

The current patch ( with your changes) output when running bench:

$  Stockfish bench 16 1 13
Stockfish 110520 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
info string Hash table allocation: Windows large pages used.
info string Hash table allocation: Windows large pages used.
info string Hash table allocation: Windows large pages used.

It would be nicer if we can get that down to one message, when there is no change to hash size. Not sure of the best way to do that.

@MichaelB7
Copy link
Contributor

MichaelB7 commented May 12, 2020

@vondele
I rewrote my patch using memtest, using your changes from above and now we only get the output once

$  Stockfish bench 16 1 13
Stockfish 110520 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
info string Hash table allocation: Windows large pages used.

Position: 1/47

...

code:

using namespace std;

namespace {

size_t memtest = 0;
  if (mem && memtest != allocSize)
      {
          sync_cout << "info string Hash table allocation: Windows large pages used." << sync_endl;
          memtest = allocSize;
      }
      else if (!mem && memtest != allocSize)
      {
          mem = VirtualAlloc(NULL, allocSize, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
          sync_cout << "info string Hash table allocation: Windows large pages not used." << sync_endl;
          memtest = allocSize;
      }

@MichaelB7
Copy link
Contributor

MichaelB7 commented May 12, 2020

@vondele
Here 's the output running bench where I deliberately made large pages failed, even though it initially was set up successfully.
The second message , was outputted due the change in hash size from default to running bench,
It failed as I was using a very large amount of memory in another program to make it fail.

$  Stockfish bench 24000 1 13
Stockfish 110520 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
info string Hash table allocation: Windows large pages used.
info string Hash table allocation: Windows large pages not used.

I personally like to see the Mb output whens setting hash , but that is simply my preference.

@skiminki
Copy link
Contributor Author

had a quick look at VirtualAllocExNuma, but I don't think it will help.

It doesn't do what we want directly. However, we should be able to do the following:

  • Allocate virtual address space only (i.e., no backing memory) with VirtualAlloc()
  • Allocate NUMA memory backing within that allocation with VirtualAllocExNuma(), so that the backing memory is spread evenly between the NUMA domains. (This is what the multi-threaded thread binding in hash clear tries to do, AFAICT)

Unfortunately, it's not completely clear to me whether the above is supported by Windows for NUMA (but it should be supported for just about everything else), so a bit of experimentation would be needed. Definitely for a later PR.

I rewrote my patch using memtest, using your changes from above and now we only get the output once

You get the output multiple times because Stockfish allocates the hash multiple times. As simple as that. Here are when the allocs happen in a regular run:

  1. Stockfish process is launched. Default alloc to 16 MB (IIRC)
  2. setoption name "Hash" value "xxxxx"
  3. setoption name "Threads" value "xxx"

If we had an option for using large pages, there would also be resize on:

  1. setoption name "Windows Large Pages" value "true"

At some point I offered a patch to fix this (deferred hash allocation), which would only allocate the hash when needed. I wrote that patch to dramatically speed up SF launch on big hash sizes. The measured speedup was 2..12x on TCEC hardware, depending on the order of setoption "Hash" and setoption "Threads".

The points when we need the hash allocated/resized (if it already isn't) are as follows:

  • isready command
  • any command that might launch a search, particularly the "go" commands
  • bench

This patch was rejected due to the concern that SF could lose on time if broken GUI/tournament software omitted the isready command. E.g.:

uci
setoption name "Hash" value "65536"   # this takes takes couple of secs on any HW
# isready missing here
go wtime 1000 winc 100 btime 1000 binc 100 # only now hash is getting allocated, timeout before that's completed

However, only a bit later I realized that even if we'd allocate the hash in setoption, we'd still get the same timeout if the GUI just spams the commands without isready, as there is no feedback from the setoption command. Anyways, the TCEC fix to avoid the 1-minute launch time was that I just hacked the TCEC cutechess-cli to always force 'Threads' always before 'Hash', which otherwise wouldn't be guaranteed at all. This reduced the launch time to around 8 secs. (Options are in an ordered map or hash map in cutechess, IIRC.)

Back to the subject matter. If we want to avoid "Hash table allocation: Windows large pages used" message written multiple times, then IMO, we should avoid multiple resizes, instead of hiding the fact that we do multiple resizes.

For reference, the patch for deferred allocs is here: skiminki@f078e43 . (This patch also avoids redundant clears, but that can be taken away.)

@vondele
Copy link
Member

vondele commented May 12, 2020

@skiminki I think we should have the output every time we reallocate the hash. So please add the code snippet I posted to this PR. I believe that would be the state ready for merging.

I have not seen a reason for an UCI option so far, this seems to work reasonably well, and users have control anyway. We're still months away from a release, we can adjust as needed.

@skiminki
Copy link
Contributor Author

I think we should have the output every time we reallocate the hash.

Agreed. I'll update the patches, probably later today. IIRC there were also some other minor things, I'll collect them too.

I won't be pushing for the deferred hash resize patch, unless someone else wants it, too.

@Coolchessguykevin
Copy link

Coolchessguykevin commented May 12, 2020

That is, this LP function is really specific and only for advanced users.

Indeed, and it needs their action to set the windows policy.

If with this patch introduction the Stockfish will default work exactly the same as before

it will.

@Coolchessguykevin
Copy link

@vondele I am surprised that you answered by full editing my original comment. Why? Was my comment inappropriate for some reason?

@vondele
Copy link
Member

vondele commented May 12, 2020

@Coolchessguykevin no, sorry, not at all, must have made a wrong click. (Didn't even know I could edit other's comments). I also don't see that I can revert it.

As is with Linux, large pages may add a significant bump to nps
numbers, especially on large hashes.

On my Windows box, go depth 30 on startpos increases the speed from
13.8 Mnps to 14.6 Mnps with 32 GB hash when large pages are
enabled. This is roughly a +6% speed increase.

No functional change
@skiminki
Copy link
Contributor Author

Full diff to previous version:

diff --git a/Readme.md b/Readme.md
index 5eafd956..68ff8417 100644
--- a/Readme.md
+++ b/Readme.md
@@ -177,17 +177,12 @@ recommended.
 The use of large pages requires "Lock Pages in Memory" privilege. See
 [Enable the Lock Pages in Memory Option (Windows)](https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/enable-the-lock-pages-in-memory-option-windows)
 on how to enable this privilege. Logout/login may be needed
-afterwards.
-
-To detect whether large pages are in use, tool [Sysinternals
-RamMap](https://docs.microsoft.com/en-us/sysinternals/downloads/rammap)
-may be used. After launching the engine, see the 'Large Page' row in
-tab 'Use Counts'. The number should match with the Stockfish hash
-size. Note that the tool does not refresh the contents periodically.
+afterwards. To determine whether large pages are in use, see the
+engine log.
 
 Due to memory fragmentation, memory with large pages may not be always
-possible to allocate. When this is the case, reboot may be
-needed.
+possible to allocate even when enabled. When this is the case, reboot
+may be needed.
 
 ## Compiling Stockfish yourself from the sources
 
diff --git a/src/misc.cpp b/src/misc.cpp
index ba21a6df..4f43b174 100644
--- a/src/misc.cpp
+++ b/src/misc.cpp
@@ -359,6 +359,10 @@ void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
 
   // try allocate large pages
   mem = aligned_ttmem_alloc_large_pages(allocSize);
+  if (mem)
+      sync_cout << "info string Hash table allocation: Windows large pages used." << sync_endl;
+  else
+      sync_cout << "info string Hash table allocation: Windows large pages not used." << sync_endl;
 
   // fall back to regular allocation if necessary
   if (!mem)
@@ -385,7 +389,7 @@ void* aligned_ttmem_alloc(size_t allocSize, void*& mem) {
 /// aligned_ttmem_free will free the previously allocated ttmem
 #if defined(_WIN64)
 
-void aligned_ttmem_free(void *mem) {
+void aligned_ttmem_free(void* mem) {
 
   if (!VirtualFree(mem, 0, MEM_RELEASE))
   {

@vondele vondele added to be merged Will be merged shortly and removed WIP labels May 13, 2020
@vondele vondele closed this in d476342 May 13, 2020
@vondele
Copy link
Member

vondele commented May 13, 2020

@skiminki thanks for the patch, and everybody else for testing.

@hero2017
Copy link

@skiminki Thank you for finally adding large pages to the official SF release. I've been pushing for this and NUMA for years. Tired of adding this manually to every release. On my hardware the speedup is astronomical. See my test results in # 2619. Again, thank you sir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
to be merged Will be merged shortly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants