Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PRs for pytorch #1396

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

PRs for pytorch #1396

wants to merge 2 commits into from

Conversation

xuhancn
Copy link

@xuhancn xuhancn commented May 20, 2023

I'm working on improve pytorch Windows version performance, detailed info here: pytorch/pytorch#62387

Picture1 3017

Tc_malloc library is the best performance candidate for final solution. But I need upstream two place modification.

  1. gperftools has a sub-library "logging", which is same name as pytorch's another sub module , google's "XNNPACK".
    Two same name libraries in the same CMake project will cause build fail.
    image
    I add a "gpf_" prefix for logging library to indicate this library is belong to gperftools.

  2. I will call tc_malloc and tc_free manually. And I didn't need tc_malloc automatic hook system malloc/free functions. It would break software stack balancing.

Please review and comment for this PR. If this PR was merged, I will integrate gperftools to pytorch to improve pytorch performance.

@xuhancn
Copy link
Author

xuhancn commented May 26, 2023

@alk could you please help on review this PR?

@alk
Copy link
Contributor

alk commented Jun 21, 2023

Hi. Apologies for delay. Can you please describe motivation for each of those changes specifically? Why rename a library and why you need to have define to omit tcmalloc guard thingy.

Also do note that gperftools' "support" for cmake is best-effort.

Also I see that your intention is to integrate with pytorch. I know nothing specifically about this project but I do see that it ships as loadable python extension. So please beware that replacing malloc in loadable module is super tricky. I think windows' approach to dlls makes it less hard, but still super-care needs to be taken. And you'll need to use "override" not "patch" model (which may have it's own challenges too).

Making it work on elf (or elf-like, e.g. osx) platforms will be super-hard. Usually much easier to just have your users LD_PRELOAD faster malloc.

@alk
Copy link
Contributor

alk commented Jun 21, 2023

Ah, I missed part of your description above. So the logging thing appears to be some artifact of how you integrate cmake thingy. I wonder if there is cleaner way. Sure, that name thing is just name, but there should be some more principled way that doesn't have gperftools' cmake details 'leak' into whatever you're doing in your project.

As for the second commit, can you please elaborate on that "balancing" thingy? I am not familiar with this notion of "software stack balancing".

@alk
Copy link
Contributor

alk commented Jun 21, 2023

Also I am genuinely curious. Somehow malloc performance also affects some TF benchmarks (or used to affect few years back).

But I am quite puzzled how come ML workload which should be super-bottlenecked on number crunching stuff, actually ends up at least partially depending on memory allocator performance. Does look like possible opportunity to beef up your project's performance if you find how/if it can depend less on dynamic memory allocation.

@xuhancn
Copy link
Author

xuhancn commented Jun 27, 2023

Also I am genuinely curious. Somehow malloc performance also affects some TF benchmarks (or used to affect few years back).

But I am quite puzzled how come ML workload which should be super-bottlenecked on number crunching stuff, actually ends up at least partially depending on memory allocator performance. Does look like possible opportunity to beef up your project's performance if you find how/if it can depend less on dynamic memory allocation.

In order to figure out the malloc performance gap, I wrote a project bench_malloc. Please the result pic:
image
We can find that, the most gap is in the "malloc & access" data. It is funny:

  1. Malloc function acturally not real malloc memory, it is only update the memory meta data.
  2. When the memory acturally malloced? When the memory was accessed.
  3. Once a memory was malloced and acturally not exist. OS will trigger page_fault exception, and than prapare the memory page for access.
  4. DL tensor is a large memory area, and will trigger a lots of paga_fault.
  5. If DL tensors was create/destory frequently, it would cause a big memory overheart.
  6. tc_malloc/mimalloc seems have mechanism to re-use malloced memory and improve performance.

@xuhancn
Copy link
Author

xuhancn commented Jun 27, 2023

Ah, I missed part of your description above. So the logging thing appears to be some artifact of how you integrate cmake thingy. I wonder if there is cleaner way. Sure, that name thing is just name, but there should be some more principled way that doesn't have gperftools' cmake details 'leak' into whatever you're doing in your project.

As for the second commit, can you please elaborate on that "balancing" thingy? I am not familiar with this notion of "software stack balancing".

The second commit, "module_enter_exit_hook" will initial and hook all loaded modules "malloc/free" & "new/delete" functions.
In some Windows scenario, it was equal "/MT" build prarmeter and embed heap memory into binary. It would crashed when cross module std::string access.
Please reference here: https://stackoverflow.com/questions/74241837/msvc-crt-debug-heap-assert-passing-c-stl-object-across-binary-boundaries

Mimalloc has a option to enable hook functions: https://github.com/microsoft/mimalloc/blob/master/CMakeLists.txt#L10C8-L10C29 and it is default turned off.

My second commit is want to add a switch to turn hook functions off.

@alk
Copy link
Contributor

alk commented Jun 27, 2023

For the "hook" thingy, I think I understood. Why then not simply use WIN32_OVERRIDE_ALLOCATORS ?

For the perf thingy couple things. First, your microbenchmark might be (and likely is) not representative of actual performance of whatever workload you're optimizing. So be careful interpreting the numbers there.

Second, I don't doubt that gperftools will be in some cases faster. And yes we do occasionally behave less aggressive than others w.r.t. returning large allocations back to kernel (and then causing page faults and pages zeroings when memory is allocated back). But my point is, whatever is being re-initialized which includes those larger allocations too, shouldn't be happening as often as it appears to (and we this is the case because there is perf impact). If you apply your work towards avoiding this, you are likely to reap benefits much much bigger than playing with malloc tweakings. I.e. whatever matrices and what not should be allocated more or less once, then malloc perf and page faults etc won't matter.

@xuhancn
Copy link
Author

xuhancn commented Jun 27, 2023

For the "hook" thingy, I think I understood. Why then not simply use WIN32_OVERRIDE_ALLOCATORS ?

For the perf thingy couple things. First, your microbenchmark might be (and likely is) not representative of actual performance of whatever workload you're optimizing. So be careful interpreting the numbers there.

Second, I don't doubt that gperftools will be in some cases faster. And yes we do occasionally behave less aggressive than others w.r.t. returning large allocations back to kernel (and then causing page faults and pages zeroings when memory is allocated back). But my point is, whatever is being re-initialized which includes those larger allocations too, shouldn't be happening as often as it appears to (and we this is the case because there is perf impact). If you apply your work towards avoiding this, you are likely to reap benefits much much bigger than playing with malloc tweakings. I.e. whatever matrices and what not should be allocated more or less once, then malloc perf and page faults etc won't matter.

  1. I tried to use "WIN32_OVERRIDE_ALLOCATORS", but is was occured some build fail issue.
  2. tc_malloc is the best performance one is not from my bench_malloc tool, it is from existing case: Bad performance of stock model on Windows compared to Linux pytorch/pytorch#62387, and brief RFC here: [RFC] Add third-party malloc library to improve pytorch memory performance on Windows pytorch/pytorch#102534.
  3. I have added mimalloc to pytorch now, and PR was merged. I want to add tc_malloc also. And then run a series pytorch benchmark to select better one.
  4. For malloc frequently, pytorch also help its submodule prepare memory, such as oneDNN. In acturaly scenario, they have a lot of reorder operations, make memory contiguous operations. These operations need a lot of temp buffer, and it is necessary.

@alk
Copy link
Contributor

alk commented Jul 3, 2023

BTW over here: #1392 is see what looks like more logical way to integrate gperftools with whatever "outer" cmake project. I.e. without having to rename any internal libraries etc.

You'd still be facing whatever overring issue there is, which I am unforunately not able to understand yet. But at least one of your patches won't be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants