-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PRs for pytorch #1396
base: master
Are you sure you want to change the base?
PRs for pytorch #1396
Conversation
@alk could you please help on review this PR? |
Hi. Apologies for delay. Can you please describe motivation for each of those changes specifically? Why rename a library and why you need to have define to omit tcmalloc guard thingy. Also do note that gperftools' "support" for cmake is best-effort. Also I see that your intention is to integrate with pytorch. I know nothing specifically about this project but I do see that it ships as loadable python extension. So please beware that replacing malloc in loadable module is super tricky. I think windows' approach to dlls makes it less hard, but still super-care needs to be taken. And you'll need to use "override" not "patch" model (which may have it's own challenges too). Making it work on elf (or elf-like, e.g. osx) platforms will be super-hard. Usually much easier to just have your users LD_PRELOAD faster malloc. |
Ah, I missed part of your description above. So the logging thing appears to be some artifact of how you integrate cmake thingy. I wonder if there is cleaner way. Sure, that name thing is just name, but there should be some more principled way that doesn't have gperftools' cmake details 'leak' into whatever you're doing in your project. As for the second commit, can you please elaborate on that "balancing" thingy? I am not familiar with this notion of "software stack balancing". |
Also I am genuinely curious. Somehow malloc performance also affects some TF benchmarks (or used to affect few years back). But I am quite puzzled how come ML workload which should be super-bottlenecked on number crunching stuff, actually ends up at least partially depending on memory allocator performance. Does look like possible opportunity to beef up your project's performance if you find how/if it can depend less on dynamic memory allocation. |
In order to figure out the malloc performance gap, I wrote a project bench_malloc. Please the result pic:
|
The second commit, "module_enter_exit_hook" will initial and hook all loaded modules "malloc/free" & "new/delete" functions. Mimalloc has a option to enable hook functions: https://github.com/microsoft/mimalloc/blob/master/CMakeLists.txt#L10C8-L10C29 and it is default turned off. My second commit is want to add a switch to turn hook functions off. |
For the "hook" thingy, I think I understood. Why then not simply use WIN32_OVERRIDE_ALLOCATORS ? For the perf thingy couple things. First, your microbenchmark might be (and likely is) not representative of actual performance of whatever workload you're optimizing. So be careful interpreting the numbers there. Second, I don't doubt that gperftools will be in some cases faster. And yes we do occasionally behave less aggressive than others w.r.t. returning large allocations back to kernel (and then causing page faults and pages zeroings when memory is allocated back). But my point is, whatever is being re-initialized which includes those larger allocations too, shouldn't be happening as often as it appears to (and we this is the case because there is perf impact). If you apply your work towards avoiding this, you are likely to reap benefits much much bigger than playing with malloc tweakings. I.e. whatever matrices and what not should be allocated more or less once, then malloc perf and page faults etc won't matter. |
|
BTW over here: #1392 is see what looks like more logical way to integrate gperftools with whatever "outer" cmake project. I.e. without having to rename any internal libraries etc. You'd still be facing whatever overring issue there is, which I am unforunately not able to understand yet. But at least one of your patches won't be needed. |
I'm working on improve pytorch Windows version performance, detailed info here: pytorch/pytorch#62387
Tc_malloc library is the best performance candidate for final solution. But I need upstream two place modification.
gperftools has a sub-library "logging", which is same name as pytorch's another sub module , google's "XNNPACK".
Two same name libraries in the same CMake project will cause build fail.
I add a "gpf_" prefix for logging library to indicate this library is belong to gperftools.
I will call tc_malloc and tc_free manually. And I didn't need tc_malloc automatic hook system malloc/free functions. It would break software stack balancing.
Please review and comment for this PR. If this PR was merged, I will integrate gperftools to pytorch to improve pytorch performance.