PRs for pytorch #1396

xuhancn · 2023-05-20T16:32:34Z

I'm working on improve pytorch Windows version performance, detailed info here: pytorch/pytorch#62387

Tc_malloc library is the best performance candidate for final solution. But I need upstream two place modification.

gperftools has a sub-library "logging", which is same name as pytorch's another sub module , google's "XNNPACK".
Two same name libraries in the same CMake project will cause build fail.

I add a "gpf_" prefix for logging library to indicate this library is belong to gperftools.
I will call tc_malloc and tc_free manually. And I didn't need tc_malloc automatic hook system malloc/free functions. It would break software stack balancing.

Please review and comment for this PR. If this PR was merged, I will integrate gperftools to pytorch to improve pytorch performance.

xuhancn · 2023-05-26T07:11:43Z

@alk could you please help on review this PR?

alk · 2023-06-21T15:37:01Z

Hi. Apologies for delay. Can you please describe motivation for each of those changes specifically? Why rename a library and why you need to have define to omit tcmalloc guard thingy.

Also do note that gperftools' "support" for cmake is best-effort.

Also I see that your intention is to integrate with pytorch. I know nothing specifically about this project but I do see that it ships as loadable python extension. So please beware that replacing malloc in loadable module is super tricky. I think windows' approach to dlls makes it less hard, but still super-care needs to be taken. And you'll need to use "override" not "patch" model (which may have it's own challenges too).

Making it work on elf (or elf-like, e.g. osx) platforms will be super-hard. Usually much easier to just have your users LD_PRELOAD faster malloc.

alk · 2023-06-21T16:02:30Z

Ah, I missed part of your description above. So the logging thing appears to be some artifact of how you integrate cmake thingy. I wonder if there is cleaner way. Sure, that name thing is just name, but there should be some more principled way that doesn't have gperftools' cmake details 'leak' into whatever you're doing in your project.

As for the second commit, can you please elaborate on that "balancing" thingy? I am not familiar with this notion of "software stack balancing".

alk · 2023-06-21T16:05:30Z

Also I am genuinely curious. Somehow malloc performance also affects some TF benchmarks (or used to affect few years back).

But I am quite puzzled how come ML workload which should be super-bottlenecked on number crunching stuff, actually ends up at least partially depending on memory allocator performance. Does look like possible opportunity to beef up your project's performance if you find how/if it can depend less on dynamic memory allocation.

xuhancn · 2023-06-27T16:33:20Z

Also I am genuinely curious. Somehow malloc performance also affects some TF benchmarks (or used to affect few years back).

But I am quite puzzled how come ML workload which should be super-bottlenecked on number crunching stuff, actually ends up at least partially depending on memory allocator performance. Does look like possible opportunity to beef up your project's performance if you find how/if it can depend less on dynamic memory allocation.

In order to figure out the malloc performance gap, I wrote a project bench_malloc. Please the result pic:

We can find that, the most gap is in the "malloc & access" data. It is funny:

Malloc function acturally not real malloc memory, it is only update the memory meta data.
When the memory acturally malloced? When the memory was accessed.
Once a memory was malloced and acturally not exist. OS will trigger page_fault exception, and than prapare the memory page for access.
DL tensor is a large memory area, and will trigger a lots of paga_fault.
If DL tensors was create/destory frequently, it would cause a big memory overheart.
tc_malloc/mimalloc seems have mechanism to re-use malloced memory and improve performance.

xuhancn · 2023-06-27T16:45:31Z

Ah, I missed part of your description above. So the logging thing appears to be some artifact of how you integrate cmake thingy. I wonder if there is cleaner way. Sure, that name thing is just name, but there should be some more principled way that doesn't have gperftools' cmake details 'leak' into whatever you're doing in your project.

As for the second commit, can you please elaborate on that "balancing" thingy? I am not familiar with this notion of "software stack balancing".

The second commit, "module_enter_exit_hook" will initial and hook all loaded modules "malloc/free" & "new/delete" functions.
In some Windows scenario, it was equal "/MT" build prarmeter and embed heap memory into binary. It would crashed when cross module std::string access.
Please reference here: https://stackoverflow.com/questions/74241837/msvc-crt-debug-heap-assert-passing-c-stl-object-across-binary-boundaries

Mimalloc has a option to enable hook functions: https://github.com/microsoft/mimalloc/blob/master/CMakeLists.txt#L10C8-L10C29 and it is default turned off.

My second commit is want to add a switch to turn hook functions off.

alk · 2023-06-27T17:47:25Z

For the "hook" thingy, I think I understood. Why then not simply use WIN32_OVERRIDE_ALLOCATORS ?

For the perf thingy couple things. First, your microbenchmark might be (and likely is) not representative of actual performance of whatever workload you're optimizing. So be careful interpreting the numbers there.

Second, I don't doubt that gperftools will be in some cases faster. And yes we do occasionally behave less aggressive than others w.r.t. returning large allocations back to kernel (and then causing page faults and pages zeroings when memory is allocated back). But my point is, whatever is being re-initialized which includes those larger allocations too, shouldn't be happening as often as it appears to (and we this is the case because there is perf impact). If you apply your work towards avoiding this, you are likely to reap benefits much much bigger than playing with malloc tweakings. I.e. whatever matrices and what not should be allocated more or less once, then malloc perf and page faults etc won't matter.

xuhancn · 2023-06-27T18:17:28Z

For the "hook" thingy, I think I understood. Why then not simply use WIN32_OVERRIDE_ALLOCATORS ?

For the perf thingy couple things. First, your microbenchmark might be (and likely is) not representative of actual performance of whatever workload you're optimizing. So be careful interpreting the numbers there.

Second, I don't doubt that gperftools will be in some cases faster. And yes we do occasionally behave less aggressive than others w.r.t. returning large allocations back to kernel (and then causing page faults and pages zeroings when memory is allocated back). But my point is, whatever is being re-initialized which includes those larger allocations too, shouldn't be happening as often as it appears to (and we this is the case because there is perf impact). If you apply your work towards avoiding this, you are likely to reap benefits much much bigger than playing with malloc tweakings. I.e. whatever matrices and what not should be allocated more or less once, then malloc perf and page faults etc won't matter.

I tried to use "WIN32_OVERRIDE_ALLOCATORS", but is was occured some build fail issue.
tc_malloc is the best performance one is not from my bench_malloc tool, it is from existing case: Bad performance of stock model on Windows compared to Linux pytorch/pytorch#62387, and brief RFC here: [RFC] Add third-party malloc library to improve pytorch memory performance on Windows pytorch/pytorch#102534.
I have added mimalloc to pytorch now, and PR was merged. I want to add tc_malloc also. And then run a series pytorch benchmark to select better one.
For malloc frequently, pytorch also help its submodule prepare memory, such as oneDNN. In acturaly scenario, they have a lot of reorder operations, make memory contiguous operations. These operations need a lot of temp buffer, and it is necessary.

alk · 2023-07-03T17:06:14Z

BTW over here: #1392 is see what looks like more logical way to integrate gperftools with whatever "outer" cmake project. I.e. without having to rename any internal libraries etc.

You'd still be facing whatever overring issue there is, which I am unforunately not able to understand yet. But at least one of your patches won't be needed.

xuhancn added 2 commits May 21, 2023 00:29

rename project logging to gpt_logging.

264da26

add option to disable tc_malloc initial hook.

27d5111

xuhancn marked this pull request as ready for review May 20, 2023 16:54

xuhancn mentioned this pull request May 21, 2023

Bad performance of stock model on Windows compared to Linux pytorch/pytorch#62387

Open

xuhancn mentioned this pull request May 30, 2023

[RFC] Add third-party malloc library to improve pytorch memory performance on Windows pytorch/pytorch#102534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRs for pytorch #1396

PRs for pytorch #1396

xuhancn commented May 20, 2023 •

edited

Loading

xuhancn commented May 26, 2023

alk commented Jun 21, 2023

alk commented Jun 21, 2023

alk commented Jun 21, 2023

xuhancn commented Jun 27, 2023

xuhancn commented Jun 27, 2023 •

edited

Loading

alk commented Jun 27, 2023

xuhancn commented Jun 27, 2023 •

edited

Loading

alk commented Jul 3, 2023

PRs for pytorch #1396

Are you sure you want to change the base?

PRs for pytorch #1396

Conversation

xuhancn commented May 20, 2023 • edited Loading

xuhancn commented May 26, 2023

alk commented Jun 21, 2023

alk commented Jun 21, 2023

alk commented Jun 21, 2023

xuhancn commented Jun 27, 2023

xuhancn commented Jun 27, 2023 • edited Loading

alk commented Jun 27, 2023

xuhancn commented Jun 27, 2023 • edited Loading

alk commented Jul 3, 2023

xuhancn commented May 20, 2023 •

edited

Loading

xuhancn commented Jun 27, 2023 •

edited

Loading

xuhancn commented Jun 27, 2023 •

edited

Loading