New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tcmalloc] O(n) address-ordered best-fit over PageHeap::large_ becomes major scalability bottleneck on fragmented heap #535
Comments
Reported by |
Reported by |
Reported by
|
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Reported by |
Hi, I'm experiencing performance slowdowns to a standstill, similar to issue 663 or 713, which are marked as duplicates of this ones. Was there a resolution or workaround to this problem or is it still open? |
No fixes yes. But you can help by describing your use-case more precisely.
There is new page heap implementation in works, so if you describe your
workload it would help us ensure it meets your needs.
…On Tue, May 30, 2017 at 12:03 PM, Hendrik Dahlkamp ***@***.*** > wrote:
Hi, I'm experiencing performance slowdowns to a standstill, similar to
issue 663 or 713, which are marked as duplicates of this ones.
Was there a resolution or workaround to this problem or is it still open?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#535 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA6AnP-H5sNDeRMK-v0LAlf77r64Q7qks5r_GgRgaJpZM4Nqtq6>
.
|
Hi. Consider testing patch on this branch: https://github.com/alk/gperftools/tree/wip-log-n-page-heap it implements O(log n) manipulation of sets of large spans so should fix scaling issues. It seems to work in my testing. |
Hi Aliaksey, thanks for the quick responses. I'm running a multi-threaded computer vision program with frequent allocation and deallocation of images. It uses 5-10GB of memory and usually works great, but about 1 in 50 instances slow down to a standstill after running it for 24h. Since my last email, I ran it with TCMALLOC_TRANSFER_NUM_OBJ=40 and This improved it considerably, but I still found occasional slowdowns. I will run your patch and report back if that solves it conclusively. |
Thanks. So with latest code lowering TRANSFER_NUM_OBJ should not be
necessary anymore. Also I'd strongly suggest checking for system-level
issues if the issue is near-full standstill after a while. Particularly
check of swapping-related issues. I.e. by observing rate of major page
faults (and minor too just in case).
If problem still occurs, I would like you to report more details on the
problem. Ideally I'd have output of MallocExtension::instance()->GetStats
and cpu profile from process having this problems.
…On Mon, Jun 5, 2017 at 5:12 PM, Hendrik Dahlkamp ***@***.***> wrote:
Hi Aliaksey,
thanks for the quick responses. I'm running a multi-threaded computer
vision program with frequent allocation and deallocation of images. It uses
5-10GB of memory and usually works great, but about 1 in 50 instances slow
down to a standstill after running it for 24h.
Since my last email, I ran it with TCMALLOC_TRANSFER_NUM_OBJ=40 and
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=2199023255552, as suggested in #713
<#713>.
This improved it considerably, but I still found occasional slowdowns. I
will run your patch and report back if that solves it conclusively.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#535 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA6AuoFXiOF40C9l2su8ShPtR9S9VLeks5sBJlhgaJpZM4Nqtq6>
.
|
@alk any update on the WIP std::set<> based page-heap implementation you linked above? Curious if you have plans to finish it up. AllocLarge continues to be a concurrency bottleneck for us in long-running server processes. |
There is ongoing effort at google rewriting page heap for hugepages friendliness. So I didn't want to touch page heap. But since that effort is taking a bit longer than we expected, I think it is okay to accept simple fixes for existing page heap. So if you want to pick up and polish this code, I'll happily take such patch (or maybe it already looks perfect and works on your workloads, then just let me know). I was thinking we should switch std set to some intrusive balanced tree implementation (e.g. boost has avl and rb trees). But it is probably okay without. |
I.e. for next couple months I'll be busy with some other work. I cannot spend much time on page heap. But will be happy to review finished work :) |
Yea, I was also thinking that an intrusive tree (eg from boost) would be great. But I didn't think that a boost dependency in gperftools would be acceptable. It seems like copy-paste importing boost::intrusive::* data structures is also non-trivial. As with most things in boost there is some viral templating going on and we'd need to import half of Boost :) |
Played a bit with your branch and also compared with implementations based on some other data structures. I used your 'random-mallocer' benchmark as a baseline here. Of course all variants are orders of magnitude better than the O(n) but I was able to also note some differences between various alternatives. Baseline results:
By using a set<pair<Length,Span*>> instead of a set<Span*> we can avoid an extra indirection/cache miss on most comparisons. Since the allocator for the set also allocates from a small set of pages it's likely a lot of the allocations can share TLB entries and avoid TLB misses too. The cost here is an extra 8 bytes per large span, but given that large spans are always at least 1MB I don't think that's of much concern (.0008% overhead). This speeds up about 10%:
I also tried boost::intrusive::set which gave similar results to the above:
... but of course it involves pulling in boost, which is a bit of a pain. It also expanded the size of the Span struct for the member hook, which wouldn't cooperate with being put inside a union. A few other things I experimented with with negative results:
All the above were only run with a relatively small number of iterations of the benchmark. Perhaps if you run it longer things get more fragmented and the results could differ. But, seeing as all the solutions are orders of magnitude better than what's in tcmalloc today, and the std::set with the length caching isn't too complex, I think we should move ahead with cleaning up that code and merging it. My hacky code is posted at toddlipcon@1bb3db5 but I can take on cleaning it up and merging with your original patch. In terms of testing, do you think there are any new specific tests to be added or do the existing system tests cover this enough? I'll try it out on an app or two here just to make sure it doesn't explode. |
This is awesome. Ship it :) In terms of tests lets make sure new code is covered. I don't think anything beyond that is necessary yet |
Working on cleaning up the patch... one question about your code: In PageHeap::ReleaseAtLeastNPages, we previously would pop from the back of the large normal span list. That effectively released the least-recently-used span, since any allocation from a large span would previously prepend the post-carved "leftover" span back to the front. That's a good heuristic because it would likely favor span sizes that weren't useful for allocations and reduce fragmentation over time, right? In the new code it seems you're using rbegin() in this method. That will always pick the largest NORMAL span to release. I'm afraid of a couple items here:
Curious whether you agree with the above points, and if so what you think the best approach is: It seems choice 'a' is the simplest (just a matter of removing a single character from your code) but figured I'd get your take rather than try to sneak a small change in that may have larger repercussions. |
This is implemented via std::set with custom STL allocator that delegates to PageHeapAllocator. Free large spans are not linked together via linked list, but inserted into std::set. Spans also store iterators to std::set positions pointing to them. So that removing span from set is fast too. Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on earlier research and experimentation by James Golick. Addresses issue gperftools#535
This is implemented via std::set with custom STL allocator that delegates to PageHeapAllocator. Free large spans are not linked together via linked list, but inserted into std::set. Spans also store iterators to std::set positions pointing to them. So that removing span from set is fast too. Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on earlier research and experimentation by James Golick. Addresses issue gperftools#535
This is implemented via std::set with custom STL allocator that delegates to PageHeapAllocator. Free large spans are not linked together via linked list, but inserted into std::set. Spans also store iterators to std::set positions pointing to them. So that removing span from set is fast too. Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on earlier research and experimentation by James Golick. Addresses issue #535 [alkondratenko@gmail.com: added Todd's fix for building on OSX] [alkondratenko@gmail.com: removed unnecessary Span constructor] [alkondratenko@gmail.com: added const for SpanSet comparator] [alkondratenko@gmail.com: added operator != for STLPageHeapAllocator]
Done :) Much thanks. I think new release logic is harmless or possibly improvement. |
This is implemented via std::set with custom STL allocator that delegates to PageHeapAllocator. Free large spans are not linked together via linked list, but inserted into std::set. Spans also store iterators to std::set positions pointing to them. So that removing span from set is fast too. Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on earlier research and experimentation by James Golick. Addresses issue gperftools#535 [alkondratenko@gmail.com: added Todd's fix for building on OSX] [alkondratenko@gmail.com: removed unnecessary Span constructor] [alkondratenko@gmail.com: added const for SpanSet comparator] [alkondratenko@gmail.com: added operator != for STLPageHeapAllocator] (cherry picked from commit 06c9414)
Originally reported on Google Code with ID 532
Reported by
jamesgolick
on 2013-05-21 23:38:00The text was updated successfully, but these errors were encountered: