New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tcmalloc] O(n) address-ordered best-fit over PageHeap::large_ becomes major scalability bottleneck on fragmented heap #535

Closed
alk opened this Issue Aug 23, 2015 · 70 comments

Comments

Projects
None yet
3 participants
@alk
Contributor

alk commented Aug 23, 2015

Originally reported on Google Code with ID 532

We run tcmalloc on our very high throughput MySQL master. Join buffers > kMaxPages are
allocated by a large percentage of client threads. Over time, as fragmentation increases,
the large_.returned list grows to contain thousands of Spans, which makes PageHeap::AllocLarge
extremely slow and the pageheap_lock becomes a major source of contention, eventually
slowing down MySQL considerably.

I have written a patch which maintains a doubly linked skiplist to calculate address-ordered
best-fit in amortized O(log(n)) instead of O(n) when the combined size of the large_
normal and returned lists exceed a configurable threshold. We're running it in production
now and according to perf top, the percentage of time spent in PageHeap::AllocLarge
has gone from ~8% overall to ~0.4%.

However, while the time complexity of the skiplist is favorable, its space complexity
(each node is 168 bytes) is obviously substantially less favorable than something like
a red/black tree. I haven't had a chance to do my own measurements yet, but it seems
from everything I've read that allocator metadata size has been shown to have a substantial
impact on CPU cache efficiency, so while this patch may lessen a scalability bottleneck,
it may have negative performance effects. I'm up for implementing a different data
structure for this if that has a better chance of being accepted.

I've also written a small test program that attempts to create a sizeable large_ list,
and then times 100000 allocations and frees from the large list. It's far from perfect,
but it's something.

Patch: https://github.com/jamesgolick/gperftools/compare/master...skiplist-trunk
Test program: https://gist.github.com/jamesgolick/cffa78146af264c191fb

Reported by jamesgolick on 2013-05-21 23:38:00

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Note that https://github.com/jamesgolick/gperftools/commit/1a87924c05ce33eb9b2b5ba78c38ba9d8653838b
is needed to run the test program against trunk (to display the length of the combined
large lists).

Reported by jamesgolick on 2013-05-22 16:49:00

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I was thinking about rbtree. I.e. either from libbsd or from STL (which would require
some carefulness with it's allocators)

Reported by alkondratenko on 2013-07-06 19:57:07

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Reported by alkondratenko on 2013-07-06 23:12:33

  • Status changed: Accepted
@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Forgot to say: I'll take a look your stuff soon. Closest Saturday latest. Thanks a lot
for taking your time to do something about this issue.

Reported by alkondratenko on 2013-07-08 17:21:26

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I think this is good enough. And I think it's quite strong candidate for 2.2 release.

I was thinking that maybe we should look at entire page spans allocator as a whole.
And implement efficient and scalable data structure for it. Our competition is using
rb-trees and b-trees. We can also consider doing simpler, dl-malloc-like organization
of page span size class bins.

And btw independently from tcmalloc and for some couchbase server experiment I've implemented
a simple rb-tree based allocator here: https://github.com/alk/minimalloc

Anyway I think dealing with large_ spans is a good start and we can always continue
improving later.

Having said all that I think we should also double check if there are other things
we should consider improving instead or in addition to scalable large_ spans searching.

Particularly I'm wondering how come allocating more than 1 meg chunks can be a notice-able
bottleneck even after your optimization ? My understanding is given that using that
big chunk is probably going to be at least hundreds of thousands of cycles, optimized
malloc-ing of those chunks should not be even notice-able on profile. Unless either
I'm missing something (like big chunks are allocated, yet only fraction of chunks are
actually used) or perhaps there are some other reasons of slowness.

Reported by alkondratenko on 2013-07-13 22:15:26

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

It's hard to know the answer to that question without seeing the test program you're
using.

I have some other ideas, like I think we should add some size classes above 1Mb, but
this is still important and needs solving either way.

I have some patches that take a running sample of large allocations and attempt to
determine which additional size classes are worth adding to the thread caches based
on that data (and then adds them). I think there's something there, but a lot more
thought needs to go in to it.

I do think that this is a good solution for now, and we can think about adding more
stuff later, unless you've got a better idea that we can implement immediately?

Incidentally, I'd like to get a lot more involved in this project. Is there anything
you need help with?

Reported by jamesgolick on 2013-07-15 23:16:03

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I was wondering why in your reported case amount of time spent allocating span out of
large_ has dropped only to 0.4%.

It's good that you have a tool to record page span allocations. I think that would
be really valuable so that we can capture some particular allocation pattern and then
explore it offline using potentially different allocation implementations.

So with that tool maybe we can dig more into why it's just 0.4% and not something far
smaller.

Regarding help. Take a look at tickets. There are few OSX tickets that may need some
help. Otherwise consider reviewing the things at the following threads: https://groups.google.com/forum/#!topic/google-perftools/8AZz3GNxH-o
and https://groups.google.com/forum/#!topic/google-perftools/PQR4zQ6CQYs

Also given we're nearly at rc any further testing on any platforms you care about is
nice help as well.

Reported by alkondratenko on 2013-07-16 15:10:27

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I don't think that it's still particularly a bottleneck, actually - it's just that MySQL
allocates a lot of big chunks.

For example, tcmalloc::SLL_Next(void*) is currently at 0.5% in perf top, and PageHeap::AllocLarge
isn't on the first page at all.

Reported by jamesgolick on 2013-07-19 18:35:55

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Looks like we're having some misunderstanding.

I'm simply wondering why you saw such high percentage of time spent in this particular
code without patch. I.e. after allocating big buffer I expect application to do something
reasonably heavy with all of it. And why with patch it only went to 0.4%. 

I'm wondering because who knows if there's something else that's asking for attention
in this particular case.

Reported by alkondratenko on 2013-07-24 02:50:40

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Been on vacation for a while, so sorry for the delay, but...

As for why GetBestFit can still be a bottleneck, perf top shows that most of the time
in that function is being spent in this comparison: 

https://github.com/jamesgolick/gperftools/commit/011657b62e85102346f4dc2219bfa2d22fa8a89d#L3R123

about 25% addressing the Span object:

 27.05 │      mov    0x8(%rax),%rax

and 50% mispredicting the branch:

 44.08 │      cmp    -0x30(%rbp),%rax

Did my addition of a pointer to the Span objects or the size of the skip lists noes
cause some regression with respect to them fitting in cache or something? I'm not sure.
Ideas?

Reported by jamesgolick on 2013-09-28 18:26:35

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Interesting result. Is there any reasonably simple test I can run myself to dig deeper
?

Reported by alkondratenko on 2013-09-29 02:12:02

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I was using something like this https://gist.github.com/jamesgolick/9a02b796693b044d21e2
to create some fragmentation and inflate the size of the large lists - perhaps just
modify the thread_loops to run forever?

Reported by jamesgolick on 2013-09-29 05:42:50

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Been thinking about this a lot, and I'm really not sure how to write a test to expose
this issue.

Is there any data that I can gather from my running mysqld that you think might give
us some clues?

Reported by jamesgolick on 2013-09-30 05:04:00

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Perhaps you can try to record allocation/free requests via hooks somehow. The challenge
is obviously going to be to efficiently and correctly record temporal relations between
different threads.

Reported by alkondratenko on 2013-09-30 18:19:42

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Any other ideas for how we can design a test to expose this? This ticket addresses a
major scalability issue and it would be great to find a way to be able to get the patch
ready for acceptance.

Reported by jamesgolick on 2013-10-03 01:35:09

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Maybe you have some particular MySQL benchmark that exposes this condition ?

Reported by alkondratenko on 2013-10-03 02:44:29

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

The following demonstrates O(n) very well too: https://gist.github.com/alk/6956552

I've tried your patch (rebasing it against master). And:

*) it fails to compile (due to missing stdio.h include's for fprintf)

*) make check and program above crash with segmentation fault

Let's get your code in commit-ready shape (no fprintfs please and not crashes) and
try to exercise it as much as we can.

Reported by alkondratenko on 2013-10-13 00:31:40

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Ok - I'm on it.

Reported by jamesgolick on 2013-10-16 10:00:33

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Okay, fixed. New patch is here: https://github.com/jamesgolick/gperftools/compare/trunk...skiplist-trunk

I'm also working on implementing a left-leaning red black tree to compare performance.

Reported by jamesgolick on 2013-10-30 16:22:19

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Note that you can take one from jemalloc. Or from libbsd (as I did for my minimalloc
experiment).

Unless of course you actually want to implement it yourself.

BTW jemalloc author has blob post somewhere on performance of various rbtree implementation
options. Might be worth checking out.

Reported by alkondratenko on 2013-10-30 16:36:36

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

So, I used the jemalloc code, and got a 20% improvement over my skip list using your
random_mallocer as a benchmark. I would imagine that the reduced space complexity would
make an even bigger difference in a more realistic test because there would be so much
less cache contention walking the tree.

https://github.com/jamesgolick/gperftools/compare/trunk...llrb

The only thing is that we probably don't want to use the jemalloc code since it's implemented
as an 800-line set of macros... If we want to go in this direction, we can: 1) expand
the jemalloc macros and clean up the code, 2) find another rbtree to use, 3) actually
implement our own.

Reported by jamesgolick on 2013-11-01 16:33:20

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I'll need clean patches without traces of "internal" fixes (like fprintf).

Also please do rebase against latest master:

git rebase --onto googlecode/master jamesgolick/trunk

Reported by alkondratenko on 2013-11-17 03:05:47

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

llrb functions as well as rb.h should be included in .cc not .h

Reported by alkondratenko on 2013-11-17 03:33:52

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

There are also fixable warnings:

src/page_heap.h: In constructor 'tcmalloc::PageHeap::PageHeap()':
src/page_heap.h:303:11: warning: 'tcmalloc::PageHeap::scavenge_counter_' will be initialized
after [-Wreorder]
   int64_t scavenge_counter_;
           ^
src/page_heap.h:243:10: warning:   'size_t tcmalloc::PageHeap::large_lists_size_' [-Wreorder]
   size_t large_lists_size_;
          ^
src/page_heap.cc:64:1: warning:   when initialized here [-Wreorder]
 PageHeap::PageHeap()
 ^
In file included from src/static_vars.h:43:0,
                 from src/page_heap.cc:43:
src/page_heap.h:306:7: warning: 'tcmalloc::PageHeap::release_index_' will be initialized
after [-Wreorder]
   int release_index_;
       ^
src/page_heap.h:246:8: warning:   'bool tcmalloc::PageHeap::using_large_llrb_' [-Wreorder]
   bool using_large_llrb_;
        ^
src/page_heap.cc:64:1: warning:   when initialized here [-Wreorder]
 PageHeap::PageHeap()
 ^

Reported by alkondratenko on 2013-11-17 03:35:12

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

In general I do like the code. Just please, make it super clean. Without unrelated changes
etc. I believe c-macro-based red-black tree code from jemalloc is fine.

BTW what LLRB means?

Also I'd not bother with linked list and kLargeLLRBThreshold. Consider making everything
simpler by always keeping large spans in red-black tree.

I think we're close to merge-able code. Thanks a lot for your effort here.



Reported by alkondratenko on 2013-11-17 03:46:16

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

There's compilation error:

In file included from src/page_heap.cc:41:0:
src/page_heap.cc: In member function 'tcmalloc::Span* tcmalloc::PageHeap::AllocLarge(Length)':
src/page_heap.cc:219:10: error: 'bestNormal' was not declared in this scope
   ASSERT(bestNormal == NULL);
          ^
src/internal_logging.h:112:9: note: in definition of macro 'CHECK_CONDITION'
   if (!(cond)) {                                                         \
         ^
src/page_heap.cc:219:3: note: in expansion of macro 'ASSERT'
   ASSERT(bestNormal == NULL);
   ^
In file included from src/llrb.h:41:0,
                 from src/page_heap.h:45,
                 from src/static_vars.h:43,
                 from src/page_heap.cc:43:


This happens when -DNDEBUG is not passed to preprocessor

Reported by alkondratenko on 2013-11-17 03:52:30

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

No prob. There's still one bug right now, where the test program attempts to free an
invalid pointer. I am trying to figure out whether it's another bug in the test program
or not right now. Running this test program against my llrb branch takes a minute or
two. Running against master takes 30+ hours :-), so I'll let you know what the result
is soon.

Reported by jamesgolick on 2014-07-21 02:36:05

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Second bug appears to be caused by unsignedness of i and countdown loop.

Reported by alkondratenko on 2014-07-21 02:44:39

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Yeah, you're right. This is running consistently with no crashes now for me. What are
the next steps?

Reported by jamesgolick on 2014-07-21 03:12:30

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

The code needs cleanup. I'm seeing:

* Trailing whitespace

* spurious whitespace-only changes (span.h)

* unused fields (large_lists_size, kLargeLLRBThreshold) and methods (InitializeLargeLLRB,
LLRB::Include)

There are seemingly larger problems as well:

* leaking of rbn_right_red and friends to llrb.h (and details of RB_COMPACT)

* LLRB::GetBestFit looks wrong. Consider a case where best node is somewhere in the
middle of the tree. With both left and right pointers non-nil. My understanding is
that rb.h API is able to deal with "find least node that's larger than given value".
I used it in minimalloc (although my code is based on rbtree from libbsd).

* it seems best if we embed LLRBNode into Span. For performance and memory usage and
etc.


And in addition to that we really need some at least basic unit test that covers this
new code. I'm sure it must be possible to write some unit-test that only deals with
LLRB and doesn't require much of rest of code.

And thanks for working on this. It is really helping massively.

Reported by alkondratenko on 2014-07-22 05:20:23

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Ah. Forgot another important downside:

* old code was searching for match among normal spans first and used returned spans
after that. Your code is not keeping that important behavior.

Reported by alkondratenko on 2014-07-22 05:42:41

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I don't see any trailing whitespace over here. Can you point it out?

I fixed the unused fields and functions.

I think get best fit is actually right because it has the second condition on the leftward
movement regarding page size. I'll write some tests to try to prove that. There is
a function in rb.h that will do something *similar* but not quite the same. I could
add a function to rb.h to do what we want if you'd prefer it there.

Regarding embedding llrbnode in to span, I'm not sure. It actually seems to me like
having the tree nodes packed together might be faster?

Regarding the old code matching normal spans first, unless I'm misreading it I don't
think that's what it actually does? Seems like it prefers the left-most address over
a span off of the normal list. If you want to prefer spans from the normal list, we
can make 2 trees. Or change the algorithm a little.

Reported by jamesgolick on 2014-07-22 14:02:38

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

>> I don't see any trailing whitespace over here. Can you point it out?

GetBestFit has few trailing tabs on first if line

>> Regarding embedding llrbnode in to span, I'm not sure. It actually seems to me like
having the tree nodes packed together might be faster?

Probably not. Particularly during tree traversal you have to touch span fields (size
& addr) anyways. So you're just adding another pointer and extra memory or cache access
latency.

>> Regarding the old code matching normal spans first, unless I'm misreading it I don't
think that's what it actually does?

You're right due to address ordered best fit it's not aggressive enough. We definitely
should be doing something more aggressive there because actually preferring normal
span to returned space is the right thing to do. The simplest thing we can do is to
tweak comparison so that it looks at size first, then normal/returned, then address.
But I'd also consider something even stronger. I.e. preferring normal but slightly
larger than best returned span.

And some more thoughts.

* Ideally we'd find some way to avoid exposing rb.h into headers. Or maybe we could
expose just smallest part of it. If we don't embed LLRBNode into span we can actually
do it right now (llrb.h don't really need full definition of this struct). But I think
embedding is the right thing to do so we might want to split part of rb.h that defines
structures to include into headers.

* GetBestFit is certainly not right. Consider another case. You're walking to the left
and find that your current node's left child is smaller than size you need. And you
stop right there, ignoring possibility of right subtree of that child having better
fit.

* GetBestFit apparently accesses rbn_right_red incorrectly. Without taking into account
redness bit stored in it's least bits.

* I'm slightly concerned about couple cases of pattern like this: walk the tree to
find something, then walk the tree again to delete it. But this is probably not big
deal.

* rb.h lacks copyright header. Should be easy to fix.

Reported by alkondratenko on 2014-07-22 17:13:12

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

Hi:

My best friend, James Golick, passed away unexpectedly nearly 2 weeks ago. He and I
discussed this patch at length and I would like to pick up the work where he left off,
as I know how much this meant to him to have this merged.

As such I have:

* Added a license blurb and copyright attribution to rb.h (I contacted the original
author to verify).
* Split out structure definitions from rb.h into rb_defs.h. I used the copyright attribution
of the original rb.h, as I didn't really do anything except copy code around.

You can find these changes here: https://github.com/ice799/gperftools/compare/googlecode-master...llrb

From reading this thread, it seems the remaining pieces are:

* Some sort of unit test for this.
* Some issues with GetBestFit.

He had removed the spurious whitespace and unused fields and methods himself.

Please let me know if my understanding of what is remaining to have this merged is
correct.

Thanks.

Reported by ice799 on 2015-01-09 04:23:40

@alk

This comment has been minimized.

Contributor

alk commented Aug 23, 2015

I'm very sorry to hear that.

Indeed those two issues remain. On GetBestFit we need it to find least-address chunk
that is larger than allocation request. It is doable efficiently in log(n) time.

Reported by alkondratenko on 2015-01-10 19:27:19

@hendrik42

This comment has been minimized.

hendrik42 commented May 30, 2017

Hi, I'm experiencing performance slowdowns to a standstill, similar to issue 663 or 713, which are marked as duplicates of this ones.

Was there a resolution or workaround to this problem or is it still open?

@alk

This comment has been minimized.

Contributor

alk commented May 30, 2017

@alk

This comment has been minimized.

Contributor

alk commented Jun 4, 2017

Hi. Consider testing patch on this branch: https://github.com/alk/gperftools/tree/wip-log-n-page-heap it implements O(log n) manipulation of sets of large spans so should fix scaling issues. It seems to work in my testing.

@hendrik42

This comment has been minimized.

hendrik42 commented Jun 6, 2017

Hi Aliaksey,

thanks for the quick responses. I'm running a multi-threaded computer vision program with frequent allocation and deallocation of images. It uses 5-10GB of memory and usually works great, but about 1 in 50 instances slow down to a standstill after running it for 24h.

Since my last email, I ran it with TCMALLOC_TRANSFER_NUM_OBJ=40 and
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=2199023255552, as suggested in #713.

This improved it considerably, but I still found occasional slowdowns. I will run your patch and report back if that solves it conclusively.

@alk

This comment has been minimized.

Contributor

alk commented Jun 6, 2017

cloudera-gerrit pushed a commit to cloudera/kudu that referenced this issue Oct 13, 2017

KUDU-2184. Avoid allocations larger than 1MB
tcmalloc doesn't perform well for allocations larger than 1MB[1]. We
currently use large allocations for full arenas (in particular the MRS)
which can cause large latency blips when a thread tries to allocate a
new arena component.

This just changes the max buffer size for all arenas to be 1MB and adds
a DCHECK so we avoid this problem in the future.

Additionally I increased the buffer size for the /memz output so that we
can get more stats on the occupancy of different allocator size classes.

[1] gperftools/gperftools#535

Change-Id: I19c8ae9ce268e2623a89624e19db673931a093c0
Reviewed-on: http://gerrit.cloudera.org:8080/8239
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <davidralves@gmail.com>
@toddlipcon

This comment has been minimized.

Contributor

toddlipcon commented Jan 19, 2018

@alk any update on the WIP std::set<> based page-heap implementation you linked above? Curious if you have plans to finish it up. AllocLarge continues to be a concurrency bottleneck for us in long-running server processes.

@alk

This comment has been minimized.

Contributor

alk commented Feb 4, 2018

There is ongoing effort at google rewriting page heap for hugepages friendliness. So I didn't want to touch page heap. But since that effort is taking a bit longer than we expected, I think it is okay to accept simple fixes for existing page heap. So if you want to pick up and polish this code, I'll happily take such patch (or maybe it already looks perfect and works on your workloads, then just let me know). I was thinking we should switch std set to some intrusive balanced tree implementation (e.g. boost has avl and rb trees). But it is probably okay without.

@alk

This comment has been minimized.

Contributor

alk commented Feb 4, 2018

I.e. for next couple months I'll be busy with some other work. I cannot spend much time on page heap. But will be happy to review finished work :)

@toddlipcon

This comment has been minimized.

Contributor

toddlipcon commented Feb 5, 2018

Yea, I was also thinking that an intrusive tree (eg from boost) would be great. But I didn't think that a boost dependency in gperftools would be acceptable. It seems like copy-paste importing boost::intrusive::* data structures is also non-trivial. As with most things in boost there is some viral templating going on and we'd need to import half of Boost :)

@toddlipcon

This comment has been minimized.

Contributor

toddlipcon commented Feb 11, 2018

Played a bit with your branch and also compared with implementations based on some other data structures. I used your 'random-mallocer' benchmark as a baseline here. Of course all variants are orders of magnitude better than the O(n) but I was able to also note some differences between various alternatives. Baseline results:

 Performance counter stats for './random-mallocer -n 10000':

    13,194,426,376      cycles
        55,064,295      cache-misses

       6.618084836 seconds time elapsed

By using a set<pair<Length,Span*>> instead of a set<Span*> we can avoid an extra indirection/cache miss on most comparisons. Since the allocator for the set also allocates from a small set of pages it's likely a lot of the allocations can share TLB entries and avoid TLB misses too. The cost here is an extra 8 bytes per large span, but given that large spans are always at least 1MB I don't think that's of much concern (.0008% overhead).

This speeds up about 10%:

 Performance counter stats for './random-mallocer -n 10000':

    11,998,739,572      cycles
        52,524,568      cache-misses

       6.065321629 seconds time elapsed

I also tried boost::intrusive::set which gave similar results to the above:

 Performance counter stats for './random-mallocer -n 10000':

    11,972,782,765      cycles
        56,905,719      cache-misses

       6.022657601 seconds time elapsed

... but of course it involves pulling in boost, which is a bit of a pain. It also expanded the size of the Span struct for the member hook, which wouldn't cooperate with being put inside a union.

A few other things I experimented with with negative results:

  • trying to remember the iterator after the removed span and passing it back into std::set::insert as a hint in Carve(). I'm not sure if I implemented it wrong or if it just isn't effective, but it caused a 5-10% slowdown.
  • boost::intrusive::splay_set: 20% slower
  • boost::intrusive::avl_set: 5-10% slower

All the above were only run with a relatively small number of iterations of the benchmark. Perhaps if you run it longer things get more fragmented and the results could differ. But, seeing as all the solutions are orders of magnitude better than what's in tcmalloc today, and the std::set with the length caching isn't too complex, I think we should move ahead with cleaning up that code and merging it. My hacky code is posted at toddlipcon@1bb3db5 but I can take on cleaning it up and merging with your original patch.

In terms of testing, do you think there are any new specific tests to be added or do the existing system tests cover this enough? I'll try it out on an app or two here just to make sure it doesn't explode.

@alk

This comment has been minimized.

Contributor

alk commented Feb 11, 2018

This is awesome. Ship it :)

In terms of tests lets make sure new code is covered. I don't think anything beyond that is necessary yet

@toddlipcon

This comment has been minimized.

Contributor

toddlipcon commented Feb 11, 2018

Working on cleaning up the patch... one question about your code:

In PageHeap::ReleaseAtLeastNPages, we previously would pop from the back of the large normal span list. That effectively released the least-recently-used span, since any allocation from a large span would previously prepend the post-carved "leftover" span back to the front. That's a good heuristic because it would likely favor span sizes that weren't useful for allocations and reduce fragmentation over time, right?

In the new code it seems you're using rbegin() in this method. That will always pick the largest NORMAL span to release. I'm afraid of a couple items here:

  1. This may exacerbate issues like #754 since, even if we just need to release a few MB, we may choose to release hundreds of MB if there is some large span available. As we saw in the discussion of that issue, madvising large amounts of memory can be quite slow.
  2. I would guess that this will result in an accumulation of a larger number of small spans over time rather than tending towards a smaller number of big spans, since we're always releasing the "best" span we have.

Curious whether you agree with the above points, and if so what you think the best approach is:
a) release the smallest span in the large list
b) still pick the largest span, but carve off the appropriate amount if the size of the span is much larger than the requested number of pages-to-release
c) attempt to keep some recency stats of each Span, and then iterate over the first few smallest spans to try to find one which is "not-that-recently-used"
d) use 'best fit' to release the first span that is larger than the requested number of pages

It seems choice 'a' is the simplest (just a matter of removing a single character from your code) but figured I'd get your take rather than try to sneak a small change in that may have larger repercussions.

toddlipcon added a commit to toddlipcon/gperftools that referenced this issue Feb 12, 2018

Implemented O(log n) searching among large spans
This is implemented via std::set with custom STL allocator that
delegates to PageHeapAllocator. Free large spans are not linked
together via linked list, but inserted into std::set. Spans also store
iterators to std::set positions pointing to them. So that removing
span from set is fast too.

Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on
earlier research and experimentation by James Golick.

Addresses issue gperftools#535

toddlipcon added a commit to toddlipcon/gperftools that referenced this issue Feb 12, 2018

Implemented O(log n) searching among large spans
This is implemented via std::set with custom STL allocator that
delegates to PageHeapAllocator. Free large spans are not linked
together via linked list, but inserted into std::set. Spans also store
iterators to std::set positions pointing to them. So that removing
span from set is fast too.

Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on
earlier research and experimentation by James Golick.

Addresses issue gperftools#535

alk added a commit that referenced this issue Feb 25, 2018

Implemented O(log n) searching among large spans
This is implemented via std::set with custom STL allocator that
delegates to PageHeapAllocator. Free large spans are not linked
together via linked list, but inserted into std::set. Spans also store
iterators to std::set positions pointing to them. So that removing
span from set is fast too.

Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on
earlier research and experimentation by James Golick.

Addresses issue #535

[alkondratenko@gmail.com: added Todd's fix for building on OSX]
[alkondratenko@gmail.com: removed unnecessary Span constructor]
[alkondratenko@gmail.com: added const for SpanSet comparator]
[alkondratenko@gmail.com: added operator != for STLPageHeapAllocator]
@alk

This comment has been minimized.

Contributor

alk commented Feb 25, 2018

Done :) Much thanks. I think new release logic is harmless or possibly improvement.

@alk alk closed this Feb 25, 2018

toddlipcon added a commit to toddlipcon/gperftools that referenced this issue Mar 15, 2018

Implemented O(log n) searching among large spans
This is implemented via std::set with custom STL allocator that
delegates to PageHeapAllocator. Free large spans are not linked
together via linked list, but inserted into std::set. Spans also store
iterators to std::set positions pointing to them. So that removing
span from set is fast too.

Patch implemented by Aliaksey Kandratsenka and Todd Lipcon based on
earlier research and experimentation by James Golick.

Addresses issue gperftools#535

[alkondratenko@gmail.com: added Todd's fix for building on OSX]
[alkondratenko@gmail.com: removed unnecessary Span constructor]
[alkondratenko@gmail.com: added const for SpanSet comparator]
[alkondratenko@gmail.com: added operator != for STLPageHeapAllocator]

(cherry picked from commit 06c9414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment