I'm working on an application which memory limits about 500MB(FLAGS_tcmalloc_heap_limit_mb). It just malloc in two size, and free randomly.
Consider the following scenario:
This is more obvious in the multi thread environment, any idea for it? Thanks.
You're right. In worth case and even in some real-world cases tcmalloc can fragment memory a lot. This is also true (sometimes to lesser extend) about any malloc implementation. jemalloc could be even worse in some cases and better in some other cases.
glibc malloc which is still somewhat based on Doug Lea's malloc is actually decent w.r.t. dealing with corner cases, but it may still fragment memory a lot sometimes.
If your app triggers some hard corner cases for malloc implementations, then only advice I can give you is to adapt your app somehow.
Thanks for your reply.
It's a bad news.I have to do something to kick ThreadCache and TCEntry in some case.
Another way, I have read the source code of TCMalloc entirely.
And I'm thinking about is there any chance to disable/reduce the TCEntry to decrease span holding but not slow so much?
I noticed that the Span>objects is a simple freelist in single linked list, fetched by a range/release one by one.
It could be contained more data structure when the object size > 8 bytes.
According to the idea of skip list, I have tried to organize it into a skip list node for range fetching.
Unfortunately, you know, the skiplist has an excellent lookup performance,
but prepend is slow relatively because it has to maintain several levels.
And the level is limitted by objects size: a smaller objects has longer freelist, but has lower levels.
on average, skiplist has a better performance when the span->objects is long enough(in my test, pagesize == 64kb and the size of object < 256).
so, I have to try to organize it into a tiny, alterative skip list which level is just 2, with a tail pointer and a "objects length" field.
It balanced the lookup and prepend operation, and when the number of fetching objects >= list length, It can be removed directly.
In my random "fetch from" and "release to" span testing, It can speed up about 20 ~ 30%.
Maybe, it's a good idea.
It needs more about 10-12 bytes in 64bit system for tail pointer and length field each Span.
(May be we can compact Span->length from uintptr_t to uint32_t or uint16_t to save some memory?)
You can disable thread cache if that's a problem for you. Just build with -DTCMALLOC_SMALL_BUT_SLOW. But it will not in general affect worst cases of fragmentation as far as I understand. That worst case fragmentation occurs from spans that have most but not all objects free and we need objects of other size class. And that doesn't look like something that can be fixed with improved freelists representation.
Regarding this idea, I was thinking about something along this lines but very different. But didn't have time in last months to actually implement this stuff. All I have is some prototyping code that I used to see how quick and compact the code could be. And so far it looks promising. It can be seen here: https://gist.github.com/alk/09a387957fc78aa25b29
The idea is inspired by "binary representations" chapter from Okasaki's "Purely Functional Data Structures" book.
My thinking so far is that this representation has chance to be fast for "push/pop one object" case and for "get N objects" and for "add N objects". With all cases doing just few memory accesses. Only needing two words per object and some manageable overhead for per-thread freelists, transfer caches (which in this case could be just single free-list per size class and per-cpu I think) and free lists in spans.
But all that needs more work and I could be wrong. In new few months I'm unlikely to have time to work on this idea.
In any case, feel free to pursue idea of skip lists if you think it'll work fine.