I'm working on an application which memory limits about 500MB(FLAGS_tcmalloc_heap_limit_mb). It just malloc in two size, and free randomly.
Consider the following scenario:
#define TCMALLOC_LARGE_PAGES64K, and run in multi-thread environment.(for example 10 threads).
Always malloc 128 bytes(class 9) every thread until it can't allocate(malloc returns NULL). It means central_cache() will New lots of Spans with 1 page size(64KB) --> fetched by ThreadCache --> application.
Now, ThreadCache is empty, and Spans above is empty too.
Free these memory randomly. It means application --> ThreadCache->list_ --> release to central_cache().
Finally, ThreadCache->list_ hold some memory which belong Span for cache. It causes the Span->refcount != 0, so many Spans can't be Delete in ReleaseToSpans().
Each Span hold 1 Page(64KB) at least, so, such as, if one ThreadCache->list_ hold 500 blocks --> It could hold 400 Spans --> hold 400 pages --> 400*64KB = 25MB. And we have 10 threads. 25MB *10 = 250MB memory that can't be use in other class.
But we limits 500MB for TCMalloc, If then I don't need to malloc 128 bytes, but malloc 65536 bytes (class 72) for other purposes. I can't use entire 500MB memory, but 250MB only.
That would be a big waste?
This is more obvious in the multi thread environment, any idea for it? Thanks.
The text was updated successfully, but these errors were encountered:
You're right. In worth case and even in some real-world cases tcmalloc can fragment memory a lot. This is also true (sometimes to lesser extend) about any malloc implementation. jemalloc could be even worse in some cases and better in some other cases.
glibc malloc which is still somewhat based on Doug Lea's malloc is actually decent w.r.t. dealing with corner cases, but it may still fragment memory a lot sometimes.
If your app triggers some hard corner cases for malloc implementations, then only advice I can give you is to adapt your app somehow.
changed the title
TCMalloc could hold most of the memory useless.Jan 25, 2016
Thanks for your reply.
It's a bad news.I have to do something to kick ThreadCache and TCEntry in some case.
Another way, I have read the source code of TCMalloc entirely.
And I'm thinking about is there any chance to disable/reduce the TCEntry to decrease span holding but not slow so much?
I noticed that the Span>objects is a simple freelist in single linked list, fetched by a range/release one by one.
It could be contained more data structure when the object size > 8 bytes.
According to the idea of skip list, I have tried to organize it into a skip list node for range fetching.
Unfortunately, you know, the skiplist has an excellent lookup performance,
but prepend is slow relatively because it has to maintain several levels.
And the level is limitted by objects size: a smaller objects has longer freelist, but has lower levels.
on average, skiplist has a better performance when the span->objects is long enough(in my test, pagesize == 64kb and the size of object < 256).
so, I have to try to organize it into a tiny, alterative skip list which level is just 2, with a tail pointer and a "objects length" field.
It balanced the lookup and prepend operation, and when the number of fetching objects >= list length, It can be removed directly.
In my random "fetch from" and "release to" span testing, It can speed up about 20 ~ 30%.
Maybe, it's a good idea.
It needs more about 10-12 bytes in 64bit system for tail pointer and length field each Span.
(May be we can compact Span->length from uintptr_t to uint32_t or uint16_t to save some memory?)
You can disable thread cache if that's a problem for you. Just build with -DTCMALLOC_SMALL_BUT_SLOW. But it will not in general affect worst cases of fragmentation as far as I understand. That worst case fragmentation occurs from spans that have most but not all objects free and we need objects of other size class. And that doesn't look like something that can be fixed with improved freelists representation.
Regarding this idea, I was thinking about something along this lines but very different. But didn't have time in last months to actually implement this stuff. All I have is some prototyping code that I used to see how quick and compact the code could be. And so far it looks promising. It can be seen here: https://gist.github.com/alk/09a387957fc78aa25b29
The idea is inspired by "binary representations" chapter from Okasaki's "Purely Functional Data Structures" book.
My thinking so far is that this representation has chance to be fast for "push/pop one object" case and for "get N objects" and for "add N objects". With all cases doing just few memory accesses. Only needing two words per object and some manageable overhead for per-thread freelists, transfer caches (which in this case could be just single free-list per size class and per-cpu I think) and free lists in spans.
But all that needs more work and I could be wrong. In new few months I'm unlikely to have time to work on this idea.
In any case, feel free to pursue idea of skip lists if you think it'll work fine.