New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose hints that will allow applications to perform runtime active defragmentation #566

Open
oranagra opened this Issue Jan 18, 2017 · 26 comments

Comments

Projects
None yet
4 participants
@oranagra

oranagra commented Jan 18, 2017

Hi, I recently added a mechanism for runtime active defragmentation in redis, for which i added some code to jemalloc so that the application can decide which pointers are worth moving, and which aren't.
this hint can be merged into jemalloc so that other applications can choose to use it, and with your assistance it can maybe be improved so that defragmentation is more efficient.

first i'll describe what i already did:
after checking that the allocation is in a small bin (not large or huge), we must make sure it's not in the current run (used for new allocations), and that it doesn't reside in a fully utilized run. then we need to decide which non-full runs to evict and which non-full runs to fill.
ideally, the least amount of "work" would be to move the allocations from runs with low %run_utilization into runs with high %run_utilization, or in other words, re-allocate pointers that reside in runs with %run_utilization that is smaller than the average (i.e. the %bin_utilization).
additionally when re-allocating, we must make sure to bypass the thread cache, otherwise we'll "end up eating our own shit".

if want, you can take a quick look at the raw code i added: https://github.com/antirez/redis/pull/3720/files
(it also contains a showcase with malloc_stats output)
this was currently implemented on top of jemalloc 4.0.3, but i can PR that into the latest.

improvements:

  1. when calculating the % bin utilization, it may be good to excludes full runs. so that we know if a run is less utilized than average, only among the non full runs, and not the entire bin utilization. for that all we need is to know the count of non-full runs in the bin.

  2. when we re-allocate a pointer, it always moves into the non full run with the lowest address. there's a chance that this allocation will move out of a run which is going to be the currun soon, and we'll end up moving another allocation to the place we just evicted.

consider these to be runs in a certain bin (sorted by address space)
[xxx-] [xxxx] [xa--] [bx--] [xxxx] [xxx-]
'x' = used, '-' = unused, 'a'/'b' used allocations that we'll want to move.

in this case, assuming we'll first try to consider moving 'a' before we consider moving 'b'. it would be more efficient if we didn't move the 'a' and then move 'b' into it's place, and instead move the just 'b' into the unused space in the first run.

in order to do that i think we need to sort all the non-full runs of each bin by their address, and divide that list into two groups: ones that reside in address that is lower than the relative point in the address space than a split point according to the average bin utilization (% utilization that excludes full runs), and ones above it.
then we need to move only pointers that reside in run above that address.

please note that even the form that i already implemented (for which i didn't change anything in existing parts of the code) is enough to be very beneficial to the application.
i exposed the % utilization to the application so that i can play with some algorithms, and measure statistics, but if we want an API that can be easily improved in the future, we need to expose a single boolean result and do the decision inside the function.

please let me know what you think.
thanks.

@jasone

This comment has been minimized.

Member

jasone commented Feb 9, 2017

Given that you have the ability to relocate allocations, why isn't it sufficient to simply relocate everything (as mentioned in #562)? This should cause nearly optimal compaction, with no need for special APIs, and no increase in operational complexity.

It's conceivable that whatever solution we come up with for #327 will somehow indirectly expose information that approximates the utilization stat, or perhaps the iteration order will be inadvertently related to extent age, but I'm wary of adding an API for reporting per slab utilization stats.

@oranagra

This comment has been minimized.

oranagra commented Feb 10, 2017

Hi, thanks for replying.
In an application like Redis (that may hold 100GB or more in ram, and millions of allocations), memcpy-ing all the data in ram to new addresses would be very inefficient.
especially considering that fragmentation may only be in one or two problematic bins.

for instance, imagine the application populated millions of allocations of 150 bytes, and then the pattern change and it allocates many allocations of 300 bytes, freeing half of the 150 bytes allocations.
this creates massive fragmentation, and to resolve it there's only need to move about half of the 150 bytes allocations, and none of the 300 bytes allocations.
obviously we don't also want to relocate all the other allocations in bins that don't cause fragmentation, and ones in large and huge bins.

one more consideration is that an application can monitor the amount of external fragmentation it has, and only start active defragmentation when it passes a certain threshold (let's say 15%). here again, it would be very inefficient to relocate all the application memory (to resolve a low fragmentation issue), and if we choose that path, we'll only start defragmentation when fragmentation reaches 100% overhead. this in turn will create severe loop of high frag and low frag, causing a high worse case memory usage, and also bursty unstable cpu performance rather than keeping fragmentation and performance (defrag effort) constantly low.

my suggestion is not to expose many internals such as bin and run utilization.
instead, a single function that takes a pointer and returns a boolean may be good.
internally this function will check if the pointer at hand is from a bin that causes frag, and if it's in a run that's %utilization is lower than the %bin utilization it'll return true.
this way the API is very simple and doesn't bind you to the internal implementation at all.
i.e. if the implementation will someday change, the logic inside the function is updated and the API remains the same.

i'll be happy to create a pull request for that, as you can see in my redis PR all i added is a 20 line function and didn't have to touch any of the existing logic or data structures of jemalloc.

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Feb 12, 2017

What about adding a MALLOCX_DEFRAG flag option that can be passed to rallocx (perhaps combined with some configuration options that would let the user control the ratio of fragmentation to allocation size before we do the move). Does that leave you enough flexibility (e.g. do you need to do any internal pointer updates or concurrency control before moving)?

This is also a pain for C++ programs to use (where copy means memcpy even less frequently), but it lets us defer a decision on how to support APIs we're not sure about (#567).

@oranagra

This comment has been minimized.

oranagra commented Feb 13, 2017

I think the idea of using reallocx with DEFRAG flag makes a lot of sense.
the application can compare the old pointer with the new pointer and if they are different it knows the realloc didn't NOP and it needs to update the references. (you may notice i did the same in my activeDefragAlloc which uses je_get_defrag_hint).
please note that you must make sure the old pointer doesn't go into the TCACHE, and that the new pointer isn't taken from there.
also, theoretically, the size of the memcpy can be the size argument the user passes to realloc, rather than the bin size (no need to copy the internal fragmentation if the user knows how much data he has there, not sure what you currently do in realloc).
i don't actually think you need to expose any configuration to the application, doing so will limit the ability to change the decision algorithm later on.
please take another look at my initial comment that has some ideas on how to improve the defrag decision and improve the efficiency.

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Feb 15, 2017

We got the chance to talk about this today. One alternate strategy would be to use the stats interfaces to identify the fragmented size classes, and then do the "malloc new; memcpy; free old" loop for items in that size class (which can be determined with sallocx or application-level segregated sizes). This would do some extra copying, but only for those allocations in size classes which are already fragmented, which are probably going to need lots of copying to defragment anyways. This saves us from committing to an API, lets the application pick acceptable fragmentation thresholds, and works for C++ users (or others for whom moving an allocation requires nontrivial logic).

Could you try this out and see if it gives acceptable performance?

@oranagra

This comment has been minimized.

oranagra commented Feb 15, 2017

Hi,
This approach requires the application to be aware of jemalloc internals and implement a relatively complicated logic:
for each allocation he considers moving, he'll need to query the size of the allocation, query the stats interface (does that also requires epoc refresh, and string parsing?), then he'll need to distinguish between large, huge, and small bins (first two don't have fragmentation issue and no utilization stat, right?).
i.e. you're moving the problem and the complicated decision from jemalloc's code to the application.

also, in order to defrag the application MUST use the MALLOCX_TCACHE_NONE, otherwise he'll end up doing nothing, since the allocation the allocation frees will be fed back to it on the next time he attempts to defrag the next pointer of that bin. can C++ applications use that? if not then there's no reason to consider them at all.

in any case, as you pointed out, what you're offering is even less optimal than what i already have, it doesn't even avoid moving allocations that are already in currun, or allocations residing in runs that are totally full. this is a huge waste. i was hoping that with your help we can improve what i already have and reduce the excess work even more (the ideas i wrote in the first post)

besides that, iterating over all the pointers the application holds is quite intensive, and during this iteration it is quite a waste to skip pointers in size classes with low fragmentation ratio (once you already found the pointer, queried it's size and asked jemalloc for the stats for that size), if during that iteration i'll skip defragging pointers with low bin utilization (e.g. 9%), and then the fragmentation rises (to 11% which may be above the threshold) and i'll need another full iteration on all the pointers in the program in order to defrag these, that's a huge waste too.

i think the idead of reallocx(MALLOCX_DEFRAG) is the best one, it hides the decision algorithm, and also the action itself (TCACHE_NONE) from the application, and it's very easy to use.
the fact that it doesn't expose any internals (or even stats) to the application also means you'll be flexible to change the internals in the future, and also that you'll be able to improve the decision algorithm without breaking the api.

@jasone

This comment has been minimized.

Member

jasone commented Feb 15, 2017

@oranagra, the approach @davidtgoldblatt outlined isn't as onerous as you appear to think, because it's reasonably straightforward to 1) memoize the size class info, 2) compute the per-size-class fragmentation stats once per defragmentation pass, and 3) simply mallocx-copy-dallocx the objects in size classes with unacceptably high fragmentation. That said, I agree with you that this is far less than optimal particularly if you are striving for low fragmentation; it only works well if fragmentation is high (>> 50%?).

Regarding the existing proposals, I see the following issues:

  • je_get_defrag_hint() returns whether the pointer resides in the current run, which in my opinion is too much information (exposes internals), and it can be counter-productive to base decisions on this anyway.
  • je_get_defrag_hint() returns fixed-point values, which throws away information about the raw numerator/denominator that may be useful for other use cases.
  • rallocx(... MALLOCX_DEFRAG) has undefined semantics if the size passed in is different from the allocation's current size, which suggests to me that this functionality doesn't belong in rallocx(). Furthermore, this interface doesn't provide for specifying what the acceptable fragmentation level is (and a separate API for such specification would make concurrent use brittle).

I propose a slightly different API that provides generic raw information on which a defragmentation pass can be built:

  • sallocx(ptr, ...) --> usize [already exists]
  • ptr --> "slab utilization" --> {active, total}
  • usize --> "size class utilization" --> {active, total} [already exists, but in a form that requires mallctl epoch update, so a more immediate alternative may be warranted]

Whether this API would be implemented via mallctl() or an experimental API depends on how #567 shakes out.

@oranagra

This comment has been minimized.

oranagra commented Feb 15, 2017

@jasone the main inefficiency in that idea as i see it, is that it reallocs (memcpy) all allocations (even ones in full bins).
in redis, we obviously need the CPU at 100% to process client commands, we can't afford to take long breaks in order to defrag, so we need it to be as efficient as possible (moving only what needs to move), and we need to constantly defrag in order to keep fragmentation low.

what i like about that reallocx(MALLOCX_DEFRAG) idea is that there's no configuration or stats to expose. there's no need for fixedpoint values or raw numerator/denomerator, nor returning info about the current run.

i don't think we need an API to define the acceptable fragmentation ratio, here's why:

what i think needs to be done (what we do now in redis) is that the application monitors the global fragmentation ratio (stats.active / stats.allocated) and decides if it needs to defrag or not.
once it decides that it wants to defrag, and starts iterating on its pointers (which are not sorted by size class).
then, we want to move every allocation that can reduce fragmentation, no need for thresholds in the jemalloc code.
i.e. the defragger is either idle or active, when it is active, it progressively iterates on all points, a few pointers at a time (in parallel to other tasks), if it finds an allocation that can reduce fragmentation, even if that specific bin doesn't have high frag ratio, we still want to move it.

pseudo code for one case will be:

int je_get_defrag_hint(void* ptr) {
  if (prt in huge or large bins)
    return 0;
  if (ptr in currun || run_util == 100%)
    return 0;
  return run_util < bin_util;
}

pseudo code for the other case:

void* reallocx(void* ptr, size_t size, int flag) {
  if (flags&MALLOCX_DEFRAG) {
    if (prt in huge or large bins)
      return ptr;
    if (ptr in currun || run_util == 100%)
      return ptr;
    if (run_util >= bin_util)
      return ptr;
    newptr = mallocx(size, TCACHE_NONE);
    memcpy(newptr, ptr, MIN(size, origsize));
    dallocx(ptr, TCACHE_NONE);
    return newptr;
  }
....
}

the above tests for currun are actually not that important and can be dropped, but in my eyes it's a placeholder for the future improvements that i suggested in my first post.
i.e. to move from high addresses to low addresses and not necessarily according to bin_util>run_util.

this is one reason not to leave the decision for the application to make, and not to expose any utilization ratios for runs, instead we should just return a boolean hint if the allocation needs to move or not.
this way we can improve the decision without breaking the api, and the user also doesn't depend on the allocator's internal design (e.g. how is the next 'currun' chosen).

@jasone

This comment has been minimized.

Member

jasone commented Feb 20, 2017

I do not want to codify a defragmentation algorithm directly in jemalloc. jemalloc can provide mechanisms by which redis or other applications can defragment, but the policy we're discussing here is only one of multiple potentially useful possibilities, and it is unclear that any single policy is sufficiently universal for direct inclusion. The policy that redis settles on long-term should be iterated on within the redis codebase, rather than jemalloc's.

The basic APIs I suggested provide the building blocks that are central to (almost trivial!) implementation of your proposed policy in the redis. However, I don't think we should commit to supporting these APIs long-term until we have solid evidence of their usefulness, and we have an introspection/iteration API (#327) with which they fit cohesively. Let's get the plan and infrastructure in place for experimental APIs (#567), then come back to this issue as an early test case for experimental API development.

@oranagra

This comment has been minimized.

oranagra commented Feb 20, 2017

Hi, sorry to bother you again, just one final word on something that maybe wasn't clear enough from my messages above:

i'm not sure what would qualify the title of "defragmentation algorithm" and where it should be implemented...

the "defragmentation algorithm" for an application is something specific for the application (i had to write some 500 lines of code in redis in order to know when to start defragmentation, and how to properly find and fix all pointers).

however, the decision of which pointer to move and which to skip is something that is tightly coupled with the internal implementation of jemalloc. the current approach i took was to look at the run utilization, but that approach is actually wrong and wasteful (i may be moving some pointers multiple times). the right approach (considering jemalloc's current allocation logic) it to evict non full runs from high addresses into non full runs of low addresses, this is because currently the next currun is selected according to it's relative address.
so ideally i should have used the relative address in the decision and not the % utilization, but this would of course change if the selection of the next currrun ever changes.

so once again, i'm trying to argue that the decision is tightly coupled with the allocator internals, and may need to be changed when they change, which is why i think the boolean decision should be implemented inside jemalloc.
additionally, if we indeed want to improve the current approach i took and implement the decision based on address space, i believe currently jemalloc doesn't have the means (information isn't collected or stored) to support that type of design, so in order to do that jeamlloc's internals will need to be changed a bit.

i'm not saying that i expect you to do that.. if i were you, i wouldn't change the internals for that reason. but i am saying that the decision is in the allocator's domain and not the application.

on second thought, maybe i'm misinterpreting your aim...
if you're saying that you don't specifically care for defragmentation at all, and the only reason you're considering exposing more information to the application is that it can serve other purposes as well, then that makes sense to me as well.

in any case, thanks for everything.

@jasone

This comment has been minimized.

Member

jasone commented Feb 20, 2017

@oranagra, thank you for your continued efforts to communicate your aims and methods for achieving them. I am very interested in enhancing jemalloc to meet those aims, because it's a great opportunity for redis and jemalloc to work cohesively and lay memory out more efficiently than either could in isolation. Please understand that my goal in designing the necessary interfaces is to end up with something resilient that allows experimentation and adaptation, in the context of redis or any other application with the ability to relocate allocations.

Let me describe briefly why an application might prefer a very different defragmentation approach than what we have discussed here thus far. Many applications exhibit strong temporal locality; allocation requests n and n+1 are often related and therefore accessed close together in time. For such an application, we will degrade the correlated physical locality if we move allocations n and n+1 to non-adjacent locations. Such scattering is a common side effect of the defragmentation algorithm we have discussed thus far. Sliding compaction does not have this downside, but it also requires more relocation. So, we have at least two viable, justifiable defragmentation approaches, but neither is universally applicable. There are surely additional possibilities that are superior under various application loads.

I don't know yet whether we can develop straightforward yet flexible APIs that support diverse defragmentation approaches, but if so, I much prefer that outcome to codifying multiple defragmentation approaches directly in jemalloc. In any case, I think there's a good chance we can develop some building blocks that have uses beyond those we're currently discussing.

@jasone jasone added the feature label May 23, 2017

@daverigby

This comment has been minimized.

Contributor

daverigby commented Jul 14, 2017

FWIW, we at Couchbase implemented our defragmenter on top of jemalloc using a very naive algorithm:

Simply look for allocations which have been around for a long time, and then reallocate them with the tcache disabled. If we get a lower addressed pointer we memcpy and free the old one.

It's inefficient (in so far as we reallocate things which are maybe already in packed runs) but it's simple and required no changes to jemalloc. It just relies on the guarantee that jemalloc will assign the lowest available address for a given size. Note this is all done within a specific arena.

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Jul 14, 2017

I'd advise against that actually these days -- we prefer old memory over low-addressed memory (if you get it from sbrk it's the same of course, but depending on various config settings you might not), so you're potentially mixing layout strategies within a single program, which might work great, but might also blow up in weird ways.

The fully extent-ized internals make a fragmentation-query API much more sane in 5.0 on, so I think we should re-evaluate something along these lines. It'll be a few weeks before the team gets to talk on our end, though.

BTW, trying jemalloc 5 in general might be a good idea if fragmentation is a significant problem for you. We've seen some big wins in RSS (at the cost of increases in virtual memory use).

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Jul 14, 2017

(also, Redis folks: congrats on the release!)

@daverigby

This comment has been minimized.

Contributor

daverigby commented Jul 15, 2017

@davidtgoldblatt: ah, interesting to know.

I originally developed our defragmenter on jemalloc 3.x, and we're currently using something like v4.3 (off the top of my head) - when does the "lowest available address" guarantee no longer apply?

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Jul 15, 2017

Looks like they were added in 5c77af9, which is in 4.4 as its first release.

@daverigby

This comment has been minimized.

Contributor

daverigby commented Jul 15, 2017

Cool, thanks for that. I'm away for a week, but when I get back would be good to discuss how you would recommended handling defragmentation in newer versions.

@oranagra

This comment has been minimized.

oranagra commented Jul 15, 2017

This is a good example of why it is wise to hide the internals from the application and only expose a boolean hint. This way the logic will be updated along with jemalloc when it's internals, and the application will only be responsible of finding the pointers that need updating.

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Jul 17, 2017

I think we're going to start playing this in an experimental way sometime soon.

BTW, I was thinking about this more, and I realized that you could get some batching benefits in your defragmentation code, @oranagra. You could create an alloc tcache and a dalloc tcache with the mallctl. You get the defragmented memory from the alloc tcache, and free the fragmented memory into the dalloc one (and, once you're done defragging, freeing both tcaches). This lets you do most of your work in the fast-path tcache code, grabbing the compacted memory from and releasing the fragmented memory to the arena only in batch.

@oranagra

This comment has been minimized.

oranagra commented Jul 17, 2017

@davidtgoldblatt won't the alloc tcache be always empty? Since we never free anything into it?
Btw, I think the most inefficient thing here is the memcpy, not the arena locking.
That's why it is so important to move only the things that we really have to, and avoid putting things back into run we just evicted.
See my suggestions for the improvement of the decision algorithm (at the top).

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Jul 17, 2017

The tcaches batch operations, so the alloc tcache can grab a bunch of items from the arena when it's empty and needs to refill.

I think you're probably right about the memcpy cost dominating, but those CASs can add up. (And, it's an avenue to speed things up on your end without being blocked on progress on ours).

@daverigby

This comment has been minimized.

Contributor

daverigby commented Aug 15, 2017

@davidtgoldblatt re: your comment on 14th July, with respect to how we defrag in Couchbase:

we at Couchbase implemented our defragmenter on top of jemalloc using a very naive algorithm: Simply look for allocations which have been around for a long time, and then reallocate them with the tcache disabled. If we get a lower addressed pointer we memcpy and free the old one.

I'd advise against that actually these days -- we prefer old memory over low-addressed memory (if you get it from sbrk it's the same of course, but depending on various config settings you might not), so you're potentially mixing layout strategies within a single program, which might work great, but might also blow up in weird ways.

I've checked the code, and we actually unconditionally replace the old allocation with the new one we just got from je_malloc(). As such, it sounds like even from 4.4 onwards (specifically 5c77af9), our simple approach should still work?

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Aug 15, 2017

Yep, that should work. You could also do the "allocate from one tcache, free into another" trick I mentioned, which could cut down on your malloc-related costs (you'd still have to pay the memcpy, of course, which is probably more painful).

@daverigby

This comment has been minimized.

Contributor

daverigby commented Aug 15, 2017

Cool, thanks for the clarification.

I'll take a look at the dual tcache allocation - however the way we have things structured at the moment the actual reallocation code doesn't know anything about jemalloc, so it'd require some refactoring to expose an old and new tcache to that part of the code.

@davidtgoldblatt

This comment has been minimized.

Member

davidtgoldblatt commented Aug 15, 2017

When we move defragmentation support in-tree, we'll also add some mallctls to make all allocs/dallocs defragmenting ones (to deal with this problem: that having to make the defragmenting calls explicit is abstraction-hostile). So unless those code paths are showing up in profiling, just waiting for us to get to it seems reasonable to me.

@daverigby

This comment has been minimized.

Contributor

daverigby commented Aug 15, 2017

Yeah, defragmentation is working fine at present on 4.3.1, but it's perhaps less aggressive (in terms of how quickly we attempt to defrag items) than would ideally be. However that's more a function of the cost to us of scanning and checking which items should be defragmented (and the memcpy), not related to jemalloc's performance directly.

Certainly we'll look to use the standard API in future as/when that's available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment