Avoid recomputing size on allocation and reallocation #1074

gnzlbg · 2017-11-14T21:45:55Z

A single nallocx call takes ~10-15ns in my test system.

Currently, Rust needs to, call nallocx on allocation:

let flags = align_to_flags(align);
let p = mallocx(size, flags) as *mut u8;
if !p.is_null() {
   // unnecessary nallocx call?
    real_size = nallocx(size, flags) as usize;
    return (p, real_size);
}

and reallocation:

let flags = align_to_flags(align);
let ptr = rallocx(ptr as *mut c_void, new_size, flags) as *mut u8;
if !p.is_null() {
   // unnecessary nallocx call?
   real_size = nallocx(size, flags) as usize;
   return (p, real_size);
}

Given that:

rust always asks for aligned allocations,
mallocx always internally computes the real size for aligned allocations (see: https://github.com/jemalloc/jemalloc/blob/dev/src/jemalloc.c#L1777)
rallocx always internally computes the real size for aligned allocations (see:

jemalloc/src/arena.c

Line 1815 in b5d071c

size_t usize = sz_s2u(size);

)

it seems that we could save those 10ns if jemalloc wouldn't throw this information away.

Would it be possible to provide a pair of mallocx/rallocx methods that always return the allocation size for aligned allocations?

For example, something like this:

void* smallocx(size_t size, int flags, size_t* usize);
void* srallocx(void *ptr, size_t size, int flags, size_t* usize);

(I don't know whether this has a cost for unaligned allocations, or what that cost might be, but it might be worth it to check that out as well).

The text was updated successfully, but these errors were encountered:

davidtgoldblatt · 2017-11-14T23:06:10Z

Can you share your test system details + how you're building and testing nallocx speed? That sounds high to me.

nallocx has an __attribute__((pure)) attached to it, so that the result need not be computed if not used; can Rust do something similar to propagate that information to the Rust-side compiler, so that you only pay the cost if necessary? The nallocx computations are also highly pipelinable, so I think that the "true" cost is likely lower than might be reflected in microbenchmarks.

Committing to a new allocation API is a big enough deal for us that I can't really see us doing it for this case. In some of the other threads, we've suggested fastpathing "freebie" alignment values into the malloc / nallocx-with-zero-flags pathways; have you tried this?

Does nallocx actually show up in application-level profiles?

gnzlbg · 2017-11-15T10:13:23Z

@davidtgoldblatt you can find the benchmarks in my gbg branch of the jemallocator rust crate. To reproduce my results you just need to:

git clone git@github.com:gnzlbg/jemallocator.git
git checkout gbg
cargo bench

This should work as in on any Linux and MacOSX system; I am not sure about Windows but I think it should also work as is there although you might need MinGW.

I've uploaded my results to a gist.

You can find the compilation options for the benches in the Cargo.toml file, but in a nutshell it uses O3, Link Time Optimization (LTO), and panic=abort (no unwinding, no exceptions, ...).

The benchmarks have a warm up phase that runs until the number of nanoseconds/iteration (_ ns/iter) and the standard deviation (+/- _)converge. After the warmup the sampling starts from zero until these converge again.

The effect of nallocx is studied by doing a roundtrip of calling mallocx and then sdallocx w/o a call to nallocx inbetween. That is, the first benchmark (roundtrip_) just calls mallocx and then sdallocx. The second benchmark (roundtrip_nallocx_) calls mallocx, then nallocx to obtain the real size, stores it in a local variable, and calls sdallocx with it.

The results are provided for powers of two, even, and odd allocations. These are some of the results (the rest are in the gist) sorted to show the differences:

// powers of two:

test roundtrip_pow2_4bytes               ... bench:          28 ns/iter (+/- 6)
test roundtrip_nallocx_pow2_4bytes       ... bench:          35 ns/iter (+/- 33)

test roundtrip_pow2_128bytes             ... bench:          25 ns/iter (+/- 9)
test roundtrip_nallocx_pow2_128bytes     ... bench:          32 ns/iter (+/- 25)

test roundtrip_pow2_1024bytes            ... bench:          25 ns/iter (+/- 5)
test roundtrip_nallocx_pow2_1024bytes    ... bench:          31 ns/iter (+/- 11)

test roundtrip_pow2_4096bytes            ... bench:          29 ns/iter (+/- 16)
test roundtrip_nallocx_pow2_4096bytes    ... bench:          32 ns/iter (+/- 29)

test roundtrip_pow2_65536bytes           ... bench:         176 ns/iter (+/- 82)
test roundtrip_nallocx_pow2_65536bytes   ... bench:         179 ns/iter (+/- 157)

test roundtrip_pow2_4194304bytes         ... bench:       2,517 ns/iter (+/- 707)
test roundtrip_nallocx_pow2_4194304bytes ... bench:       2,265 ns/iter (+/- 1,096)

// even:

test roundtrip_even_24bytes              ... bench:          26 ns/iter (+/- 10)
test roundtrip_nallocx_even_24bytes      ... bench:          35 ns/iter (+/- 7)

test roundtrip_even_10000bytes           ... bench:          29 ns/iter (+/- 1)
test roundtrip_nallocx_even_10000bytes   ... bench:          42 ns/iter (+/- 10)

test roundtrip_even_1000000bytes         ... bench:         168 ns/iter (+/- 61)
test roundtrip_nallocx_even_1000000bytes ... bench:         260 ns/iter (+/- 48)

// odd: even size - 1

test roundtrip_odd_1000000bytes          ... bench:         165 ns/iter (+/- 112)
test roundtrip_nallocx_odd_1000000bytes  ... bench:         253 ns/iter (+/- 33)

test roundtrip_odd_10000bytes            ... bench:          30 ns/iter (+/- 23)
test roundtrip_nallocx_odd_10000bytes    ... bench:          37 ns/iter (+/- 20)

test roundtrip_odd_24bytes               ... bench:          28 ns/iter (+/- 19)
test roundtrip_nallocx_odd_24bytes       ... bench:          32 ns/iter (+/- 18)

Can you share your test system details + how you're building and testing nallocx speed? That sounds high to me.

My test system is an: Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz with 8 GB 1600 MHz DDR3.

You can try to reproduce the results by yourself, but as you see, at least on my system, the call to nallocx negatively impacts both the number of ns per iteratation as well as the std deviation.

nallocx has an attribute((pure)) attached to it, so that the result need not be computed if not used; can Rust do something similar to propagate that information to the Rust-side compiler, so that you only pay the cost if necessary?

Rust does already do this. Note that in this particular synthetic test, the allocation sizes are a compile-time constant.

However, the code inside mallocx is different enough from nallocx that at least with the current state of LTO the LLVM optimizer cannot currently recognize that the call to nallocx is not necessary for our use cases. I expect this to mirror the situations where the size is only known at run-time.

Does nallocx actually show up in application-level profiles?

The main problem is that since calling nallocx is not free (as costing exactly zero cycles), we need to expose to our users a second allocation API that does not call it. This duplicates our allocator API surface which is a price that everybody pays (e.g. API-wise "should I use this or that?").

The nallocx computations are also highly pipelinable, so I think that the "true" cost is likely lower than might be reflected in microbenchmarks.

This is a synthetic test specially crafted to measure nallocx cost, so take its results with a grain of salt. While I agree with you in that in some cases the cost of nallocx is negligible, in other cases (as shown above) it is surprisingly large.

Committing to a new allocation API is a big enough deal for us that I can't really see us doing it for this case.

I understand that.

I think it should be possible to refactor mallocx internally into a separate private function that does not throw away the size in the cases in which it is computed and to make mallocx use that instead, without any performance hit whatsoever to mallocx.

After that, what I am asking for becomes "exposing" this internal function so that we can just reuse the size that mallocx is for our use cases already computing and avoid an extra nallocx call. That might allow us to cut our allocation API in half.

If we had an API that returns the allocation size, building a smaller API on top that drops it is easy, and if that happens within jemalloc it'll probably zero cost in most cases (or we could make it zero cost by only returning the size when it is zero-cost to do so).

For the other way around to be true, the compiler would need to be able to completely eliminate the nallocx call when it is not needed. This doesn't seem to be going to be the case for the foreseeable future.

davidtgoldblatt · 2017-11-15T17:22:49Z

The main problem is that since calling nallocx is not free (as costing exactly zero cycles), we need to expose to our users a second allocation API that does not call it.

But what I mean w.r.t. pure is, can't you just call it, and then let the user ignore the result if they don't need it? The optimizer can then remove the call to nallocx.

gnzlbg · 2017-11-15T17:42:14Z

I think that we could do this (or something like this because we call into jemalloc via our C FFI). But I am still worried about the users that do need it.

…

On Wed 15. Nov 2017 at 18:22, David Goldblatt ***@***.***> wrote: The main problem is that since calling nallocx is not free (as costing exactly zero cycles), we need to expose to our users a second allocation API that does not call it. But what I mean w.r.t. pure is, can't you just call it, and then let the user ignore the result if they don't need it? The optimizer can then remove the call to nallocx. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1074 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3Npo7MpYuVapgtnxXe-eFcAnhB0bFpks5s2x3vgaJpZM4QeC4I> .

gnzlbg · 2017-11-15T17:46:34Z

Also, this optimization relies on everything being inlined properly and the optimizer doing its thing. It should be the case, but it is not a 100% reliable guarantee.

…

On Wed 15. Nov 2017 at 18:40, Gonzalo BG ***@***.***> wrote: I think that we could do this (or something like this because we call into jemalloc via our C FFI). But I am still worried about the users that do need it. On Wed 15. Nov 2017 at 18:22, David Goldblatt ***@***.***> wrote: > The main problem is that since calling nallocx is not free (as costing > exactly zero cycles), we need to expose to our users a second allocation > API that does not call it. > > But what I mean w.r.t. pure is, can't you just call it, and then let the > user ignore the result if they don't need it? The optimizer can then remove > the call to nallocx. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#1074 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AA3Npo7MpYuVapgtnxXe-eFcAnhB0bFpks5s2x3vgaJpZM4QeC4I> > . >

gnzlbg · 2017-11-15T22:12:05Z

Sorry that I was answering from my phone before, but now that I have a laptop I want to reinforce that we should definetely propagate nallocx "pureness" within Rust.

But many fundamental data-structures (like vector) want to make use of the real capacity being allocated. The benchmarks do show that in the best cases the price of nallocx is 20% of the whole allocation price, and in the worst cases this blows up to 80-90% of the cost of a mallocx call.

This looks like way too much for me.

gnzlbg · 2017-11-16T16:30:24Z

Off-topic: @davidtgoldblatt working on improving this I just found out thatnallocx is pure and not const. That is, while nallocx has no side-effects, and a call to nallocx can be eliminated if its result is not used, two consecutive calls to nallocx with the same arguments are not guaranteed to return the same result, so two consecutive calls to nallocx cannot be folded into a single call. This is because pure allows the result of nallocx to depend on the state of global variables, and the values in these global variables can change between two consecutive nallocx calls (if for example a different thread changes their value in a properly synchronized way).

In Rust terms, nallocx cannot be a const fn (which is closer to GCC's const attribute than to its pure attribute), so we might need to do something language-wise to be able to apply the optimization that you suggest.

gnzlbg · 2017-11-16T19:43:09Z

So I wanted to see how easy/hard would be to see any improvement. For that I've prototyped the most low-effort solution I could think of here and benchmarked it for two cases: flags = 0 and 32byte alignment (flags =5).

The results aren't bad:

Flags == 0

Powers of two

test rt_pow2_4bytes_mallocx          ... bench:          23 ns/iter (+/- 7)
test rt_pow2_4bytes_nallocx          ... bench:          29 ns/iter (+/- 14)
test rt_pow2_4bytes_smallocx         ... bench:          24 ns/iter (+/- 9)

test rt_pow2_128bytes_mallocx        ... bench:          23 ns/iter (+/- 14)
test rt_pow2_128bytes_nallocx        ... bench:          29 ns/iter (+/- 20)
test rt_pow2_128bytes_smallocx       ... bench:          24 ns/iter (+/- 14)

test rt_pow2_1024bytes_mallocx       ... bench:          23 ns/iter (+/- 16)
test rt_pow2_1024bytes_nallocx       ... bench:          29 ns/iter (+/- 19)
test rt_pow2_1024bytes_smallocx      ... bench:          24 ns/iter (+/- 17)

test rt_pow2_4096bytes_mallocx       ... bench:          23 ns/iter (+/- 19)
test rt_pow2_4096bytes_nallocx       ... bench:          29 ns/iter (+/- 10)
test rt_pow2_4096bytes_smallocx      ... bench:          24 ns/iter (+/- 8)

test rt_pow2_65536bytes_mallocx      ... bench:         214 ns/iter (+/- 106)
test rt_pow2_65536bytes_nallocx      ... bench:         229 ns/iter (+/- 90)
test rt_pow2_65536bytes_smallocx     ... bench:         214 ns/iter (+/- 152)

Even

test rt_odd_24bytes_mallocx          ... bench:          23 ns/iter (+/- 19)
test rt_odd_24bytes_nallocx          ... bench:          29 ns/iter (+/- 5)
test rt_odd_24bytes_smallocx         ... bench:          24 ns/iter (+/- 10)

test rt_even_10000bytes_mallocx      ... bench:          28 ns/iter (+/- 4)
test rt_even_10000bytes_nallocx      ... bench:          35 ns/iter (+/- 26)
test rt_even_10000bytes_smallocx     ... bench:          29 ns/iter (+/- 19)

test rt_even_1000000bytes_mallocx    ... bench:         211 ns/iter (+/- 169)
test rt_even_1000000bytes_nallocx    ... bench:         219 ns/iter (+/- 109)
test rt_even_1000000bytes_smallocx   ... bench:         212 ns/iter (+/- 54)

Odd (even number - 1)

test rt_odd_24bytes_mallocx          ... bench:          23 ns/iter (+/- 19)
test rt_odd_24bytes_nallocx          ... bench:          29 ns/iter (+/- 5)
test rt_odd_24bytes_smallocx         ... bench:          24 ns/iter (+/- 10)

test rt_odd_10000bytes_mallocx       ... bench:          28 ns/iter (+/- 19)
test rt_odd_10000bytes_nallocx       ... bench:          35 ns/iter (+/- 20)
test rt_odd_10000bytes_smallocx      ... bench:          32 ns/iter (+/- 18)

test rt_odd_1000000bytes_mallocx     ... bench:         218 ns/iter (+/- 151)
test rt_odd_1000000bytes_nallocx     ... bench:         241 ns/iter (+/- 31)
test rt_odd_1000000bytes_smallocx    ... bench:         214 ns/iter (+/- 111)

Flags == 5 (32byte alignment)

Powers of two

test rt_pow2_4bytes_mallocx          ... bench:          33 ns/iter (+/- 25)
test rt_pow2_4bytes_nallocx          ... bench:          39 ns/iter (+/- 27)
test rt_pow2_4bytes_smallocx         ... bench:          33 ns/iter (+/- 26)

test rt_pow2_128bytes_mallocx        ... bench:          32 ns/iter (+/- 9)
test rt_pow2_128bytes_nallocx        ... bench:          40 ns/iter (+/- 23)
test rt_pow2_128bytes_smallocx       ... bench:          32 ns/iter (+/- 8)

test rt_pow2_1024bytes_mallocx       ... bench:          32 ns/iter (+/- 4)
test rt_pow2_1024bytes_nallocx       ... bench:          39 ns/iter (+/- 27)
test rt_pow2_1024bytes_smallocx      ... bench:          32 ns/iter (+/- 14)

test rt_pow2_4096bytes_mallocx       ... bench:          32 ns/iter (+/- 16)
test rt_pow2_4096bytes_nallocx       ... bench:          40 ns/iter (+/- 27)
test rt_pow2_4096bytes_smallocx      ... bench:          32 ns/iter (+/- 19)

test rt_pow2_65536bytes_mallocx      ... bench:         231 ns/iter (+/- 125)
test rt_pow2_65536bytes_nallocx      ... bench:         243 ns/iter (+/- 48)
test rt_pow2_65536bytes_smallocx     ... bench:         233 ns/iter (+/- 156)

Even

test rt_even_24bytes_mallocx         ... bench:          37 ns/iter (+/- 33)
test rt_even_24bytes_nallocx         ... bench:          40 ns/iter (+/- 14)
test rt_even_24bytes_smallocx        ... bench:          33 ns/iter (+/- 15)

test rt_even_10000bytes_mallocx      ... bench:          39 ns/iter (+/- 10)
test rt_even_10000bytes_nallocx      ... bench:          47 ns/iter (+/- 42)
test rt_even_10000bytes_smallocx     ... bench:          39 ns/iter (+/- 10)

test rt_even_1000000bytes_mallocx    ... bench:         215 ns/iter (+/- 16)
test rt_even_1000000bytes_nallocx    ... bench:         245 ns/iter (+/- 71)
test rt_even_1000000bytes_smallocx   ... bench:         235 ns/iter (+/- 103)

Odd

test rt_odd_24bytes_mallocx          ... bench:          34 ns/iter (+/- 19)
test rt_odd_24bytes_nallocx          ... bench:          39 ns/iter (+/- 22)
test rt_odd_24bytes_smallocx         ... bench:          33 ns/iter (+/- 15)

test rt_odd_10000bytes_mallocx       ... bench:          39 ns/iter (+/- 31)
test rt_odd_10000bytes_nallocx       ... bench:          47 ns/iter (+/- 34)
test rt_odd_10000bytes_smallocx      ... bench:          39 ns/iter (+/- 12)

test rt_odd_1000000bytes_mallocx     ... bench:         225 ns/iter (+/- 91)
test rt_odd_1000000bytes_nallocx     ... bench:         248 ns/iter (+/- 65)
test rt_odd_1000000bytes_smallocx    ... bench:         236 ns/iter (+/- 170)

I said this is the lowest-effort possible solution because it just computes the usize from the ind at the top-level if flags == 0, and just returning the usize otherwise (which is computed anyways).

I haven't invested much time in this, but from looking into the library for the last 45 minutes I cannot seem to find an easy way to obtain usize from within iallocztm in what appears to be the fast path. @davidtgoldblatt is this possible, and if so, could you maybe explain how?

gnzlbg · 2017-11-21T21:26:59Z

Apache Arrow had a similar issue: https://issues.apache.org/jira/browse/ARROW-464

The performance cost of nallocx is unacceptable for their purposes.

davidtgoldblatt · 2017-11-21T23:33:17Z

There's not a ton of info in that link, but it seems likely that the bulk of their regression was caused by the extra allocations from lowering the growth factor of their dynamic array.

Similarly, an extra 7ns of overhead on a 10k allocation doesn't worry me; the cost of actually touching all that memory will be large enough that the difference is lost in the noise.

The cost of nallocx for large allocations is unnecessarily high right now, and I have some ideas on how to improve it if you're feeling motivated, but again: I'm suspicious of the claim that this is something any real application will be able to notice.

Adding extra allocation functions isn't without cost; it takes up memory, may spend cycles (depending on how we would refactor to allow it), can't get deprecated without an ABI breakage, and limits our implementation flexibility.

I'm especially skeptical of benefit of this when compared against things that could be done to reduce costs on Rust's side. Calling malloc instead of mallocx (in cases where the implicit alignment is no less than the required one) will shave off a few nanoseconds. Teaching the rust compiler about the pure attribute will let it elide most of these calls.

I think I'm coming off more dead-set against this than I really am, but table stakes for adding this sort of API are a end-to-end performance issue in a "real" application that can be fixed by adding these functions, that isn't better fixed at some other layer of the stack.

gnzlbg · 2017-11-21T23:47:34Z

I understand.

The cost of nallocx for large allocations is unnecessarily high right now, and I have some ideas on how to improve it if you're feeling motivated, but again: I'm suspicious of the claim that this is something any real application will be able to notice.

I am feeling motivated so please point me in the right direction here, making nallocx closer to zero-cost would reduce the need for an alternative API. That would be preferable since current users of nallocx would benefit from this.

Similarly, an extra 7ns of overhead on a 10k allocation doesn't worry me; the cost of actually touching all that memory will be large enough that the difference is lost in the noise.

It does look large if one takes into account that the allocations itself take 20-30ns, but the synthetic benchmarks I was using did not touch the memory at all, so this makes sense.

I'm especially skeptical of benefit of this when compared against things that could be done to reduce costs on Rust's side.

Note that none of this applies if the user needs the result of nallocx. I intend to use the result of nallocx on every vector and string allocation, which are widely-used data-structures in almost every Rust application. If nallocx isn't free, then whether it is worth it to do this depends on the trade-off nallocx costs' vs nallocx savings (e.g. if the extra memory used prevents a sub-sequent reallocation of the vector or string). This is neither a Rust specific nor application specific problem.

gnzlbg · 2018-05-16T22:29:25Z

cc @davidtgoldblatt @interwq @jasone

P0901r0 talks about this: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html

and basically suggests adding to c++ an operator new() overload that maps 1:1 to the API that I've implemented, and suggested adding here.

davidtgoldblatt · 2018-05-17T20:40:42Z

I talked with the authors in Jacksonville, and they convinced me that this was a good idea (although, for telemetry and feedback about fragmentation; I still think the cycle count reasons are pretty goofy).

I'm not sure how this would be structured, though. I don't think we'd want to commit to it at the hard ABI level, except via whatever operator new overload it shows up as in the C++ version. If this were only available with a config option (and possibly some sort of user-hostile name mangling, to prevent relying on it from version to version) would that work for you?

gnzlbg · 2018-05-17T21:13:43Z

@davidtgoldblatt I think that exposing this behind a config option would be a good first step, making it clear that this new API is experimental and if it does not prove itself could be removed in the future.

Rust could start using it in its bundled jemalloc version (when dynamically linking we would just continue using mallocx + nallocx). If the API is removed from jemalloc in the future, Rust can just stop using it without any issues. This might allow testing the new APIs on larger-ish projects like servo.

If you are going to really consider going forward with this, keep in mind that Rust could technically benefit from versions of all jemalloc allocation APIs that would return the usable size directly (e.g. calloc and rallocx come to mind, xallocx already does this).

FWIW I also agree that the cycle count wins are negligible compared with the cost of touching the memory.

davidtgoldblatt · 2018-05-17T21:19:06Z

BTW, if you're looking for something that will speed things up for rust, I'd really recommend using malloc directly, where correct. At one point (IIRC), the Rust runtime was doing all of its allocations with unnecessary alignment parameters (e.g. calling mallocx with an alignment that was already guaranteed by the type). That will cause fast-path regressions, and I would expect it to be fairly straightforward to fix.

gnzlbg · 2018-05-17T21:30:15Z

At one point (IIRC), the Rust runtime was doing all of its allocations with unnecessary alignment parameters (e.g. calling mallocx with an alignment that was already guaranteed by the type). That will cause fast-path regressions, and I would expect it to be fairly straightforward to fix.

We tuned something there last year (https://github.com/alexcrichton/jemallocator/issues/25). This is in a nutshell how it looks like today: https://github.com/alexcrichton/jemallocator/blob/master/src/lib.rs#L73

/// Computes `flags` from `align`.
///
/// Equivalent to the MALLOCX_ALIGN(a) macro.
#[inline]
#[allow(non_snake_case)]
pub fn MALLOCX_ALIGN(aling: usize) -> c_int {
    aling.trailing_zeros() as c_int
}

// The minimum alignment guaranteed by the architecture. This value is used to
// add fast paths for low alignment values. In practice, the alignment is a
// constant at the call site and the branch will be optimized out.
#[cfg(all(any(target_arch = "arm",
              target_arch = "mips",
              target_arch = "mipsel",
              target_arch = "powerpc")))]
const MIN_ALIGN: usize = 8;
#[cfg(all(any(target_arch = "x86",
              target_arch = "x86_64",
              target_arch = "aarch64",
              target_arch = "powerpc64",
              target_arch = "powerpc64le",
              target_arch = "mips64",
              target_arch = "s390x",
              target_arch = "sparc64")))]
const MIN_ALIGN: usize = 16;

fn layout_to_flags(align: usize, size: usize) -> c_int {
    // If our alignment is less than the minimum alignment then we may not
    // have to pass special flags asking for a higher alignment. If the
    // alignment is greater than the size, however, then this hits a sort of odd
    // case where we still need to ask for a custom alignment. See #25 for more
    // info.
    if align <= MIN_ALIGN && align <= size {
        0
    } else {
        ffi::MALLOCX_ALIGN(align)
    }
}

/// Handle to the jemalloc allocator
///
/// This type and a reference to this type both implement the `Alloc` trait,
///
/// allowing usage of this `Jemalloc` type both in collections and as a global
/// allocator.
pub struct Jemalloc;

unsafe impl GlobalAlloc for Jemalloc {
    #[inline]
    unsafe fn alloc(&self, layout: Layout) -> *mut Opaque {
        let flags = layout_to_flags(layout.align(), layout.size());
        let ptr = ffi::mallocx(layout.size(), flags);
        ptr as *mut Opaque
    }

    #[inline]
    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut Opaque {
        let ptr = if layout.align() <= MIN_ALIGN && layout.align() <= layout.size() {
            ffi::calloc(1, layout.size())
        } else {
            let flags = layout_to_flags(layout.align(), layout.size()) | ffi::MALLOCX_ZERO;
            ffi::mallocx(layout.size(), flags)
        };
        ptr as *mut Opaque
    }

    #[inline]
    unsafe fn dealloc(&self, ptr: *mut Opaque, layout: Layout) {
        let flags = layout_to_flags(layout.align(), layout.size());
        ffi::sdallocx(ptr as *mut c_void, layout.size(), flags)
    }

    #[inline]
    unsafe fn realloc(&self,
                      ptr: *mut Opaque,
                      layout: Layout,
                      new_size: usize) -> *mut Opaque {
        let flags = layout_to_flags(layout.align(), new_size);
        let ptr = ffi::rallocx(ptr as *mut c_void, new_size, flags);
        ptr as *mut Opaque
    }
}

Note that we call calloc when the alignment flags aren't necessary, but for some reason we call mallocx with 0 as flag when the alignment isn't necessary. Since we can't "call" C macros via FFI, we have a function that should be equivalent to MALLOCX_ALIGN but... if the macro changes we have to update it and we haven't done that in a while.

There is a second allocation API below that does the same, but also uses xallocx, nallocx, rallocx, to implement other allocation functions that return the "excess" (..._excess), which are the ones used by Vec, String, etc.

gnzlbg · 2018-05-17T21:33:57Z

I think we were a bit unsure of whether the MIN_ALIGN optimization was worth it. Same applies to the calloc optimization. This happened before my time and I never got around to benchmark it (benchmarking allocators is hard).

gnzlbg mentioned this issue Nov 16, 2017

Allocator traits and std::heap rust-lang/rust#32838

Open

12 tasks

gnzlbg mentioned this issue May 16, 2018

Add missing _excess methods to Alloc and use them in RawVec rust-lang/rust#50738

Closed

gnzlbg mentioned this issue Jul 10, 2018

Add experimental API to support P0901r0: "smallocx_return_t smallocx(size, flags)" #1270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid recomputing size on allocation and reallocation #1074

Avoid recomputing size on allocation and reallocation #1074

gnzlbg commented Nov 14, 2017 •

edited

davidtgoldblatt commented Nov 14, 2017

gnzlbg commented Nov 15, 2017 •

edited

davidtgoldblatt commented Nov 15, 2017

gnzlbg commented Nov 15, 2017 via email

gnzlbg commented Nov 15, 2017 via email

gnzlbg commented Nov 15, 2017

gnzlbg commented Nov 16, 2017 •

edited

gnzlbg commented Nov 16, 2017

gnzlbg commented Nov 21, 2017

davidtgoldblatt commented Nov 21, 2017

gnzlbg commented Nov 21, 2017

gnzlbg commented May 16, 2018

davidtgoldblatt commented May 17, 2018

gnzlbg commented May 17, 2018

davidtgoldblatt commented May 17, 2018

gnzlbg commented May 17, 2018 •

edited

gnzlbg commented May 17, 2018 •

edited

Avoid recomputing size on allocation and reallocation #1074

Avoid recomputing size on allocation and reallocation #1074

Comments

gnzlbg commented Nov 14, 2017 • edited

davidtgoldblatt commented Nov 14, 2017

gnzlbg commented Nov 15, 2017 • edited

davidtgoldblatt commented Nov 15, 2017

gnzlbg commented Nov 15, 2017 via email

gnzlbg commented Nov 15, 2017 via email

gnzlbg commented Nov 15, 2017

gnzlbg commented Nov 16, 2017 • edited

gnzlbg commented Nov 16, 2017

Flags == 0

Powers of two

Even

Odd (even number - 1)

Flags == 5 (32byte alignment)

Powers of two

Even

Odd

gnzlbg commented Nov 21, 2017

davidtgoldblatt commented Nov 21, 2017

gnzlbg commented Nov 21, 2017

gnzlbg commented May 16, 2018

davidtgoldblatt commented May 17, 2018

gnzlbg commented May 17, 2018

davidtgoldblatt commented May 17, 2018

gnzlbg commented May 17, 2018 • edited

gnzlbg commented May 17, 2018 • edited

gnzlbg commented Nov 14, 2017 •

edited

gnzlbg commented Nov 15, 2017 •

edited

gnzlbg commented Nov 16, 2017 •

edited

gnzlbg commented May 17, 2018 •

edited

gnzlbg commented May 17, 2018 •

edited