New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid recomputing size on allocation and reallocation #1074
Comments
Can you share your test system details + how you're building and testing nallocx speed? That sounds high to me.
Committing to a new allocation API is a big enough deal for us that I can't really see us doing it for this case. In some of the other threads, we've suggested fastpathing "freebie" alignment values into the malloc / nallocx-with-zero-flags pathways; have you tried this? Does nallocx actually show up in application-level profiles? |
@davidtgoldblatt you can find the benchmarks in my git clone git@github.com:gnzlbg/jemallocator.git
git checkout gbg
cargo bench This should work as in on any Linux and MacOSX system; I am not sure about Windows but I think it should also work as is there although you might need MinGW. I've uploaded my results to a gist. You can find the compilation options for the benches in the The benchmarks have a warm up phase that runs until the number of nanoseconds/iteration ( The effect of The results are provided for powers of two, even, and odd allocations. These are some of the results (the rest are in the gist) sorted to show the differences: // powers of two:
test roundtrip_pow2_4bytes ... bench: 28 ns/iter (+/- 6)
test roundtrip_nallocx_pow2_4bytes ... bench: 35 ns/iter (+/- 33)
test roundtrip_pow2_128bytes ... bench: 25 ns/iter (+/- 9)
test roundtrip_nallocx_pow2_128bytes ... bench: 32 ns/iter (+/- 25)
test roundtrip_pow2_1024bytes ... bench: 25 ns/iter (+/- 5)
test roundtrip_nallocx_pow2_1024bytes ... bench: 31 ns/iter (+/- 11)
test roundtrip_pow2_4096bytes ... bench: 29 ns/iter (+/- 16)
test roundtrip_nallocx_pow2_4096bytes ... bench: 32 ns/iter (+/- 29)
test roundtrip_pow2_65536bytes ... bench: 176 ns/iter (+/- 82)
test roundtrip_nallocx_pow2_65536bytes ... bench: 179 ns/iter (+/- 157)
test roundtrip_pow2_4194304bytes ... bench: 2,517 ns/iter (+/- 707)
test roundtrip_nallocx_pow2_4194304bytes ... bench: 2,265 ns/iter (+/- 1,096)
// even:
test roundtrip_even_24bytes ... bench: 26 ns/iter (+/- 10)
test roundtrip_nallocx_even_24bytes ... bench: 35 ns/iter (+/- 7)
test roundtrip_even_10000bytes ... bench: 29 ns/iter (+/- 1)
test roundtrip_nallocx_even_10000bytes ... bench: 42 ns/iter (+/- 10)
test roundtrip_even_1000000bytes ... bench: 168 ns/iter (+/- 61)
test roundtrip_nallocx_even_1000000bytes ... bench: 260 ns/iter (+/- 48)
// odd: even size - 1
test roundtrip_odd_1000000bytes ... bench: 165 ns/iter (+/- 112)
test roundtrip_nallocx_odd_1000000bytes ... bench: 253 ns/iter (+/- 33)
test roundtrip_odd_10000bytes ... bench: 30 ns/iter (+/- 23)
test roundtrip_nallocx_odd_10000bytes ... bench: 37 ns/iter (+/- 20)
test roundtrip_odd_24bytes ... bench: 28 ns/iter (+/- 19)
test roundtrip_nallocx_odd_24bytes ... bench: 32 ns/iter (+/- 18)
My test system is an: Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz with 8 GB 1600 MHz DDR3. You can try to reproduce the results by yourself, but as you see, at least on my system, the call to
Rust does already do this. Note that in this particular synthetic test, the allocation sizes are a compile-time constant. However, the code inside
The main problem is that since calling
This is a synthetic test specially crafted to measure
I understand that. I think it should be possible to refactor After that, what I am asking for becomes "exposing" this internal function so that we can just reuse the size that If we had an API that returns the allocation size, building a smaller API on top that drops it is easy, and if that happens within jemalloc it'll probably zero cost in most cases (or we could make it zero cost by only returning the size when it is zero-cost to do so). For the other way around to be true, the compiler would need to be able to completely eliminate the |
But what I mean w.r.t. pure is, can't you just call it, and then let the user ignore the result if they don't need it? The optimizer can then remove the call to nallocx. |
I think that we could do this (or something like this because we call into
jemalloc via our C FFI).
But I am still worried about the users that do need it.
…On Wed 15. Nov 2017 at 18:22, David Goldblatt ***@***.***> wrote:
The main problem is that since calling nallocx is not free (as costing
exactly zero cycles), we need to expose to our users a second allocation
API that does not call it.
But what I mean w.r.t. pure is, can't you just call it, and then let the
user ignore the result if they don't need it? The optimizer can then remove
the call to nallocx.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1074 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3Npo7MpYuVapgtnxXe-eFcAnhB0bFpks5s2x3vgaJpZM4QeC4I>
.
|
Also, this optimization relies on everything being inlined properly and the
optimizer doing its thing. It should be the case, but it is not a 100%
reliable guarantee.
…On Wed 15. Nov 2017 at 18:40, Gonzalo BG ***@***.***> wrote:
I think that we could do this (or something like this because we call into
jemalloc via our C FFI).
But I am still worried about the users that do need it.
On Wed 15. Nov 2017 at 18:22, David Goldblatt ***@***.***>
wrote:
> The main problem is that since calling nallocx is not free (as costing
> exactly zero cycles), we need to expose to our users a second allocation
> API that does not call it.
>
> But what I mean w.r.t. pure is, can't you just call it, and then let the
> user ignore the result if they don't need it? The optimizer can then remove
> the call to nallocx.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#1074 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA3Npo7MpYuVapgtnxXe-eFcAnhB0bFpks5s2x3vgaJpZM4QeC4I>
> .
>
|
Sorry that I was answering from my phone before, but now that I have a laptop I want to reinforce that we should definetely propagate But many fundamental data-structures (like vector) want to make use of the real capacity being allocated. The benchmarks do show that in the best cases the price of This looks like way too much for me. |
Off-topic: @davidtgoldblatt working on improving this I just found out that In Rust terms, |
So I wanted to see how easy/hard would be to see any improvement. For that I've prototyped the most low-effort solution I could think of here and benchmarked it for two cases: flags = 0 and 32byte alignment (flags =5). The results aren't bad: Flags == 0Powers of twotest rt_pow2_4bytes_mallocx ... bench: 23 ns/iter (+/- 7)
test rt_pow2_4bytes_nallocx ... bench: 29 ns/iter (+/- 14)
test rt_pow2_4bytes_smallocx ... bench: 24 ns/iter (+/- 9)
test rt_pow2_128bytes_mallocx ... bench: 23 ns/iter (+/- 14)
test rt_pow2_128bytes_nallocx ... bench: 29 ns/iter (+/- 20)
test rt_pow2_128bytes_smallocx ... bench: 24 ns/iter (+/- 14)
test rt_pow2_1024bytes_mallocx ... bench: 23 ns/iter (+/- 16)
test rt_pow2_1024bytes_nallocx ... bench: 29 ns/iter (+/- 19)
test rt_pow2_1024bytes_smallocx ... bench: 24 ns/iter (+/- 17)
test rt_pow2_4096bytes_mallocx ... bench: 23 ns/iter (+/- 19)
test rt_pow2_4096bytes_nallocx ... bench: 29 ns/iter (+/- 10)
test rt_pow2_4096bytes_smallocx ... bench: 24 ns/iter (+/- 8)
test rt_pow2_65536bytes_mallocx ... bench: 214 ns/iter (+/- 106)
test rt_pow2_65536bytes_nallocx ... bench: 229 ns/iter (+/- 90)
test rt_pow2_65536bytes_smallocx ... bench: 214 ns/iter (+/- 152) Eventest rt_odd_24bytes_mallocx ... bench: 23 ns/iter (+/- 19)
test rt_odd_24bytes_nallocx ... bench: 29 ns/iter (+/- 5)
test rt_odd_24bytes_smallocx ... bench: 24 ns/iter (+/- 10)
test rt_even_10000bytes_mallocx ... bench: 28 ns/iter (+/- 4)
test rt_even_10000bytes_nallocx ... bench: 35 ns/iter (+/- 26)
test rt_even_10000bytes_smallocx ... bench: 29 ns/iter (+/- 19)
test rt_even_1000000bytes_mallocx ... bench: 211 ns/iter (+/- 169)
test rt_even_1000000bytes_nallocx ... bench: 219 ns/iter (+/- 109)
test rt_even_1000000bytes_smallocx ... bench: 212 ns/iter (+/- 54) Odd (even number - 1)test rt_odd_24bytes_mallocx ... bench: 23 ns/iter (+/- 19)
test rt_odd_24bytes_nallocx ... bench: 29 ns/iter (+/- 5)
test rt_odd_24bytes_smallocx ... bench: 24 ns/iter (+/- 10)
test rt_odd_10000bytes_mallocx ... bench: 28 ns/iter (+/- 19)
test rt_odd_10000bytes_nallocx ... bench: 35 ns/iter (+/- 20)
test rt_odd_10000bytes_smallocx ... bench: 32 ns/iter (+/- 18)
test rt_odd_1000000bytes_mallocx ... bench: 218 ns/iter (+/- 151)
test rt_odd_1000000bytes_nallocx ... bench: 241 ns/iter (+/- 31)
test rt_odd_1000000bytes_smallocx ... bench: 214 ns/iter (+/- 111) Flags == 5 (32byte alignment)Powers of twotest rt_pow2_4bytes_mallocx ... bench: 33 ns/iter (+/- 25)
test rt_pow2_4bytes_nallocx ... bench: 39 ns/iter (+/- 27)
test rt_pow2_4bytes_smallocx ... bench: 33 ns/iter (+/- 26)
test rt_pow2_128bytes_mallocx ... bench: 32 ns/iter (+/- 9)
test rt_pow2_128bytes_nallocx ... bench: 40 ns/iter (+/- 23)
test rt_pow2_128bytes_smallocx ... bench: 32 ns/iter (+/- 8)
test rt_pow2_1024bytes_mallocx ... bench: 32 ns/iter (+/- 4)
test rt_pow2_1024bytes_nallocx ... bench: 39 ns/iter (+/- 27)
test rt_pow2_1024bytes_smallocx ... bench: 32 ns/iter (+/- 14)
test rt_pow2_4096bytes_mallocx ... bench: 32 ns/iter (+/- 16)
test rt_pow2_4096bytes_nallocx ... bench: 40 ns/iter (+/- 27)
test rt_pow2_4096bytes_smallocx ... bench: 32 ns/iter (+/- 19)
test rt_pow2_65536bytes_mallocx ... bench: 231 ns/iter (+/- 125)
test rt_pow2_65536bytes_nallocx ... bench: 243 ns/iter (+/- 48)
test rt_pow2_65536bytes_smallocx ... bench: 233 ns/iter (+/- 156) Eventest rt_even_24bytes_mallocx ... bench: 37 ns/iter (+/- 33)
test rt_even_24bytes_nallocx ... bench: 40 ns/iter (+/- 14)
test rt_even_24bytes_smallocx ... bench: 33 ns/iter (+/- 15)
test rt_even_10000bytes_mallocx ... bench: 39 ns/iter (+/- 10)
test rt_even_10000bytes_nallocx ... bench: 47 ns/iter (+/- 42)
test rt_even_10000bytes_smallocx ... bench: 39 ns/iter (+/- 10)
test rt_even_1000000bytes_mallocx ... bench: 215 ns/iter (+/- 16)
test rt_even_1000000bytes_nallocx ... bench: 245 ns/iter (+/- 71)
test rt_even_1000000bytes_smallocx ... bench: 235 ns/iter (+/- 103) Oddtest rt_odd_24bytes_mallocx ... bench: 34 ns/iter (+/- 19)
test rt_odd_24bytes_nallocx ... bench: 39 ns/iter (+/- 22)
test rt_odd_24bytes_smallocx ... bench: 33 ns/iter (+/- 15)
test rt_odd_10000bytes_mallocx ... bench: 39 ns/iter (+/- 31)
test rt_odd_10000bytes_nallocx ... bench: 47 ns/iter (+/- 34)
test rt_odd_10000bytes_smallocx ... bench: 39 ns/iter (+/- 12)
test rt_odd_1000000bytes_mallocx ... bench: 225 ns/iter (+/- 91)
test rt_odd_1000000bytes_nallocx ... bench: 248 ns/iter (+/- 65)
test rt_odd_1000000bytes_smallocx ... bench: 236 ns/iter (+/- 170) I said this is the lowest-effort possible solution because it just computes the I haven't invested much time in this, but from looking into the library for the last 45 minutes I cannot seem to find an easy way to obtain |
Apache Arrow had a similar issue: https://issues.apache.org/jira/browse/ARROW-464 The performance cost of |
There's not a ton of info in that link, but it seems likely that the bulk of their regression was caused by the extra allocations from lowering the growth factor of their dynamic array. Similarly, an extra 7ns of overhead on a 10k allocation doesn't worry me; the cost of actually touching all that memory will be large enough that the difference is lost in the noise. The cost of nallocx for large allocations is unnecessarily high right now, and I have some ideas on how to improve it if you're feeling motivated, but again: I'm suspicious of the claim that this is something any real application will be able to notice. Adding extra allocation functions isn't without cost; it takes up memory, may spend cycles (depending on how we would refactor to allow it), can't get deprecated without an ABI breakage, and limits our implementation flexibility. I'm especially skeptical of benefit of this when compared against things that could be done to reduce costs on Rust's side. Calling malloc instead of mallocx (in cases where the implicit alignment is no less than the required one) will shave off a few nanoseconds. Teaching the rust compiler about the I think I'm coming off more dead-set against this than I really am, but table stakes for adding this sort of API are a end-to-end performance issue in a "real" application that can be fixed by adding these functions, that isn't better fixed at some other layer of the stack. |
I understand.
I am feeling motivated so please point me in the right direction here, making
It does look large if one takes into account that the allocations itself take 20-30ns, but the synthetic benchmarks I was using did not touch the memory at all, so this makes sense.
Note that none of this applies if the user needs the result of |
cc @davidtgoldblatt @interwq @jasone P0901r0 talks about this: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0901r0.html and basically suggests adding to c++ an operator new() overload that maps 1:1 to the API that I've implemented, and suggested adding here. |
I talked with the authors in Jacksonville, and they convinced me that this was a good idea (although, for telemetry and feedback about fragmentation; I still think the cycle count reasons are pretty goofy). I'm not sure how this would be structured, though. I don't think we'd want to commit to it at the hard ABI level, except via whatever operator new overload it shows up as in the C++ version. If this were only available with a config option (and possibly some sort of user-hostile name mangling, to prevent relying on it from version to version) would that work for you? |
@davidtgoldblatt I think that exposing this behind a config option would be a good first step, making it clear that this new API is experimental and if it does not prove itself could be removed in the future. Rust could start using it in its bundled jemalloc version (when dynamically linking we would just continue using If you are going to really consider going forward with this, keep in mind that Rust could technically benefit from versions of all jemalloc allocation APIs that would return the usable size directly (e.g. FWIW I also agree that the cycle count wins are negligible compared with the cost of touching the memory. |
BTW, if you're looking for something that will speed things up for rust, I'd really recommend using malloc directly, where correct. At one point (IIRC), the Rust runtime was doing all of its allocations with unnecessary alignment parameters (e.g. calling mallocx with an alignment that was already guaranteed by the type). That will cause fast-path regressions, and I would expect it to be fairly straightforward to fix. |
We tuned something there last year (https://github.com/alexcrichton/jemallocator/issues/25). This is in a nutshell how it looks like today: https://github.com/alexcrichton/jemallocator/blob/master/src/lib.rs#L73 /// Computes `flags` from `align`.
///
/// Equivalent to the MALLOCX_ALIGN(a) macro.
#[inline]
#[allow(non_snake_case)]
pub fn MALLOCX_ALIGN(aling: usize) -> c_int {
aling.trailing_zeros() as c_int
}
// The minimum alignment guaranteed by the architecture. This value is used to
// add fast paths for low alignment values. In practice, the alignment is a
// constant at the call site and the branch will be optimized out.
#[cfg(all(any(target_arch = "arm",
target_arch = "mips",
target_arch = "mipsel",
target_arch = "powerpc")))]
const MIN_ALIGN: usize = 8;
#[cfg(all(any(target_arch = "x86",
target_arch = "x86_64",
target_arch = "aarch64",
target_arch = "powerpc64",
target_arch = "powerpc64le",
target_arch = "mips64",
target_arch = "s390x",
target_arch = "sparc64")))]
const MIN_ALIGN: usize = 16;
fn layout_to_flags(align: usize, size: usize) -> c_int {
// If our alignment is less than the minimum alignment then we may not
// have to pass special flags asking for a higher alignment. If the
// alignment is greater than the size, however, then this hits a sort of odd
// case where we still need to ask for a custom alignment. See #25 for more
// info.
if align <= MIN_ALIGN && align <= size {
0
} else {
ffi::MALLOCX_ALIGN(align)
}
}
/// Handle to the jemalloc allocator
///
/// This type and a reference to this type both implement the `Alloc` trait,
///
/// allowing usage of this `Jemalloc` type both in collections and as a global
/// allocator.
pub struct Jemalloc;
unsafe impl GlobalAlloc for Jemalloc {
#[inline]
unsafe fn alloc(&self, layout: Layout) -> *mut Opaque {
let flags = layout_to_flags(layout.align(), layout.size());
let ptr = ffi::mallocx(layout.size(), flags);
ptr as *mut Opaque
}
#[inline]
unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut Opaque {
let ptr = if layout.align() <= MIN_ALIGN && layout.align() <= layout.size() {
ffi::calloc(1, layout.size())
} else {
let flags = layout_to_flags(layout.align(), layout.size()) | ffi::MALLOCX_ZERO;
ffi::mallocx(layout.size(), flags)
};
ptr as *mut Opaque
}
#[inline]
unsafe fn dealloc(&self, ptr: *mut Opaque, layout: Layout) {
let flags = layout_to_flags(layout.align(), layout.size());
ffi::sdallocx(ptr as *mut c_void, layout.size(), flags)
}
#[inline]
unsafe fn realloc(&self,
ptr: *mut Opaque,
layout: Layout,
new_size: usize) -> *mut Opaque {
let flags = layout_to_flags(layout.align(), new_size);
let ptr = ffi::rallocx(ptr as *mut c_void, new_size, flags);
ptr as *mut Opaque
}
} Note that we call There is a second allocation API below that does the same, but also uses |
I think we were a bit unsure of whether the |
A single
nallocx
call takes ~10-15ns in my test system.Currently, Rust needs to, call
nallocx
on allocation:and reallocation:
Given that:
jemalloc/src/arena.c
Line 1815 in b5d071c
it seems that we could save those 10ns if jemalloc wouldn't throw this information away.
Would it be possible to provide a pair of
mallocx
/rallocx
methods that always return the allocation size for aligned allocations?For example, something like this:
(I don't know whether this has a cost for unaligned allocations, or what that cost might be, but it might be worth it to check that out as well).
The text was updated successfully, but these errors were encountered: