Simplify `CAMLalign` and improve alignment of `caml_stat_block` with C11 `max_align_t` and `aligned_alloc` #13139

MisterDA · 2024-04-30T20:13:10Z

Some cleanups removing checks and workarounds for older compilers, assuming that the compiler supports C11 or C++11 out of the box. We may use _Alignas (since C11) or alignas (since C23) directly, and use the max_align_t type. Unfortunately, support for max_align_t is missing from the Windows C standard library.

xavierleroy

Looks good to me. +1 for using _Alignas and alignas in preference to attributes. One suggestion below concerning the MSVC fallback.

xavierleroy · 2024-05-31T13:17:31Z

runtime/memory.c


 struct pool_block {
 #ifdef DEBUG
  intnat magic;
 #endif
  struct pool_block *next;
  struct pool_block *prev;
-  union max_align data[];  /* not allocated, used for alignment purposes */
+  max_align_t data[]; /* not allocated, used for alignment purposes */
 };


One downside with this use of double as fallback max_align_t type is that double has rather low alignment: 8 on x86_64, 4 on x86_32, while most SSE vector instructions require 16-alignment. (In Linux x86_64 and macOS x86_64, max_align_t has 16 alignment.) What about using an explicit 16 alignment as the fallback case?

struct pool_block { #ifdef DEBUG intnat magic; #endif struct pool_block *next; struct pool_block *prev; #ifdef HAVE_MAX_ALIGN_T max_align_t data[]; /* not allocated, used for alignment purposes */ #else CAMLalign(16) char data[]; /* 16 is a reasonable alignment default */ #endif };

I'd tend to align (ha, ha) with your suggestion, but: MSVC C++ cstddef.h header uses double alignment for max_align_t, and clang-cl uses double too, which I think makes it a reasonable default for Windows.
As for a general fallback, I hope that other compilers+libc aren't as buggy, and I wonder if we do need to provide a definition, or rather catch a missing definition as a compilation failure.

MSVC C++ cstddef.h header uses double alignment for max_align_t

It is in error, then. The program below, compiled with MSVC, prints 16, showing that there are types with alignment greater than 8.

#include <iostream> #include <xmmintrin.h> int main() { std::cout << alignof(__m128) << '\n'; return 0; }

In this PR, you're not trying to emulate whatever strange choices MSVC does, but to make sure OCaml's pool allocator works as intended. If the intent is to align for the maximal alignment constraint of the target platform, the alignment must be >= 16 on x86, because SSE instructions. If the intent is to align for the biggest datatype OCaml stores in heap blocks, word-alignment is enough and you don't need to add anything to struct pool_block to guarantee it, as it already contains two word-sized pointers.

What about defaulting to long double rather than double then?

It is in error, then. The program below, compiled with MSVC, prints 16, showing that there are types with alignment greater than 8.

At least for C++, I don't think there's an error.

The type max_align_t is a POD type whose alignment requirement is at least as great as that of every scalar type, and whose alignment requirement is supported in every context. [C++11]

#include <iostream> #include <xmmintrin.h> #include <cstddef> #include <type_traits> int main() { std::cout << alignof(__m128) << std::endl << alignof(std::max_align_t) << std::endl << std::is_scalar<__m128>() << std::endl; return 0; }

shows 16, 8, 0, indicating that __m128 isn't considered a scalar type, and thus the definition of std::max_align_t is consistent (with respect to the __m128 type).

The definition for C is more vague and I guess it could be argued that 16 would be a correct value.

max_align_t which is an object type whose alignment is as great as is supported by the implementation in all contexts; [C11]

What about defaulting to long double rather than double then?

long double is identical to double under MSVC ¹.

Thanks for the thorough review. I think the real question is indeed whether the intent is to align for the maximal alignment constraint of the target platform, or to align for the biggest datatype OCaml stores in heap blocks. None of the fields in the former union max_align had a 16 bytes alignment, and all worked well, hasn't it? Now it's not clear to me that this field is actually needed.

Footnotes

https://learn.microsoft.com/en-us/cpp/c-language/type-long-double?view=msvc-170 ↩

We could use the same fallback as GCC and clang do. They don't seem to account for vector extensions (also, why stop at __m128 when there's also 256-bits vectors?). My macOS M1 which supports NEON still defines alignof(max_align_t) == 8.

#if !defined(HAVE_MAX_ALIGN_T) typedef struct { CAMLalign(long long) long long ll; CAMLalign(long double) long double ld; } max_align_t; #endif

MinGW-w64 defines a 16-bytes alignment for max_align_t, but contrary to MSVC, it supports long double, aligned to 16 bytes.

The original justification comes from this comment:

That trick with max_align is not needed at all for correctness (void* would do just fine as far as correctness goes); however, it is supposed to increase the chances of getting a more favourable alignment of data in terms of performance (it is important, since data is going to be accessed much more often than the header of the block).

This patch would move data from an 8-bytes boundary to a 16-bytes boundary. I don't know if it would affect performance, but it would likely waste a bit of space.

Thinking about it, the data field type probably shouldn't be max_align_t but rather char as in your first suggestion.

CAMLalign(max_align_t) char data[];

"this comment" refers to the OCaml 4 memory allocator, which has been extensively rewritten for OCaml 5. Please see #12212 and the corresponding OCaml 5 code to check how alignment is actually handled in pool blocks.

I have the impression that you are talking about two different things, which very conveniently have the exact same name in the codebase. In memory.c (which the PR is hacking on), "pool" refers to a single global pool of blocs (struct pool_block) that have been caml_stat_alloced by the runtime, and is used to make sure that they are all freed on program termination -- I think that it is only used in cleanup-at-exit mode? There is a single memory pool, which is basically a doubly-linked list, and each caml_stat_alloc adds one element to it.

In shared_heap.c, struct pool refers to one block or "slab" of memory in the memory, of size 4096 words, and owned by a domain-local caml_heap_state structure; each pool has its own free list. I think that the alignment constraints that @xavierleroy has in mind apply to values stored in the shared_heap.c pools, but that those are in fact currently not stored in the global memory.c pool, as they are allocated by caml_mem_map in shared_heap.c:pool_acquire. (Large objects are allocated differently in large_allocate, and oddly enough they seem to use malloc and not caml_stat_alloc.)

Thanks for the clarification, it makes more sense now.

So, we're talking about caml_stat_alloc, which should have the same good properties as malloc, in particular: align sufficiently so that all data types supported by the target architecture (not just those used by the OCaml runtime system, because third-party OCaml-C stubs) can be stored safely. For x86, it means 16-alignment, because of SSE 128-bit vector operations. (256- and 512-bit AVX operations support less aligned accesses, albeit slowly.) For the other OCaml target architectures, 8-alignment seems enough, e.g. for ARM64's load/store register pair instructions, and for ARM NEON, but I'm not sure.

(Note that I wrote "stored safely", i.e. the program doesn't crash, not "stored efficiently" with the alignment that gives best hardware performance for vector instructions. The latter may require specific allocators besides malloc, for the reason below.)

Giving struct pool_block an alignment greater than the one guaranteed by malloc is useless. E.g. if struct pool_block is 16-aligned and malloc returns an address = 8 mod 16 (as can happen in 32-bit Windows, I heard), you'll add 16 to this address and get something that is not 16 aligned, but 8 bytes are wasted.

For this reason, I don't believe in the "align on cache lines" argument of "this comment": cache lines are typically 64-byte wide, sometimes 128 or even 256. malloc will not return 64-byte aligned pointers, at least not for small- to medium-sized blocks, because it would waste too much space, and trying to realign to 64 afterward would waste much space too.

What does this mean for this PR?

The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the data part of struct pool_block is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using max_align_t should have no effect. (*)

For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning data to 16 seems preferable to me and can be achieved by using max_align_t.

For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in struct pool_block will give 16-aligned data fields, so you could just as well put no alignment constraints.

(*) There's still an issue with the pesky magic number added in debug mode, which throws alignment off. I wish we would just remove it, as I don't think it adds anything to the debug mechanisms built into most malloc implementations.

That's all (and everything) I had to say about this PR. Now I'm done with this discussion and this PR.

Thank you for the review and the explanations, there's a lot to be learned here.

gasche · 2024-06-04T15:04:50Z

I would be happy to approve and merge this PR if the recommendations of Xavier are implemented:

The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the data part of struct pool_block is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using max_align_t should have no effect. (*)

For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning data to 16 seems preferable to me and can be achieved by using max_align_t.

For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in struct pool_block will give 16-aligned data fields, so you could just as well put no alignment constraints.

(The footnote (*) is about the magic number. Given that this is a debug-only feature, which results at worst in a bit of wasted space due to extra alignment, it seems reasonable to do the easy thing which is to leave it alone.)

I am not sure whether the current state of the PR corresponds to this explicit design choice. (I think it does, but in a non-obvious way.) @MisterDA, can you confirm? Could you maybe include a comment that explains the current design/intent, possibly by just quoting the description of Xavier above? (If you just quote, you could mark him as author of the corresponding git commit.)

MisterDA · 2024-06-06T07:47:26Z

I am not sure whether the current state of the PR corresponds to this explicit design choice. (I think it does, but in a non-obvious way.) @MisterDA, can you confirm?

I'm not sure either, I'll need some time to convince myself.

Could you maybe include a comment that explains the current design/intent, possibly by just quoting the description of Xavier above?

I'm thinking of turning the bullets points into some sort of static assertions, to be added to the code or the configure script.

Explanations from Xavier Leroy at ocaml#13139 (comment) - The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the `data` part of `struct pool_block` is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using `max_align_t` should have no effect. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning `data` to 16 seems preferable to me and can be achieved by using `max_align_t`. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in `struct pool_block` will give 16-aligned `data` fields, so you could just as well put no alignment constraints. Unfortunately, MSVC C11 suppport is incomplete and doesn't define `max_align_t`. - https://developercommunity.visualstudio.com/t/max_align_t-is-not-provided-by-stddefh/10299797 - https://developercommunity.visualstudio.com/t/stdc11-should-add-max-align-t-to-stddefh/1386891

ghost · 2024-06-19T06:46:05Z

Coming back to this PR:

on one hand, the proposed change to use max_align_t is technically correct (and even if, on some compilers, max_align_t is not good enough for SSE data, it is not worse than the existing union).
on the other hand, Xavier's point that, as long as pool_block get obtained with malloc, it makes no sense to have pool_block require a larger alignment than what malloc provides.

A somewhat simple way to address this would be to:

remove that unused field from struct pool_block altogether
rewrite SIZEOF_POOL_BLOCK as a computation rounding sizeof(struct pool_block) to the intended alignment (supposedly 16 on amd64 due to SSE, max_align_t elsewhere).
allocate these pool_block using C11 aligned_alloc, using the above decided alignment.
replace the use of the removed data field of struct pool_block with pointer arithmetic using SIZEOF_POOL_BLOCK above.

This could be done either in this PR or in a forthcoming PR (since this one currently doesn't change anything about the actual alignment of these allocations).

Explanations from Xavier Leroy at ocaml#13139 (comment) - The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the `data` part of `struct pool_block` is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using `max_align_t` should have no effect. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning `data` to 16 seems preferable to me and can be achieved by using `max_align_t`. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in `struct pool_block` will give 16-aligned `data` fields, so you could just as well put no alignment constraints. Unfortunately, MSVC C11 suppport is incomplete and doesn't define `max_align_t`. - https://developercommunity.visualstudio.com/t/max_align_t-is-not-provided-by-stddefh/10299797 - https://developercommunity.visualstudio.com/t/stdc11-should-add-max-align-t-to-stddefh/1386891

MisterDA · 2024-06-20T09:07:49Z

I agree with @dustanddreams conclusions. I'd like to experiment with aligned_alloc in a follow-up PR, if possible.

MisterDA · 2024-06-24T11:41:54Z

Unfortunately we cannot use aligned_alloc as the blocks can be resized. This is currently done with realloc, but there is no POSIX or C11 aligned realloc. We cannot hand-roll our own realloc, because the original size of the data is lost (unless we duplicate its size in struct pool_block). Does that make sense?

ghost · 2024-06-24T11:47:51Z

Unfortunately we cannot use aligned_alloc as the blocks can be resized. This is currently done with realloc, but there is no POSIX or C11 aligned realloc. We cannot hand-roll our own realloc, because the original size of the data is lost (unless we duplicate its size in struct pool_block). Does that make sense?

That's unfortunate.

This means the (few) current users of caml_stat_resize and caml_stat_resize_noexc need to be inspected; if they do not require any particular alignment the use of realloc is safe and a comment should be added to mention that it's ok; otherwise we'll have to roll our own realloc-with-alignment flavour. (and yes, this involves figuring out what the original allocation size was - glibc has malloc_usable_size but that's a non-portable interface)

Explanations from Xavier Leroy at ocaml#13139 (comment) - The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the `data` part of `struct pool_block` is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using `max_align_t` should have no effect. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning `data` to 16 seems preferable to me and can be achieved by using `max_align_t`. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in `struct pool_block` will give 16-aligned `data` fields, so you could just as well put no alignment constraints. Unfortunately, MSVC C11 suppport is incomplete and doesn't define `max_align_t`. - https://developercommunity.visualstudio.com/t/max_align_t-is-not-provided-by-stddefh/10299797 - https://developercommunity.visualstudio.com/t/stdc11-should-add-max-align-t-to-stddefh/1386891

MisterDA · 2024-07-04T13:46:32Z

I've added more code that uses C11 aligned_alloc, free or Microsoft _aligned_malloc, _aligned_free, _aligned_realloc.
The default alignment of struct pool_block and its data field is now the alignment of max_align_t in the general case, or 16 on amd64 if alignof(max_align_t) is smaller than 16.
As the libc doesn't provide an aligned realloc (Microsoft does), this needs a little custom realloc implementation in caml_stat_resize to keep the buffer aligned. We've considered using malloc_usable_size or Apple's malloc_size, but they're non-portable and intended towards telemetry rather than custom realloc, so we need to keep track of the requested alloc size, and copy the data manually.

MisterDA · 2024-07-08T07:29:57Z

@xavierleroy May I ask for another review? I hope to have addressed your concerns by switching to aligned_alloc.

xavierleroy · 2024-07-16T08:06:32Z

I had a quick look at the current code, so maybe I missed something.

I think it's a good idea to use _aligned_alloc for Win32, because it addresses both a Win32-specific problem (malloc result possibly insufficiently aligned) and a MSVC-specific problem (max_align_t not defined).

For all other platforms, I'd rather keep your original code, the one that uses plain malloc and max_align_t, under the assumption that all other platforms do the right thing, namely:

max_align_t is the biggest alignment enforced by the processor, and
malloc returns max_align_t-aligned blocks.

Relatedly, your original code avoids the need for storing the size in the block header, something I'm not a fan of because it increases memory usage.

In our public headers, we're using either: - C23 where `alignas` is a keyword; - C++11 or later where `alignas` is also available; - C11/C17 where `_Alignas` is available.

Explanations from Xavier Leroy at ocaml#13139 (comment) - The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the `data` part of `struct pool_block` is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using `max_align_t` should have no effect. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning `data` to 16 seems preferable to me and can be achieved by using `max_align_t`. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in `struct pool_block` will give 16-aligned `data` fields, so you could just as well put no alignment constraints. Unfortunately, MSVC C11 suppport is incomplete and doesn't define `max_align_t`. - https://developercommunity.visualstudio.com/t/max_align_t-is-not-provided-by-stddefh/10299797 - https://developercommunity.visualstudio.com/t/stdc11-should-add-max-align-t-to-stddefh/1386891

For C++, MSVC defines `using max_align_t = double`. LLVM's clang-cl copies this. It's unlikely that we need to carry a fallback implementation for other compilers. If so, the following could be used: typedef struct { alignas(long long) long long ll; alignas(long double) long double ld; } max_align_t; https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/ginclude/stddef.h https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__stddef_max_align_t.h https://en.cppreference.com/w/c/types/max_align_t

This ensures that the data field is always aligned to the best boundary.

MisterDA · 2024-07-17T08:57:06Z

Thanks for the review. I've kept aligned malloc/realloc/free on Windows, and regular malloc/realloc/free on other platforms, this removed the need to keep track of the allocation size.

ghost approved these changes May 2, 2024

View reviewed changes

MisterDA force-pushed the alignas-simplify-CAMLalign branch from 0d98742 to a49b31b Compare May 14, 2024 08:20

gasche assigned xavierleroy May 15, 2024

MisterDA force-pushed the alignas-simplify-CAMLalign branch from a49b31b to d226d67 Compare May 21, 2024 06:48

xavierleroy approved these changes May 31, 2024

View reviewed changes

MisterDA force-pushed the alignas-simplify-CAMLalign branch 2 times, most recently from 3b09e44 to bc1c885 Compare June 3, 2024 14:04

xavierleroy removed their assignment Jun 4, 2024

MisterDA force-pushed the alignas-simplify-CAMLalign branch from bc1c885 to fd228f1 Compare June 7, 2024 09:05

MisterDA force-pushed the alignas-simplify-CAMLalign branch from fd228f1 to 5403677 Compare June 20, 2024 09:04

MisterDA force-pushed the alignas-simplify-CAMLalign branch from 5403677 to d5cdc51 Compare July 4, 2024 12:39

MisterDA changed the title ~~Simplify CAMLalign and use C11 max_align_t~~ Simplify CAMLalign and improve alignment of caml_stat_block with C11 max_align_t and aligned_alloc Jul 10, 2024

MisterDA added 5 commits July 17, 2024 10:35

Simplify CAMLalign

ed2aef4

In our public headers, we're using either: - C23 where `alignas` is a keyword; - C++11 or later where `alignas` is also available; - C11/C17 where `_Alignas` is available.

Remove magic debug number from struct pool_block

84a3773

Simplify struct pool_block access and allocation

0e25650

MisterDA force-pushed the alignas-simplify-CAMLalign branch from d5cdc51 to 7a55003 Compare July 17, 2024 08:53

Use aligned alloc for struct pool_block on Windows

484e863

This ensures that the data field is always aligned to the best boundary.

MisterDA force-pushed the alignas-simplify-CAMLalign branch from 7a55003 to 484e863 Compare July 17, 2024 08:56

shindere merged commit 714d09b into ocaml:trunk Jul 19, 2024
15 checks passed

MisterDA deleted the alignas-simplify-CAMLalign branch July 19, 2024 09:05

Simplify CAMLalign and improve alignment of caml_stat_block with C11 max_align_t and aligned_alloc #13139

Simplify CAMLalign and improve alignment of caml_stat_block with C11 max_align_t and aligned_alloc #13139

Uh oh!

Conversation

MisterDA commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

xavierleroy May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche commented Jun 4, 2024

Uh oh!

MisterDA commented Jun 6, 2024

Uh oh!

ghost commented Jun 19, 2024

Uh oh!

MisterDA commented Jun 20, 2024

Uh oh!

MisterDA commented Jun 24, 2024

Uh oh!

ghost commented Jun 24, 2024

Uh oh!

MisterDA commented Jul 4, 2024

Uh oh!

MisterDA commented Jul 8, 2024

Uh oh!

xavierleroy commented Jul 16, 2024

Uh oh!

MisterDA commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Simplify `CAMLalign` and improve alignment of `caml_stat_block` with C11 `max_align_t` and `aligned_alloc` #13139

Simplify `CAMLalign` and improve alignment of `caml_stat_block` with C11 `max_align_t` and `aligned_alloc` #13139

MisterDA commented Apr 30, 2024 •

edited

Loading

xavierleroy May 31, 2024 •

edited

Loading

gasche Jun 3, 2024 •

edited

Loading

MisterDA commented Jul 17, 2024 •

edited

Loading