Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AddressSpaceManager #214

Merged
merged 8 commits into from
Jun 22, 2020
Merged

Add AddressSpaceManager #214

merged 8 commits into from
Jun 22, 2020

Conversation

mjp41
Copy link
Member

@mjp41 mjp41 commented Jun 19, 2020

This change brings in a new approach to managing address space.
It wraps the Pal with a power of two reservation system, that
guarantees all returned blocks are naturally aligned to their size. It
either lets the Pal perform aligned requests, or over allocates and
splits into power of two blocks.

@mjp41 mjp41 force-pushed the poweroftwo branch 2 times, most recently from 0ffe3e1 to 5a1f6ac Compare June 19, 2020 10:50
This change brings in a new approach to managing address space.
It wraps the Pal with a power of two reservation system, that
guarantees all returned blocks are naturally aligned to their size. It
either lets the Pal perform aligned requests, or over allocates and
splits into power of two blocks.
Copy link
Collaborator

@davidchisnall davidchisnall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a really nice cleanup!

```
Only one of these needs to be implemented, depending on whether the underlying
system can provide strongly aligned memory regions.
If the system guarantees only page alignment, implement the second and snmalloc
will over-allocate and then trim the requested region.
If the system guarantees only page alignment, implement the second. The Pal is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you document that the second one does not commit and why (I presume because the caller will always use a subset of the allocated space and so it makes no sense to atomically commit it and then decommit a chunk)?

namespace snmalloc
{
template<typename Pal>
class AddressSpaceManager : Pal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we have a doc comment explaining what this class is for?

It looks as if we have one, but inside the class? I've only ever seen doc comments there in the Windows NT codebase, I don't think any open source tooling can handle them in that location.

// There are a maximum of two blocks for any size/align in a range.
// One before the point of maximum alignment, and one after.
// As we can add multiple ranges the second entry's block may contain
// a pointer to subsequent blocks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this and the comment below into doc comments and expand them. It isn't completely clear to me what this is. I think it is an array of doubly linked lists, indexed by the log2(size) of the allocation (though it seems to be larger than it needs to be: if bits::BITS is 64, we have at most 51 used entries in this with 4KiB pages, even on systems that use the full 4KiB range).

Why are you using a size-two array instead of a pair?

}

// Look for larger block and split up recursively
void* bigger = remove_block(align_bits + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if there's no combine operation, what is the fragmentation implication here? Should we always round up requested allocation sizes to a multiple of the superslab size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snmalloc never returns address space, so we never have the chance to consolidate.

You should ask for the size you need, it will do the best based on the alignment.

// considerably, and should never be on the fast path.
std::atomic_flag spin_lock = ATOMIC_FLAG_INIT;

inline void check_block(void* base, size_t align_bits)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add doc comments for these methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with writing doc comments for most of these methods as the name is as descriptive as the comment I would write.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read the implementations of these to understand why they are used.

Copy link
Member Author

@mjp41 mjp41 Jun 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code doesn't normally document "why they are used" but "what it does". Personally, I find the name as descriptive as the comment. If it has to document its use site, then in this instance it is okay, as most are used in a single place, but normally not. I could inline them, but the code becomes much less readable.

{
constexpr size_t min_size =
bits::is64() ? bits::one_at_bit(32) : bits::one_at_bit(28);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these magic numbers doing? Ensuring that we allocate at least 256MiB chunks on 32-bit and 4GiB chunks on 64-bit platforms? Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To perform overallocation, and the platform decides on this. So on OE, it can return the whole heap, on 32bit overallocate, but not too large, on 64bit can go larger. The numbers might need refinement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. There's a non-trivial overhead from the kernel having to maintain virtual memory metadata for these large chunks, so a comment here explaining the tradeoff is probably a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think other numbers would work better happy to change them.

@@ -24,87 +24,28 @@ namespace snmalloc

// There are a maximum of two blocks for any size/align in a range.
// One before the point of maximum alignment, and one after.
static inline std::array<std::array<void*, 2>, bits::BITS> ranges;
static inline void* heap_base = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc comment and the comment above doesn't seem to reflect the code.

src/pal/pal_open_enclave.h Show resolved Hide resolved
src/pal/pal_posix.h Outdated Show resolved Hide resolved
src/pal/pal_posix.h Show resolved Hide resolved
@mjp41 mjp41 requested a review from davidchisnall June 19, 2020 14:20
@mjp41 mjp41 marked this pull request as ready for review June 19, 2020 14:20
@mjp41
Copy link
Member Author

mjp41 commented Jun 19, 2020

@achamayou, @anakrish this should bring the OE memory requirements down further. Should now be minimum (256KiB * (N + 1) for N threads. Obviously, if you use a lot of memory, then you will need more, but these are the minimum requirements to set up the global and per-thread data structures. This improves a little on #212, which give (256KiB * (N+2)).

@mjp41
Copy link
Member Author

mjp41 commented Jun 19, 2020

Addresses #213

@mjp41
Copy link
Member Author

mjp41 commented Jun 19, 2020

So I ran the mimalloc benckmark, to see about perf: -Old is for current master, and -New is for this PR. -oe is the open enclave superslab size, 1MiB is the 1MiB superslab size, and 16MiB superslab size.

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims

cfrac mi           09.30 3440 9.26 0.00 2 379
cfrac sn-1MiB-Old  09.36 12724 9.33 0.00 2 439
cfrac sn-16MiB-Old 09.33 20128 9.31 0.00 3 224
cfrac sn-1MiB-New  09.24 8652 9.23 0.00 0 439
cfrac sn-16MiB-New 09.27 9972 9.27 0.00 0 227

espresso mi           07.26 5920 7.18 0.04 2 609
espresso sn-1MiB-Old  07.17 11856 7.13 0.04 0 222
espresso sn-16MiB-Old 07.10 21980 7.02 0.08 0 225
espresso sn-1MiB-New  07.08 10508 7.02 0.05 0 403
espresso sn-16MiB-New 07.05 12020 6.96 0.09 0 227

barnes mi           04.42 66476 4.38 0.02 2 220
barnes sn-1MiB-Old  04.41 102284 4.37 0.04 0 282
barnes sn-16MiB-Old 04.35 98448 4.32 0.02 0 279
barnes sn-1MiB-New  04.40 71604 4.38 0.01 0 272
barnes sn-16MiB-New 04.32 77992 4.31 0.01 0 276

leanN mi           35.55 578036 123.18 1.67 73 12243
leanN sn-1MiB-Old  34.53 562624 120.71 1.38 0 15012
leanN sn-16MiB-Old 35.31 597084 123.09 1.37 0 1195
leanN sn-1MiB-New  34.42 529072 121.08 1.33 0 1658
leanN sn-16MiB-New 35.19 588352 122.86 1.23 0 1348

redis mi           6.433 38972 2.70 0.50 9 837
redis sn-1MiB-Old  5.906 40964 2.46 0.49 0 1090
redis sn-16MiB-Old 5.928 50132 2.37 0.58 0 303
redis sn-1MiB-New  5.939 40804 2.43 0.53 0 542
redis sn-16MiB-New 6.028 46080 2.52 0.49 0 303

alloc-test1 mi           05.61 17896 5.60 0.00 1 1284
alloc-test1 sn-1MiB-Old  05.84 24528 5.83 0.00 0 539
alloc-test1 sn-16MiB-Old 05.75 29568 5.74 0.00 0 264
alloc-test1 sn-1MiB-New  05.89 16040 5.87 0.01 0 285
alloc-test1 sn-16MiB-New 05.72 22136 5.71 0.00 0 272

alloc-testN mi           05.34 54900 62.86 0.08 0 327
alloc-testN sn-1MiB-Old  05.57 33536 63.52 0.07 0 617
alloc-testN sn-16MiB-Old 05.37 70748 63.03 0.08 0 310
alloc-testN sn-1MiB-New  05.36 33772 63.21 0.03 0 345
alloc-testN sn-16MiB-New 05.54 59036 63.25 0.09 0 342

larsonN mi           6.453 117144 59.19 0.27 0 18338
larsonN sn-1MiB-Old  6.270 132312 59.56 0.21 0 5336
larsonN sn-16MiB-Old 6.323 171076 59.57 0.26 0 3530
larsonN sn-1MiB-New  6.342 124856 59.46 0.29 0 3789
larsonN sn-16MiB-New 5.735 162320 58.52 0.21 0 3828

sh6benchN mi           00.20 222692 1.74 0.09 0 12429
sh6benchN sn-1MiB-Old  00.18 286428 1.79 0.08 0 5242
sh6benchN sn-16MiB-Old 00.17 303048 1.73 0.08 0 380
sh6benchN sn-1MiB-New  00.16 266076 1.79 0.08 0 477
sh6benchN sn-16MiB-New 00.16 280092 1.75 0.06 0 394

sh8benchN mi           00.68 237300 5.32 0.09 0 1290
sh8benchN sn-1MiB-Old  00.56 228392 5.27 0.07 0 1591
sh8benchN sn-16MiB-Old 00.57 246828 5.20 0.09 0 355
sh8benchN sn-1MiB-New  00.55 192948 5.28 0.09 0 694
sh8benchN sn-16MiB-New 00.57 226288 5.26 0.07 0 387

xmalloc-testN mi           0.590 142300 55.04 1.33 0 5693
xmalloc-testN sn-1MiB-Old  0.517 154072 40.77 6.41 0 1319
xmalloc-testN sn-16MiB-Old 0.489 316952 42.40 6.72 0 463
xmalloc-testN sn-1MiB-New  0.512 132628 41.88 6.19 0 504
xmalloc-testN sn-16MiB-New 0.498 224844 41.55 6.65 0 453

cache-scratch1 mi           02.50 5988 2.50 0.00 0 232
cache-scratch1 sn-1MiB-Old  02.50 12092 2.49 0.00 0 230
cache-scratch1 sn-16MiB-Old 02.50 20760 2.50 0.00 0 230
cache-scratch1 sn-1MiB-New  02.51 8200 2.50 0.00 0 234
cache-scratch1 sn-16MiB-New 02.50 12104 2.50 0.00 0 235

cache-scratchN mi           00.22 6148 2.55 0.00 0 298
cache-scratchN sn-1MiB-Old  00.23 26208 2.48 0.01 0 268
cache-scratchN sn-16MiB-Old 00.23 54792 2.49 0.02 0 264
cache-scratchN sn-1MiB-New  00.21 22008 2.47 0.00 0 275
cache-scratchN sn-16MiB-New 00.22 36408 2.49 0.01 0 270

mstressN mi           03.33 1570420 5.64 0.58 0 7282
mstressN sn-1MiB-Old  03.32 1902200 5.62 0.59 0 3317
mstressN sn-16MiB-Old 03.23 2056812 5.64 0.61 0 1387
mstressN sn-1MiB-New  03.18 1674780 5.50 0.62 0 2192
mstressN sn-16MiB-New 03.26 2030676 5.64 0.64 0 1485

rptestN mi           6.520 690848 13.22 1.66 1 16475
rptestN sn-1MiB-Old  5.266 727324 12.26 0.71 0 1552
rptestN sn-16MiB-Old 5.316 822288 12.89 0.75 0 693
rptestN sn-1MiB-New  5.082 587432 12.27 0.85 0 1173
rptestN sn-16MiB-New 5.378 820680 12.96 0.79 0 762

The drop in RSS for this change is pretty good. I think this is interacting much better with transparent huge pages. We have a less fragmented address space, so it can consolidate better.

Overall, I would say performance is not really changed, possibly slightly improved, but would need a lot more runs to confirm.

Here is a second run, where I added the Open Enclave slab sizes into the mix.

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims

cfrac mi           09.32 3600 9.31 0.00 0 380
cfrac sn-1MiB-Old  09.20 12724 9.19 0.00 0 438
cfrac sn-oe-New    09.27 6568 9.26 0.00 0 428
cfrac sn-1MiB-New  09.19 8724 9.18 0.00 0 442
cfrac sn-16MiB-New 09.20 9996 9.19 0.00 0 226

espresso mi           07.18 6076 7.11 0.06 0 613
espresso sn-1MiB-Old  07.03 12396 6.95 0.08 0 397
espresso sn-oe-New    07.10 8632 7.03 0.06 0 467
espresso sn-1MiB-New  07.06 10444 7.01 0.05 0 402
espresso sn-16MiB-New 07.09 11756 7.04 0.04 0 222

barnes mi           04.28 66404 4.26 0.02 0 219
barnes sn-1MiB-Old  04.30 98240 4.26 0.04 0 276
barnes sn-oe-New    04.30 69580 4.27 0.02 0 272
barnes sn-1MiB-New  04.29 71700 4.28 0.01 0 275
barnes sn-16MiB-New 04.29 77768 4.27 0.02 0 268

leanN mi           34.63 575208 122.10 1.47 0 12855
leanN sn-1MiB-Old  34.18 552928 120.09 1.39 0 15361
leanN sn-oe-New    33.88 494868 119.89 1.21 0 1898
leanN sn-1MiB-New  34.30 546048 119.39 1.46 0 1774
leanN sn-16MiB-New 35.76 567808 124.91 1.45 0 1307

redis mi           6.581 39032 2.61 0.68 0 841
redis sn-1MiB-Old  5.966 40960 2.53 0.44 0 1095
redis sn-oe-New    6.070 36652 2.54 0.49 0 524
redis sn-1MiB-New  6.056 40784 2.55 0.46 0 539
redis sn-16MiB-New 5.934 46048 2.50 0.46 0 304

alloc-test1 mi           05.78 17948 5.77 0.00 0 1286
alloc-test1 sn-1MiB-Old  05.94 24468 5.92 0.01 0 535
alloc-test1 sn-oe-New    06.04 15116 6.03 0.00 0 593
alloc-test1 sn-1MiB-New  05.81 16040 5.79 0.01 0 289
alloc-test1 sn-16MiB-New 05.69 23512 5.69 0.00 0 266

alloc-testN mi           05.49 54796 63.29 0.05 0 325
alloc-testN sn-1MiB-Old  05.40 42424 63.65 0.05 0 463
alloc-testN sn-oe-New    05.40 21176 64.07 0.06 0 613
alloc-testN sn-1MiB-New  05.59 32496 64.77 0.07 0 353
alloc-testN sn-16MiB-New 05.49 58948 64.88 0.07 0 339

larsonN mi           6.607 115488 59.26 0.33 0 18414
larsonN sn-1MiB-Old  6.366 150108 58.50 0.28 0 5022
larsonN sn-oe-New    7.135 123188 59.49 0.27 0 3645
larsonN sn-1MiB-New  6.317 141968 58.52 0.25 0 3806
larsonN sn-16MiB-New 5.760 159512 59.34 0.27 0 3855

sh6benchN mi           00.17 222656 1.72 0.06 0 12430
sh6benchN sn-1MiB-Old  00.17 287096 1.77 0.08 0 4768
sh6benchN sn-oe-New    00.17 254924 1.79 0.16 0 936
sh6benchN sn-1MiB-New  00.16 265700 1.78 0.07 0 485
sh6benchN sn-16MiB-New 00.17 281780 1.73 0.08 0 390

sh8benchN mi           00.63 233332 5.27 0.08 0 1288
sh8benchN sn-1MiB-Old  00.54 231784 5.28 0.08 0 1935
sh8benchN sn-oe-New    00.56 172572 5.51 0.15 0 891
sh8benchN sn-1MiB-New  00.53 195040 5.28 0.07 0 701
sh8benchN sn-16MiB-New 00.56 224780 5.17 0.07 0 390

xmalloc-testN mi           0.596 147556 53.98 1.48 0 6002
xmalloc-testN sn-1MiB-Old  0.483 249292 42.31 6.74 0 1628
xmalloc-testN sn-oe-New    0.480 84940 42.95 6.69 0 910
xmalloc-testN sn-1MiB-New  0.479 132560 42.59 6.90 0 487
xmalloc-testN sn-16MiB-New 0.482 231012 42.61 6.64 0 476

cache-scratch1 mi           02.50 6032 2.50 0.00 0 231
cache-scratch1 sn-1MiB-Old  02.47 12356 2.46 0.00 0 229
cache-scratch1 sn-oe-New    02.48 5868 2.47 0.00 0 233
cache-scratch1 sn-1MiB-New  02.52 8028 2.52 0.00 0 237
cache-scratch1 sn-16MiB-New 02.49 12052 2.49 0.00 0 232

cache-scratchN mi           00.22 6136 2.52 0.00 0 295
cache-scratchN sn-1MiB-Old  00.22 30328 2.47 0.01 0 263
cache-scratchN sn-oe-New    00.21 11700 2.45 0.01 0 277
cache-scratchN sn-1MiB-New  00.23 21960 2.50 0.00 0 272
cache-scratchN sn-16MiB-New 00.22 36528 2.49 0.01 0 263

mstressN mi           03.23 1570560 5.50 0.56 0 6281
mstressN sn-1MiB-Old  03.26 1903220 5.50 0.56 0 3556
mstressN sn-oe-New    03.24 1688088 5.49 0.82 0 3135
mstressN sn-1MiB-New  03.14 1652280 5.51 0.54 0 2172
mstressN sn-16MiB-New 03.13 1920300 5.51 0.59 0 1437

rptestN mi           6.557 697372 13.52 1.67 0 16580
rptestN sn-1MiB-Old  5.379 680712 12.25 0.75 0 6936
rptestN sn-oe-New    5.289 555860 12.23 1.47 0 2754
rptestN sn-1MiB-New  5.060 593612 12.34 0.90 0 1180
rptestN sn-16MiB-New 5.250 831004 12.88 0.82 0 780

The new configuration is generally performing well, except for Larson.

The two tables, are both just single runs, so pretty noisy.

@mjp41
Copy link
Member Author

mjp41 commented Jun 22, 2020

I disabled transparent huge pages and there is no change above noise:


# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims

cfrac mi          09.06 3548 9.02 0.00 2 379
cfrac sn-1MiB-Old 09.00 4588 8.97 0.01 3 444
cfrac sn-1MiB-New 09.05 4708 9.01 0.00 2 442

espresso mi          07.21 4332 7.12 0.04 2 705
espresso sn-1MiB-Old 07.00 4740 6.92 0.07 0 508
espresso sn-1MiB-New 06.88 4908 6.84 0.04 0 516

barnes mi          04.45 58904 4.40 0.03 2 15621
barnes sn-1MiB-Old 04.42 60056 4.38 0.04 0 15689
barnes sn-1MiB-New 04.41 60048 4.37 0.03 0 15677

leanN mi          34.33 488328 119.38 1.44 72 119745
leanN sn-1MiB-Old 34.25 490920 120.12 1.43 0 120573
leanN sn-1MiB-New 34.73 498556 121.58 1.63 0 122882

redis mi          6.317 29692 2.64 0.52 9 6698
redis sn-1MiB-Old 6.001 31216 2.49 0.51 0 6842
redis sn-1MiB-New 5.782 31312 2.39 0.50 0 6848

alloc-test1 mi          05.52 12780 5.50 0.00 1 2565
alloc-test1 sn-1MiB-Old 05.71 12832 5.71 0.00 0 2555
alloc-test1 sn-1MiB-New 05.74 12644 5.74 0.00 0 2553

alloc-testN mi          05.40 14204 62.21 0.07 0 2948
alloc-testN sn-1MiB-Old 05.27 15208 62.56 0.02 0 3185
alloc-testN sn-1MiB-New 05.41 15180 62.46 0.04 0 3180

larsonN mi          6.287 107820 59.36 0.27 0 27881
larsonN sn-1MiB-Old 5.878 128988 58.41 0.28 0 31602
larsonN sn-1MiB-New 5.864 131100 58.23 0.34 0 32101

sh6benchN mi          00.17 216536 1.71 0.11 0 53833
sh6benchN sn-1MiB-Old 00.17 251076 1.74 0.14 0 62127
sh6benchN sn-1MiB-New 00.17 251040 1.74 0.11 0 62127

sh8benchN mi          00.67 126472 5.53 0.08 0 31335
sh8benchN sn-1MiB-Old 00.54 182560 5.14 0.13 0 45026
sh8benchN sn-1MiB-New 00.58 183960 5.24 0.12 0 45435

xmalloc-testN mi          0.582 147328 55.01 1.07 0 36528
xmalloc-testN sn-1MiB-Old 0.460 96780 44.10 6.37 0 23624
xmalloc-testN sn-1MiB-New 0.497 95440 42.41 6.60 0 23314

cache-scratch1 mi          02.40 3844 2.40 0.00 0 235
cache-scratch1 sn-1MiB-Old 02.45 3952 2.45 0.00 0 238
cache-scratch1 sn-1MiB-New 02.42 3944 2.42 0.00 0 239

cache-scratchN mi          00.22 4088 2.47 0.00 0 301
cache-scratchN sn-1MiB-Old 00.22 4112 2.47 0.00 0 296
cache-scratchN sn-1MiB-New 00.22 4012 2.46 0.00 0 292

mstressN mi          03.62 1518980 5.79 0.81 0 379672
mstressN sn-1MiB-Old 03.59 1558820 5.78 0.83 0 389436
mstressN sn-1MiB-New 03.59 1569840 5.80 0.84 0 393388

rptestN mi          6.164 708528 13.95 1.65 1 176769
rptestN sn-1MiB-Old 5.086 583608 12.65 0.95 0 145320
rptestN sn-1MiB-New 5.073 602924 12.60 1.02 0 150666

@davidchisnall
Copy link
Collaborator

Nice! It looks as if the 1MiB slab size has a noticeable impact on RSS but a small (and sometimes positive) impact on performance. I wonder if we should consider adjusting the default. The smallest superpage size on x86 is 2MiB and 1MiB on ARM, perhaps we should consider adding a superpage_size constant to the AAL and default to that if the user doesn't override it? We may want to make it larger on CHERI to trade off the larger pagemap against the larger slab size.

@mjp41
Copy link
Member Author

mjp41 commented Jun 22, 2020

@davidchisnall I agree we should look at the defaults. I think assessing that needs some time invested in statistically significant benchmarking of parameter sweeps, which I don't have time to set up currently. I.e. it is not part of the PR ;-)

@mjp41 mjp41 merged commit e16f2af into microsoft:master Jun 22, 2020
@mjp41 mjp41 deleted the poweroftwo branch June 22, 2020 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants