Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: difficulty handling system page aligned stacks #41008

Closed
4a6f656c opened this issue Aug 24, 2020 · 17 comments
Closed

runtime: difficulty handling system page aligned stacks #41008

4a6f656c opened this issue Aug 24, 2020 · 17 comments

Comments

@4a6f656c
Copy link
Contributor

@4a6f656c 4a6f656c commented Aug 24, 2020

The current runtime stack allocation is based on mheap, which uses its own fixed page size (4096 bytes). On some systems there is a requirement to have stacks be system page aligned (for example, 16KB alignment for OpenBSD/octeon). In this case, if mheap provides memory that is not 16KB aligned various things fail.

There are a couple of options to address this:

  1. Make it possible to system allocate the stacks (i.e. malloc) - this is the simplest option.

  2. Over allocate via allocManual() and round to system page alignment during stack allocation - this should not require changes to mheap, however would require changes to type stack, which would need to track the actual allocation (address and size), in addition to the lo/hi pointers (this is needed in order for stackfree() to correctly return the allocation). Two additional pointers in this struct are probably acceptable, however various assembly knows about its size and offset.

  3. Change mheap to provide system page size aligned allocations - this would be doable, but seems to require a fair amount of effort, however avoids the need to change type stack (and various assembly).

For the time being I plan on using (1) for the openbsd/mips64 port, however we may want to consider alternatives at a later date.

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 24, 2020

Change https://golang.org/cl/250183 mentions this issue: runtime: use stacks from system for openbsd/mips64

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 26, 2020

On some systems there is a requirement to have stacks be system page aligned

I'm not sure I understand this requirement. How does the system ensure it? What does the system check?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 26, 2020

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 26, 2020

How big is the physical page size on the system? If we set in the runtime the physical page size as at least 16KB, then all mmap should be 16K aligned, including MAP_STACK. I assume the actual goroutine stack does not necessarily start at the beginning of the mmap, and it would probably be fine.

@cagedmantis cagedmantis added this to the Backlog milestone Aug 26, 2020
@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Oct 18, 2020

How big is the physical page size on the system? If we set in the runtime the physical page size as at least 16KB, then all mmap should be 16K aligned, including MAP_STACK.

@cherrymui - OpenBSD on Octeon runs with 16KB physical pages. Changing the runtime page size to 16KB would be one solution (however that seemed like a larger hammer than might be acceptable).

I assume the actual goroutine stack does not necessarily start at the beginning of the mmap, and it would probably be fine.

I believe this is correct. The failures are when, for example, setting an alternate signal stack and passing a stack that is not physical page aligned.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Oct 21, 2020

The runtime should query the physical page size at startup and only do physical-page-aligned mmaps, regardless of runtime's internal page size. If that is not the case, we probably should fix that.

Maybe we just need to round the address when calling sigaltstack?

@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Oct 24, 2020

The runtime should query the physical page size at startup and only do physical-page-aligned mmaps, regardless of runtime's internal page size. If that is not the case, we probably should fix that.

Two mmaps are required on OpenBSD - the first is done for allocation, the second is to mark an allocation as being stack (MAP_STACK). The problem is that due to the way stack allocation is currently performed, we end up in the following situation:

  1. Go allocates memory for mheap via mmap.
  2. A stack allocation is requested from mheap - this allocation is part of the earlier mmap, but rounded to the mheap fixed page size (4096 bytes) (i.e. we're no longer physical page aligned).
  3. On OpenBSD, this stack allocation needs to be marked as MAP_STACK - Go calls mmap with the 4096 byte rounded allocation, which fails as the allocation is not physical page aligned.

I'm not sure how we would fix/avoid this situation without (2) or (3) from the original issue comment.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Oct 28, 2020

Can we round the mmap in (3) to physical page size? I guess it would not be a problem if we map more?

@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Oct 29, 2020

Can we round the mmap in (3) to physical page size? I guess it would not be a problem if we map more?

Hrmmm, we can likely round a stack allocation down to get physical page alignment and then round up to get a multiple of physical page size. The problem would then potentially be two back to back stack allocations, that would then have overlapping mmaps - I'll have to see what the kernel says if we do that (and particularly if we remove one of them again).

@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Oct 29, 2020

So I do not think this is going to work - if we do an mmap (96 physical pages) we get:

0000002758FA0000   1536K read/write          [ anon ]

We then mmap again one physical page into the previous allocation and 32 physical pages in size:

0000002758FA0000     16K read/write          [ anon ]
0000002758FA4000    512K read/write          [ anon ]  <- stack mapped a
0000002759024000   1008K read/write          [ anon ]

We then mmap again, 32 physical pages into the original allocation and 32 physical pages in size (resulting in a single physical page overlap with the previous allocation, as would be required in the round down/up scenario):

0000002758FA0000     16K read/write          [ anon ]
0000002758FA4000    496K read/write          [ anon ] <- stack mapped a
0000002759020000    512K read/write          [ anon ] <- stack mapped b
00000027590A0000    512K read/write          [ anon ]

We then stop using the second stack and remove MAP_STACKMAP:

0000002758FA0000     16K read/write          [ anon ]
0000002758FA4000    496K read/write          [ anon ] <- stack mapped a
0000002759020000    512K read/write          [ anon ]
00000027590A0000    512K read/write          [ anon ]

If the top physical page of the first stack mapped allocation is then used and a system call is made (or a signal occurs), the process will now be terminated. So we'd be back to having to over allocate in order to guarantee there was no overlap in the case of rounding down/up. I'll take another look to see if this is possible in combination.

@aclements
Copy link
Member

@aclements aclements commented Oct 30, 2020

I think basing this off stackFromSystem is the wrong approach, I'm afraid. This approach is going to cause all goroutine stacks to start at 16 KiB on openbsd/mips64, rather than the usual 2 KiB; it's going to throw off stack memory metrics; and it's going to dramatically change scavenging behavior for stack memory. stackFromSystem is really meant for runtime debugging (I'm fine with adding the osStackAlloc on that path, since that does fix that debugging mode on openbsd, but as a separate change).

If I understand the underlying problem, we need to mark MAP_STACK regions on physical page boundaries (which happens to work right now only when the physical page size is <= the runtime page size).

I just discussed this with @cherrymui and @mknyszek and they came up with several ideas. I'll summarize here, but let them weigh in with more details:

  • If we're switching to libc calls instead of raw syscalls on OpenBSD (are we?), then do we actually need g stacks to be MAP_STACK, or is it enough to have the g0 stacks be MAP_STACK where we do the libc call, and to have the sigaltstacks mapped MAP_STACK? [@cherrymui]
  • Use a different register as the stack pointer and leave the OS-expected stack pointer register pointing at the g0 stack. [@cherrymui]
  • Map the Go heap MAP_STACK since Go already mitigates the sorts of C bugs this is meant to defend against.
  • Add a new spanAllocType, say spanAllocStackPhysPageAligned, and modify allocSpan to recognize this type, skip the page cache, allocate a larger range from the heap, and return any unaligned part back to the heap. [@mknyszek]
  • Add a way to free part of a span, then in stack allocation ask for a large span, and free the unaligned part.
@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Nov 1, 2020

If the top physical page of the first stack mapped allocation is then used and a system call is made (or a signal occurs), the process will now be terminated. So we'd be back to having to over allocate in order to guarantee there was no overlap in the case of rounding down/up. I'll take another look to see if this is possible in combination.

To close this out, from an operating system perspective this appears to work, however the over allocation required is pretty substantial and there are parts of Go's runtime that seem to dislike this (I've not tracked down exactly why this is yet).

@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Nov 1, 2020

I think basing this off stackFromSystem is the wrong approach, I'm afraid. This approach is going to cause all goroutine stacks to start at 16 KiB on openbsd/mips64, rather than the usual 2 KiB; it's going to throw off stack memory metrics; and it's going to dramatically change scavenging behavior for stack memory. stackFromSystem is really meant for runtime debugging (I'm fine with adding the osStackAlloc on that path, since that does fix that debugging mode on openbsd, but as a separate change).

@aclements thanks for the detailed response - I agree that stackFromSystem is not ideal, but it does allow things to work :)

If I understand the underlying problem, we need to mark MAP_STACK regions on physical page boundaries (which happens to work right now only when the physical page size is <= the runtime page size).

This is correct.

I just discussed this with @cherrymui and @mknyszek and they came up with several ideas. I'll summarize here, but let them weigh in with more details:

  • If we're switching to libc calls instead of raw syscalls on OpenBSD (are we?), then do we actually need g stacks to be MAP_STACK, or is it enough to have the g0 stacks be MAP_STACK where we do the libc call, and to have the sigaltstacks mapped MAP_STACK? [@cherrymui]

Yes, we need to switch to libc calls (see issue #36435), although this is currently blocked on #39257 (and https://go-review.googlesource.com/c/go/+/249978). This is likely the cleanest and easiest solution, providing that the g0 stacks are always used for libc calls (and therefore system calls). The g0 stack is presumably the OS allocated stack, which will already be MAP_STACK, at which point this entire issue should be moot.

  • Use a different register as the stack pointer and leave the OS-expected stack pointer register pointing at the g0 stack. [@cherrymui]

Possible, but seems like a lot of effort and a wasted register.

  • Map the Go heap MAP_STACK since Go already mitigates the sorts of C bugs this is meant to defend against.

This is a pretty big hammer that would allow a ROP stack to be created and run from any part of the heap - while Go mitigates against buffer overflows and makes it harder to gain control flow, it does not necessarily help when cgo or unsafe are in use.

  • Add a new spanAllocType, say spanAllocStackPhysPageAligned, and modify allocSpan to recognize this type, skip the page cache, allocate a larger range from the heap, and return any unaligned part back to the heap. [@mknyszek]

Hrmmm, that could work reasonably well. I'll have a poke at this.

  • Add a way to free part of a span, then in stack allocation ask for a large span, and free the unaligned part.

That likely works but seems like more effort the doing it in allocSpan.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 1, 2020

Change https://golang.org/cl/266919 mentions this issue: runtime: allow physical page aligned stacks to be allocated

@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Nov 1, 2020

  • Add a new spanAllocType, say spanAllocStackPhysPageAligned, and modify allocSpan to recognize this type, skip the page cache, allocate a larger range from the heap, and return any unaligned part back to the heap. [@mknyszek]

Hrmmm, that could work reasonably well. I'll have a poke at this.

This seems to work reasonably well - https://go-review.googlesource.com/c/go/+/266919

@aclements
Copy link
Member

@aclements aclements commented Nov 2, 2020

If we're switching to libc calls instead of raw syscalls on OpenBSD (are we?), then do we actually need g stacks to be MAP_STACK, or is it enough to have the g0 stacks be MAP_STACK where we do the libc call, and to have the sigaltstacks mapped MAP_STACK? [@cherrymui]

@cherrymui looked into this a little and determined that this won't work. She points out that the kernel patch mentions "We can also perform this check on standard synchronous traps, for instance page faults." These happen all the time while running on the g stack, so the g stack has to be MAP_STACK.

@4a6f656c
Copy link
Contributor Author

@4a6f656c 4a6f656c commented Nov 2, 2020

If we're switching to libc calls instead of raw syscalls on OpenBSD (are we?), then do we actually need g stacks to be MAP_STACK, or is it enough to have the g0 stacks be MAP_STACK where we do the libc call, and to have the sigaltstacks mapped MAP_STACK? [@cherrymui]

@cherrymui looked into this a little and determined that this won't work. She points out that the kernel patch mentions "We can also perform this check on standard synchronous traps, for instance page faults." These happen all the time while running on the g stack, so the g stack has to be MAP_STACK.

Ah, correct - checks are performed on both syscalls and page faults, so a page fault occurring on a g stack would be problematic without it being MAP_STACK. Additionally, sigaltstack() marks the passed memory as MAP_STACK itself so the caller does not actually need to do this ahead of time.

@gopherbot gopherbot closed this in 5ca43ac Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants