Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gollvm: find a better way to deal with g #37295

Open
erifan opened this issue Feb 19, 2020 · 4 comments
Open

gollvm: find a better way to deal with g #37295

erifan opened this issue Feb 19, 2020 · 4 comments
Milestone

Comments

@erifan
Copy link
Contributor

@erifan erifan commented Feb 19, 2020

Currently, gollvm stores the current g in tls. The runtime.getg () function returns the current g. This function will be inlined for better performance and the inlining will be disabled by GoSafeGetg pass in some situations. Cherry described this situation in this pass:

within a function,
//
//   load g
//   call mcall(...)
//   load g
//
// may be compiled to
//
//   leaq    g@TLS, %rdi
//   call    __tls_get_addr
//   movq    %rax, %rbx     // cache in a callee-save register %rbx
//   ... use g in %rax ...
//   call    foo
//   ... use g in %rbx ...
// This is incorrect if a thread switch happens at the call of foo.

A practical example of this situation: gofrontend/chan.go#154
By removing the inlining of the second and subsequent getg functions in a block, GoSafeGetg pass fixed this issue on linux/amd64. But on Linux / arm64, llvm performs cache optimization on the tls base address across the entire function range, so the g obtained by the second and subsequent getg in the above situation may still be wrong.

As I know, this kind of optimization is common in llvm and gcc, and it seems to be correct and very good for c / c ++. Before c / c ++ introduced a concept like goroutine, I think this optimization will not be changed. Please refer to similar issues: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461, https://bugs.llvm.org/show_bug.cgi?id=19177.
Currently gollvm only supports linux / amd64 and linux / arm64, but as more platforms are supported in the future, I think this issue will be a problem on more platforms, so I think we need to find a better way to store the current g.

At present, the methods I can think of are as follows:

  1. Keep the current practice, store g in tls, try to remove the cache of tls base address.
  2. Follow the practice of main go, reserve a register to store g.
  3. Store g in a suitable location on the stack.
    @thanm @cherrymui @ianlancetaylor Any suggestions ?
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 19, 2020

Personally I'd really like a way to let the backend not cache TLS addresses in GCC and LLVM. I don't think this is specific to Go. In C, if you use getcontext/setcontext with TLS variables, this problem can also happen. But I guess getcontext/setcontext are not used very common. As you said, this probably won't happen soon.

Follow the practice of main go, reserve a register to store g.

I thought about that. The tricky thing is that the libgo runtime uses C for many things, including external C code, libbacktrace, libffi, libgcc, and syscall wrappers from libc. If we reserve a register globally, all these C code need to be compiled in a special way. Or we have to have some wrapper that save/restore the reserved register at C library boundaries.

Store g in a suitable location on the stack.

I'm not sure how this is going to work. Also, the C code seems to need special compilation.

Maybe a possibility is with the current approach, and add a machine-IR pass, runs after CSE, that inlines getg calls back in. This will be machine dependent. And I'm not sure if this can be done for gccgo.

@toothrot toothrot added this to the gollvm milestone Feb 19, 2020
@erifan
Copy link
Contributor Author

@erifan erifan commented Mar 3, 2020

Hi Cherry, I also thought about many possible solutions, including:

  1. callee saved register. This is actually the current main go approach. We need to deal with many special cases (tsan, msan, vdso, etc.). As you said in gollvm we may also need special compilation for a lot of c code. This method is too cumbersome.
  2. free system register. No such register is currently found.
  3. somewhere on the stack. This method requires some very ugly restrictions on the goroutine's stack. For example, put g at the bottom of the stack and require the stack to be aligned with 2m bytes, and the size of the stack cannot exceed 2m. Then we can calculate the position of g from the value of sp. I think this is just an option, and we know this method is very ugly.
  4. tls. This method needs to deal with the caching of the tls base address by CSE optimization. At present, this method is the most feasible approach. On linux/amd64 we wrote a pass to disable the CSE optimization. But maybe we can disable this optimization in a simpler way, that is inline assembly.

Here is an example on Linux/arm64:

#include <stdio.h>
#include <stdlib.h>

__thread long long g = 1;
void schedule_this_routine_to_a_different_thread() {
  printf("schedule_this_routine_to_a_different_thread\n");
}

// The following getg simulates the getg in gollvm
long long getg() {
  return g;
}

// getg_new is implemented with inline assembly
long long getg_new() {
  long long out;
   asm volatile(
     "mrs %[out], tpidr_el0 \t\n\
      add %[out], %[out], #:tprel_hi12:g, lsl #12 \t\n\
      add %[out], %[out], #:tprel_lo12_nc:g \t\n\
      ldr %[out], [%[out]] \t\n\
     " : [out] "=r" (out)
   );
  return out;
}

long long test_getg() {
  long long v1, v2;
  v1 = getg();
  schedule_this_routine_to_a_different_thread();
  v2 = getg();
  return v1 + v2;
}

long long test_getg_new() {
  long long v1, v2;
  v1 = getg_new();
  schedule_this_routine_to_a_different_thread();
  v2 = getg_new();
  return v1 + v2;
}

int main() {
  int res = 0;
  res += test_getg();
  res += test_getg_new();
  return res;
}

compile the above getg.c file with "clang -O2 -o getg getg.c".

The objdump result:

    00000000004005f0 <test_getg>:
  4005f0:       a9be4ff4        stp     x20, x19, [sp, #-32]!
  4005f4:       a9017bfd        stp     x29, x30, [sp, #16]
  4005f8:       d53bd048        mrs     x8, tpidr_el0
  4005fc:       91400108        add     x8, x8, #0x0, lsl #12
  400600:       91004113        add     x19, x8, #0x10
  400604:       f9400274        ldr     x20, [x19]
  400608:       90000000        adrp    x0, 400000 <g+0x400000>
  40060c:       911e8000        add     x0, x0, #0x7a0
  400610:       910043fd        add     x29, sp, #0x10
  400614:       97ffffa3        bl      4004a0 <puts@plt>
  400618:       f9400268        ldr     x8, [x19]
  40061c:       a9417bfd        ldp     x29, x30, [sp, #16]
  400620:       8b140100        add     x0, x8, x20
  400624:       a8c24ff4        ldp     x20, x19, [sp], #32
  400628:       d65f03c0        ret

  000000000040062c <test_getg_new>:
  40062c:       f81e0ff3        str     x19, [sp, #-32]!
  400630:       90000000        adrp    x0, 400000 <g+0x400000>
  400634:       911e8000        add     x0, x0, #0x7a0
  400638:       a9017bfd        stp     x29, x30, [sp, #16]
  40063c:       910043fd        add     x29, sp, #0x10
  400640:       d53bd053        mrs     x19, tpidr_el0
  400644:       91400273        add     x19, x19, #0x0, lsl #12
  400648:       91004273        add     x19, x19, #0x10
  40064c:       f9400273        ldr     x19, [x19]
  400650:       97ffff94        bl      4004a0 <puts@plt>
  400654:       d53bd048        mrs     x8, tpidr_el0
  400658:       91400108        add     x8, x8, #0x0, lsl #12
  40065c:       91004108        add     x8, x8, #0x10
  400660:       f9400108        ldr     x8, [x8]
  400664:       a9417bfd        ldp     x29, x30, [sp, #16]
  400668:       8b130100        add     x0, x8, x19
  40066c:       f84207f3        ldr     x19, [sp], #32
  400670:       d65f03c0        ret

As you can see, getg_new is inlined, which also avoids the optimization problem of CSE. Then we don't need the GoSafeGetg pass any more. I'm not familiar with x86 assembly, so I didn't write a x86 example, but I think This approach applies to other architectures.
Of course, this method is not perfect. If there is no thread switch between multiple getg calls, this will cause some unnecessary instructions to be executed. But I think this performance loss is very small, and it is rare for a function to call getg a lot.
What you think of this method? If you feel ok I am willing to send a patch to fix this problem.

@erifan
Copy link
Contributor Author

@erifan erifan commented Mar 6, 2020

Inline assembly may not apply to dynamic linking 😢

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 17, 2020

Change https://golang.org/cl/228737 mentions this issue: gollvm: implement getg function with inline assembly for arm64

pingw33n pushed a commit to pingw33n/gollvm that referenced this issue Oct 6, 2020
The getg function is called frequently in runtime, and its performance
is very critical, so we turn the call to the getg function into a load
operation to inline it. In order to avoid the thread pointer caching
problem when thread switching happens between two function calls of this
function, we only inline the first one through GoSafeGetg pass. But this
implementation is not feasible on linux arm64, because it is based on the
assumption that the llvm backend's cse optimization of the thread pointer
only happens in a block range, but on linux arm64, this optimization occurs
in the entire function range.
Expanding GoSafeGetg pass to the entire function range is not a good
solution, because there will still be a large number of getg function
calls that are not inlined.
This CL simulates the implementation of getg in the c file through inline
assembly. This not only ensures its correctness, but also ensures that
each getg function call in Go files is inlined.
The disadvantage is that if no thread switching occurs between two getg
function calls, then theoretically the second getg function call can be
optimized, but not in this implementation. In this implementation, all
instruction sequences of getg function will be executed once it is called.

This CL also added a unit test case for this implementation.

Updates golang/go#37295

Change-Id: If9e47b2afeb420a1d0316f2a82602b18bed82477
Reviewed-on: https://go-review.googlesource.com/c/gollvm/+/228737
Reviewed-by: Than McIntosh <thanm@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.