Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: preserve extra M across calls from C to Go #51676

Open
doujiang24 opened this issue Mar 15, 2022 · 14 comments
Open

runtime: preserve extra M across calls from C to Go #51676

doujiang24 opened this issue Mar 15, 2022 · 14 comments
Labels
help wanted NeedsInvestigation
Milestone

Comments

@doujiang24
Copy link
Contributor

@doujiang24 doujiang24 commented Mar 15, 2022

There are 5 sigprocmask calls and 3 sigaltstack calls when calling every go exported function from C.

syscall during needm:

rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
sigaltstack({ss_sp=0xc00003e000, ss_flags=0, ss_size=32768}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

syscall during dropm:

rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

We can call PreBindExtraM to bind extra M after loaded go so file and before call any go exported functions, for better performance.
And nothing changes without this PreBindExtraM call.

background:
We are building GoLang extension for Envoy which heavily relies on cgo.

@doujiang24
Copy link
Contributor Author

@doujiang24 doujiang24 commented Mar 15, 2022

I finished a draft PR for this proposal: #51679
Any feedbacks are welcome, thanks!

With PreBindExtraM, c call go is ~30x faster in the following simple test case:

hello.go

package main

import "C"

//export AddFromGo
func AddFromGo(a int64, b int64) int64 {
    return a + b
}

func main() {}

hello.c

#include <stdio.h>
#include "libgo-hello.h"
#include <stdlib.h>

int main(int argc, char **argv) {
    long a = 2;
    long b = 3;
    long max = 1;

    if (argc > 1) {
        max = atoi(argv[1]);
    }

    printf("max loop: %ld\n", max);

    PreBindExtraM();

    long r;
    for (int i = 0; i < max; i++) {
        r = AddFromGo(a, b);
    }

    printf("%ld + %ld = %ld\n", a, b, r);
}

benchmark with PreBindExtraM:

$ time ./hello 1000000
max loop: 1000000
2 + 3 = 5

real    0m0.150s
user    0m0.156s
sys     0m0.010s

benchmark without PreBindExtraM(just remove it):

$ time ./hello 1000000
max loop: 1000000
2 + 3 = 5

real    0m5.088s
user    0m1.536s
sys     0m4.116s

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 15, 2022

Even after looking at the pull request I'm not sure precisely what you are proposing.

Is user code expected to call PreBindExtraM? What is the exact semantics of that function? How would you write user documentation for it? Thanks.

@ianlancetaylor ianlancetaylor changed the title proposal: cgo: add PreBindExtraM to reduce signal syscall. proposal: cmd/cgo: add PreBindExtraM to reduce signal syscall Mar 15, 2022
@doujiang24
Copy link
Contributor Author

@doujiang24 doujiang24 commented Mar 16, 2022

@ianlancetaylor Thanks.

Is user code expected to call PreBindExtraM? What is the exact semantics of that function? How would you write user documentation for it? Thanks.

Yes, user code have to call PreBindExtraM to enable this optimization, as shown in the hello.c.
Without the additional call of PreBindExtraM, everything just works as previous, nothing changes.

Let me try to write a bit document for it:

When calling a go exported function in a c process, in short, it works as this flow:

  1. bind an extra M(also a P, we don't care it here),
  2. execute the go function,
  3. drop the extra M (P).

In step 1 (needm) and step 3 (dropm), there are five signal syscalls.

To avoid these five signal syscall, cgo also generated a built-in C function PreBindExtraM.
You can call PreBindExtraM to pre-bind extra M, before you call any go exported functions, after you loaded the go so file.
After pre-binding extra M, step 1 and step 3 will be skipped when calling any go exported functions.

@aclements
Copy link
Member

@aclements aclements commented Mar 16, 2022

I haven't thought through this deeply, but is the TODO(rsc) comment on dropm relevant to this case? It seems like if the runtime could use TLS to bind an M to a C thread, we wouldn't need to manipulate the sigaltstack so frequently. But I may be wrong about that.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Mar 16, 2022
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 16, 2022

OK, I think that in effect what the suggested change does is, for a thread created by C, set the g TLS variable to a newly created G and and associated M. However, there is no way to actually release that G and M if the thread exits.

So, I agree: the TODO by @rsc is a better approach. With that approach, the first time a C thread calls into Go we allocate a G and M and set the g TLS variable. Then we just keep that around, but if the thread exits we release that G and M and put the M back on the extram list.

Note that we will get into trouble if the C thread calls Go code, then disables the signal stack, then calls Go code again. Perhaps that case is not worth worrying about.

I'm going to take this out of the proposal process because I think we can get the same effect without an API change.

@ianlancetaylor ianlancetaylor changed the title proposal: cmd/cgo: add PreBindExtraM to reduce signal syscall runtime: preserve extra M across calls from C to Go Mar 16, 2022
@ianlancetaylor ianlancetaylor removed this from Incoming in Proposals Mar 16, 2022
@ianlancetaylor ianlancetaylor added the NeedsInvestigation label Mar 16, 2022
@ianlancetaylor ianlancetaylor removed this from the Proposal milestone Mar 16, 2022
@ianlancetaylor ianlancetaylor added this to the Backlog milestone Mar 16, 2022
@thepudds
Copy link

@thepudds thepudds commented Mar 16, 2022

Then we just keep that around, but if the thread exits we release that G and M and put the M back on the extram list.

Sorry for basic question, but today does it already track when a thread created by C exits?

@thepudds
Copy link

@thepudds thepudds commented Mar 16, 2022

To partly answer my own question, it looks like registering a destructor which would be called on thread exit would be part of the work here...

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 16, 2022

Yes, we would use pthread_key_create with a destructor function. We wouldn't actually track when a thread exits as such.

@doujiang24
Copy link
Contributor Author

@doujiang24 doujiang24 commented Mar 17, 2022

Oh, agreed, the TODO by @rsc is a better approach. Using pthread_key_create to register a destructor is a good idea.

set the g TLS variable to a newly created G and and associated M.

Do it need to create a new g? Maybe using the g0 could be a better choice, as it does now.

Does the following change is in the right way? I would love to have a try. Thanks.

  1. pthread_key_create to register a destructor when loading go so file, maybe in the x_cgo_sys_thread_create function.
  2. needm in _cgo_wait_runtime_init_done when thread key value is NULL, also, set the thread key value to a non-NULL value.
  3. when the destructor is called, dropm.

In short, we always try to pre-bind M in every Go exported function. And drop M in destructor to avoid M leaking.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 17, 2022

Do it need to create a new g? Maybe using the g0 could be a better choice, as it does now.

Yes, that is the right thing to do.

Your set of steps sounds basically right.

@doujiang24
Copy link
Contributor Author

@doujiang24 doujiang24 commented Mar 18, 2022

Okay, jumping out of the pre-bind rabbit hole, maybe step 2 change to the following is simpler (or expected)?
2. skip dropm when a destructor is registered.

@aclements
Copy link
Member

@aclements aclements commented Mar 18, 2022

I'm not sure I completely understand what you mean, but I think that's the right direction. cgocallback already calls needm if the g TLS isn't set, so it's probably easiest to let it keep doing that, rather than moving responsibility for that to _cgo_wait_runtime_init_done, and just leave that g/m set. That also means we don't need to access this new pthread_key's value from Go; we're only using it for its destructor.

_cgo_wait_runtime_init_done might be a good place to ensure the pthread_key is set to a non-NULL value for that thread (otherwise the destructor won't be called), and possibly a good place to ensure the pthread_key has been created in the first place.

Creating the m in _cgo_wait_runtime_init_done would probably work, but it's sort on the wrong side of the language divide.

@doujiang24
Copy link
Contributor Author

@doujiang24 doujiang24 commented Mar 21, 2022

Yeah, I mean keep needm in cgocallback, and skip dropm when a destructor is registered by pthread_key_create, since I have noticed the following comment for dropm in source code.

// We may have to keep the current version on systems with cgo
// but without pthreads, like Windows.

_cgo_wait_runtime_init_done might be a good place to ensure the pthread_key is set to a non-NULL value for that thread

Yeah, this sounds better than x_cgo_sys_thread_create. I will have a try. Thanks.

@doujiang24
Copy link
Contributor Author

@doujiang24 doujiang24 commented Mar 21, 2022

I have implemented the new way in CL 387415.
Please help to take a look if it's the right direction. If yes, I'll continue to improve it.
Any feedbacks are welcome, thanks.

In CL 387415, we introduced to variables:

  1. x_cgo_pthread_key_created indicates if we have registered the destructor or not,
  2. x_cgo_dropm to save the cgodropm function address, since I found it's hard to import cgodropm from go to gcc_libinit.c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

5 participants