Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel build broken on MacOS because of thread-local exec_ctx #13856

Closed
vjpai opened this issue Dec 21, 2017 · 20 comments · Fixed by #13929
Closed

Bazel build broken on MacOS because of thread-local exec_ctx #13856

vjpai opened this issue Dec 21, 2017 · 20 comments · Fixed by #13929

Comments

@vjpai
Copy link
Member

vjpai commented Dec 21, 2017

Please answer these questions before submitting your issue.

Should this be an issue in the gRPC issue tracker?

Yes, for tracking reasons

What version of gRPC and what language are you using?

1.9.0-dev

What operating system (Linux, Windows, …) and version?

MacOS

What runtime / compiler are you using (e.g. python version or version of gcc)

$ bazel info
release: release 0.8.1-homebrew
$ clang -v
Apple LLVM version 9.0.0 (clang-900.0.38)

What did you do?

bazel build //:grpc

What did you expect to see?

Successful build as was possible before exec_ctx became a thread-local variable

What did you see instead?

$ bazel build //:grpc
INFO: Analysed target //:grpc (17 packages loaded).
INFO: Found 1 target...
ERROR: /Users/vpai/Git/grpc/BUILD:224:1: Linking of rule '//:grpc' failed (Exit 1)
clang: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
ld: illegal thread local variable reference to regular symbol __ZN9grpc_core7ExecCtx9exec_ctx_E for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Anything else we should know about your project / environment?

A similar issue is reported (and unresolved) at tensorflow/serving#1 and other places in Github

@vjpai
Copy link
Member Author

vjpai commented Jan 3, 2018

After discussing this with @yashykt : it seems similar in some ways to #13884 . I can confirm that if I manually change port_platform.h to change all of the Mac build versions to GPR_PTHREAD_TLS instead of GPR_GCC_TLS, it builds fine. That said, I know that that's suboptimal, so we can figure out what the version issue (or whatever) is so that we can make this use GPR_GCC_TLS when possible.

@yashykt
Copy link
Member

yashykt commented Jan 3, 2018

Also, it looks like the issue crops up only while linking which means that thread local support is not working for externally defined thread local variables on that version.
So, another solution is to remove the inline functions in exec_ctx.h which are accessing the thread local variable. (might have a negligible/small affect on performance though).

@vjpai
Copy link
Member Author

vjpai commented Jan 4, 2018

Indeed, this is the only appearance of thread-local storage in any .h file. All the other uses are in a .cc file. This is probably significant.

@gnossen
Copy link
Contributor

gnossen commented Apr 27, 2020

This appears to still be a problem in certain cases. Repro (on Mac):

cd bazel/test/python_test_repo/
bazel build //:import_moved_test

This results in the same linker error as above.

CC @vjpai

@gnossen gnossen reopened this Apr 27, 2020
@jtattermusch
Copy link
Contributor

Context:
AFAIK the bazel build on mac is still broken if you use GCC thread-locals in a header, which has been worked around by Vijay by switching from using GCC thread-locals to pthread thread locals for our C++ builds (I think in #13929). For python we've never adjusted the build, which is why it's failing.

@jtattermusch
Copy link
Contributor

Because of this problem, we keep seeing assertion failed: 0 == pthread_setspecific(tls->key, (void*)value) assertion in some MacOS tests.
Basically our code is not always working reliably in situations when thread locals require initialization but because of the bazel build limitation, we are required to use pthread thread locals on Mac, which results in problems.

@gnossen
Copy link
Contributor

gnossen commented Jun 11, 2020

@veblush Any progress on this? Another user has run into this problem: #23096

@veblush
Copy link
Contributor

veblush commented Jul 22, 2020

Oh I didn't realize that I was assigned to this problem. I don't know the problem much so can you help me understand the problem? Is this something happening now? and is bazel build on the mac completely broken now because of this?

@jtattermusch
Copy link
Contributor

The problem in short:

  1. the exec_ctx.h header is currenly using thread-local variable (I've been told it's for performance reasons)
  2. for some reason bazel build on MacOS won't work when GCC thread locals are using in a header (which is what this issue is about)
  3. because of that, we switched our bazel build on MacOS to pthread thread locals instead (see When building with bazel on a Mac, workaround bazelbuild/bazel#4341 #13929). So build currently works (and we were able to fix our mac tests too), but it's awkward that we cannot use the gcc thread locals and it's causing friction for the users and for the team as well.

@jtattermusch
Copy link
Contributor

A few more observations:

  • this problem only exhibits when the library that contains exec_ctx (in case of our bazel build it's //:grpc_base_c) is linked by another binary/library dynamically. If it's linked statically (linkstatic=True), it works around the problem. Unfortunately this cannot be used as a general solution as we cannot dictate whether users should link against grpc statically or dynamically. (

  • I tried using C++11's thread_local instead of GCC's __thread and it seems to fix the problem, but there are some performance considerations around using thread_local vs __thread when shared libraries come into play (based on my reading up, I'm no expert here).

@yashykt @vjpai would using C++11 thread_local in our mac builds be an acceptable alternative from the performance perspective? (https://stackoverflow.com/questions/13106049/what-is-the-performance-penalty-of-c11-thread-local-variables-in-gcc-4-8 btw, some wrapped language do rely on using dlopen / dlsym to load C core).

@yashykt
Copy link
Member

yashykt commented Aug 27, 2020

That's a good suggestion @jtattermusch. I think using C++11 thread_local would be a good idea.

@jtattermusch
Copy link
Contributor

@yashykt can you please comment more on the performance implications of using thread_local? (sounds like it could be slow when used in a shared library, see my comment above).

Would we only use thread_local for the specific configuration that's broken (bazel build on macos), or whenever on mac?
Depending on the performance characteristics of thread_local, using __thread when building on macos with make/cmake vs thread_local when building with bazel sounds like a better option, but there's also value in all the builds being consistent (= all mac builds using the same compilation options), because too much differentiation leads to increased complexity and harder maintenance (and we really don't want that).

@gnossen
Copy link
Contributor

gnossen commented Sep 16, 2020

@yashykt @veblush Ping. The folks from github.com/googleapis/googleapis have run into this.

@yashykt
Copy link
Member

yashykt commented Sep 17, 2020

We have not focused any performance effort on MacOS, so we wouldn't be able to have concrete data for whether it will have a noticeable effect, but just based on the fear that dynamic initialization might be slower than static initialization, I don't think that matters to us since we mostly care about per-call performance.

@veblush
Copy link
Contributor

veblush commented Sep 17, 2020

@gnossen What problem did they have? Was it performance-related?

We tried to replace gpr_tls with C++ thread_local completely but couldn't because old version of xcode gRPC supports doesn't work well with C++ thread_local. (#20413)

@gnossen
Copy link
Contributor

gnossen commented Sep 18, 2020

No. It was a build failure on Mac OS of exactly the sort mentioned in the original post here. They're seeing this problem with Bazel on Mac OS when pulling in gRPC Python (which is dynamically linked). It seems that the gcc TLS implementation is being picked instead of the pthread implementation. I've confirmed that switching to the pthread implementation fixes the problem.

@veblush
Copy link
Contributor

veblush commented Sep 30, 2020

@gnossen Thanks! :grpc is now okay but :import_moved_test isn't.

@veblush
Copy link
Contributor

veblush commented Sep 30, 2020

I made #24247 unifying TLS implementation across all platforms but it needs some changes on our supported platforms. It would take some time to bring all stakeholders on the same page.

@jtattermusch
Copy link
Contributor

jtattermusch commented Oct 12, 2020

@veblush
Looks like this might have been fixed by this change: https://github.com/grpc/grpc/pull/24247/files#diff-dd49c38c34f9401e10f1921931d6c121R211?

We are now using C++11 TLS on mac, and the problem we were experiencing under bazel was only for GCC_TLS.

@jtattermusch
Copy link
Contributor

I confirmed that both of these work on my Mac

bazel build //:grpc
cd bazel/test/python_test_repo/
bazel build //:import_moved_test

So I'm pretty confident that this has been fixed by #24247. Thanks @veblush for the work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.