Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible: use gcc 4.9 on CentOS 6 #809

Merged
merged 1 commit into from
Nov 13, 2017
Merged

ansible: use gcc 4.9 on CentOS 6 #809

merged 1 commit into from
Nov 13, 2017

Conversation

seishun
Copy link
Contributor

@seishun seishun commented Jul 22, 2017

Currently CentOS 6 machines use gcc 4.8 while Node.js requires "gcc and g++ 4.9.4 or newer". See #762.

devtoolset-3 has gcc 4.9.1, too old.
devtoolset-5 has "gcc (GCC) 5.3.1 20160406", that's still older than gcc 4.9.4.
devtoolset-6 has "gcc (GCC) 6.3.1 20170216", good enough.

The CERN repo doesn't have devtoolset newer than 2, so I replaced that with Software Collections.

@rvagg
Copy link
Member

rvagg commented Aug 11, 2017

heads-up that we're dealing with some glibc bug on centos6 for 8.x builds on the linux distributions: nodesource/distributions#513, waiting for Chris to shave that yak so we can understand more.

@chrislea
Copy link

Here is what I know thus far:

If you compile 8.3.0 on CentOS 6 with gcc = 4.8.2 from devtoolset-2, the binary works fine.

If you compile 8.3.0 on CentOS 6 with gcc = 4.9.2 from devtoolset-3, the binary does not work and gives the error:

./node: error while loading shared libraries: cannot allocate memory in static TLS block

You see the same bad behavior if you use the compilers from devtoolset-4 or devtoolset-6. The issue is constrained specifically to CentOS 6. CentOS 7 does not have this problem, nor does any version of Fedora, Debian or Ubuntu that we tend to care about.

The error appears to be due to this bug which wasn't actually fixed until a newer version of glibc, meaning that fix is not at this point going to make it back to CentOS 6.

This is problematic as the compiler requirement recently moved to "4.9.x" basically. So the options seem to be:

  1. Back the compiler requirement back down. I'm not sure if this will actually be technically possible going forward if V8 starts using language features that gcc 4.8.x doesn't have.
  2. Stop supporting CentOS 6. Rather undesirable considering the amount of enterprise deployments still using it.
  3. Some technical (or other) fix for this that's past my understanding.

Happy to provide more info if I have it, just let me know.

@seishun
Copy link
Contributor Author

seishun commented Aug 11, 2017

Back the compiler requirement back down.

In other words, continue working around compiler bugs due to a glibc bug that was fixed 4.5 years ago.

I'd say we should just raise the minimum supported glibc version to 2.17 (which came out on 25 Dec 2012). Enterprises still using CentOS 6 would need to upgrade to CentOS 7 or get a newer version of glibc.

cc @nodejs/build

@chrislea
Copy link

I'd say we should just raise the minimum supported glibc version to 2.17 (which came out on 25 Dec 2012). Enterprises still using CentOS 6 would need to upgrade or get a newer version of glibc.

Just speaking from experience, that won't happen. If we do that, we should understand that we are dropping support for CentOS 6 / RHEL 6.

I'm not saying that to imply it's the wrong course of action, but we shouldn't kid ourselves about the consequences.

@gibfahn
Copy link
Member

gibfahn commented Aug 11, 2017

Enterprises still using CentOS 6 would need to upgrade or get a newer version of glibc.

AIUI successfully upgrading glibc is basically impossible (or at least not something you'd ever contemplate for a production system).

@seishun
Copy link
Contributor Author

seishun commented Aug 11, 2017

It's possible to install a newer version in parallel: https://unix.stackexchange.com/a/299665/164374

@chrislea
Copy link

chrislea commented Aug 11, 2017

@seishun It's possible yes, but that seems like a really bad thing for the Node Foundation to suggest for end users for several reasons.

  1. As soon as you do that, you're putting the Node project in the line of sight for support for something other than installing Node itself, which is definitely not desirable.
  2. Even worse, when the inevitable case happens where somebody does install an out-of-band version of glibc, does something to their system to invoke using it for Node (adjusting LD_PRELOAD_PATH or somesuch), and then forgets they did that, and some other application doesn't work properly as a result, the Node project will be seen as a bad actor for having suggested doing it in the first place.
  3. Probably worst of all, one of the reasons we're having the discussion is that we don't want to back the compiler requirement back down to 4.8.x. So if we suggest this, we're basically saying "We aren't willing to work around issues with old glibc implementations anymore because it's just too painful for us. So instead we are pushing that pain point onto our end users." This is not a good strategy for making end users happy. You'd never see Oracle suggesting such a thing for a Java build, for example. If you're at the point where you're telling an end user to compile system level libraries from source just to use the application software you're shipping, you've gone down a bad road.

There's a bit of a pickle here already of course, because 8.3.0 is already out in the wild, and was built with a compiler that doesn't meet the current compiler requirements. But that's probably a small issue in the grand scheme of things.

I still think that if we're not willing to use the older compiler, and there's no way to code around this, then it's time we simply say CentOS 6 / RHEL 6 is too old and it's not supported anymore. That day is going to come eventually, and it's going to come before those distros reach EOL, so if we're at that point so be it.

@seishun
Copy link
Contributor Author

seishun commented Aug 11, 2017

the Node project will be seen as a bad actor for having suggested doing it in the first place.

Node project shouldn't suggest anything. It should just state the glibc and compiler requirements and let the users (or repo maintainers) figure out how to satisfy them.

So if we suggest this, we're basically saying "We aren't willing to work around issues with old glibc implementations anymore because it's just too painful for us. So instead we are pushing that pain point onto our end users." This is not a good strategy for making end users happy. You'd never see Oracle suggesting such a thing for a Java build, for example.

Java is maintained by Oracle, so they can follow whatever policy they like. But Node is mostly maintained by unpaid volunteers and having to work around old compiler bugs will turn people away. Requiring a glibc version that's less than 5 years old is not an unreasonable demand.

@chrislea
Copy link

Hey, like I said, I'm not in any way trying to suggest that saying "EL6 is too old, sorry" is the wrong answer here. Nor is it my decision to make. All I was trying to point out was that if that's what's happening, I think the project should be clear about it.

@refack
Copy link
Contributor

refack commented Aug 11, 2017

I'd like to restate the working assumption "feature frozen system is a feature frozen system", that is, if user decided to freeze their OS for stability's sake, it is fairly safe to assume they have also frozen their node version.
So the combination of an old OS + new major node version is highly unlikely.
Hence we should only support old OSs for old node versions.
Ref: nodejs/node#12672

@chrislea
Copy link

Oh man @refack my life would be so much easier if ^^^ was true. ;-)

@refack
Copy link
Contributor

refack commented Aug 11, 2017

Oh man @refack my life would be so much easier if ^^^ was true. ;-)

Well then you are our gatekeeper, personally I haven't seen many "minor version upgrade broke my system" bugs in the core repo. So thank you, and sorry 🤗
But then again I'm gonna sue you for attempting to penetrate my system with
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash - 😉
maybe I've been preoccupied with "windows installer broke my system" bugs

@rvagg
Copy link
Member

rvagg commented Aug 12, 2017

It's hard to overstate how poorly this will go down: "Enterprises still using CentOS 6 would need to upgrade to CentOS 7". Folks are still using EL5 in production out there. In very large companies, these kinds of moves are very complicated and therefore move very slowly. We're a little too comfortable catering to nimble startups and early-adopter types that are more used to working on the cutting edge, or close to it. As you go out further than that we hit real world constraints where these things are really painful.

Does clang offer us a get-out-of-jail here perchance?

@bnoordhuis
Copy link
Member

There might be a simple workaround. Can someone try this patch?

diff --git a/deps/v8/src/trap-handler/handler-shared.cc b/deps/v8/src/trap-handler/handler-shared.cc
index 7b399f5eea..e4f0136be7 100644
--- a/deps/v8/src/trap-handler/handler-shared.cc
+++ b/deps/v8/src/trap-handler/handler-shared.cc
@@ -23,7 +23,9 @@ namespace v8 {
 namespace internal {
 namespace trap_handler {
 
-THREAD_LOCAL bool g_thread_in_wasm_code = false;
+// Note: sizeof(g_thread_in_wasm_code) must be > 1 to work around
+// https://sourceware.org/bugzilla/show_bug.cgi?id=14898
+THREAD_LOCAL int g_thread_in_wasm_code = false;
 
 size_t gNumCodeObjects = 0;
 CodeProtectionInfoListEntry* gCodeObjects = nullptr;
diff --git a/deps/v8/src/trap-handler/trap-handler.h b/deps/v8/src/trap-handler/trap-handler.h
index 5494c5fdb3..ed9459918b 100644
--- a/deps/v8/src/trap-handler/trap-handler.h
+++ b/deps/v8/src/trap-handler/trap-handler.h
@@ -65,7 +65,7 @@ inline bool UseTrapHandler() {
   return FLAG_wasm_trap_handler && V8_TRAP_HANDLER_SUPPORTED;
 }
 
-extern THREAD_LOCAL bool g_thread_in_wasm_code;
+extern THREAD_LOCAL int g_thread_in_wasm_code;
 
 inline bool IsThreadInWasm() { return g_thread_in_wasm_code; }
 

@bnoordhuis
Copy link
Member

Does clang offer us a get-out-of-jail here perchance?

Unlikely. It's a glibc bug, gcc is not at blame.

@seishun
Copy link
Contributor Author

seishun commented Aug 12, 2017

@bnoordhuis Builds and runs fine with the patch.

@rvagg Without undermining your point about large and slow companies, comparing "working on the cutting edge" with "using an EOL distro that was superseded 6 years ago" is a bit of a false dichotomy.

@bnoordhuis
Copy link
Member

Thanks. I'll check if upstream is amenable to this change.

@bnoordhuis
Copy link
Member

https://chromium-review.googlesource.com/c/612351 - let's wait and see.

@seishun
Copy link
Contributor Author

seishun commented Aug 12, 2017

I'm curious why this isn't an issue with gcc 4.8.2 though.

@FireBurn
Copy link

It's an issue for me with GCC 4.8.2 on RHEL6

@chrislea
Copy link

Confirming (also) that 8.3.0 builds and runs on EL6 using that patch with the compilers from

devtoolset-3: gcc = 4.9.2
devtoolset-4: gcc = 5.3.1
devtoolset-6: gcc = 6.3.1

As seems often the case, a great big thanks to @bnoordhuis for saving the day.

A question (again I suppose for you Ben) since I'm not very familiar with how quickly things move through V8: is it reasonable to assume that they'll include this patch and that it will make it to the mainstream V8 that gets shipped with Node Soon (tm)? I ask because on the RPM packages side, it's fairly easy for me to use this patch and put in logic like "if el6 use the patch else don't". But if this is going to land in mainline next week or something, it's probably not worth it to do that, rebuild everything, and then back that out when it's no longer needed.

Thanks again!

@refack
Copy link
Contributor

refack commented Aug 12, 2017

But if this is going to land in mainline next week or something, it's probably not worth it to do that, rebuild everything, and then back that out when it's no longer needed.

V8 CLs can land very fast, if they are "amenable" to the actual change.
P.S. @bnoordhuis, didn't they give you Dry Run permissions?

@bnoordhuis
Copy link
Member

@chrislea What Refael said. If I get a LGTM, I can land the patch and cherry-pick it into node. That would then go out in the next release.

@refack I have, but this is one of those it-either-compiles-or-it-doesn't things. I didn't feel it was worth spending Watts on a dry run.

@chrislea
Copy link

Okay, thanks Ben. If that's the case I'll to a one-off rebuild just for EL6 RPM packages and this should be settled.

gibfahn added a commit to gibfahn/node that referenced this pull request Aug 12, 2017
Also updates glibc version to 2.17. This means the minimum supported
distros are RHEL/CentOS 7 and Ubuntu 14.04.

| Distro       | Kernel | glibc |
|  ---         |  ---   |  ---  |
| Current      | 2.6.32 | 2.12  |
| RHEL6        | 2.6.32 | 2.12  |
| New          | 3.10.0 | 2.17  |
| RHEL7        | 3.10.0 | 2.17  |
| Ubuntu 14.04 | 3.13   | 2.19  |
| Ubuntu 16.04 | 4.4    | 2.23  |
| Latest       | 4.12.5 | 2.26  |

Refs: nodejs/build#809
kisg pushed a commit to paul99/v8mips that referenced this pull request Aug 17, 2017
glibc before 2.17 has a bug that makes it impossible to execute binaries
that have single-byte thread-local variables:

    % node --version
    node: error while loading shared libraries: cannot allocate memory
    in static TLS block

Work around that by making the one instance in the V8 code base an int.

See: https://sourceware.org/bugzilla/show_bug.cgi?id=14898
See: nodesource/distributions#513
See: nodejs/build#809
Change-Id: Iefd8009100cd93e26cf8dc5dc03f2d622b423385
Reviewed-on: https://chromium-review.googlesource.com/612351
Commit-Queue: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-by: Eric Holk <eholk@chromium.org>
Cr-Commit-Position: refs/heads/master@{#47400}
@refack
Copy link
Contributor

refack commented Oct 25, 2017

@seishun since ATM we don't have a test CentOS6 x32 machine and one one release machine, we probably don't want this to run on x32 machines, just in case.

To sum discussion, we need these bot setups:

CentOS6 x64 x32
Release 4.8 & 4.9 4.8
Test 4.8 & 4.9 4.8

@seishun
Copy link
Contributor Author

seishun commented Oct 25, 2017

@refack Why do we need 4.8 on 64-bit?

@refack
Copy link
Contributor

refack commented Oct 25, 2017

@refack Why do we need 4.8 on 64-bit?

Testing & Building node v6

@seishun
Copy link
Contributor Author

seishun commented Oct 25, 2017

Makes sense. Does that mean the repo should have separate ansible scripts for 4.8 and >=4.9?

@seishun
Copy link
Contributor Author

seishun commented Oct 26, 2017

(edited because I realized there is no 32-bit CentOS 6 test machine)

On the other hand, we'll already have gcc 4.8 on CentOS 5 if we want to make sure node 6.x builds with gcc 4.8. Is there really a need to use gcc 4.8 on all other test machines for node 6.x?

@refack
Copy link
Contributor

refack commented Oct 26, 2017

@sgallagher hi, I just saw you commenting on the c-ares thread and was wondering if you have any input on this?

@sgallagher
Copy link

@refack Sorry, I've scanned the messages, but I'm not sure what question you want me to answer. Could you summarize, please?

@refack
Copy link
Contributor

refack commented Oct 26, 2017

@sgallagher we want to upgrade to GCC 4.9.4, we couldn't find an RPM that will cross compile on CentOS6.64 for an 32bit target.
I was wondering if you know of a simple way to achieve that (optimal scenario a 4 way side by side, one host that can use 4.9.4x32 and x64, and 4.8x32 and 64)?

@codonell
Copy link

I'm sure @sgallagher would agree with me here...

As an upstream senior developer for glibc, and the glibc team lead for Red Hat Enterprise Linux, I strongly suggest that issues like this be taken upstream for your distribution. If you are supporting CentOS 6, take it upstream to RHEL6, and file a bug. That way my team can evaluate the direct impact of the problem, perhaps even providing a workaround (if one exists). Your bug always adds to the data we have in our tracker, and that helps us evaluate risk vs. reward for a fix in RHEL6 (in production phase 3, accepting only urgent bug fixes).

The upstream glibc bug 14898 was a trivial fix of a singular internal constant (changed to -1) to allow a single-byte TLS variable to exist. It is a fix that has limited scope, little risk, and would fix this use case. The question to really answer is: Is this the problem you are seeing? If it is, then a trivial workaround is to increase the size of the object beyond 1 byte, that would safely change the size of the allocated object not to clash with the use of the internal constant (which is 1 also).

So you have a workable solution to keep running on RHEL6 IMO.

Lastly, I do not think this is a compiler issue, but it might be, perhaps the compiler was able to optimize and remove a TLS variable that was never used, and now you're down to just one variable of size 1-byte. It's hard for me to evaluate this in this ticket (there is a lot of noise here). Again, filling a ticket upstream would help us get to the bottom of this particular use case.

@sgallagher
Copy link

@refack I would suggest you take a look at the mock tool in CentOS. It’s whole purpose is to work with yum to set up a chroot environment for doing clean compilations. (It’s what all of the builders in Fedora use to create RPMs).

You can use both the epel-6-x86_64 and epel-6-i386 configs to build for 64-bit and 32-bit, respectively.

Feel free to reach out to me privately if you need help setting it up.

@refack
Copy link
Contributor

refack commented Oct 26, 2017

@codonell thanks for the reply. The TLS issue was indeed a blocker, but as you state it was resolved. IMHO what we are facing now is more of a configuration/ABI-compatibility issue. That is we are now stuck at the link stage:

  g++ -pthread -rdynamic -m32  -o /home/centos/node/out/Release/openssl-cli -Wl,--start-group ... /home/centos/node/out/Release/obj.target/deps/openssl/libopenssl.a -ldl -Wl,--end-group
/opt/rh/devtoolset-6/root/usr/libexec/gcc/x86_64-redhat-linux/6.2.1/ld: skipping incompatible /opt/rh/devtoolset-6/root/usr/lib/gcc/x86_64-redhat-linux/6.2.1/libstdc++_nonshared.a when searching for -lstdc++_nonshared
/opt/rh/devtoolset-6/root/usr/libexec/gcc/x86_64-redhat-linux/6.2.1/ld: cannot find -lstdc++_nonshared

@refack
Copy link
Contributor

refack commented Oct 26, 2017

P.S. @codonell @sgallagher are there any obvious caveats in building on CentOS/RHEL6 with a non "canonical" GCC toolchain (i.e. 4.9.4)

@sgallagher
Copy link

@refack Do you mean a custom-made toolchain or one of those from Red Hat’s Developer Toolset releases? Those at least are fully supported by Res Hat for production use (with a Red Hat subscription, of course).

@refack
Copy link
Contributor

refack commented Oct 26, 2017

Do you mean a custom-made toolchain or one of those from Red Hat’s Developer Toolset releases? Those at least are fully supported by Res Hat for production use (with a Red Hat subscription, of course).

AFAIK Since node is an OSS project there's a tendency to use FOSS, so till now we used devtoolset-2 built by SLC. And now we're considering devtoolset-6 build by mlampe for copr.
I'm assuming any answers are general observations and not official RH support

@codonell
Copy link

@refack Your question about building with a non-canonical toolchain is a very good one. In the case of devtoolset, the tooling is designed specifically to layer a new compiler (and parts of the runtime) on top of system runtimes. If you build a new toolchain from whole cloth, you are in a very different position, a position where you become responsible for bundling (a) the compiler runtime, (b) the language runtimes and (c) any new dependencies therein. Therefore using devtoolset (developer toolset) vs. using a freshly built gcc (which may need a new binutils, and whose generated code may or may not need a new glibc) are very different questions. I would suggest staying with devtoolset as much as you can since the results are predictable and well studied.

@seishun
Copy link
Contributor Author

seishun commented Nov 6, 2017

@refack ping... what's the status of this?

@refack refack merged commit f6f19ab into nodejs:master Nov 13, 2017
@refack
Copy link
Contributor

refack commented Nov 14, 2017

summary:

  1. uninstalled devtoolset-2
  2. removed centos6-64-gcc44 label
  3. updated git to rh-git29 (added to profile.d and sysconfig/jenkins)

@seishun seishun deleted the centos6-gcc49 branch November 14, 2017 07:45
@seishun
Copy link
Contributor Author

seishun commented Nov 14, 2017

Ran CI with a commit that fails on gcc < 4.9. Looks good: https://ci.nodejs.org/job/node-test-commit-linux/14092/nodes=centos6-64/

Just curious, how did you solve the problem of testing and releasing on Node versions < 9.x?

Also, while I appreciate your work here, upgrading gcc on select machines doesn't help much if it's not going to be upgraded on all machines used for CI on master. Which is why I'm going to wait until #797 (comment) is resolved before continuing with other machines that require upgrading.

@refack
Copy link
Contributor

refack commented Nov 14, 2017

Just curious, how did you solve the problem of testing and releasing on Node versions < 9.x?

I tested the binary outputted by devtoolset-6 on vanilla CenOS6 and it "just works" (as @codonell commented in #809 (comment))

Also, while I appreciate your work here, upgrading gcc on select machines doesn't help much if it's not going to be upgraded on all machines used for CI on master.

👍 I'll use https://ci.nodejs.org/job/node-test-commit/14000/ as the basis for an upgrade task list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants