Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded location of linker will fail on Redhat 7 #2

Closed
surak opened this issue Aug 21, 2020 · 3 comments
Closed

Hardcoded location of linker will fail on Redhat 7 #2

surak opened this issue Aug 21, 2020 · 3 comments

Comments

@surak
Copy link

surak commented Aug 21, 2020

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Redhat 7
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.15.3+nv20.07
  • Python version: 3.8.5
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source):9.3.0
  • CUDA/cuDNN version:Cuda 11.0.207 cuDNN 8.0.1.13
  • GPU model and memory: Nvidia A100

Describe the current behavior

(...)  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -shared -o bazel-out/k8-py2-opt/bin/tensorflow/python/_tf_stack.so '-Wl,-rpath,$ORIGIN/,-rpath,$ORIGIN/..' -Wl,--version-script bazel-out/k8-py2-opt/bin/tensorflow/python/_tf_stack-version-script.lds -Wl,-no-as-needed -Wl,-z,relro,-z,now '-Wl,--build-id=md5' '-Wl,--hash-style=gnu' -no-canonical-prefixes -fno-canonical-system-headers -B/usr/bin -Wl,--gc-sections -Wl,@bazel-out/k8-py2-opt/bin/tensorflow/python/_tf_stack.so-2.params)
Execution platform: @bazel_tools//platforms:host_platform
/usr/bin/ld.gold: --push-state: unknown option
/usr/bin/ld.gold: use the --help option for usage information
collect2: error: ld returned 1 exit status```

**Describe the expected behavior**
Fixed ld.gold versions work

**Code to reproduce the issue**
Compile with standard Redhat's GCC

**Other info / logs**
The solution is to use a properly patched libtool, and remove the hardcoded path to /usr/bin. The fine gentlemen of the EasyBuild project have a patch that does exactly this: https://github.com/easybuilders/easybuild-easyconfigs/blob/master/easybuild/easyconfigs/t/TensorFlow/TensorFlow-1.13.1_remove_usrbin_from_linker_bin_path_flag.patch
@JanuszL
Copy link

JanuszL commented Jan 8, 2021

Hi,
The requirements are:

Ubuntu 18.04 or later (64-bit)
GPU support requires a CUDA®-enabled card
For NVIDIA GPUs, the r450 driver must be installed

So that could be a reason why it doesn't work on RHEL 7.

@surak
Copy link
Author

surak commented Jan 8, 2021

We have 4 thousand A100 gpus in a single system. I won't change the OS because someone at NVidia can't fix a bug where things are hardcoded.

This can be closed if the kind of fix is like that.

@surak surak closed this as completed Jan 8, 2021
@JanuszL
Copy link

JanuszL commented Jan 8, 2021

someone at NVidia can't fix a bug where things are hardcoded.

It is more than that. nvidia-tensorflow is build using Ubuntu 18.04 system so all dependencies - linked, libraries, and their versions depend on this.
Making it distribution independent is not an easy task - you can read more about this in PEP. We are aware of this limitation, we have a few ideas on how to address it but it is hard to commit to any official timeline.
Still, you can use a containerized environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants