Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struggling to switch users and maintain full cuda support #108

Closed
njacobson-nci opened this issue Mar 27, 2023 · 12 comments
Closed

Struggling to switch users and maintain full cuda support #108

njacobson-nci opened this issue Mar 27, 2023 · 12 comments

Comments

@njacobson-nci
Copy link

I'm trying to run this stack for a few different users and want to be able to switch the username of the notebook user when i stand up the image.

When I do this, the spawned terminals/notebooks under the switched user aren't correctly sourcing the jovyan bashrc and running bitsandbytes fails to find libcudart.so.

The command i'm using: (most basic version to eliminate any variables in my custom install and deployment)
docker run --gpus all -it -p 8848:8888 --user root -e NB_USER="njacobson" -e CHOWN_HOME=yes -w "/home/njacobson" cschranz/gpu-jupyter:v1.5_cuda-11.6_ubuntu-20.04_python-only

I attach to the container as root and run
mamba install cudatoolkit -y
python -m pip install bitandbytes

python -m bitsandbytes
this fails to init and can't find libcudadart.so

if I source the /home/jovyan/.bashrc
python -m bitsandbytes works

Running a jupyter notebook via jupyterlab and importing bitsandbytes also fails.
As does a terminal spawned via jupyterlab unless I source the jovyan .bashrc.

I've tried copying the jovyan .bashrc into my home and chowning it, this fixes new terminals but new notebooks still won't properly import bitsandbytes.

nvidia-smi nvcc and torch.cuda.is_available() work in notebooks.

Not sure if this belongs here or with docker-stacks, but figured I'd start here.

Thanks!

@benz0li
Copy link
Contributor

benz0li commented Mar 29, 2023

@njacobson-nci The problem is that LD_LIBRARY_PATH is not preserved, which is essential for CUDA images.

And the default LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 must be set/extended to LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib641 beforehand.

FYI @mathbunnyru

Footnotes

  1. LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64

@mathbunnyru
Copy link

@benz0li as far as I understand, in docker-stacks images we do not rely on LD_LIBRARY_PATH.
And --preserve-env doesn't preserve this environment variable.
I'm not sure we should explicitly preserve it.

https://www.sudo.ws/docs/man/sudoers.man/

The dynamic linker on most operating systems will remove variables that can control dynamic linking from the environment of set-user-ID executables, including sudo.

@benz0li
Copy link
Contributor

benz0li commented Mar 29, 2023

I'm not sure we should explicitly preserve it.

@mathbunnyru For the jupyter/docker-stacks you are not supposed to.

@benz0li
Copy link
Contributor

benz0li commented Mar 29, 2023

But if someone builds the jupyter/docker-stacks on top of nvidia/cuda images, LD_LIBRARY_PATH must be preserved – i.e. start.sh modified accordingly.

@mathbunnyru
Copy link

Thanks @benz0li. It makes sense to me 👍

@njacobson-nci
Copy link
Author

Setting LD_LIBRARY_PATH in the jupyter notebook does resolve the issue in notebooks.

It doesn't appear to be required in a terminal after sourcing the jovyan bashrc, and it's not set in that terminal session either.

Changing the start.sh to not check for an existing /home/{$NB_USER} does allow the script to copy over the jovyan directory correctly and then new terminals are properly set up, but it doesn't fix notebooks. I'll try preserving LD_LIBRARY_PATH and see how that does.

import os

os.environ["LD_LIBRARY_PATH"] = "/opt/conda/lib/:/usr/local/cuda/lib64/lib"

import bitsandbytes

@njacobson-nci
Copy link
Author

@benz0li Applying this fix you provided does copy over the LD_LIBRARY_PATH to the environment of the jupyter notbooks, but libcudart.so is still not found.

'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64',

This path doesn't exist in this image or any of the other ones i've used recently. It is the default ld_library_path on the 11.6.2-cudnn-runtime base image, but those folders don't exist on that image either.

@benz0li
Copy link
Contributor

benz0li commented Mar 29, 2023

This path doesn't exist in this image or any of the other ones i've used recently. It is the default ld_library_path on the 11.6.2-cudnn-runtime base image, but those folders don't exist on that image either.

That is correct. See also jupyter/docker-stacks#1792 (comment).
ℹ️ These paths /usr/local/nvidia/lib:/usr/local/nvidia/lib64 are kept for legacy reasons.

@benz0li Applying this fix you provided does copy over the LD_LIBRARY_PATH to the environment of the jupyter notbooks, but libcudart.so is still not found.

'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64',

You need to set/extend the path to LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib641 beforehand.

Footnotes

  1. LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64

@njacobson-nci
Copy link
Author

njacobson-nci commented Mar 29, 2023

Updating the start.sh to set the ld_library_path as you called out here still has issues, but that might be a bitsandbytes thing.
'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64',

Appending /opt/conda/lib/ does resolve notebooks being able to find the libcudart.so.

This is what i added to the start.sh to fix it now
LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64:/opt/conda/lib/" \

It still seems like there is a deeper issue with switching users in this manner, and I wonder if there are further bugs that will be experienced when applying this fix.

In a notebook, if I run " ! ll " this alias is not found despite being defined in the jovyan/njacobson .bashrc, is that expected?

@benz0li
Copy link
Contributor

benz0li commented Mar 29, 2023

@njacobson-nci I can't help you any further as my images only use Python – and don't have Conda / Mamba installed.

@njacobson-nci
Copy link
Author

Understood, I appreciate the help very much!

@benz0li
Copy link
Contributor

benz0li commented Mar 29, 2023

P.S.: You can always install Conda / Mamba on user level in my images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants