Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with installing the Docker container in new version 1.0.1 #2929

Open
AlexanderRitter02 opened this issue Feb 19, 2024 · 5 comments
Open

Comments

@AlexanderRitter02
Copy link

Describe the bug

I was using the previous 2023 version (0.3.4) of Nerfstudio in the docker container.

Trying to upgrade now to the new 1.0.1 caused issues, the container pull worked,
but when trying to run I get an error about nvidia-container-cli: mount error:

> docker run --gpus all -v O:\nerfstudio_docker_new:/workspace/ \ 
-v O:\nerfstudio_cache_new:/home/user/.cache/ -p 7007:7007 --rm -it --shm-size=12gb dromni/nerfstudio:1.0.1

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/f4d3ddcd6eb213766b91e92283207cef5bdbd09256bc6b285600d02a8a8bd747/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

To Reproduce
Steps to reproduce the behavior:

  1. Run docker pull dromni/nerfstudio:1.0.1
  2. Then start the interactive container using the command from https://github.com/nerfstudio-project/nerfstudio/blob/main/docs/quickstart/installation.md#using-an-interactive-container
  3. Result: It fails to create task for container with the error specificied above

Expected behavior
To start the interactive container without issues just as it does with version 0.3.4.

Additional context
Docker running on Windows 10 with WSL2

@dsjstc
Copy link

dsjstc commented Mar 28, 2024

I can confirm this behaviour on Win11/wsl2 with nerfstudio 1.0.2 as well.

I wondered if the Docker Desktop nvidia runtime might be attempting to re-install the file, but even specifying DOCKER_RUNTIME=runc didn't correct the error.

@wvnuw
Copy link

wvnuw commented Apr 10, 2024

Have you found solution to this? I get same error when I try to run up dromni/nerfstudio:1.0.3. I am using Windows 11 and nvidia driver version is 536.25. I don't get error when I use dromni/nerfstudio:0.3.1

@AlexanderRitter02
Copy link
Author

AlexanderRitter02 commented Apr 10, 2024

@dsjstc @wvnuw Yes, I found the solution to this. Sorry for having forgotten about posting here.

Cause

The issue seems to be that the images were built incorrectly, with Nvidia libraries such as "/usr/lib/x86_64-linux-gnu/libcuda.so.1" (and others) already existing in the image as "ghost" libraries
according to sources in NVIDIA/nvidia-container-toolkit#289.
The nvidia runtime will only inject the working files into the container if these "ghost" libraries are not present in the image.

This may have been caused by building the image with the nvidia runtime, see below for a possible permanent solution on Nerfstudios side and the quick workaround that I used.

Workaround

You can solve this personally by creating a derivative image of Nerfstudio:

  1. Create a new dockerfile with the following contents, removing all problematic libraries in the new image:
    (For the purposes of this example the Dockerfile will be named nerfstudio_1.0.2_fixed)

    FROM dromni/nerfstudio:1.0.2
    
    RUN sudo rm -rf /usr/lib/x86_64-linux-gnu/libcuda.so.1 \
        /usr/lib/x86_64-linux-gnu/libnvidia-*.so.1 \
        /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 \
        /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
    
  2. Build the Dockerfile using your WSL distribution (e.g. Ubuntu).
    Make sure that you are NOT building with nvidia set as your runtime (check in daemon.json).

    sudo docker build -t "nerfstudio_1.0.2_fixed:Dockerfile" .
    
  3. You can now run it using your usual command, just switch out the container name with the name of the newly created container (which is the name of the Dockerfile, e.g. here nerfstudio_1.0.2_fixed):

    docker run --gpus all -v O:\ns_docker:/workspace/ -v O:\ns_cache:/home/user/.cache/ -p 7007:7007 \
      --rm -it --shm-size=12gb nerfstudio_1.0.2_fixed:Dockerfile
    

Permanent Solution

A permanent solution would require the Nerfstudio Project building the Docker container without these libraries.

According to: NVIDIA/nvidia-container-toolkit#289 (comment)

Having these „ghost“ libraries in your image is most often the result of building the image with nvidia set as the default runtime in docker. Building MUST be done without nvidia set as the runtime.

So they'd need to make sure that nvidia is NOT set as the default runtime in the daemon.json file while building it for release.
However I have no idea how that would affect the container on other operating systems.

@wvnuw
Copy link

wvnuw commented Apr 11, 2024

@AlexanderRitter02 Thank you! This worked wonders.

@GateraGael
Copy link

Hello, I can confirm that this solution also works for Docker-Desktop (windows 11).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants