Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating to Cuda 11.1 and Ubuntu 20.04 #30

Closed
jnfinitym opened this issue Sep 28, 2020 · 18 comments
Closed

Updating to Cuda 11.1 and Ubuntu 20.04 #30

jnfinitym opened this issue Sep 28, 2020 · 18 comments

Comments

@jnfinitym
Copy link

With all the shiny new GPUs coming out recently, I propose updating to use images that run on CUDA 11.1.
I will try to do that in a forked version in the next few weeks and if the maintainer(s) on here think this is a good plan, I am happy to submit a pull request once that is done, and as soon as I get mine, test it on an RTX3080 to make sure it runs as it should.
In the same breath, I also propose moving the images to the new Ubuntu LTS.

@mathematicalmichael
Copy link
Contributor

we're working on the LTS as part of a larger effort in #27, but unfortunately I've been short on time. It will happen though. Congrats on landing one, I feel it'll be a while before I'm able to.
That said, we plan to support different cuda versions. While I personally like being on the cutting edge as well, I've encountered lots of projects which rely on older versions, so it's good to have options. The idea of investing in the CI/CD effort is that it should make it easier to validate (at least that they complete) the different combinations of builds instead of doing so locally as we have been.

@ChristophSchranz
Copy link
Collaborator

Thank you @jnfinitym!

I'm looking forward to see your PR :)

@Manouchehri
Copy link

I'd be happy to test any PRs with my RTX 3090.

@Manouchehri
Copy link

Not sure if this is because of the outdated library or not (I'm new to CUDA), but this is what happens with the existing build:

import tensorflow
from tensorflow.python.client import device_lib
tensorflow.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
print(device_lib.list_local_devices())
RuntimeErrorTraceback (most recent call last)

<ipython-input-4-57062467377b> in <module>
----> 1 print(device_lib.list_local_devices())


/opt/conda/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py in list_local_devices(session_config)
     41     serialized_config = session_config.SerializeToString()
     42   return [
---> 43       _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
     44   ]


RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

@mathematicalmichael
Copy link
Contributor

my kernel just updated to CUDA 11.0 and I had an existing running jupyter container based on an older cuda version. @Manouchehri I ran your code without any runtime errors.

In [3]: tensorflow.config.list_physical_devices('GPU')                                     
2020-10-06 05:41:19.030089: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-06 05:41:19.045054: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-06 05:41:19.045258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.875GHz coreCount: 24 deviceMemorySize: 5.77GiB deviceMemoryBandwidth: 268.26GiB/s
2020-10-06 05:41:19.045302: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-06 05:41:19.080290: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-06 05:41:19.100038: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-06 05:41:19.105031: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-06 05:41:19.140467: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-06 05:41:19.145395: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-06 05:41:19.145608: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-10-06 05:41:19.145620: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

I think the cudnn have been resolved with the latest image, this is an older one. Something to consider for sure... people's computers are updating automatically, and usually containers don't care, here they do...

Anyway, how did you set up the nvidia drivers on your computer? I'll push images with newer versions of cuda this week and ping you, but this would help debug for now.

@Manouchehri any tips for getting the card? Been incredibly challenging.

@mathematicalmichael
Copy link
Contributor

mathematicalmichael commented Oct 6, 2020

I was also able to run this very basic example, which tells me that somehow my nvidia libs updated and despite using an old container... I could still do math on the gpu.YMMV
image

@Manouchehri
Copy link

Manouchehri commented Oct 6, 2020

@mathematicalmichael

Anyway, how did you set up the nvidia drivers on your computer? I'll push images with newer versions of cuda this week and ping you, but this would help debug for now.

I installed the beta drivers off of Nvidia's website (thought the beta would be require as support was just added in the 455.23.04 release).

wget "https://us.download.nvidia.com/XFree86/Linux-x86_64/455.23.04/NVIDIA-Linux-x86_64-455.23.04.run"
chmod +x NVIDIA-Linux-x86_64-455.23.04.run
sudo ./NVIDIA-Linux-x86_64-455.23.04.run # I kept the defaults, except I said "no" to having my Xorg config updated. It's a headless VM.
dave@ubuntu:~$ nvidia-smi
Tue Oct  6 11:53:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04    Driver Version: 455.23.04    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:0C:00.0 Off |                  N/A |
| 52%   65C    P2   339W / 350W |    656MiB / 24268MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8106      C   /opt/conda/bin/python             253MiB |
|    0   N/A  N/A     20367      C   ...13/Core_22.fah/FahCore_22      401MiB |
+-----------------------------------------------------------------------------+

@Manouchehri any tips for getting the card? Been incredibly challenging.

I joined the community NVIDIA Discord, and on the RTX 3090 launch day someone shared a link that would add the card directly to your cart, so you only had to load two or three pages. Basically cut the amount of clicks in half.

It was a last minute change on the web store, so I don't think most bot authors had a chance to update their scripts before us humans grabbed all of them. (My order was placed at 9:13 AM EST, so it definitely didn't sell out in seconds like the RTX 3080.)

@Manouchehri
Copy link

Using tensorflow/tensorflow:nightly-gpu-jupyter instead of gpu-jupyter seems to work, so my guess is it's just mismatching library versions causing the problem in #30 (comment).

docker run --gpus all -d -it -p 127.0.0.1:8888:8888 -v $(pwd)/data:/mnt/space/ml -e GRANT_SUDO=yes --name tf-nightly-gpu-jupyter_1 tensorflow/tensorflow:nightly-gpu-jupyter
import tensorflow
from tensorflow.python.client import device_lib
tensorflow.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
print(device_lib.list_local_devices())
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3954530472912642823
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 23087702400
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12185603361499007737
physical_device_desc: "device: 0, name: GeForce RTX 3090, pci bus id: 0000:0c:00.0, compute capability: 8.6"
]

@mathematicalmichael
Copy link
Contributor

@Manouchehri are there any upsides using this project instead of tensorflow's? I wasn't aware they had one like that. I suppose it has some different included packages, but practically speaking, how different are they?

@totycro
Copy link

totycro commented Nov 18, 2020

@edurenye
Copy link
Contributor

Trying to update to FROM nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 or any newer image.
When running:

RUN conda install --quiet --yes \
     pytorch \
     torchvision \
     cudatoolkit=11.0 -c pytorch

Fails with:

The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.31=0
  - feature:|@/linux-64::__glibc==2.31=0
  - configurable-http-proxy -> nodejs -> __glibc[version='>=2.17,<3.0.a0']
  - cudatoolkit=11.0 -> __glibc[version='>=2.17,<3.0.a0']
  - jupyterhub=1.3.0 -> nodejs[version='>=12'] -> __glibc[version='>=2.17,<3.0.a0']
  - nodejs -> __glibc[version='>=2.17,<3.0.a0']
  - pytorch -> cudatoolkit[version='>=11.0,<11.1'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
  - torchvision -> cudatoolkit[version='>=11.0,<11.1'] -> __glibc[version='>=2.17,<3.0.a0']

Your installed version is: 2.31

Seems to be related to: pytorch/vision#3264 and pytorch/vision#3207

@ChristophSchranz
Copy link
Collaborator

There also seems to be version conflicts on WSL 2 and Cuda 11.1:
NVIDIA/nvidia-docker#1458

As soon as PyTorch, Tensorflow and WSL 2 accept Cuda 11, we will update it.

Has anyone experienced severe disadvantages with Cuda 10.1? May be switching to 10.2 with 10.2-cudnn8-runtime-ubuntu18.04 would be an intermediate option?

@ChristophSchranz
Copy link
Collaborator

Cuda 10.2 seems to work well.
We will stay at 10.2 unitl Cuda 11.X works.

@ChristophSchranz
Copy link
Collaborator

ChristophSchranz commented Apr 13, 2021

Good news for this issue,
with the latest updates, Tensorflow supports CUDA 10.0 with cuDNN 8.0 (Tensorflow-supports)

I've created a branch v1.4_cuda-11.0_ubuntu-18.04 for images based on nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu18.04
The resulting images are tagged with cschranz/gpu-jupyter:v1.4_cuda-11.0_ubuntu-18.04
Let me know if they work as expected (they do on my machine).

@ChristophSchranz
Copy link
Collaborator

I've created images for CUDA 11.0 and Ubuntu 20.04 that are available on Dockerhub:

  • v1.4_cuda-11.0_ubuntu-20.04 (full image)
  • v1.4_cuda-11.0_ubuntu-20.04_python-only (only with a python interpreter and without Julia and R)
  • v1.4_cuda-11.0_ubuntu-20.04_slim (only with a python interpreter and without additional packages)

I think I can close this issue now. If a new CUDA version is supported (especially for Tensorflow) you can reopen this issue.

@xhejtman
Copy link

Hello, is the version v1.4_cuda-11.0_ubuntu-20.04 expected to work on cuda 11.0 ?

it seems, it is still linked with cuda 10.1.
python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
2021-04-20 18:59:44.572229: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-20 18:59:44.572341: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

@ChristophSchranz
Copy link
Collaborator

You are right. The problem was that TensorFlow was not updated and the older version depends on 10.1.

The update is on the way.

@ChristophSchranz
Copy link
Collaborator

The commit e6300cd should have solved this issue. The images are currently built and pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants