Updating to Cuda 11.1 and Ubuntu 20.04 #30

jnfinitym · 2020-09-28T09:40:37Z

With all the shiny new GPUs coming out recently, I propose updating to use images that run on CUDA 11.1.
I will try to do that in a forked version in the next few weeks and if the maintainer(s) on here think this is a good plan, I am happy to submit a pull request once that is done, and as soon as I get mine, test it on an RTX3080 to make sure it runs as it should.
In the same breath, I also propose moving the images to the new Ubuntu LTS.

mathematicalmichael · 2020-09-29T01:21:05Z

we're working on the LTS as part of a larger effort in #27, but unfortunately I've been short on time. It will happen though. Congrats on landing one, I feel it'll be a while before I'm able to.
That said, we plan to support different cuda versions. While I personally like being on the cutting edge as well, I've encountered lots of projects which rely on older versions, so it's good to have options. The idea of investing in the CI/CD effort is that it should make it easier to validate (at least that they complete) the different combinations of builds instead of doing so locally as we have been.

ChristophSchranz · 2020-09-29T08:08:16Z

Thank you @jnfinitym!

I'm looking forward to see your PR :)

Manouchehri · 2020-10-06T03:57:43Z

I'd be happy to test any PRs with my RTX 3090.

Manouchehri · 2020-10-06T04:50:09Z

Not sure if this is because of the outdated library or not (I'm new to CUDA), but this is what happens with the existing build:

import tensorflow

from tensorflow.python.client import device_lib

tensorflow.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

print(device_lib.list_local_devices())

RuntimeErrorTraceback (most recent call last)

<ipython-input-4-57062467377b> in <module>
----> 1 print(device_lib.list_local_devices())


/opt/conda/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py in list_local_devices(session_config)
     41     serialized_config = session_config.SerializeToString()
     42   return [
---> 43       _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)
     44   ]


RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

mathematicalmichael · 2020-10-06T05:45:47Z

my kernel just updated to CUDA 11.0 and I had an existing running jupyter container based on an older cuda version. @Manouchehri I ran your code without any runtime errors.

In [3]: tensorflow.config.list_physical_devices('GPU')                                     
2020-10-06 05:41:19.030089: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-06 05:41:19.045054: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-06 05:41:19.045258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.875GHz coreCount: 24 deviceMemorySize: 5.77GiB deviceMemoryBandwidth: 268.26GiB/s
2020-10-06 05:41:19.045302: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-06 05:41:19.080290: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-06 05:41:19.100038: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-06 05:41:19.105031: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-06 05:41:19.140467: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-06 05:41:19.145395: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-06 05:41:19.145608: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-10-06 05:41:19.145620: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

I think the cudnn have been resolved with the latest image, this is an older one. Something to consider for sure... people's computers are updating automatically, and usually containers don't care, here they do...

Anyway, how did you set up the nvidia drivers on your computer? I'll push images with newer versions of cuda this week and ping you, but this would help debug for now.

@Manouchehri any tips for getting the card? Been incredibly challenging.

mathematicalmichael · 2020-10-06T05:48:02Z

I was also able to run this very basic example, which tells me that somehow my nvidia libs updated and despite using an old container... I could still do math on the gpu.YMMV

Manouchehri · 2020-10-06T16:05:44Z

@mathematicalmichael

Anyway, how did you set up the nvidia drivers on your computer? I'll push images with newer versions of cuda this week and ping you, but this would help debug for now.

I installed the beta drivers off of Nvidia's website (thought the beta would be require as support was just added in the 455.23.04 release).

wget "https://us.download.nvidia.com/XFree86/Linux-x86_64/455.23.04/NVIDIA-Linux-x86_64-455.23.04.run"
chmod +x NVIDIA-Linux-x86_64-455.23.04.run
sudo ./NVIDIA-Linux-x86_64-455.23.04.run # I kept the defaults, except I said "no" to having my Xorg config updated. It's a headless VM.

dave@ubuntu:~$ nvidia-smi
Tue Oct  6 11:53:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.04    Driver Version: 455.23.04    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:0C:00.0 Off |                  N/A |
| 52%   65C    P2   339W / 350W |    656MiB / 24268MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8106      C   /opt/conda/bin/python             253MiB |
|    0   N/A  N/A     20367      C   ...13/Core_22.fah/FahCore_22      401MiB |
+-----------------------------------------------------------------------------+

@Manouchehri any tips for getting the card? Been incredibly challenging.

I joined the community NVIDIA Discord, and on the RTX 3090 launch day someone shared a link that would add the card directly to your cart, so you only had to load two or three pages. Basically cut the amount of clicks in half.

It was a last minute change on the web store, so I don't think most bot authors had a chance to update their scripts before us humans grabbed all of them. (My order was placed at 9:13 AM EST, so it definitely didn't sell out in seconds like the RTX 3080.)

Manouchehri · 2020-10-06T16:29:26Z

Using tensorflow/tensorflow:nightly-gpu-jupyter instead of gpu-jupyter seems to work, so my guess is it's just mismatching library versions causing the problem in #30 (comment).

docker run --gpus all -d -it -p 127.0.0.1:8888:8888 -v $(pwd)/data:/mnt/space/ml -e GRANT_SUDO=yes --name tf-nightly-gpu-jupyter_1 tensorflow/tensorflow:nightly-gpu-jupyter

import tensorflow

from tensorflow.python.client import device_lib

tensorflow.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3954530472912642823
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 23087702400
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12185603361499007737
physical_device_desc: "device: 0, name: GeForce RTX 3090, pci bus id: 0000:0c:00.0, compute capability: 8.6"
]

mathematicalmichael · 2020-11-17T23:56:10Z

@Manouchehri are there any upsides using this project instead of tensorflow's? I wasn't aware they had one like that. I suppose it has some different included packages, but practically speaking, how different are they?

totycro · 2020-11-18T09:40:32Z

The tensorflow docker images are still based on ubuntu 18.04 btw:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/gpu-jupyter.Dockerfile
https://hub.docker.com/r/tensorflow/tensorflow/

edurenye · 2021-02-18T11:08:16Z

Trying to update to FROM nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 or any newer image.
When running:

RUN conda install --quiet --yes \
     pytorch \
     torchvision \
     cudatoolkit=11.0 -c pytorch

Fails with:

The following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.31=0
  - feature:|@/linux-64::__glibc==2.31=0
  - configurable-http-proxy -> nodejs -> __glibc[version='>=2.17,<3.0.a0']
  - cudatoolkit=11.0 -> __glibc[version='>=2.17,<3.0.a0']
  - jupyterhub=1.3.0 -> nodejs[version='>=12'] -> __glibc[version='>=2.17,<3.0.a0']
  - nodejs -> __glibc[version='>=2.17,<3.0.a0']
  - pytorch -> cudatoolkit[version='>=11.0,<11.1'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
  - torchvision -> cudatoolkit[version='>=11.0,<11.1'] -> __glibc[version='>=2.17,<3.0.a0']

Your installed version is: 2.31

Seems to be related to: pytorch/vision#3264 and pytorch/vision#3207

ChristophSchranz · 2021-02-22T11:47:26Z

There also seems to be version conflicts on WSL 2 and Cuda 11.1:
NVIDIA/nvidia-docker#1458

As soon as PyTorch, Tensorflow and WSL 2 accept Cuda 11, we will update it.

Has anyone experienced severe disadvantages with Cuda 10.1? May be switching to 10.2 with 10.2-cudnn8-runtime-ubuntu18.04 would be an intermediate option?

ChristophSchranz · 2021-03-05T16:35:58Z

Cuda 10.2 seems to work well.
We will stay at 10.2 unitl Cuda 11.X works.

ChristophSchranz · 2021-04-13T14:24:34Z

Good news for this issue,
with the latest updates, Tensorflow supports CUDA 10.0 with cuDNN 8.0 (Tensorflow-supports)

I've created a branch v1.4_cuda-11.0_ubuntu-18.04 for images based on nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu18.04
The resulting images are tagged with cschranz/gpu-jupyter:v1.4_cuda-11.0_ubuntu-18.04
Let me know if they work as expected (they do on my machine).

ChristophSchranz · 2021-04-19T11:38:54Z

I've created images for CUDA 11.0 and Ubuntu 20.04 that are available on Dockerhub:

v1.4_cuda-11.0_ubuntu-20.04 (full image)
v1.4_cuda-11.0_ubuntu-20.04_python-only (only with a python interpreter and without Julia and R)
v1.4_cuda-11.0_ubuntu-20.04_slim (only with a python interpreter and without additional packages)

I think I can close this issue now. If a new CUDA version is supported (especially for Tensorflow) you can reopen this issue.

xhejtman · 2021-04-20T19:00:32Z

Hello, is the version v1.4_cuda-11.0_ubuntu-20.04 expected to work on cuda 11.0 ?

it seems, it is still linked with cuda 10.1.
python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
2021-04-20 18:59:44.572229: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-04-20 18:59:44.572341: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

ChristophSchranz · 2021-04-26T14:22:58Z

You are right. The problem was that TensorFlow was not updated and the older version depends on 10.1.

The update is on the way.

ChristophSchranz · 2021-04-26T14:46:47Z

The commit e6300cd should have solved this issue. The images are currently built and pushed.

ChristophSchranz closed this as completed Apr 19, 2021

b-y-f mentioned this issue Apr 7, 2022

PackagesNotFoundError #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating to Cuda 11.1 and Ubuntu 20.04 #30

Updating to Cuda 11.1 and Ubuntu 20.04 #30

jnfinitym commented Sep 28, 2020

mathematicalmichael commented Sep 29, 2020

ChristophSchranz commented Sep 29, 2020

Manouchehri commented Oct 6, 2020

Manouchehri commented Oct 6, 2020

mathematicalmichael commented Oct 6, 2020

mathematicalmichael commented Oct 6, 2020 •

edited

Loading

Manouchehri commented Oct 6, 2020 •

edited

Loading

Manouchehri commented Oct 6, 2020

mathematicalmichael commented Nov 17, 2020

totycro commented Nov 18, 2020

edurenye commented Feb 18, 2021

ChristophSchranz commented Feb 22, 2021

ChristophSchranz commented Mar 5, 2021

ChristophSchranz commented Apr 13, 2021 •

edited

Loading

ChristophSchranz commented Apr 19, 2021

xhejtman commented Apr 20, 2021

ChristophSchranz commented Apr 26, 2021

ChristophSchranz commented Apr 26, 2021

Updating to Cuda 11.1 and Ubuntu 20.04 #30

Updating to Cuda 11.1 and Ubuntu 20.04 #30

Comments

jnfinitym commented Sep 28, 2020

mathematicalmichael commented Sep 29, 2020

ChristophSchranz commented Sep 29, 2020

Manouchehri commented Oct 6, 2020

Manouchehri commented Oct 6, 2020

mathematicalmichael commented Oct 6, 2020

mathematicalmichael commented Oct 6, 2020 • edited Loading

Manouchehri commented Oct 6, 2020 • edited Loading

Manouchehri commented Oct 6, 2020

mathematicalmichael commented Nov 17, 2020

totycro commented Nov 18, 2020

edurenye commented Feb 18, 2021

ChristophSchranz commented Feb 22, 2021

ChristophSchranz commented Mar 5, 2021

ChristophSchranz commented Apr 13, 2021 • edited Loading

ChristophSchranz commented Apr 19, 2021

xhejtman commented Apr 20, 2021

ChristophSchranz commented Apr 26, 2021

ChristophSchranz commented Apr 26, 2021

mathematicalmichael commented Oct 6, 2020 •

edited

Loading

Manouchehri commented Oct 6, 2020 •

edited

Loading

ChristophSchranz commented Apr 13, 2021 •

edited

Loading