Add cuda12 variant of tensorflow-notebook #2100

ChristofKaufmann · 2024-03-10T06:27:06Z

Describe your changes

This adds a cuda12 variant of the tensorflow-notebook analog to eccda24 with the pytorch-notebook.
The CPU version uses the tensorflow-cpu wheel now (to reduce size of the image).
Regarding cuda11 variant: The current version of TensorFlow is 2.16.1 and it seems the last compatible PyPI wheel with CUDA 11.8 is TensorFlow 2.14.1 (according to the officially tested versions). I still tried TensorFlow 2.16.1 with CUDA 11.8.0, but it didn't work. The current version of tensorflow-gpu on conda-forge is 2.15.0 and has a CUDA 11.8 build. So if you want a cuda11 variant, I can try to use the conda-forge version for that, but the TensorFlow version is not up-to-date.

Issue ticket if applicable

Fix: #2095, #1557.

Checklist (especially for first-time contributors)

I have performed a self-review of my code
If it is a core feature, I have added thorough tests
I will try not to use force-push to make the review process easier for reviewers
I have updated the documentation for significant changes

mathbunnyru · 2024-03-10T11:49:10Z

Could you please fix tests?

docs/using/selecting.md

images/tensorflow-notebook/cuda12/Dockerfile

mathbunnyru · 2024-03-11T09:55:49Z

images/tensorflow-notebook/cuda12/Dockerfile

+SHELL ["/bin/bash", "-o", "pipefail", "-c"]
+
+# Install CUDA libs and cuDNN with mamba
+RUN mamba install --yes -c nvidia/label/cuda-12.3.2 \


Is there a chance not to hardcode minor.patch version here?
Also, 12.4 was released.

I used 12.3, because it is listed in the tested build configurations. There seems to be no label like cuda-12. Without label, i. e. -c nvidia I got the latest CUDA version, which is 12.4, currently. It worked, but I thought it is risky. A new CUDA release might be incompatible (I guess not before 13.x) and it seems we do not have a unit test e. g. to check the output of python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))". Should we add a unit test like this?
Secondly, the pytorch-notebook fixes the minor version as well.

I used 12.3, because it is listed in the tested build configurations.

There is a chance that the table might be slightly outdated.

There seems to be no label like cuda-12. Without label, i. e. -c nvidia I got the latest CUDA version, which is 12.4, currently.

Can we use something like nvidia/label/cuda-12.*?

Should we add a unit test like this?

I am not sure if this test will work with a regular GitHub-hosted ubuntu runner.
Also, we currently don't have a way to run a test for variant image (but it's not difficult to add something like this).

There is a chance that the table might be slightly outdated.

I looked at the table just a few hours after the release and it was up to date. But it won't ever be tested against a newer version.

Can we use something like nvidia/label/cuda-12.*?

No, there is no such label. Here is a list.

I am not sure if this test will work with a regular GitHub-hosted ubuntu runner.

Right, that is always problematic. Sorry.

Can we use something like nvidia/label/cuda-12.*?

No, there is no such label. Here is a list.

I meant maybe mamba supports label regex (I have no idea if it does or not).
In that case we will be fine with existing tags and won’t need to hardcode some particular version.

Despite using labels, I noticed all package versions are also in the main label. The version in the list of labels is a bit strange. So the labels are formatted in major.minor.patch and the version in major.minor.build. Usually, if patch increments, the build number just continues:

label: cuda-11.6.0, version: 11.6.55

label: cuda-11.6.1, version: 11.6.112

label: cuda-11.6.2, version: 11.6.124

There is one exception: labels cuda-11.4.3 and cuda-11.4.4 both have version 11.4.152.
Nevertheless, I just tried:

mamba install -c nvidia 'cuda-nvcc<13' and got version 12.4.99 (which is also in label cuda-12.4.0)

mamba install -c nvidia 'cuda-nvcc<12' and got version 11.8.89 (which is also in label cuda-11.8.0)

mamba install -c nvidia 'cuda-nvcc<11.5' and got version 11.4.152 (which is also in labels cuda-11.4.3 and cuda-11.4.4)

mamba install -c nvidia 'cuda-nvcc=12.3' and got version 12.3.107 (which is also in label cuda-12.3.2)

So, we could use 'cuda-nvcc<13', to reduce maintenance work (avoid updating the version for every TensorFlow release), but these are not officially tested by TensorFlow (not sure, if there can occur incompatibilities with new minor versions). Using something like 'cuda-nvcc=12.3' is more work (still avoiding the patch version), but officially tested.

I'm ok with cuda-nvcc=12.3, please add the comment why we choose this version though.

This did not work, since the dependencies have no limitation in their versions. So using cuda-nvcc=12.3 with the nvidia channel resulted in a mixture of 12.3 and 12.4. NVIDIA is quite sloppy in their packaging.

Then I noticed, that cudnn from the nvidia channel is outdated. Apparently they dropped the support 3 years ago. The cudnn from conda-forge is quite up-to-date, but there is no CUDA 12 build yet, only CUDA 11.8.

So, I would like to go with the new installation method supported by TensorFlow, which is basically just pip install tensorflow[and-cuda]. This also has the advantage, that the installed CUDA version is always the officially tested version – so less maintenance for you. Usually the path to the nvidia libs should be found automatically, but in 2.16.1 there seems to be a bug, so we have to add them. I prepared an activation script for that. Also LD_LIBRARY_PATH is not polluted with this method, because the paths from the pip installation contain only the nvidia libs. Before we added ${CONDA_DIR}/lib/, which contains quite a lot libraries.

Ok, let’s try it

images/tensorflow-notebook/cuda12/Dockerfile

mathbunnyru · 2024-03-11T09:58:06Z

images/tensorflow-notebook/cuda12/Dockerfile

+    NVIDIA_DRIVER_CAPABILITIES="compute,utility"
+
+# Install Tensorflow with pip
+RUN pip install --no-cache-dir tensorflow && \


Maybe we can use tensorflow from mamba?
I think in such a case we won't even need to list the dependencies.

The current version of tensorflow-gpu on conda-forge is 2.15.0 and has a CUDA 11.8 build. So if you want a cuda11 variant, I can try to use the conda-forge version for that, but the TensorFlow version is not up-to-date.

This is the kind of complexity I recall when getting a tensorflow gpu image working, where the conda-forge version is often outdated, and trying to install with pip for gpu support was complicated.

I think if we choose between install complexity or relying on something regularly outdated, the install complexity may be prefered - otherwise we introduce things we can't control. At the same time, the fact that its outdated etc relates to how complicated it may be to keep installing something that works over time, which we then may be taking on.

This PR will probably demonstrate the current maturity of tensorflow gpu stuff upstream, if its as bad as I experienced it was a while back, then I think its better to not try to try maintain a tensorflow gpu image to avoid making this project too hard to maintain as a whole.

Yes, it was very complicated and the conda-forge tensorflow-gpu package helped by providing cudatoolkit and cudnn within the same toolchain. But I just tried to install tensorflow-gpu from conda-forge into scipy-notebook and it fails due to conflicts. So maybe it changed and nowadays the installation of tensorflow from PyPI and cuda and cudnn from nvidia's conda-channel is the easiest way. For maintenance I imagine using the cuda version from the tested builds for a new TensorFlow release should work.

In my opinion, we can't tell if the current maturity of tensorflow packages is better until we merge this PR, have weekly builds, and have several releases of cuda/cudnn and the tensorflow package itself.

So, I am ok with how Dockerfile currently looks like, but we need to see if it's gonna be ok after a few releases.

I am ready to give it a try as a maintainer (and I can always disable the build just by changing a few lines in the docker.yml config-like file).

tests/docker-stacks-foundation/test_packages.py

mathbunnyru · 2024-03-11T10:01:36Z

@mathbunnyru, @consideRatio, @yuvipanda, and @manics, please vote 👍 to accept this change and 👎 not to accept it (use a reaction to this message)
The voting deadline is the 11th of April (a month since I posted this message).
The change is accepted, if there are at least 2 positive votes.

We can have a discussion until the deadline, so please express your opinions.

As this is very similar to the pytorch-notebook, I won't wait until the deadline, if there are 2 positive votes before it.

Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>

consideRatio · 2024-03-11T10:23:40Z

I voted 👀 for now, I'd like to see that this seems reasonable to install and maintain long term by tensorflow gpu packages upstream make it easy enough, because it has been a mess historically in my experience and I don't want this project to take on maintaining function if its too messy. If the implementation looks not-messy, I'd be 👍, but I think this may be a very notable commitment if it is, and that we better then protect the project limited maintenance capacity from taking on such maintenance burden.

Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>

mathbunnyru · 2024-03-12T01:45:28Z

I cleaned aarch64 machines, builds should work better - unfortunately, docker is the worst in cleaning its cache.

manics · 2024-03-12T23:40:02Z

I don't feel qualified to give a 👍 or 👎, @consideRatio has already highlighted the main issues around long-term maintainability so I think the decision should be from those who have ultimate responsibility for maintaining it.

ChristofKaufmann · 2024-03-13T00:48:01Z

I spend some time to find the best way regarding maintainability. Now it looks quite similar to the PyTorch cuda variant, except that for TensorFlow:

You can't choose between CUDA 11 and 12 – it will use the officially tested build configuration.
There is a workaround for a bug.

mathbunnyru · 2024-03-13T00:58:51Z

You can't choose between CUDA 11 and 12 – it will use the officially tested build configuration.

Why can't you pin the version when using pip?

ChristofKaufmann · 2024-03-13T06:45:59Z

The extra "and-cuda" is defined here. There the package versions of the dependencies are fixed to the ones listed in tested build configuration. There is no additional "and-cuda11" extra. That's what I meant.

mathbunnyru · 2024-03-15T01:22:47Z

The extra "and-cuda" is defined here. There the package versions of the dependencies are fixed to the ones listed in tested build configuration. There is no additional "and-cuda11" extra. That's what I meant.

I guess there are many libraries installed alongside with tensorflow.
They probably either have cu12 in their name or in their version.
We can limit the version of one of such libraries and that's why pip will have to choose a proper cuda version.

At least I think it is worth trying.

ChristofKaufmann · 2024-03-15T02:55:31Z

I tried to use the -cu11 packages (except for nvidia-nvjitlink-cu12, since it is new in CUDA 12) using:
CUDA11_DEPS=$(wget -qO- https://pypi.org/pypi/tensorflow/json | grep -o -e '[a-z_-]\+-cu12' | sed 's/-cu12/-cu11/; s/nvidia-nvjitlink-cu11//' | xargs)

However, it does not work. The tensorflow package is linked against CUDA 12 libraries. It expects e. g. libcudart.so.12, while there is a libcudart.so.11.

Full import errors with TF_CPP_MAX_VLOG_LEVEL=3

2024-03-15 02:38:23.246671: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcudart.so.12'; dlerror: libcudart.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.246811: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublas.so.12'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.246910: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublasLt.so.12'; dlerror: libcublasLt.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.247014: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcufft.so.11'; dlerror: libcufft.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.261102: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcusolver.so.11
2024-03-15 02:38:23.261261: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcusparse.so.12'; dlerror: libcusparse.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.261470: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcudnn.so.8
2024-03-15 02:38:23.261483: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

If you look at the conda-forge tensorflow-gpu package files, there are e. g. cuda120 (12.0) and cuda118 (11.8) builds. So the library version is fixed at build time and the PyPI tensorflow package provides only a CUDA 12 build.

mathbunnyru · 2024-03-16T07:56:00Z

I tried to use the -cu11 packages (except for nvidia-nvjitlink-cu12, since it is new in CUDA 12) using:

CUDA11_DEPS=$(wget -qO- https://pypi.org/pypi/tensorflow/json | grep -o -e '[a-z_-]\+-cu12' | sed 's/-cu12/-cu11/; s/nvidia-nvjitlink-cu11//' | xargs)

However, it does not work. The tensorflow package is linked against CUDA 12 libraries. It expects e. g. libcudart.so.12, while there is a libcudart.so.11.

Full import errors with TF_CPP_MAX_VLOG_LEVEL=3


2024-03-15 02:38:23.246671: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcudart.so.12'; dlerror: libcudart.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.246811: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublas.so.12'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.246910: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublasLt.so.12'; dlerror: libcublasLt.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.247014: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcufft.so.11'; dlerror: libcufft.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.261102: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcusolver.so.11

2024-03-15 02:38:23.261261: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcusparse.so.12'; dlerror: libcusparse.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.261470: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcudnn.so.8

2024-03-15 02:38:23.261483: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

Skipping registering GPU devices...

[]

If you look at the conda-forge tensorflow-gpu package files, there are e. g. cuda120 (12.0) and cuda118 (11.8) builds. So the library version is fixed at build time and the PyPI tensorflow package provides only a CUDA 12 build.

Thanks. I guess in that case let's rename the variant to simple cuda because we don't have any control over the cuda version. And please update the docs to mention it.

mathbunnyru · 2024-03-19T10:13:37Z

#2100 (comment)

@yuvipanda what do you think about this PR?

twalcari · 2024-03-21T09:45:37Z

While I have no official vote here, I would like to express my full support for this PR. Given that TensorFlow is mainly used for GPU-accelerated applications, it makes a lot of sense to have a GPU-capable docker image available.

The current non-GPU enabled images feel like a neutered alternative that are OK for doing some preliminary exploration on how TensorFlow works, but are ineffective to be used in any real-world applications.

yuvipanda · 2024-03-26T00:35:52Z

Sorry for the delay, @mathbunnyru.

I'm +1 on this change because it's using the upstream supported way to install tensorflow - the and-cuda variant (as described in https://www.tensorflow.org/install/pip#linux).

The only (non-blocking) concern I have is that it's based on the scipy-notebook image, which installs packages primarily from conda-forge. And tensorflow has some dependencies (particularly numpy) that are already in the base image. So the question is, what happens if a newer (or older) version of numpy is required by tensorflow than what we get from conda-forge? Would mixing pip and conda like this cause issues? In my experience, it mostly does not (I literally did this with tensorflow in another project a few months ago). And I'd rather us do this if it means we can directly use the method maintained by upstream. It's also what we do for pytorch now.

So overall, +1 from me. Thank you for this contribution, @ChristofKaufmann! And thanks for your stewardship, @mathbunnyru

mathbunnyru · 2024-03-26T01:10:27Z

So the question is, what happens if a newer (or older) version of numpy is required by tensorflow than what we get from conda-forge?

We can always pin versions in some images if we need to and there is no other choice.

I think numpy is so widely used by everyone, so the conda-forge team puts lots of effort into releasing new versions and we won't even have to wait long.
But we'll only see this in practice, when we merge, and gain some experience.

So, let's try to merge this one 🙂

yuvipanda · 2024-03-26T01:15:09Z

But we'll only see this in practice, when we merge, and gain some experience.

Big big +1! Thank you :)

ChristofKaufmann · 2024-03-26T04:53:57Z

Thank you for your helpful comments to improve the code @mathbunnyru

ChristofKaufmann added 2 commits March 10, 2024 06:50

Add cuda12 variant for tensorflow-notebook

4911838

Reduce size of CPU version of tensorflow-notebook

9b5ac1b

Try to fix tests

76fa0ff

mathbunnyru requested changes Mar 11, 2024

View reviewed changes

ChristofKaufmann and others added 2 commits March 11, 2024 11:07

Update docs/using/selecting.md

b6c75ff

Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>

Update images/tensorflow-notebook/cuda12/Dockerfile

6c437f0

Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>

ChristofKaufmann and others added 2 commits March 11, 2024 11:24

Update tests/docker-stacks-foundation/test_packages.py

d483e42

Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>

Remove obsolete XLA_FLAGS env var

3799dee

ChristofKaufmann added 2 commits March 13, 2024 00:54

Install CUDA and cuDNN using pip instead of mamba

2ade514

Fix pre-commit shell checks

767b593

Change tensorflow variant name from cuda12 to cuda

4d99dcd

mathbunnyru approved these changes Mar 17, 2024

View reviewed changes

mathbunnyru added 3 commits March 19, 2024 10:17

Update selecting.md

885bd9f

Merge branch 'main' into tf-cuda

b6c767b

Update selecting.md

191e997

mathbunnyru merged commit b9553a8 into jupyter:main Mar 26, 2024
74 checks passed

mathbunnyru mentioned this pull request Mar 27, 2024

Build GPU Variants of Current Images #1557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cuda12 variant of tensorflow-notebook #2100

Add cuda12 variant of tensorflow-notebook #2100

ChristofKaufmann commented Mar 10, 2024

mathbunnyru commented Mar 10, 2024

mathbunnyru Mar 11, 2024

ChristofKaufmann Mar 11, 2024

mathbunnyru Mar 11, 2024

ChristofKaufmann Mar 11, 2024

mathbunnyru Mar 11, 2024

ChristofKaufmann Mar 11, 2024

mathbunnyru Mar 12, 2024

ChristofKaufmann Mar 12, 2024

mathbunnyru Mar 12, 2024

mathbunnyru Mar 11, 2024

consideRatio Mar 11, 2024 •

edited

ChristofKaufmann Mar 11, 2024

mathbunnyru Mar 11, 2024

mathbunnyru Mar 11, 2024

mathbunnyru commented Mar 11, 2024

consideRatio commented Mar 11, 2024

mathbunnyru commented Mar 12, 2024

manics commented Mar 12, 2024

ChristofKaufmann commented Mar 13, 2024

mathbunnyru commented Mar 13, 2024

ChristofKaufmann commented Mar 13, 2024

mathbunnyru commented Mar 15, 2024

ChristofKaufmann commented Mar 15, 2024

mathbunnyru commented Mar 16, 2024

mathbunnyru commented Mar 19, 2024

twalcari commented Mar 21, 2024

yuvipanda commented Mar 26, 2024

mathbunnyru commented Mar 26, 2024

yuvipanda commented Mar 26, 2024

ChristofKaufmann commented Mar 26, 2024

Add cuda12 variant of tensorflow-notebook #2100

Add cuda12 variant of tensorflow-notebook #2100

Conversation

ChristofKaufmann commented Mar 10, 2024

Describe your changes

Issue ticket if applicable

Checklist (especially for first-time contributors)

mathbunnyru commented Mar 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio Mar 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathbunnyru commented Mar 11, 2024

consideRatio commented Mar 11, 2024

mathbunnyru commented Mar 12, 2024

manics commented Mar 12, 2024

ChristofKaufmann commented Mar 13, 2024

mathbunnyru commented Mar 13, 2024

ChristofKaufmann commented Mar 13, 2024

mathbunnyru commented Mar 15, 2024

ChristofKaufmann commented Mar 15, 2024

mathbunnyru commented Mar 16, 2024

mathbunnyru commented Mar 19, 2024

twalcari commented Mar 21, 2024

yuvipanda commented Mar 26, 2024

mathbunnyru commented Mar 26, 2024

yuvipanda commented Mar 26, 2024

ChristofKaufmann commented Mar 26, 2024

consideRatio Mar 11, 2024 •

edited