Pin jaxlib to use cuda120 build with cuda-nvcc and update README note #549

weiji14 · 2024-05-21T21:19:37Z

Adding a explicit pin on jaxlib=*=cuda120* to pick up the cuda120 build version, ~~as the conda-lock solver was picking up the cpu build of jaxlib instead of the cuda build in #514.~~ See:

pangeo-docker-images/ml-notebook/conda-lock.yml

Lines 4166 to 4182 in 6d4c2ab

    
           - name: jaxlib 
        
             version: 0.4.23 
        
             manager: conda 
        
             platform: linux-64 
        
             dependencies: 
        
               libabseil: '>=20230802.1,<20230803.0a0' 
        
               libgcc-ng: '>=12' 
        
               libgrpc: '>=1.59.3,<1.60.0a0' 
        
               libstdcxx-ng: '>=12' 
        
               libzlib: '>=1.2.13,<1.3.0a0' 
        
               ml_dtypes: '>=0.2.0' 
        
               numpy: '>=1.23.5,<2.0a0' 
        
               openssl: '>=3.2.1,<4.0a0' 
        
               python: '>=3.11,<3.12.0a0' 
        
               python_abi: 3.11.* 
        
               scipy: '>=1.9' 
        
             url: https://conda.anaconda.org/conda-forge/linux-64/jaxlib-0.4.23-cpu_py311hc0fb0b9_0.conda

Edit: the cuda120 build is actually picked up automatically now as of the 2024.06.02 tag, but setting the cuda120 pin still to be sure that this doesn't break in the future.

Note that jaxlib-0.4.23-cuda120py* has an explicit runtime dependency on cuda-nvcc since conda-forge/jaxlib-feedstock#241, so this should mean users won't have to install cuda-nvcc explicitly anymore.

The cuda-nvcc workaround in the main README.md file has also been removed, in place of a message recommending users to use ml-notebook>=2024.06.02.

Fixes #438

The conda-lock solver was picking up the cpu build of jaxlib instead of the cuda build. Adding a explicit pin here to pick up the cuda120 build version.

pangeo-bot · 2024-05-21T21:19:48Z

/condalock
Automatically locking new conda environment, building, and testing images...

github-actions · 2024-05-21T21:19:49Z

👈 Try on Mybinder.org!

weiji14 · 2024-05-21T21:25:34Z

Pulling in jaxlib with the cuda120 build doesn't work yet, see https://github.com/pangeo-data/pangeo-docker-images/actions/runs/9181646795/job/25248830043#step:4:56:

INFO:conda_lock.conda_lock:Using virtual packages from virtual-packages.yml
Locking dependencies for ['linux-64']...
INFO:conda_lock.conda_solver:linux-64 using specs ['cuda-version >=12.0', 'flax >=0.8.0', 'jax', 'jaxlib >=0.4.23 cuda120*', 'jupyterlab-nvdashboard', 'keras-cv', 'tensorflow >=2.15.0 cuda120*', 'adlfs', 'argopy', 'awscli', 'black', 'boto3', 'bottleneck', 'cartopy', 'cdsapi', 'cfgrib', 'cf_xarray', 'ciso', 'cmocean', 'dask-ml', 'datashader', 'descartes', 'earthaccess', 'eofs', 'erddapy', 'esmpy', 'fastjmd95', 'flox', 'fsspec', 'gcm_filters', 'gcsfs', 'gh', 'gh-scoped-creds', 'geocube', 'geopandas', 'geopy', 'geoviews-core', 'git-lfs', 'gsw', 'h5netcdf', 'h5py', 'holoviews', 'hvplot', 'intake', 'intake-esm', 'intake-geopandas', 'intake-stac', 'intake-xarray', 'ipdb', 'ipykernel', 'ipyleaflet', 'ipytree', 'ipywidgets', 'jupyterlab_code_formatter', 'jupyterlab-git', 'jupyterlab-lsp', 'jupyterlab-myst', 'jupyter-panel-proxy', 'jupyter-resource-usage', 'kerchunk', 'line_profiler', 'lxml', 'lz4', 'matplotlib-base', 'memory_profiler', 'metpy', 'nb_conda_kernels', 'nbstripout', 'nc-time-axis', 'netcdf4', 'numbagg', 'numcodecs', 'numpy', 'numpy_groupies', 'odc-stac', 'pandas', 'panel', 'parcels', 'param', 'pop-tools', 'pyarrow', 'pycamhd', 'pydap', 'pystac', 'pystac-client', 'python-blosc', 'python-gist', 'python-graphviz', 'python-lsp-ruff', 'python-xxhash', 'rasterio', 'rechunker', 'rio-cogeo', 'rioxarray', 'ruff', 's3fs', 'satpy', 'scikit-image', 'scikit-learn', 'scipy', 'seaborn', 'sparse', 'snakeviz', 'stackstac', 'tiledb-py', 'timezonefinder', 'watermark', 'xarray', 'xarrayutils', 'xarray-datatree', 'xarray_leaflet', 'xarray-spatial', 'xbatcher', 'xcape', 'xclim', 'xesmf', 'xgboost', 'xgcm', 'xhistogram', 'xmip', 'xmitgcm', 'xpublish', 'xrft', 'xskillscore', 'xxhash', 'zarr', 'python 3.11.*', 'pangeo-notebook 2024.05.20.*', 'pip']
Failed to parse json, Expecting value: line 1 column 1 (char 0)
Could not lock the environment for platform linux-64
Could not solve for environment specs
The following packages are incompatible
├─ jaxlib >=0.4.23 cuda120* is installable with the potential options
│  ├─ jaxlib 0.4.23 would require
│  │  └─ libabseil >=20240116.1,<20240117.0a0 , which can be installed;
│  ├─ jaxlib 0.4.23 would require
│  │  └─ libabseil >=20240116.2,<20240117.0a0 , which can be installed;
│  └─ jaxlib 0.4.23 would require
│     └─ python_abi 3.12.* *_cp312, which requires
│        └─ python 3.12.* *_cpython, which can be installed;
├─ python 3.11**  is not installable because it conflicts with any installable versions previously reported;
└─ tensorflow >=2.15.0 cuda120* is not installable because it requires
   └─ tensorflow-base [2.15.0 cuda120py310heceb7ac_2|2.15.0 cuda120py310heceb7ac_3|...|2.15.0 cuda120py39hf42b710_3], which requires
      └─ libabseil >=20230802.1,<20230803.0a0 , which conflicts with any installable versions previously reported.
{
    "success": false
}

Need to wait for newer version of tensorflow on conda-forge to use libabseil>=20240116, wait for conda-forge/tensorflow-feedstock#372 or conda-forge/tensorflow-feedstock#385

weiji14 · 2024-05-21T21:45:43Z

README.md

@@ -54,7 +54,7 @@ The primary use of these Docker images is running on Pangeo Cloud deployments wi
 * Since 2020.10.16, [mamba](https://github.com/mamba-org/mamba) is installed into the base-image and conda-lock environment and is used by default to solve for a compatible environment (see #146)
 * For a simple list of packages for a given image, you can use a link like this: https://github.com/pangeo-data/pangeo-docker-images/blob/2020.10.08/pangeo-notebook/packages.txt
 * To compare changes between two images, you can use a link like this: https://github.com/pangeo-data/pangeo-docker-images/compare/2020.10.03..2020.10.08
-* Our `ml-notebook` image now contains JAX and TensorFlow with XLA enabled. Due to licensing issues, conda-forge does not have `ptxas`, but `ptxas` is needed for XLA to work correctly. Should you like to use JAX and/or TensorFlow with XLA optimization, please install `ptxas` on your own, for example, by `conda install -c nvidia cuda-nvcc`. At the time of writing (October 2022), JAX throws a compilation error if the `ptxas` version is higher than the driver version. There does not exist an easy solution for K80 GPUs, but in the case of T4 GPUs, you should install `conda install -c nvidia cuda-nvcc==11.6.*` to be safe. Alternatively for any GPU, you could set an environment variable to resolve the error caused by JAX: `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=1"`. The aforementioned error will be removed (and likely turned into a warning) in a future version of JAX. See https://github.com/google/jax/issues/12776#issuecomment-1276649134
+* Our `ml-notebook` image now contains JAX and TensorFlow with XLA enabled. Due to licensing issues, conda-forge does not have `ptxas`, but `ptxas` is needed for XLA to work correctly. Should you like to use JAX and/or TensorFlow with XLA optimization, please install `ptxas` on your own, for example, by `conda install -c nvidia cuda-nvcc`. At the time of writing (October 2022), JAX throws a compilation error if the `ptxas` version is higher than the driver version. There does not exist an easy solution for K80 GPUs, but in the case of T4 GPUs or newer, you should install `conda install -c nvidia cuda-nvcc==12.*` to be safe. Alternatively for any GPU, you could set an environment variable to resolve the error caused by JAX: `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=1"`. The aforementioned error will be removed (and likely turned into a warning) in a future version of JAX. See https://github.com/google/jax/issues/12776#issuecomment-1276649134


It might be ok to remove this note, once we have cuda-nvcc pulled in as a dependency of jaxlib. Ideally, we'll add a unit test to https://github.com/pangeo-data/pangeo-docker-images/blob/master/tests/test_ml-notebook.py (maybe using the snippet from #387 (comment)) to ensure that jax works.

For the uninitiated, I'm not even sure what XLA is :) In light of that, could add a sentence here to give context for why this matters (or just link https://openxla.org/xla).

To be honest, I didn't know what the XLA acronym stands for either 😆 I've added a link to that XLA page in commit a5fb3e9, though users shouldn't need to dig into this too much since we're using XLA-enabled builds of Tensorflow by default now.

Regression test to ensure that JaX and cuda-nvcc are installed and compatible on GPU devices.

Since we're using a version of GitHub Actions without a GPU, the JaX random number generator test on GPU cannot be properly tested, so wrapping the check in a try-except block. Also tidied up some import statements and docstrings in the test file.

Since `cuda-nvcc` is installed with jaxlib's cuda120 builds, the workaround to conda install cuda-nvcc should not be needed anymore.

weiji14 · 2024-06-07T03:52:10Z

tests/test_ml-notebook.py

+    # Test running on GPU (need to run locally)
+    try:
+        gpu_device = jax.devices("gpu")[0]
+        with jax.default_device(gpu_device):
+            key = random.key(seed=24)
+            x = random.normal(key=key)
+            np.testing.assert_allclose(x, -1.168644)
+    except RuntimeError:  # Unknown backend: 'gpu' requested
+        logging.log(level=logging.INFO, msg="JAX was not tested on a GPU device")


This doesn't work on GitHub Actions without a GPU, but I ran this locally using:

mamba create -n ml-notebook --file https://raw.githubusercontent.com/pangeo-data/pangeo-docker-images/2024.06.02/ml-notebook/conda-linux-64.lock mamba activate ml-notebook pytest --verbose tests/test_ml-notebook.py

The tests passed on my computer with an NVIDIA GPU, so should be ok I think.

If there were Azure credits for Pangeo we could add GPU runners https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/managing-larger-runners !

scottyhq

Thanks for the notes @weiji14, sorry I overlooked approving this until now. merge when you're happy with it!

Link to https://openxla.org/xla so that people know what XLA is.

weiji14 · 2024-06-29T00:29:51Z

ml-notebook/environment.yml

@@ -8,6 +8,7 @@ dependencies:
 - cuda-version>=12.0
 - flax>=0.8.0
 - jax
+ - jaxlib>=0.4.23=cuda120*


This pin on using a cuda120* build of jaxlib might actually need to go away if we're planning on adding multi-arch builds of ml-notebook (see #399 (review) for context), but will deal with this later since the same pin is present on Tensorflow.

weiji14 added 2 commits May 22, 2024 09:11

Pin jaxlib to use cuda120 build

c3f34a5

The conda-lock solver was picking up the cpu build of jaxlib instead of the cuda build. Adding a explicit pin here to pick up the cuda120 build version.

Update note on cuda-nvcc workaround

f7c2bbf

weiji14 self-assigned this May 21, 2024

weiji14 mentioned this pull request May 21, 2024

cuda-nvcc missing again #438

Closed

weiji14 commented May 21, 2024

View reviewed changes

weiji14 added 4 commits June 6, 2024 12:05

Merge branch 'master' into jaxlib-cuda-build

d62d7f2

Add unit test for jax random number generator

1cf6637

Regression test to ensure that JaX and cuda-nvcc are installed and compatible on GPU devices.

Update note in README to say cuda-nvcc workaround is no longer needed

5049f6c

Since `cuda-nvcc` is installed with jaxlib's cuda120 builds, the workaround to conda install cuda-nvcc should not be needed anymore.

weiji14 changed the title ~~Pin jaxlib to use cuda120 build with cuda-nvcc dependency~~ Pin jaxlib to use cuda120 build with cuda-nvcc and update README note Jun 7, 2024

weiji14 commented Jun 7, 2024

View reviewed changes

weiji14 marked this pull request as ready for review June 7, 2024 03:52

scottyhq approved these changes Jun 27, 2024

View reviewed changes

weiji14 added 2 commits June 29, 2024 12:15

Merge branch 'master' into jaxlib-cuda-build

9c5dd55

Add link to XLA (Accelerated Linear Algebra) page

a5fb3e9

Link to https://openxla.org/xla so that people know what XLA is.

weiji14 commented Jun 29, 2024

View reviewed changes

weiji14 merged commit 09649eb into master Jul 1, 2024
5 checks passed

weiji14 deleted the jaxlib-cuda-build branch July 1, 2024 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin jaxlib to use cuda120 build with cuda-nvcc and update README note #549

Pin jaxlib to use cuda120 build with cuda-nvcc and update README note #549

weiji14 commented May 21, 2024 •

edited

Loading

pangeo-bot commented May 21, 2024

github-actions bot commented May 21, 2024

weiji14 commented May 21, 2024

weiji14 May 21, 2024

scottyhq Jun 27, 2024

weiji14 Jun 29, 2024

weiji14 Jun 7, 2024

scottyhq Jun 27, 2024

scottyhq left a comment

weiji14 Jun 29, 2024

	- name: jaxlib
	version: 0.4.23
	manager: conda
	platform: linux-64
	dependencies:
	libabseil: '>=20230802.1,<20230803.0a0'
	libgcc-ng: '>=12'
	libgrpc: '>=1.59.3,<1.60.0a0'
	libstdcxx-ng: '>=12'
	libzlib: '>=1.2.13,<1.3.0a0'
	ml_dtypes: '>=0.2.0'
	numpy: '>=1.23.5,<2.0a0'
	openssl: '>=3.2.1,<4.0a0'
	python: '>=3.11,<3.12.0a0'
	python_abi: 3.11.*
	scipy: '>=1.9'
	url: https://conda.anaconda.org/conda-forge/linux-64/jaxlib-0.4.23-cpu_py311hc0fb0b9_0.conda

Pin jaxlib to use cuda120 build with cuda-nvcc and update README note #549

Pin jaxlib to use cuda120 build with cuda-nvcc and update README note #549

Conversation

weiji14 commented May 21, 2024 • edited Loading

pangeo-bot commented May 21, 2024

github-actions bot commented May 21, 2024

weiji14 commented May 21, 2024

weiji14 May 21, 2024

Choose a reason for hiding this comment

scottyhq Jun 27, 2024

Choose a reason for hiding this comment

weiji14 Jun 29, 2024

Choose a reason for hiding this comment

weiji14 Jun 7, 2024

Choose a reason for hiding this comment

scottyhq Jun 27, 2024

Choose a reason for hiding this comment

scottyhq left a comment

Choose a reason for hiding this comment

weiji14 Jun 29, 2024

Choose a reason for hiding this comment

weiji14 commented May 21, 2024 •

edited

Loading