-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
triton_autotuner: Rounding modifier required for instruction 'cvt' #15900
Comments
@KeremTurgutlu How do we know that it originates in the triton_autotuner? |
@hawkinsp Is it hard to propagate C++ stack trace along with the error? |
I am not 100% if that's the root cause but I should've probably pasted this as well:
I was able to successfully run the code with from scratch nvidia driver, cuda (12.1), cudnn and jax installation
nvcc not installed. |
Oh this is CUDA12, probably that would explain it --- we had some bugs filed on CUDA12 before. |
I think this is actually a case of too old a CUDA installation, not the other way around. The image is named JAX is built for CUDA 11.8 (or CUDA 12), and if I recall correctly Ampere GPU support wasn't added until longer after 11.3. Can you update to CUDA 11.8 or newer? Note |
Sorry if it was not clear but what I wanted mention was issue was fixed when I installed cuda 12 from scratch instead using the Google Cloud image. |
Recently got this error (might be related to 0.4.9 release looking into it):
Edit: Tried again by recreating a new instance, and I wasn't able to reproduce the error. |
Getting similar when trying to run a custom model on an A6000 with https://pastebin.com/raw/MUsYZje8 Update: Seems to only happen with bfloat16. Works fine with float32 |
@euclaise Do you have another copy of |
@hawkinsp I don't, but it should be whatever is used here https://hub.docker.com/r/runpod/pytorch/ |
After some testing, it appears to be caused by me accidentally mixing |
Description
Getting the following error when trying to run code on a A100 80GB Google Cloud Debian Deep Learning image (c0-deeplearning-common-cu113-v20230501-debian-10). This code is tested and works on TPU (using t5x library). I don't know if this error is related to my setup but after creating the instance before running the code these are the steps I took:
Created a new conda environment with py3.9
Install latest jax cuda
pip install jax[cuda] -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
, which is 4.8.0 as of writing.Clone t5x library and install editable local version
-e git+https://github.com/google-research/t5x.git@2b010160e7fe8a4505a6d1032a7b737a633636e5#egg=t5x
.Install extra dep:
pip install t5
.Upgrade CUDNN library to 8.6.0 as jax complained it requires at least that version by manually downloading cudnn-linux-x86_64-8.6.0.163_cuda11-archive.tar.xz and then running the following:
The following is the error I get when running a t5x pretraining script using train.py.
What jax/jaxlib version are you using?
jax==0.4.7 jaxlib==0.4.7+cuda11.cudnn86
Which accelerator(s) are you using?
GPU
Additional system info
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03) [GCC 11.3.0] on linux
NVIDIA GPU info
The text was updated successfully, but these errors were encountered: