Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request for Error Troubleshooting] RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) #5

Open
kskim-phd opened this issue Feb 25, 2024 · 6 comments

Comments

@kskim-phd
Copy link

Dear Team;

Thank you for sharing this great platform!

I have set up according to your instruction but I got the following error when I command the training code :

python train_net.py
--config-file configs/esm35_evo1_initial_pdbonly.yaml
--num-gpus 1 SOLVER.SEQ_PER_BATCH 1
OUTPUT_DIR output/esm35_evo1/initial_pdbonly

This is error : RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Screenshot from 2024-02-25 12-00-09

I have run this code with 1 GPU A6000.

Could you reply on how to solve this issue if possible? I deeply appreciate it.

@jmlee4967
Copy link
Contributor

Thank you for your interest in this project!

This project has been tested with CUDA 11.3 and PyTorch 1.12, and it has dependencies on other projects(ex. fair-esm), so different versions may lead to errors.
While this error may suggest a mismatch between the ESM version and the CUDA version, providing your environment details would greatly assist in achieving a thorough understanding of the issue.

@kskim-phd
Copy link
Author

kskim-phd commented Mar 4, 2024

Dear Team,
Yes, I will check the mismatch issue.

Thank you for your valuable comments.

In fact, I found that there was a setup error when I type the following original command you provided, under CUDA (11.3) and PyTorch (1.12):

pip install -v -U git+https://github.com/facebookresearch/xformers.git@b31f4a1#egg=xformers

Therefore, I cannot help but use the alternative following command:

wget https://anaconda.org/xformers/xformers/0.0.22/download/linux-64/xformers-0.0.22-py39_cu11.6.2_pyt1.12.1.tar.bz2
conda install ./xformers-0.0.22-py39_cu11.6.2_pyt1.12.1.tar.bz2

But this would require different CUDA and Pytorch version.

I will try to see if there is any other option to solve this issue in my end.

I greatly appreciate if you could provide any comment for it if possible.

Thank you!

@resplendentHSHI
Copy link

resplendentHSHI commented Mar 4, 2024

Hi, I had this exact issue. My problem arose when using biotite as the latest version. For me, the exact version of biotite required was not available for python = 3.8. You could try remaking the environment with python = 3.9.

There are a few issues in the code such as np.int being phased out and PIL using BILINEAR instead of LINEAR now, but those are easy fixes.

After I changed to the correct biotite version, this error went away. Hope this helps!

@kskim-phd
Copy link
Author

Hi, I had this exact issue. My problem arose when using biotite as the latest version. For me, the exact version of biotite required was not available for python = 3.8. You could try remaking the environment with python = 3.9.

There are a few issues in the code such as np.int being phased out and PIL using BILINEAR instead of LINEAR now, but those are easy fixes.

After I changed to the correct biotite version, this error went away. Hope this helps!

Great! I sincerely thank you for your valuable comments!

@resplendentHSHI
Copy link

One more note, you might have some success with xformers v0.21 or v0.20 since v0.22's release is dated after solvent was published. For reference if other people also have this issue, I was able to run solvent using both a 3080 and a 4090.

@kskim-phd
Copy link
Author

kskim-phd commented Mar 4, 2024

xformers v0.21 or v0.20

One more note, you might have some success with xformers v0.21 or v0.20 since v0.22's release is dated after solvent was published. For reference if other people also have this issue, I was able to run solvent using both a 3080 and a 4090.

Thank you for the great suggestion. I will try to check if it would work by using biotite with python 3.9 and xformers v0.21 or v0.20 as well !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants