Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on first epoch #4

Closed
nhalsteadvt opened this issue Jan 19, 2021 · 11 comments
Closed

Failure on first epoch #4

nhalsteadvt opened this issue Jan 19, 2021 · 11 comments

Comments

@nhalsteadvt
Copy link

opens folder where picture should be saved, but this error shows up immediately:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

torch version: 1.7.1
torch.cuda.is_available() == true

what am i missing?

@lucidrains
Copy link
Owner

@nhalsteadvt hmm, could you try upgrading your torchvision?

@nhalsteadvt
Copy link
Author

@lucidrains I thought 1.7.1 was the latest version after checking back here
I think the error might be between PyTorch and CUDA, but I couldn't find what versions I needed on this repo.

@lucidrains
Copy link
Owner

@nhalsteadvt what is your current cuda version? I'm running 10.2

@nhalsteadvt
Copy link
Author

@lucidrains I believe I'm running 11.2. I'll try to reinstall pytorch with CUDA 10.2 / make 10.2 (which I have installed) the active version.

@lucidrains
Copy link
Owner

@nhalsteadvt ohh sorry, actually i am running 11.1, so it should be fine!

@enricoros
Copy link
Contributor

Verified working with CUDA 10.1 and PyTorch for CUDA 10.1 as well.

@enricoros
Copy link
Contributor

@nhalsteadvt Still experiencing issues?

@nhalsteadvt
Copy link
Author

@enricoros Yeah it's a different error now about CUDA out of memory. I thought 15.8 gigs of usable RAM was enough, but it seems something else is wrong.

"RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 4.08 GiB already allocated; 1.16 MiB free; 4.18 GiB reserved in total by PyTorch)"

This stuff really isn't my strongsuit, but this looks like I don't have something configured right to use my GPU or something. I have 70gigs of storage space if that means anything.

@enricoros
Copy link
Contributor

@nhalsteadvt depends on the memory on the video card. With 8GB of Video mem (RTX 2070) I can run size=128 and size=256 images with no problem, but you need more memory for size=512 (stops after hundreds of iterations). What video card do you have? As alternative, you can run this project using the "simplified notebook" that you see on the home page, where the cards are NVIDIA T4s on the Google Cloud.

@nhalsteadvt
Copy link
Author

nhalsteadvt commented Jan 22, 2021

@enricoros I've been using the notebook a bit, so that's cool. Task manager says I have 7.9GB of shared memory between my Intel and Nvidia graphics cards. However, the DirectX Diagnostic tool says I have 8095MB (8.095GB) of shared memory.
How would I alter the image size?

edit: error now says "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)"

@enricoros
Copy link
Contributor

@nhalsteadvt for the task manager stats, look at the "Dedicated GPU memory" value. The Shared GPU memory doesn't mean much (mine has 32GB shared, don't know where that's coming from). It's the dedicated that counts. For example, when running the code right now, I see "Dedicated GPU memory: 7.4/8.0GB" as roughly 90% of the GPU mem is allocated for this operation,

As far as the CUDA errors. You should make sure that the CUDA installed in your system matches the PyTorch expectations. For instance, I don't have the latest CUDA, I have a stable one (10.2 on Windows) that can be accessed here: https://developer.nvidia.com/cuda-10.2-download-archive. And then when downloading PyTorch, I select the same combo (Windows, CUDA 10.2) on the website. Finally I even download CUDNN that matches the CUDA version here: https://developer.nvidia.com/rdp/cudnn-download#a-collapse805-102 (selecting 10.2). Yeah it ain't pretty to get a system working nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants