Failure on first epoch #4

nhalsteadvt · 2021-01-19T23:05:54Z

opens folder where picture should be saved, but this error shows up immediately:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

torch version: 1.7.1
torch.cuda.is_available() == true

what am i missing?

lucidrains · 2021-01-19T23:22:33Z

@nhalsteadvt hmm, could you try upgrading your torchvision?

nhalsteadvt · 2021-01-20T00:12:04Z

@lucidrains I thought 1.7.1 was the latest version after checking back here
I think the error might be between PyTorch and CUDA, but I couldn't find what versions I needed on this repo.

lucidrains · 2021-01-20T00:22:14Z

@nhalsteadvt what is your current cuda version? I'm running 10.2

nhalsteadvt · 2021-01-20T02:13:10Z

@lucidrains I believe I'm running 11.2. I'll try to reinstall pytorch with CUDA 10.2 / make 10.2 (which I have installed) the active version.

lucidrains · 2021-01-20T02:48:37Z

@nhalsteadvt ohh sorry, actually i am running 11.1, so it should be fine!

enricoros · 2021-01-22T02:37:03Z

Verified working with CUDA 10.1 and PyTorch for CUDA 10.1 as well.

enricoros · 2021-01-22T06:02:17Z

@nhalsteadvt Still experiencing issues?

nhalsteadvt · 2021-01-22T16:18:49Z

@enricoros Yeah it's a different error now about CUDA out of memory. I thought 15.8 gigs of usable RAM was enough, but it seems something else is wrong.

"RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 4.08 GiB already allocated; 1.16 MiB free; 4.18 GiB reserved in total by PyTorch)"

This stuff really isn't my strongsuit, but this looks like I don't have something configured right to use my GPU or something. I have 70gigs of storage space if that means anything.

enricoros · 2021-01-22T20:56:58Z

@nhalsteadvt depends on the memory on the video card. With 8GB of Video mem (RTX 2070) I can run size=128 and size=256 images with no problem, but you need more memory for size=512 (stops after hundreds of iterations). What video card do you have? As alternative, you can run this project using the "simplified notebook" that you see on the home page, where the cards are NVIDIA T4s on the Google Cloud.

nhalsteadvt · 2021-01-22T21:12:52Z

@enricoros I've been using the notebook a bit, so that's cool. Task manager says I have 7.9GB of shared memory between my Intel and Nvidia graphics cards. However, the DirectX Diagnostic tool says I have 8095MB (8.095GB) of shared memory.
How would I alter the image size?

edit: error now says "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)"

enricoros · 2021-01-23T09:59:22Z

@nhalsteadvt for the task manager stats, look at the "Dedicated GPU memory" value. The Shared GPU memory doesn't mean much (mine has 32GB shared, don't know where that's coming from). It's the dedicated that counts. For example, when running the code right now, I see "Dedicated GPU memory: 7.4/8.0GB" as roughly 90% of the GPU mem is allocated for this operation,

As far as the CUDA errors. You should make sure that the CUDA installed in your system matches the PyTorch expectations. For instance, I don't have the latest CUDA, I have a stable one (10.2 on Windows) that can be accessed here: https://developer.nvidia.com/cuda-10.2-download-archive. And then when downloading PyTorch, I select the same combo (Windows, CUDA 10.2) on the website. Finally I even download CUDNN that matches the CUDA version here: https://developer.nvidia.com/rdp/cudnn-download#a-collapse805-102 (selecting 10.2). Yeah it ain't pretty to get a system working nice.

nhalsteadvt closed this as completed Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure on first epoch #4

Failure on first epoch #4

nhalsteadvt commented Jan 19, 2021

lucidrains commented Jan 19, 2021

nhalsteadvt commented Jan 20, 2021

lucidrains commented Jan 20, 2021

nhalsteadvt commented Jan 20, 2021

lucidrains commented Jan 20, 2021

enricoros commented Jan 22, 2021

enricoros commented Jan 22, 2021

nhalsteadvt commented Jan 22, 2021

enricoros commented Jan 22, 2021

nhalsteadvt commented Jan 22, 2021 •

edited

enricoros commented Jan 23, 2021

Failure on first epoch #4

Failure on first epoch #4

Comments

nhalsteadvt commented Jan 19, 2021

lucidrains commented Jan 19, 2021

nhalsteadvt commented Jan 20, 2021

lucidrains commented Jan 20, 2021

nhalsteadvt commented Jan 20, 2021

lucidrains commented Jan 20, 2021

enricoros commented Jan 22, 2021

enricoros commented Jan 22, 2021

nhalsteadvt commented Jan 22, 2021

enricoros commented Jan 22, 2021

nhalsteadvt commented Jan 22, 2021 • edited

enricoros commented Jan 23, 2021

nhalsteadvt commented Jan 22, 2021 •

edited