train_ngp_nerf_occ.py: RuntimeError: CUDA error: invalid configuration argument #207

sararoma95 · 2023-04-24T06:26:45Z

Hello,

It may be a easy problem to solve but I have not be able to do it.

When running python examples/train_ngp_nerf_occ.py --scene lego --data_root path
I get the following error when I evaluate the model (L. 236):

RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This does not happen when using train_ngp_nerf_prop.py

Thanks!

The text was updated successfully, but these errors were encountered:

liruilong940607 · 2023-04-25T09:12:28Z

This is weird. May I know your CUDA and pytorch version?

python -c "import torch; print(torch.__version__)"
nvcc --version

And what's your GPU?

nvidia-smi

My guess this is related to gpu you are using.

aiyb1314 · 2023-04-25T14:51:23Z

Hello, I have the same problem!
CUDA and pytorch version: 2.0.0+cu118, Cuda compilation tools, release 11.8
GPU：12.0

liruilong940607 · 2023-04-25T19:24:02Z

Sorry I mean which NVIDIA card are you using, e.g. V100?

sararoma95 · 2023-04-25T19:33:09Z

Hello I’m using A100 thanks Regards,Sara RojasOn 25 Apr 2023, at 10:24 PM, Ruilong Li ***@***.***> wrote: Sorry I mean which NVIDIA card are you using, e.g. V100? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***> This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

aiyb1314 · 2023-04-26T00:49:19Z

RTX 3090

filipgu · 2023-04-28T15:04:34Z

also having problems on A100 with occupancy grids. Rays are in bounding box, but ray indices, starts and ends come back with 0 in the batch dimension.

mJones00 · 2023-05-02T00:02:26Z

I also had this exact issue on my 3090, although it was working fine in another conda environment. Turns out that when I set up the new environment, I forgot to specify the torch version and therefore I was using torch 2.0.0 with cuda 11.7, then tiny-cuda-nn was compiled with this combination. Installing torch version 1.13.0 with pytorch-cuda 11.7 fixed the issue for me.

liruilong940607 · 2023-05-02T00:10:12Z

I can reproduce this error with torch 2.0.0. With torch 1.13.0 everything seems working fine.

I'll come back to this issue once I figure out why this happens. In the mean time, using torch 1.13.0 seems to be a workaround.

sararoma95 · 2023-05-02T06:31:49Z

Thanks!

…

On Tue, May 2, 2023 at 3:10 AM Ruilong Li(李瑞龙) ***@***.***> wrote: I can reproduce this error with torch 2.0.0. With torch 1.13.0 everything seems working fine. I'll come back to this issue once I figure out why this happens. In the mean time, using torch 1.13.0 seems to be a workaround. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/KAIR-BAIR/nerfacc/issues/207*issuecomment-1530635985__;Iw!!Nmw4Hv0!yg9uH9frG5qkdhyd16cc_UdzvfZsk9eZtN20yBug2ojRTIRroLs1cpcMxHBZop57a0_YtJbPANmU7j3eLMd_euT0o1fH3HPVEYQ$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AWZC2PIOSBGVOQ3GEUT5Q2TXEBGG5ANCNFSM6AAAAAAXJDPSLQ__;!!Nmw4Hv0!yg9uH9frG5qkdhyd16cc_UdzvfZsk9eZtN20yBug2ojRTIRroLs1cpcMxHBZop57a0_YtJbPANmU7j3eLMd_euT0o1fHgCjw0B4$> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.

liruilong940607 · 2023-05-02T06:43:43Z

Just fixed it on the master branch!

1z1y · 2023-05-04T09:32:08Z

I also have this problem with torch1.10 and cuda 11.3

AIBluefisher · 2023-07-05T08:47:08Z

I have this problem with torch 1.11 and cuda 11.3 while I use NVIDIA 4090.
Tracing stacks:

  File "/home/xxx/anaconda3/envs/conerf/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/.../model/render_utils.py", line 772, in render_sdf_image_with_xxx
    weights, _ = render_weight_from_alpha(
  File "/home/xxx/anaconda3/envs/conerf/lib/python3.9/site-packages/nerfacc/volrend.py", line 305, in render_weight_from_alpha
    trans = render_transmittance_from_alpha(
  File "/home/xxx/anaconda3/envs/conerf/lib/python3.9/site-packages/nerfacc/volrend.py", line 201, in render_transmittance_from_alpha
    packed_info = pack_info(ray_indices, n_rays)
  File "/home/xxx/anaconda3/envs/conerf/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/xxx/anaconda3/envs/conerf/lib/python3.9/site-packages/nerfacc/pack.py", line 43, in pack_info
    chunk_cnts = torch.zeros((n_rays,), device=device, dtype=dtype)
RuntimeError: CUDA error: invalid configuration argument

AIBluefisher · 2023-07-08T02:41:17Z

I'm here to contribute more data points. I encountered the same issue with torch 1.11 and cuda 11.3 while I use NVIDIA 4090. The issue always happens for the render_image() function during validation. Different versions of nerfacc I have tried so far: 0.3.2, 0.3.3, 0.3.4, 0.3.5.

ahmadki · 2023-07-27T13:22:32Z

I've encountered the same problem, and after two days of debugging, I believe I've figured it out.

The error is not related to the GPU model nor the CUDA version. This is not even a NerfAcc bug.

The error message that I see is similar to the one reported by @AIBluefisher:

File "/home/xxx/anaconda3/envs/conerf/lib/python3.9/site-packages/nerfacc/pack.py", line 43, in pack_info
    chunk_cnts = torch.zeros((n_rays,), device=device, dtype=dtype)
RuntimeError: CUDA error: invalid configuration argument

But CUDA entered an invalid state before this point, and torch.zeros is likely the first CUDA kernel being called post-error. Unfortunately, setting CUDA_LAUNCH_BLOCKING=1 doesn't help to pinpoint the source of the issue.

I added torch.zeros((1000,), device="cuda", dtype=torch.int64) in different parts of my code to trace back to the function that was causing CUDA to malfunction. In my case, it was TorchNGP GridEncoder, specifically when the GridEncoder is invoked with an empty positions tensor (positions.shape=[0, 3]).

The positions tensor is empty because OccGridEstimator was being called before training and before any calls to update_every_n_steps, which results in all zero binary values (in other words, OccGridEstimator skips all rays). This is particularly common with pytorch lightning, where the framework performs sanity check iterations before training.

As a fix, I simply bypassed the NeRF / TorchNGP when the input positions are empty:

if positions.shape[0] == 0:
    sigma = torch.zeros(0, device=positions.device)
    color = torch.zeros(0, 3, device=positions.device)
    return sigma, color

Alternatively, it might be worth explore calling update_every_n_steps right after initializing the OccGridEstimator.

weihan1 · 2023-08-08T17:40:17Z

Hi, I also encountered this issue:

File "/miniconda3/envs/multiview/lib/python3.8/site-packages/nerfacc/vol_rendering.py", line 338, in render_visibility visibility, packed_info_visible = _C.rendering_alphas_forward( File "miniconda3/envs/multiview/lib/python3.8/site-packages/nerfacc/cuda/__init__.py", line 13, in call_cuda return getattr(_C, name)(*args, **kwargs) RuntimeError: CUDA error: invalid configuration argument

I'm using Pytorch 1.12.1+cuda11.6 and CUDA compilation tools release 11.4. Also, using NVIDIA RTX A6000

Rlee719 · 2023-09-21T12:11:42Z

Same problem in version 0.5.2, volrend.accumulate_along_rays can not handle empty input

stevehan00 · 2023-11-06T04:59:31Z

I faced the same issue using torch==1.12.1+cuda11.3 in RTX 4090, but it could be solved by replacing the version with torch 1.13.1+cuda11.6

mashad98 · 2024-04-07T13:44:40Z

anyone an idea how i can solve this?? i am not a developer but a designer using SVD..

liruilong940607 mentioned this issue May 2, 2023

Remove unused code and fix issues when data is empty #211

Merged

liruilong940607 closed this as completed in #211 May 2, 2023

liruilong940607 reopened this May 4, 2023

wangyida mentioned this issue May 30, 2023

CUDA error: invalid configuration argument bennyguo/instant-nsr-pl#73

Open

ahmadki mentioned this issue Jul 27, 2023

GridEncoder can't handle empty positions ashawkey/torch-ngp#176

Open

AIBluefisher mentioned this issue Sep 20, 2023

error when running inference AIBluefisher/DReg-NeRF#3

Open

ArthurZucker mentioned this issue Jan 31, 2024

BART-base flash_attention_2 causes CUDA error huggingface/transformers#28794

Closed

4 tasks

Yosshi999 mentioned this issue Mar 1, 2024

Early return when sampled ray is empty #287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_ngp_nerf_occ.py: RuntimeError: CUDA error: invalid configuration argument #207

train_ngp_nerf_occ.py: RuntimeError: CUDA error: invalid configuration argument #207

sararoma95 commented Apr 24, 2023 •

edited

Loading

liruilong940607 commented Apr 25, 2023 •

edited

Loading

aiyb1314 commented Apr 25, 2023

liruilong940607 commented Apr 25, 2023

sararoma95 commented Apr 25, 2023 via email

aiyb1314 commented Apr 26, 2023

filipgu commented Apr 28, 2023

mJones00 commented May 2, 2023

liruilong940607 commented May 2, 2023

sararoma95 commented May 2, 2023 via email

liruilong940607 commented May 2, 2023

1z1y commented May 4, 2023

AIBluefisher commented Jul 5, 2023 •

edited

Loading

AIBluefisher commented Jul 8, 2023

ahmadki commented Jul 27, 2023 •

edited

Loading

weihan1 commented Aug 8, 2023

Rlee719 commented Sep 21, 2023

stevehan00 commented Nov 6, 2023

mashad98 commented Apr 7, 2024

train_ngp_nerf_occ.py: RuntimeError: CUDA error: invalid configuration argument #207

train_ngp_nerf_occ.py: RuntimeError: CUDA error: invalid configuration argument #207

Comments

sararoma95 commented Apr 24, 2023 • edited Loading

liruilong940607 commented Apr 25, 2023 • edited Loading

aiyb1314 commented Apr 25, 2023

liruilong940607 commented Apr 25, 2023

sararoma95 commented Apr 25, 2023 via email

aiyb1314 commented Apr 26, 2023

filipgu commented Apr 28, 2023

mJones00 commented May 2, 2023

liruilong940607 commented May 2, 2023

sararoma95 commented May 2, 2023 via email

liruilong940607 commented May 2, 2023

1z1y commented May 4, 2023

AIBluefisher commented Jul 5, 2023 • edited Loading

AIBluefisher commented Jul 8, 2023

ahmadki commented Jul 27, 2023 • edited Loading

weihan1 commented Aug 8, 2023

Rlee719 commented Sep 21, 2023

stevehan00 commented Nov 6, 2023

mashad98 commented Apr 7, 2024

sararoma95 commented Apr 24, 2023 •

edited

Loading

liruilong940607 commented Apr 25, 2023 •

edited

Loading

AIBluefisher commented Jul 5, 2023 •

edited

Loading

ahmadki commented Jul 27, 2023 •

edited

Loading