OOM when validating #32

huangfuyang · 2022-07-18T09:54:24Z

Each time I run lego or fox sample, the display memory warning promps immediately when it is validating. The memory-usage when training is about 5400M/ 12G. But it fills up the memory when validate, and the whole training is suspended therefore. Is it normal or what i can do to reduce the memory usage?

It promts:
[w 0718 17:48:56.120434 88 cuda_device_allocator.cc:29] Unable to alloc cuda device memory, use unify memory instead. This may cause low performance.

THANKS A LOT

Gword · 2022-07-19T08:15:03Z

Generally, the memory-usage of lego dataset is about 8G. You can pull the latest version and try it. Or you can try to reduce n_rays_per_batch in ngp_base.py to reduce the memory usage when validate.

huangfuyang · 2022-07-19T08:39:03Z

Generally, the memory-usage of lego dataset is about 8G. You can pull the latest version and try it. Or you can try to reduce n_rays_per_batch in ngp_base.py to reduce the memory usage when validate.

Thanks for your suggestion. I remove the validate stage and it seems ok when training. But the render result is messy. Is there any idea about it

?

Gword · 2022-07-19T08:41:51Z

What is your operating system, GPU and CUDA version？

huangfuyang · 2022-07-19T08:57:21Z

What is your operating system, GPU and CUDA version？

ubuntu 18.04 1080TI cuda 11.2

LeoRainly · 2023-04-14T07:40:23Z

Generally, the memory-usage of lego dataset is about 8G. You can pull the latest version and try it. Or you can try to reduce n_rays_per_batch in ngp_base.py to reduce the memory usage when validate.

I also meet this when training:
[i 0414 07:15:02.026124 56 compiler.py:955] Jittor(1.3.7.13) src: /usr/local/lib/python3.8/dist-packages/jittor
[i 0414 07:15:02.032066 56 compiler.py:956] g++ at /usr/bin/g++(9.4.0)
[i 0414 07:15:02.032149 56 compiler.py:957] cache_path: /root/.cache/jittor/jt1.3.7/g++9.4.0/py3.8.10/Linux-3.10.0-1x47/IntelRXeonRSilxb7/default
[i 0414 07:15:02.038467 56 init.py:411] Found nvcc(11.3.109) at /usr/local/cuda/bin/nvcc.
[i 0414 07:15:02.045088 56 init.py:411] Found addr2line(2.34) at /usr/bin/addr2line.
[i 0414 07:15:02.432493 56 compiler.py:1010] cuda key:cu11.3.109_sm_75
[i 0414 07:15:02.667230 56 init.py:227] Total mem: 62.39GB, using 16 procs for compiling.
[i 0414 07:15:02.872833 56 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0414 07:15:03.160802 56 init.cc:62] Found cuda archs: [75,]
[i 0414 07:15:03.355404 56 compile_extern.py:522] mpicc not found, distribution disabled.
[i 0414 07:15:05.770059 56 init.py:6] JNeRF(0.1.3.0) at /data3/liuyu/develop/JNeRF/python/jnerf
[i 0414 07:15:05.990121 56 cuda_flags.cc:39] CUDA enabled.
Loading config from: ./projects/ngp/configs/ngp_base.py
load train data
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 26.77it/s]
load val data
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.31it/s]
10%|███████████████████████▎ | 4094/40000 [00:50<07:12, 82.97it/s]/data3/liuyu/develop/JNeRF/python/jnerf/runner/runner.py:191: RuntimeWarning: invalid value encountered in cast
ndarr = (img*255+0.5).clip(0, 255).astype('uint8')
STEP=4096 | LOSS=nan | VAL PSNR=nan
20%|██████████████████████████████████████████████▋ | 8192/40000 [01:53<07:02, 75.34it/s]STEP=8192 | LOSS=nan | VAL PSNR=nan
31%|█████████████████████████████████████████████████████████████████████▋ | 12288/40000 [02:58<06:02, 76.48it/s]STEP=12288 | LOSS=nan | VAL PSNR=nan
41%|████████████████████████████████████████████████████████████████████████████████████████████▉ | 16377/40000 [04:08<05:30, 71.40it/s]STEP=16384 | LOSS=nan | VAL PSNR=nan
51%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 20475/40000 [05:26<05:19, 61.18it/s]STEP=20480 | LOSS=nan | VAL PSNR=nan
61%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 24572/40000 [06:53<04:34, 56.16it/s]STEP=24576 | LOSS=nan | VAL PSNR=nan
72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 28670/40000 [08:48<04:37, 40.85it/s]STEP=28672 | LOSS=0.08867161720991135 | VAL PSNR=15.269938468933105
82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 32767/40000 [10:40<02:56, 40.92it/s]STEP=32768 | LOSS=0.08883365988731384 | VAL PSNR=10.358024597167969
92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 36863/40000 [12:31<01:42, 30.54it/s]STEP=36864 | LOSS=0.08840325474739075 | VAL PSNR=10.357995986938477
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40000/40000 [13:57<00:00, 47.79it/s]
load test data
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 27.13it/s]
rendering testset...
0%| | 0/200 [00:00<?, ?it/s][w 0414 07:29:31.006825 56 cuda_device_allocator.cc:30] Unable to alloc cuda device memory, use unify memory instead. This may cause low performance.
[i 0414 07:29:31.006853 56 cuda_device_allocator.cc:32]
=== display_memory_info ===
total_cpu_ram: 62.39GB total_device_ram: 7.795GB
hold_vars: 96 lived_vars: 243 lived_ops: 158
name: sfrl is_device: 1 used: 1.546GB(49.2%) unused: 1.597GB(50.8%) total: 3.144GB
name: sfrl is_device: 1 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 81.33MB(80.5%) unused: 19.67MB(19.5%) total: 101MB
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 11.11GB gpu: 7.076GB cpu: 4.031GB
free: cpu( 11.7GB) gpu(209.4MB)
swap: total( 0 B) last( 0 B)
[w 0414 07:29:31.008137 56 cuda_device_allocator.cc:30] Unable to alloc cuda device memory, use unify memory instead. This may cause low performance.
[i 0414 07:29:31.008161 56 cuda_device_allocator.cc:32]
=== display_memory_info ===
total_cpu_ram: 62.39GB total_device_ram: 7.795GB
hold_vars: 96 lived_vars: 236 lived_ops: 158
name: sfrl is_device: 1 used: 3.454GB(68.4%) unused: 1.598GB(31.6%) total: 5.052GB
name: sfrl is_device: 1 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 81.33MB(80.5%) unused: 19.67MB(19.5%) total: 101MB
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 13.02GB gpu: 8.984GB cpu: 4.031GB
free: cpu( 11.7GB) gpu(209.4MB)
swap: total( 0 B) last( 0 B)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [05:19<00:00, 1.60s/it]
TOTAL TEST PSNR====11.312249183654785

But the rendered result of the params.pkl is a video of nothing

ZhengHFei · 2023-05-04T11:43:53Z

Have you solved this problem? I have the same problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when validating #32

OOM when validating #32

huangfuyang commented Jul 18, 2022

Gword commented Jul 19, 2022

huangfuyang commented Jul 19, 2022

Gword commented Jul 19, 2022

huangfuyang commented Jul 19, 2022

LeoRainly commented Apr 14, 2023

ZhengHFei commented May 4, 2023

OOM when validating #32

OOM when validating #32

Comments

huangfuyang commented Jul 18, 2022

Gword commented Jul 19, 2022

huangfuyang commented Jul 19, 2022

Gword commented Jul 19, 2022

huangfuyang commented Jul 19, 2022

LeoRainly commented Apr 14, 2023

ZhengHFei commented May 4, 2023