Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when validating #32

Open
huangfuyang opened this issue Jul 18, 2022 · 6 comments
Open

OOM when validating #32

huangfuyang opened this issue Jul 18, 2022 · 6 comments

Comments

@huangfuyang
Copy link

Each time I run lego or fox sample, the display memory warning promps immediately when it is validating. The memory-usage when training is about 5400M/ 12G. But it fills up the memory when validate, and the whole training is suspended therefore. Is it normal or what i can do to reduce the memory usage?

It promts:
[w 0718 17:48:56.120434 88 cuda_device_allocator.cc:29] Unable to alloc cuda device memory, use unify memory instead. This may cause low performance.

THANKS A LOT

@Gword
Copy link
Collaborator

Gword commented Jul 19, 2022

Generally, the memory-usage of lego dataset is about 8G. You can pull the latest version and try it. Or you can try to reduce n_rays_per_batch in ngp_base.py to reduce the memory usage when validate.

@huangfuyang
Copy link
Author

Generally, the memory-usage of lego dataset is about 8G. You can pull the latest version and try it. Or you can try to reduce n_rays_per_batch in ngp_base.py to reduce the memory usage when validate.

Thanks for your suggestion. I remove the validate stage and it seems ok when training. But the render result is messy. Is there any idea about it
lego_r_14
?

@Gword
Copy link
Collaborator

Gword commented Jul 19, 2022

What is your operating system, GPU and CUDA version?

@huangfuyang
Copy link
Author

What is your operating system, GPU and CUDA version?

ubuntu 18.04 1080TI cuda 11.2

@LeoRainly
Copy link

Generally, the memory-usage of lego dataset is about 8G. You can pull the latest version and try it. Or you can try to reduce n_rays_per_batch in ngp_base.py to reduce the memory usage when validate.

I also meet this when training:
[i 0414 07:15:02.026124 56 compiler.py:955] Jittor(1.3.7.13) src: /usr/local/lib/python3.8/dist-packages/jittor
[i 0414 07:15:02.032066 56 compiler.py:956] g++ at /usr/bin/g++(9.4.0)
[i 0414 07:15:02.032149 56 compiler.py:957] cache_path: /root/.cache/jittor/jt1.3.7/g++9.4.0/py3.8.10/Linux-3.10.0-1x47/IntelRXeonRSilxb7/default
[i 0414 07:15:02.038467 56 init.py:411] Found nvcc(11.3.109) at /usr/local/cuda/bin/nvcc.
[i 0414 07:15:02.045088 56 init.py:411] Found addr2line(2.34) at /usr/bin/addr2line.
[i 0414 07:15:02.432493 56 compiler.py:1010] cuda key:cu11.3.109_sm_75
[i 0414 07:15:02.667230 56 init.py:227] Total mem: 62.39GB, using 16 procs for compiling.
[i 0414 07:15:02.872833 56 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0414 07:15:03.160802 56 init.cc:62] Found cuda archs: [75,]
[i 0414 07:15:03.355404 56 compile_extern.py:522] mpicc not found, distribution disabled.
[i 0414 07:15:05.770059 56 init.py:6] JNeRF(0.1.3.0) at /data3/liuyu/develop/JNeRF/python/jnerf
[i 0414 07:15:05.990121 56 cuda_flags.cc:39] CUDA enabled.
Loading config from: ./projects/ngp/configs/ngp_base.py
load train data
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 26.77it/s]
load val data
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.31it/s]
10%|███████████████████████▎ | 4094/40000 [00:50<07:12, 82.97it/s]/data3/liuyu/develop/JNeRF/python/jnerf/runner/runner.py:191: RuntimeWarning: invalid value encountered in cast
ndarr = (img*255+0.5).clip(0, 255).astype('uint8')
STEP=4096 | LOSS=nan | VAL PSNR=nan
20%|██████████████████████████████████████████████▋ | 8192/40000 [01:53<07:02, 75.34it/s]STEP=8192 | LOSS=nan | VAL PSNR=nan
31%|█████████████████████████████████████████████████████████████████████▋ | 12288/40000 [02:58<06:02, 76.48it/s]STEP=12288 | LOSS=nan | VAL PSNR=nan
41%|████████████████████████████████████████████████████████████████████████████████████████████▉ | 16377/40000 [04:08<05:30, 71.40it/s]STEP=16384 | LOSS=nan | VAL PSNR=nan
51%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 20475/40000 [05:26<05:19, 61.18it/s]STEP=20480 | LOSS=nan | VAL PSNR=nan
61%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 24572/40000 [06:53<04:34, 56.16it/s]STEP=24576 | LOSS=nan | VAL PSNR=nan
72%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 28670/40000 [08:48<04:37, 40.85it/s]STEP=28672 | LOSS=0.08867161720991135 | VAL PSNR=15.269938468933105
82%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 32767/40000 [10:40<02:56, 40.92it/s]STEP=32768 | LOSS=0.08883365988731384 | VAL PSNR=10.358024597167969
92%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 36863/40000 [12:31<01:42, 30.54it/s]STEP=36864 | LOSS=0.08840325474739075 | VAL PSNR=10.357995986938477
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40000/40000 [13:57<00:00, 47.79it/s]
load test data
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 27.13it/s]
rendering testset...
0%| | 0/200 [00:00<?, ?it/s][w 0414 07:29:31.006825 56 cuda_device_allocator.cc:30] Unable to alloc cuda device memory, use unify memory instead. This may cause low performance.
[i 0414 07:29:31.006853 56 cuda_device_allocator.cc:32]
=== display_memory_info ===
total_cpu_ram: 62.39GB total_device_ram: 7.795GB
hold_vars: 96 lived_vars: 243 lived_ops: 158
name: sfrl is_device: 1 used: 1.546GB(49.2%) unused: 1.597GB(50.8%) total: 3.144GB
name: sfrl is_device: 1 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 81.33MB(80.5%) unused: 19.67MB(19.5%) total: 101MB
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 11.11GB gpu: 7.076GB cpu: 4.031GB
free: cpu( 11.7GB) gpu(209.4MB)
swap: total( 0 B) last( 0 B)
[w 0414 07:29:31.008137 56 cuda_device_allocator.cc:30] Unable to alloc cuda device memory, use unify memory instead. This may cause low performance.
[i 0414 07:29:31.008161 56 cuda_device_allocator.cc:32]
=== display_memory_info ===
total_cpu_ram: 62.39GB total_device_ram: 7.795GB
hold_vars: 96 lived_vars: 236 lived_ops: 158
name: sfrl is_device: 1 used: 3.454GB(68.4%) unused: 1.598GB(31.6%) total: 5.052GB
name: sfrl is_device: 1 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 3.912GB(99.5%) unused: 20.97MB(0.521%) total: 3.933GB
name: sfrl is_device: 0 used: 81.33MB(80.5%) unused: 19.67MB(19.5%) total: 101MB
name: temp is_device: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: temp is_device: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 13.02GB gpu: 8.984GB cpu: 4.031GB
free: cpu( 11.7GB) gpu(209.4MB)
swap: total( 0 B) last( 0 B)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [05:19<00:00, 1.60s/it]
TOTAL TEST PSNR====11.312249183654785

But the rendered result of the params.pkl is a video of nothing

@ZhengHFei
Copy link

Have you solved this problem? I have the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants