Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occupies the entire GPU #146

Open
Terry10086 opened this issue Nov 10, 2023 · 0 comments
Open

occupies the entire GPU #146

Terry10086 opened this issue Nov 10, 2023 · 0 comments

Comments

@Terry10086
Copy link

Terry10086 commented Nov 10, 2023

Thank you for your brilliant work! I have the following issues.
Q1: when running the following commands, it occupies the entire GPU, even with a small batch_size. During the evaluation process, it fills up all my GPUs.
python -m train --gin_configs=configs/blender_refnerf.gin --gin_bindings="Config.data_dir = '${DATA_DIR}/${SCENE}'" --gin_bindings="Config.checkpoint_dir = '${CHECKPOINT_DIR}'" --logtostderr
python -m eval --logtostderr --gin_configs=configs/blender_refnerf.gin --gin_bindings="Config.data_dir = './data/car'" --gin_bindings="Config.checkpoint_dir = './logs/shinyblender/car/checkpoint_240000'"
Is there a way to control it to use only one GPU and utilize only the required VRAM? (like pytorch)

Q2: During the evaluation process, I encountered the following issue: This problem only occurs when the second gpu are occupied , and it doesn't seem to be an Out Of Memory (OOM) problem due to the batch_size =1024. Is there a way to resolve this?
Here is the usage status of GPUs (the red indicates the usage during my evaluation) and the reported errors. If GPU1 is completely available, I can run the evaluation process normally.
image

(multinerf) yangtongyu@amax21-1:~/nerf/multinerf$ python -m eval --logtostderr  --gin_configs=configs/blender_refnerf.gin   --gin_bindings="Config.data_dir = './data/car'"   --gin_bindings="Config.checkpoint_dir = './logs/shinyblender/car/checkpoint_240000'" 
2023-11-11 01:46:55.062487: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-11 01:46:55.062513: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-11 01:46:55.062530: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-11 01:46:55.582288: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I1111 01:46:56.709442 140481812828288 xla_bridge.py:633] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA
I1111 01:46:56.710078 140481812828288 xla_bridge.py:633] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2023-11-11 01:47:25.524115: W external/xla/xla/service/gpu/nvptx_compiler.cc:679] The NVIDIA driver's CUDA version is 11.4 which is older than the ptxas CUDA version (11.8.89). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
I1111 01:47:35.500932 140481812828288 checkpoints.py:1062] Restoring orbax checkpoint from logs/shinyblender/car/checkpoint_240000
I1111 01:47:35.501093 140481812828288 type_handlers.py:233] OCDBT is initialized successfully.
I1111 01:47:35.503083 140481812828288 checkpointer.py:98] Restoring item from logs/shinyblender/car/checkpoint_240000.
W1111 01:47:36.265129 140481812828288 transform_utils.py:229] The transformations API will eventually be replaced by an upgraded design. The current API will not be removed until this point, but it will no longer be actively worked on.
I1111 01:47:36.397814 140481812828288 checkpointer.py:100] Finished restoring checkpoint from logs/shinyblender/car/checkpoint_240000.
Evaluating checkpoint at step 240000.
Evaluating image 1/200
------------------------------------------------ 1024
Rendering chunk 0/624
2023-11-11 01:47:39.531097: E external/xla/xla/stream_executor/cuda/cuda_blas.cc:190] failed to create cublas handle: the library was not initialized
2023-11-11 01:47:39.531121: E external/xla/xla/stream_executor/cuda/cuda_blas.cc:193] Failure to initialize cublas may be due to OOM (cublas needs some free memory when you initialize it, and your deep-learning framework may have preallocated more than its fair share), or may be because this binary was not built with support for the GPU in your machine.
2023-11-11 01:47:39.531158: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 1 failed: INTERNAL: Failed to instantiate GPU graphs: Failed to initialize BLAS support
2023-11-11 01:47:49.498202: E external/xla/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:
2023-11-11 01:47:49.531332: F external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2862] Replicated computation launch failed, but not all replicas terminated. Aborting process to work around deadlock. Failure message (there may have been multiple failures, see the error log for all failures): 

Failed to instantiate GPU graphs: Failed to initialize BLAS support
Fatal Python error: Aborted

Thread 0x00007fc1fbfff700 (most recent call first):
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/threading.py", line 312 in wait
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/queue.py", line 140 in put
  File "/home/yangtongyu/nerf/multinerf/internal/datasets.py", line 361 in run
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/threading.py", line 980 in _bootstrap_inner
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/threading.py", line 937 in _bootstrap

Current thread 0x00007fc4788d1080 (most recent call first):
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/interpreters/pxla.py", line 1152 in __call__
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/profiler.py", line 340 in wrapper
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/api.py", line 1794 in cache_miss
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 177 in reraise_with_filtered_traceback
  File "/home/yangtongyu/nerf/multinerf/internal/models.py", line 674 in render_image
  File "/home/yangtongyu/nerf/multinerf/eval.py", line 101 in main
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 254 in _run_main
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/site-packages/absl/app.py", line 308 in run
  File "/home/yangtongyu/nerf/multinerf/eval.py", line 263 in <module>
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 87 in _run_code
  File "/home/yangtongyu/software/anaconda3/envs/multinerf/lib/python3.9/runpy.py", line 197 in _run_module_as_main
已放弃 (核心已转储)

Q3: What 's the different between render_chunk_size and batch_size in configs.py? Does render_chunk_size represent the batch_size during evaluation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant