Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

Closed
cjw531 opened this issue Sep 14, 2021 · 3 comments

Comments

@cjw531
Copy link

cjw531 commented Sep 14, 2021

Hi,
With your debugging help done in other issues, I was able to get up to the last step.

I have 3 questions about this last step:

  1. If I use single 2080ti here (as you set gpus='0'), I am getting OOM allocation error so I assigned three 2080ti here. Is this an acceptable approach? Because you did not seem to allow allocating multi-gpus for calculating geometry buffers. Also, should I consider using imh=256 instead of 512 to reduce memory usage?
    Error message as follows:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 
    OOM when allocating tensor with shape[68361728,3] and type float on /job:localhost/replica:0/t
ask:0/device:GPU:0 by allocator GPU_0_bfc [Op:Mul]
  1. What I initially did was copying the whole script (step 1, 2, and 3) of the final step, and running it with $ bash ./script.sh. However this causes the error saying that it cannot find the ckpt-2 and ckpt-10 files that should be pre-existed. So I separated three scripts, and was able to get up to the shape pre-training and joint optimization process. I hope my execution did not cause the below tensorflow warning regarding:
The calling iterator did not fully read the dataset being cached. 
In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. 
This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. 
You should use `dataset.take(k).cache().repeat()` instead.
  1. I am getting the following error in the very last step and cannot complete your hotdog example ("Simultaneous Relighting and View Synthesis (testing)"):
[test] Restoring trained model
[models/base] Trainable layers registered:
        ['net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0']
[models/base] Trainable layers registered:
        ['net_brdf_mlp_layer0', 'net_brdf_mlp_layer1', 'net_brdf_mlp_layer2', 'net_brdf_mlp_layer3', 'net_brdf_out_layer0']
[models/base] Trainable layers registered:
        ['net_albedo_mlp_layer0', 'net_albedo_mlp_layer1', 'net_albedo_mlp_layer2', 'net_albedo_mlp_layer3', 'net_albedo_out_layer0', 'net_brdf_z_mlp_layer0', 'net_brdf_z_mlp_layer1', 'net_brdf_z_mlp_layer2', 'net_brdf_z_mlp_layer3', 'net_brdf_z_out_layer0', 'net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0']
[test] Running inference
Inferring Views:   0%|                                                     | 0/200 [00:00<?, ?it/s]
2021-09-14 01:46:33.905210: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-09-14 01:47:05.401366: W tensorflow/core/kernels/data/cache_dataset_ops.cc:794] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Inferring Views:   0%|                                                     | 0/200 [02:22<?, ?it/s]
Traceback (most recent call last):
  File "/home/jiwonchoi/code/nerfactor/nerfactor/test.py", line 209, in <module>
    app.run(main)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/jiwonchoi/code/nerfactor/nerfactor/test.py", line 192, in main
    brdf_z_override=brdf_z_override)
  File "/home/jiwonchoi/code/nerfactor/nerfactor/models/nerfactor.py", line 266, in call
    relight_probes=relight_probes)
  File "/home/jiwonchoi/code/nerfactor/nerfactor/models/nerfactor.py", line 362, in _render
    rgb_probes = tf.concat([x[:, None, :] for x in rgb_probes], axis=1)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1606, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1181, in concat_v2
    _ops.raise_from_not_ok_status(e, name)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat

It seems like rgb_probes = tf.concat([x[:, None, :] for x in rgb_probes], axis=1) this line of code causes the issue. Not sure how to debug this.

Thank you in advance.

@XiaoKangW
Copy link

@cjw531 i also have the same problems as you. and I have to adjust batch size , but there are still problems about memory.

@Jiangyu1181
Copy link

@cjw531 I also have the same problems as you. But the no_batch=True, so I can't change it.

@hdupuyang
Copy link

it seems that you haven't download the light probes. you can download them in the author's project pages. And then set the 'test_envmap_dir' term in lr5e-3.ini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants