tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

cjw531 · 2021-09-14T16:56:57Z

Hi,
With your debugging help done in other issues, I was able to get up to the last step.

I have 3 questions about this last step:

If I use single 2080ti here (as you set gpus='0'), I am getting OOM allocation error so I assigned three 2080ti here. Is this an acceptable approach? Because you did not seem to allow allocating multi-gpus for calculating geometry buffers. Also, should I consider using imh=256 instead of 512 to reduce memory usage?
Error message as follows:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 
    OOM when allocating tensor with shape[68361728,3] and type float on /job:localhost/replica:0/t
ask:0/device:GPU:0 by allocator GPU_0_bfc [Op:Mul]

What I initially did was copying the whole script (step 1, 2, and 3) of the final step, and running it with $ bash ./script.sh. However this causes the error saying that it cannot find the ckpt-2 and ckpt-10 files that should be pre-existed. So I separated three scripts, and was able to get up to the shape pre-training and joint optimization process. I hope my execution did not cause the below tensorflow warning regarding:

The calling iterator did not fully read the dataset being cached. 
In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. 
This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. 
You should use `dataset.take(k).cache().repeat()` instead.

I am getting the following error in the very last step and cannot complete your hotdog example ("Simultaneous Relighting and View Synthesis (testing)"):

[test] Restoring trained model
[models/base] Trainable layers registered:
        ['net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0']
[models/base] Trainable layers registered:
        ['net_brdf_mlp_layer0', 'net_brdf_mlp_layer1', 'net_brdf_mlp_layer2', 'net_brdf_mlp_layer3', 'net_brdf_out_layer0']
[models/base] Trainable layers registered:
        ['net_albedo_mlp_layer0', 'net_albedo_mlp_layer1', 'net_albedo_mlp_layer2', 'net_albedo_mlp_layer3', 'net_albedo_out_layer0', 'net_brdf_z_mlp_layer0', 'net_brdf_z_mlp_layer1', 'net_brdf_z_mlp_layer2', 'net_brdf_z_mlp_layer3', 'net_brdf_z_out_layer0', 'net_normal_mlp_layer0', 'net_normal_mlp_layer1', 'net_normal_mlp_layer2', 'net_normal_mlp_layer3', 'net_normal_out_layer0', 'net_lvis_mlp_layer0', 'net_lvis_mlp_layer1', 'net_lvis_mlp_layer2', 'net_lvis_mlp_layer3', 'net_lvis_out_layer0']
[test] Running inference
Inferring Views:   0%|                                                     | 0/200 [00:00<?, ?it/s]
2021-09-14 01:46:33.905210: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-09-14 01:47:05.401366: W tensorflow/core/kernels/data/cache_dataset_ops.cc:794] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Inferring Views:   0%|                                                     | 0/200 [02:22<?, ?it/s]
Traceback (most recent call last):
  File "/home/jiwonchoi/code/nerfactor/nerfactor/test.py", line 209, in <module>
    app.run(main)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/jiwonchoi/code/nerfactor/nerfactor/test.py", line 192, in main
    brdf_z_override=brdf_z_override)
  File "/home/jiwonchoi/code/nerfactor/nerfactor/models/nerfactor.py", line 266, in call
    relight_probes=relight_probes)
  File "/home/jiwonchoi/code/nerfactor/nerfactor/models/nerfactor.py", line 362, in _render
    rgb_probes = tf.concat([x[:, None, :] for x in rgb_probes], axis=1)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1606, in concat
    return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1181, in concat_v2
    _ops.raise_from_not_ok_status(e, name)
  File "/home/jiwonchoi/.conda/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat

It seems like rgb_probes = tf.concat([x[:, None, :] for x in rgb_probes], axis=1) this line of code causes the issue. Not sure how to debug this.

Thank you in advance.

The text was updated successfully, but these errors were encountered:

XiaoKangW · 2021-09-15T13:36:04Z

@cjw531 i also have the same problems as you. and I have to adjust batch size , but there are still problems about memory.

Jiangyu1181 · 2021-12-01T08:23:00Z

@cjw531 I also have the same problems as you. But the no_batch=True, so I can't change it.

hdupuyang · 2024-01-02T05:27:22Z

it seems that you haven't download the light probes. you can download them in the author's project pages. And then set the 'test_envmap_dir' term in lr5e-3.ini

xiumingzhang closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

cjw531 commented Sep 14, 2021

XiaoKangW commented Sep 15, 2021

Jiangyu1181 commented Dec 1, 2021

hdupuyang commented Jan 2, 2024

tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

tensorflow.python.framework.errors_impl.InvalidArgumentError error in synthesis step #11

Comments

cjw531 commented Sep 14, 2021

XiaoKangW commented Sep 15, 2021

Jiangyu1181 commented Dec 1, 2021

hdupuyang commented Jan 2, 2024