-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about runnning 'training_S3DIS.py ' #11
Comments
This error seems to happen because the validation projection indices are wrong. I changed the implementation of these indices recently. Did you try with the last version of the code? On my computer, I started from scratch with the current implementation and did not have this error. If you want to retry, you should delete the Best, |
Hi, @HuguesTHOMAS Sorry for my neglecting your updated code. I will try it again following your advice.Thank you so much again. Best, |
Hi @HuguesTHOMAS, I met with similar problem in the middle of training (at around iteration 20000) with the latest code. I didn’t modify the default code, would you mind help us figure out how the solve the problem? Great thanks~ Best, |
Hi @HuguesTHOMAS, I also have the same problem when running validation with S3DIS dataset. Did pull your changes today. Delete the precomputed files and run it again. The problem still persists. Thanks again. My setup:
P.S. (off topic) I saw that you have reported a bug with 1.13.0 and CUDA 10 regarding matrix multiplications. Fortunately, I did not face it in on my setup with an above-mentioned version of SW: |
@nejcd, @kentangSJTU, |
Hi, @HuguesTHOMAS, sorry for not making the question clear.
In the main thread, the error happens in layer_0; but in my case, the error happens in layer_2, so it seems to be different. The second part occurs during handling of the above exception, the error message is very similar to what is reported above. And the last part is the same as what is reported in the main thread, "IndexError: arrays used as indices must be of integer (or boolean) type", with exactly the same line number (i.e. Line 806 in trainer.py). The loss didn't become NaN, and I use TF 1.12.0 + CUDA 9.0, as is suggested by the official document. Thanks a lot for your reply~ I have an update: By temporarily disabling the code around Line 806 of trainer.py, I am able to train the model normally. But in testing, the same error happens again, when the script is calculating Reprojection Vote #15. Thus, I believe this error is not related to training, but testing indeed. Best, |
@kentangSJTU, Thank you for the details. It seems that their is a problem with the reprojection indices, which is not surprising, as I changed this part of the code very recently. As this happens in the middle of the training, this could be caused by a particular input, for example, with empty reprojection indices or something similar which is not handled well. I am going to run the code myself and see if I find what causes the error. Best, |
The bug has been fixed. The validation and test should work now on all datasets, but you will have to delete the |
Thanks a lot for your reply~ |
@HuguesTHOMAS , as I have understood that, NaNs are starting to appear during training and it is not possible to train with affected version? All training runs I have run converged nicely therefor I assumed that version and hw(GPU GTX 1080ti) I have is not affected. If I am missing something please let me know, or how should I test it. |
Hi, @HuguesTHOMAS ,
Firstly, thanks for your great work on KPConv. Here I have met some problems when I run 'training_S3DIS.py'. The error information is below:
Traceback (most recent call last):
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call
return fn(*args)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,3], [?,3], [?,3], [?,3], [?,3], ..., [?], [?,3], [?,3,3], [?], [?]], output_types=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: optimizer/gradients/KernelPointNetwork/layer_0/resnetb_1/conv2/concat_1_grad/GatherV2_2/axis/_222 = _HostSendT=DT_INT32, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1469_...rV2_2/axis", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hwk/KPConv/utils/trainer.py", line 261, in train
_, L_out, L_reg, L_p, probs, labels, acc = self.sess.run(ops, {model.dropout_prob: 0.5})
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,3], [?,3], [?,3], [?,3], [?,3], ..., [?], [?,3], [?,3,3], [?], [?]], output_types=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: optimizer/gradients/KernelPointNetwork/layer_0/resnetb_1/conv2/concat_1_grad/GatherV2_2/axis/_222 = _HostSendT=DT_INT32, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1469_...rV2_2/axis", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Caused by op 'IteratorGetNext', defined at:
File "training_S3DIS.py", line 213, in
dataset.init_input_pipeline(config)
File "/home/hwk/KPConv/datasets/common.py", line 749, in init_input_pipeline
self.flat_inputs = iter.get_next()
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 410, in get_next
name=name)), self._output_types,
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2069, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/hwk/anaconda3/envs/py3.5/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()
OutOfRangeError (see above for traceback): End of sequence
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,3], [?,3], [?,3], [?,3], [?,3], ..., [?], [?,3], [?,3,3], [?], [?]], output_types=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: optimizer/gradients/KernelPointNetwork/layer_0/resnetb_1/conv2/concat_1_grad/GatherV2_2/axis/_222 = _HostSendT=DT_INT32, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1469_...rV2_2/axis", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "training_S3DIS.py", line 244, in
trainer.train(model, dataset)
File "/home/hwk/KPConv/utils/trainer.py", line 347, in train
self.cloud_validation_error(model, dataset)
File "/home/hwk/KPConv/utils/trainer.py", line 806, in cloud_validation_error
preds = (sub_preds[dataset.validation_proj[i_val]]).astype(np.int32)
IndexError: arrays used as indices must be of integer (or boolean) type
I am looking forward to your reply.
The text was updated successfully, but these errors were encountered: