Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About training on COCO #8

Closed
BruceLeeeee opened this issue Jul 1, 2019 · 9 comments
Closed

About training on COCO #8

BruceLeeeee opened this issue Jul 1, 2019 · 9 comments

Comments

@BruceLeeeee
Copy link

BruceLeeeee commented Jul 1, 2019

Hi, Thanks for your work. I tried to train on coco dataset and only changed dataset in default config, but I encountered the error as follow:

07-01 15:14:24 Initialize saver ...
07-01 15:14:27 Initialize all variables ...
07-01 15:14:39 Initialized model weights from /root/lsh2/PoseFix_RELEASE/main/../data/imagenet_weights/resnet_v1_152.ckpt ...
07-01 15:14:55 Start training ...
2019-07-01 15:15:19.420659: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at scatter_nd_op.cc:119 : Invalid argument: indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
	 [[{{node tower_0/ScatterNd}} = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/Cast, tower_0/concat_8, tower_0/ScatterNd/shape)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/lsh2/PoseFix_RELEASE/main/train.py", line 31, in <module>
    trainer.train()
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 449, in train
    [self.graph_ops[0], self.lr, *self.summary_dict.values()], feed_dict=feed_dict)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
	 [[node tower_0/ScatterNd (defined at /root/lsh2/PoseFix_RELEASE/main/model.py:108)  = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/Cast, tower_0/concat_8, tower_0/ScatterNd/shape)]]

Caused by op 'tower_0/ScatterNd', defined at:
  File "/root/lsh2/PoseFix_RELEASE/main/train.py", line 30, in <module>
    trainer = Trainer(Model(), cfg)
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 195, in __init__
    super(Trainer, self).__init__(net, cfg, data_iter, log_name='train_logs.txt')
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 125, in __init__
    self.build_graph()
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 142, in build_graph
    self.graph_ops = self._make_graph()
  File "/root/lsh2/PoseFix_RELEASE/main/../lib/tfflat/base.py", line 382, in _make_graph
    self.net.make_network(is_train=True)
  File "/root/lsh2/PoseFix_RELEASE/main/model.py", line 156, in make_network
    self.render_onehot_heatmap(target_coord, cfg.output_shape),\
  File "/root/lsh2/PoseFix_RELEASE/main/model.py", line 108, in render_onehot_heatmap
    heatmap = tf.scatter_nd(indices, probs, (batch_size, *output_shape, cfg.num_kps))
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 7077, in scatter_nd
    "ScatterNd", indices=indices, updates=updates, shape=shape, name=name)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[3] = [0, 0, 159, 3] does not index into shape [16,96,72,17]
	 [[node tower_0/ScatterNd (defined at /root/lsh2/PoseFix_RELEASE/main/model.py:108)  = ScatterNd[T=DT_FLOAT, Tindices=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/Cast, tower_0/concat_8, tower_0/ScatterNd/shape)]]

Thanks for your time.

@coordxyz
Copy link

coordxyz commented Jul 5, 2019

I had same problem. Did you solve the problem and how? Thanks~

@mks0601
Copy link
Owner

mks0601 commented Jul 5, 2019

can you describe in a more detailed way? what did you change from original code?

@mks0601
Copy link
Owner

mks0601 commented Jul 5, 2019

Did you train model on COCO and tried to test on other dataset?

@BruceLeeeee
Copy link
Author

@bhyzhao @mks0601 I only changed Config.dataset to COCO. I think tf.scatter_nd in render_onehot_headmap() doesn't check indices for out-of-bounds induces the error. If I clip the value of target_coord, it works.

@mks0601
Copy link
Owner

mks0601 commented Jul 9, 2019

which version of TF are you using? according to my experience and doc (https://www.tensorflow.org/api_docs/python/tf/scatter_nd), in case of GPU, out of box indices are ignored.

@BruceLeeeee
Copy link
Author

I have tested on tensorfow==1.12 tensorflow==1.14 and tensorFlow-gpu==1.14, and all have the same error. According to the doc, it should works, but I don't know why.

@mks0601
Copy link
Owner

mks0601 commented Jul 10, 2019

I also used 1.12 when implementing PoseFix. That is weird.. Can you tell me how did you clip the coordinates?

@BruceLeeeee
Copy link
Author

BruceLeeeee commented Jul 11, 2019

I am not sure if it is correct, I think those out-of-bounds points are invalid, so it would not affect loss, right?
`

def render_onehot_heatmap(self, coord, output_shape):
    batch_size = tf.shape(coord)[0]

    x = tf.reshape(coord[:,:,0] / cfg.input_shape[1] * output_shape[1],[-1])
    y = tf.reshape(coord[:,:,1] / cfg.input_shape[0] * output_shape[0],[-1])
    x_floor = tf.floor(x)
    y_floor = tf.floor(y)

    x_floor = tf.clip_by_value(x_floor, 0, output_shape[1] - 2)  # fix out-of-bounds x
    y_floor = tf.clip_by_value(y_floor, 0, output_shape[0] - 2)  # fix out-of-bounds y

    indices_batch = tf.expand_dims(tf.to_float(\
            tf.reshape(
            tf.transpose(\
            tf.tile(\
            tf.expand_dims(tf.range(batch_size),0)\
            ,[cfg.num_kps,1])\
            ,[1,0])\
            ,[-1])),1)
    indices_batch = tf.concat([indices_batch, indices_batch, indices_batch, indices_batch], axis=0)
    indices_joint = tf.to_float(tf.expand_dims(tf.tile(tf.range(cfg.num_kps),[batch_size]),1))
    indices_joint = tf.concat([indices_joint, indices_joint, indices_joint, indices_joint], axis=0)
    
    indices_lt = tf.concat([tf.expand_dims(y_floor,1), tf.expand_dims(x_floor,1)], axis=1)
    indices_lb = tf.concat([tf.expand_dims(y_floor+1,1), tf.expand_dims(x_floor,1)], axis=1)
    indices_rt = tf.concat([tf.expand_dims(y_floor,1), tf.expand_dims(x_floor+1,1)], axis=1)
    indices_rb = tf.concat([tf.expand_dims(y_floor+1,1), tf.expand_dims(x_floor+1,1)], axis=1)

    indices = tf.concat([indices_lt, indices_lb, indices_rt, indices_rb], axis=0)
    indices = tf.cast(tf.concat([indices_batch, indices, indices_joint], axis=1),tf.int32)

    prob_lt = (1 - (x - x_floor)) * (1 - (y - y_floor))
    prob_lb = (1 - (x - x_floor)) * (y - y_floor)
    prob_rt = (x - x_floor) * (1 - (y - y_floor))
    prob_rb = (x - x_floor) * (y - y_floor)
    probs = tf.concat([prob_lt, prob_lb, prob_rt, prob_rb], axis=0)

    heatmap = tf.scatter_nd(indices, probs, (batch_size, *output_shape, cfg.num_kps))
    normalizer = tf.reshape(tf.reduce_sum(heatmap,axis=[1,2]),[batch_size,1,1,cfg.num_kps])
    normalizer = tf.where(tf.equal(normalizer,0),tf.ones_like(normalizer),normalizer)
    heatmap = heatmap / normalizer
    
    return heatmap 

`

@mks0601
Copy link
Owner

mks0601 commented Jul 12, 2019

Yes they would not effect loss because there is also target_valid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants