Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal example running on NCS #8

Closed
matpalm opened this issue Sep 6, 2018 · 13 comments

Comments

@matpalm
Copy link
Owner

commented Sep 6, 2018

cc @squeakus

finally have a version of this network running on the NCS :)

( this image was calculated from the stick )

from_ncs

currently all code is on a hacky branch ncs_poc just to prove things work, have to clean up a fair bit and merge everything back to master.

see this README for repro instructions

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2018

Happy days! I'm delighted you have got it working! I am currently collecting another dataset but hope to get it running on the NCS soon.

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

I finally got a chance to test it out tonight. It looks like it is working but I had to remove an encoding layer as my images are 640x480 and a 512 patch size is too big. I changed it to look like this:
input (?, 256, 256, 3) #196608
e1 (?, 127, 127, 16) #258064
e2 (?, 63, 63, 32) #127008
e3 (?, 31, 31, 64) #61504
e4 (?, 15, 15, 128) #28800
d1 (?, 31, 31, 64) #61504
d2 (?, 63, 63, 32) #127008
d3 (?, 127, 127, 16) #258064
logits (?, 127, 127, 1) #16129

Is that the correct way to do it?

Also I uncommented the modelTester code (I love me some stats) and got the following error:
full res test model...
WARNING: ncs_hacktastic
input (?, 480, 640, 3) #921600
e1 (?, 239, 319, 16) #1219856
e2 (?, 119, 159, 32) #605472
e3 (?, 59, 79, 64) #298304
e4 (?, 29, 39, 128) #144768
d1 (?, 59, 79, 64) #298304
d2 (?, 119, 159, 32) #605472
d3 (?, 239, 319, 16) #1219856
logits (?, 239, 319, 1) #76241
ValueError: logits and labels must have the same shape ((?, 239, 319, 1) vs (?, 127, 127, 1))

It seems to be picking up the image size, not the patch size. How do I best mix patches with the testing code?

@matpalm

This comment has been minimized.

Copy link
Owner Author

commented Sep 19, 2018

yeah that (256,256) -> (127,127) all looks good.

with respect to the (127,127) you're sadly hitting some hard coded stuff i have in there... it's this bit of code which is an explicit slice/reshape workaround for the size/shape of the 2d output being wrong

it's clumsy i know, but that could be configurable (until there's a fix..)

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2018

ahh, would it be quicker for me to just crop the test images to 127,127 or will it work if I change the shape of the output?

@matpalm

This comment has been minimized.

Copy link
Owner Author

commented Sep 20, 2018

changing the code to match your size would probably be the quickest...

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Sep 22, 2018

I finally got a bit of time this morning and managed to get it working from start to NCS finish! Unfortunately the results were not great. I went back to train.py and uncommented the test code to see how well the training was working. There was an issue with the training network set up for a certain patch size and the test network being used on the full image so I turned off the patches and changed the image shape in data.py to resize to 239x319. The network topology and labels now match up:
patch train model...
input (?, 480, 640, 3) #921600
e1 (?, 239, 319, 16) #1219856
e2 (?, 119, 159, 32) #605472
e3 (?, 59, 79, 64) #298304
e4 (?, 29, 39, 128) #144768
d1 (?, 59, 79, 64) #298304
d2 (?, 119, 159, 32) #605472
d3 (?, 239, 319, 16) #1219856
logits (?, 239, 319, 1) #76241

but when I run the training I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
[[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

so 240x320 is 76800 but I cannot see anywhere in the code where the tensor is being set to 19200 and I am starting to realise that tensorflow is difficult to debug to say the least!

Do you have any suggestions to see where this is getting set of for debugging tensorflow models?

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Sep 23, 2018

I thought it may have been the shape of my images so I resized them to match the patch size, I also resized my labels to 64x64 with nearest neighbour interpolation. I get the same error despite the dumped shapes of the models being identical

So it works with the patch flag but not without. This means it is either something wrong with my labels or i'm missing something in xys_iterator. I tried tfdbg but it is hard to see what is going on....

@matpalm

This comment has been minimized.

Copy link
Owner Author

commented Sep 24, 2018

yeah, it's been a nightmare to debug... i've also made this repo now more complicated than it needs to be because i've been confounding two things 1) running a patch batched model with fixed sized inference to run on the NCS and 2) training patch based and running on arbitrary sized output for my meta learning experiments; i should really move 2) into it's own repo since it requires different things than 1) on the data pipeline.... but that's an aside...

are you trying to run with an output of (239,319) on the NCS? i recall having a problem where i couldn't get anything over (127,127) as output on the stick...

can you share a larger stack trace around the tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800 line ?

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2018

It seemed to compile and run with 239,219 but the output was a mess. I will resize it if I hit the same 127,127 limitation

Here is the stack trace and some additional information, as the expected size (19200) is 120 x 160 and the value being passed (76800) is 240 x 320 I think it might be the output of a particular layer is the wrong size. I used the slim model analyzer to get more info but still cannot see anything wrong.

$ ./train.py --run $RUN --steps $STEPS --train-steps 1000 --train-image-dir $DATADIR/train/ --test-image-dir $DATADIR/test/ --label-dir $DATADIR/labels/ --no-use-batch-norm --no-use-skip-connections --width 640 --height 480 --label-rescale 0.25
opts Namespace(base_filter_size=8, batch_size=32, flip_left_right=False, height=480, label_dir='data/1850/labels/', label_rescale=0.25, learning_rate=0.001, no_use_batch_norm=True, no_use_skip_connections=True, patch_width_height=None, random_rotate=False, run='r2', secs=None, steps=2000, test_image_dir='data/1850/test/', train_image_dir='data/1850/train/', train_steps=1000, width=640)
len(rgb_filenames) 1401 NO CACHE
WARNING: ncs_hacktastic
patch train model...
input (?, 480, 640, 3) #921600
e1 (?, 239, 319, 16) #1219856
e2 (?, 119, 159, 32) #605472
e3 (?, 59, 79, 64) #298304
e4 (?, 29, 39, 128) #144768
d1 (?, 59, 79, 64) #298304
d2 (?, 119, 159, 32) #605472
d3 (?, 239, 319, 16) #1219856
logits (?, 239, 319, 1) #76241

Variables: name (type shape) [size]

train_test_model/e1/weights:0 (float32_ref 3x3x3x16) [432, bytes: 1728]
train_test_model/e1/biases:0 (float32_ref 16) [16, bytes: 64]
train_test_model/e2/weights:0 (float32_ref 3x3x16x32) [4608, bytes: 18432]
train_test_model/e2/biases:0 (float32_ref 32) [32, bytes: 128]
train_test_model/e3/weights:0 (float32_ref 3x3x32x64) [18432, bytes: 73728]
train_test_model/e3/biases:0 (float32_ref 64) [64, bytes: 256]
train_test_model/e4/weights:0 (float32_ref 3x3x64x128) [73728, bytes: 294912]
train_test_model/e4/biases:0 (float32_ref 128) [128, bytes: 512]
train_test_model/d1/weights:0 (float32_ref 3x3x64x128) [73728, bytes: 294912]
train_test_model/d1/biases:0 (float32_ref 64) [64, bytes: 256]
train_test_model/d2/weights:0 (float32_ref 3x3x32x64) [18432, bytes: 73728]
train_test_model/d2/biases:0 (float32_ref 32) [32, bytes: 128]
train_test_model/d3/weights:0 (float32_ref 3x3x16x32) [4608, bytes: 18432]
train_test_model/d3/biases:0 (float32_ref 16) [16, bytes: 64]
train_test_model/d4/weights:0 (float32_ref 3x3x16x1) [144, bytes: 576]
train_test_model/d4/biases:0 (float32_ref 1) [1, bytes: 4]
Total size of variables: 194465
Total bytes of variables: 777860
2018-09-25 07:46:07.989531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-25 07:46:07.989915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.34GiB
2018-09-25 07:46:07.989926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-09-25 07:46:08.140031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-25 07:46:08.140055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-09-25 07:46:08.140059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-09-25 07:46:08.140224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10009 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
[[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer/_11 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_66_train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./train.py", line 96, in
sess.run(train_op)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/jonathan/.local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 19200 values, but the requested shape has 76800
[[Node: Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32](arg1, Reshape_1/shape)]]
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,480,640,3], [?,239,319,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer/_11 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_66_train_test_model/d1/Shape-0-0-VecPermuteNCHWToNHWC-LayoutOptimizer", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Exit 1

@matpalm

This comment has been minimized.

Copy link
Owner Author

commented Oct 2, 2018

thanks for waiting jono, i still haven't had a chance to look at this yet... hopefully this afternoon the planets will align for some free time :D

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

No rush! I only get a chance to look at it at the weekend atm

@squeakus

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2018

I am going to try rewriting the code over the weekend to work with my images, is the NCS_POC still the latest version or should I be working off the master branch?

@matpalm

This comment has been minimized.

Copy link
Owner Author

commented Nov 2, 2018

Yeah. I still haven't merged it back yet sorry (since it also needs some clean up) but it demonstrates the things I needed to do. Good luck!

@matpalm matpalm closed this Feb 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.