Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis. #7

Closed
tomas-wood opened this issue Oct 22, 2018 · 9 comments

Comments

@tomas-wood
Copy link

tomas-wood commented Oct 22, 2018

Getting this error when I run blocks_test.py, modules_test.py, and utils_tf_test.py.

2018-10-22 14:07:06.293160: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis. Error: Pack node (data_dicts_to_graphs_tuple/stack) axis attribute is out of bounds: 0

Was using tensorflow version 1.13.0-dev20181022.

@IMBurbank
Copy link

How are you running the tests? What environment? What commands?

They all pass in my dev environment.

I ran the tests as follows:

Clone and enter repo

git clone https://github.com/deepmind/graph_nets.git
cd gatph_nets/

I use docker images with all the dependencies included so I don't have to worry about system incompatibilities or version conflicts. If you have docker, you can try the Graph Nets images I'm currently hosting to see if it's an issue with your local dev environment.

# CPU Image
docker run --rm -u $(id -u):$(id -g) -p 8888:8888 -v $(pwd):/my-devel -it imburbank/graph_nets bash -l

# GPU image
docker run --rm --runtime=nvidia --user $(id -u):$(id -g) -p 8888:8888 -v $(pwd):/my-devel -it imburbank/graph_nets:latest-gpu bash -l

Then I ran each test

python graph_nets/tests/blocks_test.py
python graph_nets/tests/modules_test.py
python graph_nets/tests/utils_tf_test.py
...ect

@tomas-wood
Copy link
Author

Hi @IMBurbank thank you for commenting.

I was just cding into graph_nets/tests and running python blocks_test.py after installing. I'm pulling your docker images right now and will try it out through them. Alright I tried 'em out and it got pretty ugly.

Realized I had sshed into the wrong machine and had just installed the binaries for tensorflow through pip instead of building them myself with bazel as I always do. Though in this case it seems like the new binary for tensorflow installed through pip isn't sending me mangled stack traces my own build is.

Running with my own compiled binaries (no docker, no conda env, just Ubuntu 16.04) gave me something similar to your docker image.

2018-10-22 14:49:38.296960: I tensorflow/stream_executor/stream.cc:1960] stream 0x8cda6960 did not wait for stream: 0x18423e90
2018-10-22 14:49:38.296978: I tensorflow/stream_executor/stream.cc:4793] stream 0x8cda6960 did not memcpy host-to-device; source: 0x7fc8f2c00000
2018-10-22 14:49:38.297064: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so(+0x6ba3ee)[0x7fd10b6163ee]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd15e6f4390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fd15e34e428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fd15e35002a]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x4fadaa7)[0x7fd110eddaa7]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so(+0x5f75ff)[0x7fd10b5535ff]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x241)[0x7fd10b5ee581]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x37)[0x7fd10b5ec317]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fd11d65fc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fd15e6ea6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fd15e42041d]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	
	
	gsignal
	abort
	
	
	Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
	std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
	
	
	clone
*** End stack trace ***

Aborted (core dumped)

It looks like it couldn't find BLAS with your docker image and in my local environment I'm having trouble getting data from the GPU to the CPU because of a misbehaving stream.

@IMBurbank
Copy link

As long as Docker is working, your local installations of python, conda, bazel, tensorflow, etc won't matter. Everything needed to run the tests is already in the container environments.

Let's start with CPU (I'm not sure if you have GPU configured).

  1. Make sure you're in your normal local environment at a location where you can download the graph_nets repository.

  2. Clone a fresh version of graph_nets to make sure tests are passing with the current build.

git clone https://github.com/deepmind/graph_nets.git
  1. Enter the graph_nets project directory
cd graph_nets/
  1. Run the CPU docker file image with a bash command to enter the container.
docker run --rm -u $(id -u):$(id -g) -p 8888:8888 -v $(pwd):/my-devel -it imburbank/graph_nets bash -l
  1. In that same terminal, so that you're using the container environment, run the tests
python graph_nets/tests/blocks_test.py
python graph_nets/tests/modules_test.py
python graph_nets/tests/utils_tf_test.py

This will not use your locally-compiled tensorflow. The tests should pass. From there, you may be able to work on isolating the problem in your local dev environment.

I would recommend trying the tests on your local dev system with a standard tensorflow package and seeing if they pass. If they do, move to the next link in the chain with your compiled tensorflow.

@tomas-wood
Copy link
Author

I'll try out your CPU version, but I have the GPU configured. My locally installed tensorflow-r1.10 build works on the GPU. All tests passing. Lots of code run with it. If it's causing the problem, I'm only seeing it when trying to run the tests in graph_nets. I also know how docker works. I'm not a complete idiot (just a touch, now and then, for character).

Looks like your CPU binaries work. Doesn't really do me a bit of good, but they work. Kudos.

The thing you recommend, trying the tests on local dev system with standard tf (no GPU) installed with pip is what produced the Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis. errors I first reported.

@tomas-wood
Copy link
Author

Okay I figured it out it's related to this issue. Merci!

@IMBurbank
Copy link

IMBurbank commented Oct 22, 2018

To run the GPU version, follow the exact same steps again, but swap in the GPU image in step 3.
3. Run the GPU docker file image with a bash command to enter the container.

docker run --rm --runtime=nvidia --user $(id -u):$(id -g) -p 8888:8888 -v $(pwd):/my-devel -it imburbank/graph_nets:latest-gpu bash -l

That should duplicate a standard environment running tensorflow_gpu, tensorflow_probability_gpu, graph_nets and the standard dependencies.


I see you got it worked out. Cheers!

@tomas-wood
Copy link
Author

I'm still using nvidia-docker because I'm trapped in the past lol

@abh2424
Copy link

abh2424 commented Mar 22, 2019

I am facing the similar error while running my object detection python file.I have completed all the above steps given by @IMBurBank.But still the error is same.

What is the top-level directory of the model you are using:
./models/research
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
NO
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Raspbian OS - linux
TensorFlow installed from (source or binary):
pip3
TensorFlow version (use command below):
1.13.1
Bazel version (if compiling from source):
0.8.0
CUDA/cuDNN version:
no
GPU model and memory:
cpu only

Please help me @IMBurbank

@cutemuggle
Copy link

I am facing the similar error while running my object detection python file. Could you please tell me how to solve it ? Thanks a lot! @abh2424 @tomas-wood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants