Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run the R-FCN 3K demo - nnvm check failed #3

Closed
Lancelot365 opened this issue Jun 18, 2018 · 13 comments
Closed

Failed to run the R-FCN 3K demo - nnvm check failed #3

Lancelot365 opened this issue Jun 18, 2018 · 13 comments

Comments

@Lancelot365
Copy link

Hi,

First of all, thanks for providing this nicely organized repo.
I have successfully run SNIPER demo without any problem.
However, for R-FCN 3K, I am not able to run the demo. There is strange error, which originates from asnumpy() function.
The error message I got is
SNIPER/CRCNN-mxnet/3rdparty/nnvm/include/nnvm/tuple.h:438: Check failed: dim == static_cast(ndim()) (2 vs. 1) dimension do not match target dimension 2 vs 1

I am using CUDA 9.1 with cuDNN 7.02 on a P3.x16 machine.

I have tried diffrent configures on the cmake, e.g., with/withou mkl, but I still cannot figure out what is the problem.

I can easily convert other NDArray to numpy array without any problem. I just cannot convert the output of the R-FCN network.

Please advise.

@bharatsingh430
Copy link
Collaborator

We do source python 2_7 mxnet on aws, and compile our mxnet with openblas, please check the makefile in the mxnet repo we have and use it to compile it on aws

@bharatsingh430
Copy link
Collaborator

i think there was some linking issue to repo, should be fixed now, otherwise mxnet branch should be explicitly changed and cloned.

@Lancelot365
Copy link
Author

Thanks for your quick comment.
I did complie with openblas. Still no luck with the newest update.
Some more error messages if your are curious.
I have also tried to remove the asnumpy() in out = mod.get_outputs()[4].asnumpy() and put it to pooled_feat, also has the same error.

Load data from cache.
Extracting features of 1332 images...
Traceback (most recent call last):
File "demo.py", line 284, in
main()
File "demo.py", line 174, in main
out = mod.get_outputs()[4].asnumpy()
File "/home/SNIPER/SNIPER-mxnet/python/mxnet/ndarray/ndarray.py", line 1876, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/SNIPER/SNIPER-mxnet/python/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [04:03:29] /home/SNIPER/SNIPER-mxnet/3rdparty/nnvm/include/nnvm/tuple.h:438: Check failed: dim == static_cast(ndim()) (2 vs. 1) dimension do not match target dimension 2 vs 1

[bt] (0) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f11a864eb8b]
[bt] (1) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mshadow::Tensor<mshadow::gpu, 2, float> mxnet::TBlob::get<mshadow::gpu, 2, float>(mshadow::Streammshadow::gpu) const+0xaf6) [0x7f11aa37d9b6]
[bt] (2) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::MultiProposalGPUOpmshadow::gpu::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x12c0) [0x7f11ab551710]
[bt] (3) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x35b) [0x7f11a87b999b]
[bt] (4) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f11a870dd30]
[bt] (5) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(+0xb865c8) [0x7f11a86d75c8]
[bt] (6) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock
)+0x429) [0x7f11a87d6549]
[bt] (7) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f11a87da82b]
[bt] (8) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f11a87daaae]
[bt] (9) /home/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f11a87d6aaa]

@Lancelot365
Copy link
Author

May I know the CUDA and cuDNN version that you are testing on ?
Thanks

@Lancelot365
Copy link
Author

One of my friends also tried this morning with clean CUDA 9 environment, he got the same error.

@bharatsingh430
Copy link
Collaborator

we will check it and get back, hard to check these things due to cvpr

@Lancelot365
Copy link
Author

@bharatsingh430 Sure, Enjoy!

@Lancelot365
Copy link
Author

@bharatsingh430 @mahyarnajibi
After some tests, I am pretty sure it is from the MultiProposal OP. I can output relu1 without any problems, but once rois is involved, nnvm will report this checked fail error.

@bharatsingh430
Copy link
Collaborator

bharatsingh430 commented Jun 22, 2018

Thanks for the catch. I think we made some last minute merges in the cuda layers. We will fix it in day or so.

@bharatsingh430
Copy link
Collaborator

btw, did you use the right mxnet branch for 3k?

@Lancelot365
Copy link
Author

I have tried all three branches with different build options, no luck with all of them.

@Lancelot365
Copy link
Author

@bharatsingh430 @mahyarnajibi
I think I found out the problem.
'im_info'in data has the shape (3,), but in the MultiProposal Op, a 2D tensor is assigned to 'im_info', which causes the problem.
If we reshape the NDArray to (3,1) or (1,3), everything seems to be fine.

There is another small bug, line 119 should be
im_info_list.append(one['im_info'])

@henrylee2570
Copy link
Collaborator

@Lancelot365 thank you for the catch, the im_info issue has been fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants