About the Configuration #6

Angzz · 2018-06-21T13:27:19Z

Hi, what is the cuda version? python version

Angzz · 2018-06-22T07:43:06Z

after installation, I encounter a problem when running demo.py:

mxnet.base.MXNetError: [07:35:37] /liang_volume/SNIPER/SNIPER-mxnet/src/operator/nn/batch_norm.cu:527: Check failed: err == cudaSuccess (7 vs. 0) Name: BatchNormalizationUpdateOutput ErrStr:too many resources requested for launch

Stack trace returned 10 entries:
[bt] (0) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::StackTraceabi:cxx11+0x1bc) [0x7f70067b0f7c]
[bt] (1) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f70067b2328]
[bt] (2) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::BatchNormForwardImpl<mshadow::gpu, mshadow::half::half_t, float>(mshadow::Streammshadow::gpu, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x31e) [0x7f70096675fe]
[bt] (3) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::BatchNormForward<mshadow::gpu, mshadow::half::half_t, float>(mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x301) [0x7f7009667de1]
[bt] (4) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::BatchNormComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x852) [0x7f700965f492]
[bt] (5) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f700823a050]
[bt] (6) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(+0x2559258) [0x7f700820f258]
[bt] (7) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock)+0x429) [0x7f70067ba489]
[bt] (8) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f70067be99b]
[bt] (9) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f70067bec1e]

Can you give me some suggestions ?

CUDA VERSION : 9.0
CUDNN VERSION : 7.0

Angzz · 2018-06-22T07:44:25Z

BTW, I use 4 TITAN Xp GPU with 12GB memory.

Lancelot365 · 2018-06-22T18:36:51Z

This seems to be a GPU memory issue. Have you checked the memory usage before it crashed?

bharatsingh430 · 2018-06-22T18:47:55Z

If it’s memory you can reduce the concurrent jobs

mahyarnajibi · 2018-06-22T19:03:32Z

Hey Ang,
It might be the case that your admin set the NVIDIA EXCLUSIVE_PROCESS flag that allow only one process to be ran on each gpu at a time. The demo would be simplified in the next commit to avoid such problems.

Angzz · 2018-06-24T04:19:14Z

@mahyarnajibi Thanks, I am looking forward to your next commit.

mahyarnajibi · 2018-06-26T03:38:34Z

Hey @Angzz ,
Can you pull the master branch of both SNIPER and SNIPER-mxnet and recompile the MXNet using make instead of cmake (the README is also updated) and see if the problem continues to exist when you run the demo?

Jerry3062 · 2018-06-26T07:16:55Z

@mahyarnajibi
HI:
I also met the same problem.
Recently, I compiled the new master. Then I run the demo successfully. Thank you very much.
But, When I compile the code,I met some problems. Such as /home/jerry/SNIPER/SNIPER-mxnet/3rdparty/mshadow/mshadow/./base.h:155:23: fatal error: cblas.h: No such file or directory. I link several file to /usr/include folder and solve it. And When I run the demo.py, some error raised. I solved it by the last two link.
libopenblas.so -> /home/jerry/softs/OpenBLAS/libopenblas.so* libopenblas.so.0 -> libopenblas.so*
I think my method is stupid. I'm poor at C and linux. But I hope it will be help.

mahyarnajibi · 2018-07-02T23:19:30Z

Feel free to re-open the issue if the update did not solve the problem.

mahyarnajibi closed this as completed Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the Configuration #6

About the Configuration #6

Angzz commented Jun 21, 2018

Angzz commented Jun 22, 2018

Angzz commented Jun 22, 2018

Lancelot365 commented Jun 22, 2018

bharatsingh430 commented Jun 22, 2018

mahyarnajibi commented Jun 22, 2018 •

edited

Loading

Angzz commented Jun 24, 2018

mahyarnajibi commented Jun 26, 2018

Jerry3062 commented Jun 26, 2018

mahyarnajibi commented Jul 2, 2018

About the Configuration #6

About the Configuration #6

Comments

Angzz commented Jun 21, 2018

Angzz commented Jun 22, 2018

Angzz commented Jun 22, 2018

Lancelot365 commented Jun 22, 2018

bharatsingh430 commented Jun 22, 2018

mahyarnajibi commented Jun 22, 2018 • edited Loading

Angzz commented Jun 24, 2018

mahyarnajibi commented Jun 26, 2018

Jerry3062 commented Jun 26, 2018

mahyarnajibi commented Jul 2, 2018

mahyarnajibi commented Jun 22, 2018 •

edited

Loading