Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the Configuration #6

Closed
Angzz opened this issue Jun 21, 2018 · 9 comments
Closed

About the Configuration #6

Angzz opened this issue Jun 21, 2018 · 9 comments

Comments

@Angzz
Copy link

Angzz commented Jun 21, 2018

Hi, what is the cuda version? python version

@Angzz
Copy link
Author

Angzz commented Jun 22, 2018

after installation, I encounter a problem when running demo.py:

mxnet.base.MXNetError: [07:35:37] /liang_volume/SNIPER/SNIPER-mxnet/src/operator/nn/batch_norm.cu:527: Check failed: err == cudaSuccess (7 vs. 0) Name: BatchNormalizationUpdateOutput ErrStr:too many resources requested for launch

Stack trace returned 10 entries:
[bt] (0) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::StackTraceabi:cxx11+0x1bc) [0x7f70067b0f7c]
[bt] (1) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f70067b2328]
[bt] (2) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::BatchNormForwardImpl<mshadow::gpu, mshadow::half::half_t, float>(mshadow::Streammshadow::gpu, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x31e) [0x7f70096675fe]
[bt] (3) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::BatchNormForward<mshadow::gpu, mshadow::half::half_t, float>(mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x301) [0x7f7009667de1]
[bt] (4) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::BatchNormComputemshadow::gpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x852) [0x7f700965f492]
[bt] (5) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f700823a050]
[bt] (6) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(+0x2559258) [0x7f700820f258]
[bt] (7) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock
)+0x429) [0x7f70067ba489]
[bt] (8) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f70067be99b]
[bt] (9) /liang_volume/SNIPER/SNIPER-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f70067bec1e]

Can you give me some suggestions ?

CUDA VERSION : 9.0
CUDNN VERSION : 7.0

@Angzz
Copy link
Author

Angzz commented Jun 22, 2018

BTW, I use 4 TITAN Xp GPU with 12GB memory.

@Lancelot365
Copy link

This seems to be a GPU memory issue. Have you checked the memory usage before it crashed?

@bharatsingh430
Copy link
Collaborator

If it’s memory you can reduce the concurrent jobs

@mahyarnajibi
Copy link
Owner

mahyarnajibi commented Jun 22, 2018

Hey Ang,
It might be the case that your admin set the NVIDIA EXCLUSIVE_PROCESS flag that allow only one process to be ran on each gpu at a time. The demo would be simplified in the next commit to avoid such problems.

@Angzz
Copy link
Author

Angzz commented Jun 24, 2018

@mahyarnajibi Thanks, I am looking forward to your next commit.

@mahyarnajibi
Copy link
Owner

Hey @Angzz ,
Can you pull the master branch of both SNIPER and SNIPER-mxnet and recompile the MXNet using make instead of cmake (the README is also updated) and see if the problem continues to exist when you run the demo?

@Jerry3062
Copy link

@mahyarnajibi
HI:
I also met the same problem.
Recently, I compiled the new master. Then I run the demo successfully. Thank you very much.
But, When I compile the code,I met some problems. Such as /home/jerry/SNIPER/SNIPER-mxnet/3rdparty/mshadow/mshadow/./base.h:155:23: fatal error: cblas.h: No such file or directory. I link several file to /usr/include folder and solve it. And When I run the demo.py, some error raised. I solved it by the last two link.
libopenblas.so -> /home/jerry/softs/OpenBLAS/libopenblas.so* libopenblas.so.0 -> libopenblas.so*
I think my method is stupid. I'm poor at C and linux. But I hope it will be help.

@mahyarnajibi
Copy link
Owner

Feel free to re-open the issue if the update did not solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants