Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

somebody who can help me, cudnn 5.1 or cudnn 6 both not work after recompiling ... #20

Open
sheldon606 opened this issue Jun 5, 2017 · 3 comments

Comments

@sheldon606
Copy link

An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fcf6d268f7c]
[bt] (1) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x4ac) [0x7fcf6e08cd4c]
[bt] (2) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvvEZZN5mxnet6engine23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlvE_E9_M_invokeERKSt9_Any_data+0x60) [0x7fcf6e08fef0]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fcfe2828c80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fcfe90d16ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fcfe8e0782d]

@sheldon606
Copy link
Author

Stack trace returned 10 entries:
[bt] (0) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x3e4) [0x7f6e47f1fde4]
[bt] (1) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x57) [0x7f6e47f23817]
[bt] (2) /home/sheldon606/mxnet/python/mxnet/../../lib/libmxnet.so(+0x1433443) [0x7f6e47f72443]
[bt] (3) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor19InitDataEntryMemoryEPSt6vectorINS_7NDArrayESaIS3_EE+0x2284) [0x7f6e47f7b1c4]
[bt] (4) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_EEERKSt6vectorINS_7NDArrayESaISI_EESM_RKSH_INS_9OpReqTypeESaISN_EESM_PNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryESI_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISA_IKSV_SI_EEE+0x5bb) [0x7f6e47f8154b]
[bt] (5) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(ZN5mxnet8Executor4BindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vectorINS_7NDArrayESaISH_EESL_RKSG_INS_9OpReqTypeESaISM_EESL_PS0+0x684) [0x7f6e47f81c74]
[bt] (6) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x2425) [0x7f6e47f506f5]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f6e7cfcfe40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f6e7cfcf8ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f6e5fdd43df]

Traceback (most recent call last):
File "experiments/fcis/fcis_end2end_train_test.py", line 13, in
train_end2end.main()
File "experiments/fcis/../../fcis/train_end2end.py", line 181, in main
config.TRAIN.lr, config.TRAIN.lr_step)
File "experiments/fcis/../../fcis/train_end2end.py", line 173, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "experiments/fcis/../../fcis/core/module.py", line 932, in fit
for_training=True, force_rebind=force_rebind)
File "experiments/fcis/../../fcis/core/module.py", line 840, in bind
for_training, inputs_need_grad, force_rebind=False, shared_module=None)
File "experiments/fcis/../../fcis/core/module.py", line 397, in bind
state_names=self._state_names)
File "experiments/fcis/../../fcis/core/DataParallelExecutorGroup.py", line 178, in init
self.bind_exec(data_shapes, label_shapes, shared_group)
File "experiments/fcis/../../fcis/core/DataParallelExecutorGroup.py", line 278, in bind_exec
shared_group))
File "experiments/fcis/../../fcis/core/DataParallelExecutorGroup.py", line 613, in _bind_ith_exec
grad_req=self.grad_req, shared_exec=shared_exec)
File "/home/shel/mxnet/python/mxnet/symbol.py", line 1413, in bind
ctypes.byref(handle)))
File "/home/shel/mxnet/python/mxnet/base.py", line 85, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [10:32:26] src/storage/./pooled_storage_manager.h:84: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage23GPUPooledStorageManager5AllocEm+0x3e4) [0x7f6e47f1fde4]
[bt] (1) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x57) [0x7f6e47f23817]
[bt] (2) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(+0x1433443) [0x7f6e47f72443]
[bt] (3) /home/sheld/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor19InitDataEntryMemoryEPSt6vectorINS_7NDArrayESaIS3_EE+0x2284) [0x7f6e47f7b1c4]
[bt] (4) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet4exec13GraphExecutor4InitEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS4_St4lessISsESaISt4pairIKSsS4_EEERKSt6vectorINS_7NDArrayESaISI_EESM_RKSH_INS_9OpReqTypeESaISN_EESM_PNS_8ExecutorERKSt13unordered_mapINS2_9NodeEntryESI_NS2_13NodeEntryHashENS2_14NodeEntryEqualESaISA_IKSV_SI_EEE+0x5bb) [0x7f6e47f8154b]
[bt] (5) /home/shel/mxnet/python/mxnet/../../lib/libmxnet.so(ZN5mxnet8Executor4BindEN4nnvm6SymbolERKNS_7ContextERKSt3mapISsS3_St4lessISsESaISt4pairIKSsS3_EEERKSt6vectorINS_7NDArrayESaISH_EESL_RKSG_INS_9OpReqTypeESaISM_EESL_PS0+0x684) [0x7f6e47f81c74]
[bt] (6) /home/sheldon606/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0x2425) [0x7f6e47f506f5]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f6e7cfcfe40]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7f6e7cfcf8ab]
[bt] (9) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48f) [0x7f6e5fdd43df]

@sheldon606
Copy link
Author

Floor 2 shows the error after compiling the mxnet not using cudnn. Demo works ok for me, but the train process doesn't work any more.

@shuishui602
Copy link

@sheldon606 have you solved this problem??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants