convolution core dump #624

fengrenguang · 2019-12-26T06:59:10Z

Summary

Provide a short summary of the issue. Sections below provide guidance on what
factors are considered important to reproduce an issue.
primitive create and forward in different thread may lead to core dump

the error information
mkldnn/mkldnn/common/memory_tracking.hpp:240: void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const: Assertion `size() == 0' failed.

Version

Report DNNL version and githash. Version information is printed to stdout
in verbose mode.
0.21.0

Environment

DNNL includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:

CPU make and model (try lscpu; if your lscpu does not list CPU flags,
try running cat /proc/cpuinfo | grep flags | sort -u)
intel xeon CPU E5-2630 V4 @2.2GHz
OS version (uname -a)
Linux sdw2 2.6.32-696.16.1.el6.x86_64 How do I do to build mkl_dnn by intel compiler. #1 SMP Wed Nov 15 16:51:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Compiler version (gcc --version)
gcc version 8.2.0 (GCC)
CMake version (cmake --version)

cmake version 3.5.0-rc3

CMake output log
git hash (git log -1 --format=%H)

Steps to reproduce

Please check that the issue is reproducible with the latest revision on
master. Include all the steps to reproduce the issue.

You can use verbose mode
and benchdnn
to validate correctness of all primitives the library supports. If this does not
work a short C/C++ program or modified unit tests demonstrating the issue
will greatly help with the investigation.

Observed behavior

Document behavior you observe. For performance defects, like performance
regressions or a function being slow, provide a log including output generated
by your application in
verbose mode.

Expected behavior

Document behavior you expect.
how to solve this problem

The text was updated successfully, but these errors were encountered:

emfomenk · 2019-12-26T17:26:54Z

Hi @fengrenguang,

Could you please share the call stack (using gdb, run the application, and when the assertion caught type bt to see the trace)?
Also, could you please share the problem parameters? E.g if this is a convolution that causes the issue, share the ic, oc, mb, ih, kh, etc... as well as memory formats you used to create a convolution.

angus1121 · 2020-01-09T02:39:55Z

hi @emfomenk
i am working with fengrenguang,here is the call stack,
inblob format: mkldnn_nChw16c shape: 1 32 68 120
outblob format:mkldnn_nChw16c shape 1 64 68 120
kernel shape 64 32 3 3
error infomation is mkldnn/mkldnn/common/memory_tracking.hpp:240: void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const: Assertion `size() == 0' failed.

here is fellowing the call stack:
#0 0x00007ffff572f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff573102a in __GI_abort () at abort.c:89
#2 0x00007ffff5727bd7 in _assert_fail_base (fmt=, assertion=assertion@entry=0x7ffff7b82048 "size() == 0",
file=file@entry=0x7ffff7b804b0 "../../../code/src/libcnn/layer/3rdparty/mkldnn/mkldnn/common/memory_tracking.hpp", line=line@entry=240,
function=function@entry=0x7ffff7be9ac0 <ZZNK6mkldnn4impl15memory_tracking10registry_t3getERKjPvE19__PRETTY_FUNCTION> "void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const")
at assert.c:92
#3 0x00007ffff5727c82 in _GI___assert_fail (assertion=0x7ffff7b82048 "size() == 0", file=0x7ffff7b804b0 "../../../code/src/libcnn/layer/3rdparty/mkldnn/mkldnn/common/memory_tracking.hpp", line=240,
function=0x7ffff7be9ac0 <ZZNK6mkldnn4impl15memory_tracking10registry_t3getERKjPvE19__PRETTY_FUNCTION> "void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const") at assert.c:101
#4 0x00007ffff7673336 in mkldnn::impl::cpu::_jit_avx512_core_fp32_wino_conv_4x3_t::_execute_data_W_S_G_D(float*, float*, float*, float*, mkldnn::impl::memory_tracking::grantor_t const&) const ()
from ../../../lib/X64_LINUX/libcnn_v3.7.so
#5 0x00007ffff768a88f in mkldnn::impl::cpu::jit_avx512_core_fp32_wino_conv_4x3_fwd_t::execute(mkldnn::impl::event_t*) const () from ../../../lib/X64_LINUX/libcnn_v3.7.so
#6 0x00007ffff7677463 in mkldnn::impl::cpu::cpu_engine_t::submit(mkldnn_primitive*, mkldnn::impl::event_t*, mkldnn::impl::nstl::vectormkldnn::impl::event_t*&) ()
from ../../../lib/X64_LINUX/libcnn_v3.7.so
#7 0x00007ffff7440a3f in mkldnn::impl::stream_eager_t::rerun_impl(mkldnn_primitive**) () from ../../../lib/X64_LINUX/libcnn_v3.7.so
#8 0x00007ffff743f4f3 in mkldnn_stream::rerun(mkldnn_primitive**) () from ../../../lib/X64_LINUX/libcnn_v3.7.so
#9 0x00007ffff7432767 in CConvLayerMkldnn::Forward() () from ../../../lib/X64_LINUX/libcnn_v3.7.so

rsdubtso · 2020-01-09T05:10:48Z

Hi @angus1121. Thanks for the update. There seems to be some inconsistency with the original report which mentioned 'Intel Xeon CPU E5-2630 V4 @2.2GHz' which supports AVX2 only, while the backtrace is for AVX-512 Winograd.

rsdubtso · 2020-01-09T05:36:33Z

I tried reproducing this with 0.20 and 0.20.6 but no luck.

$ MKLDNN_VERBOSE=1 ./tests/benchdnn/benchdnn --conv --alg=WINO mb1_ic32ih68iw120_oc64oh68ow120_kh3kw3ph1pw1
mkldnn_verbose,info,Intel(R) MKL-DNN v0.20.0 (Git Hash d89bf4babd7cce7efa6613387dca79c123164084),Intel(R) AVX512-Deep Learning Boost (Intel(R) AVX512-DL Boost)
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw16c,num:1,1x32x68x120,10.082
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,64x32x3x3,9.99292
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw16c,num:1,1x64x68x120,10.3831
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_x out:f32_x,num:1,64,0.0891113
mkldnn_verbose,exec,convolution,jit_wino_4x3:avx512_core,forward_training,fsrc:nChw16c fwei:OIhw16i16o fbia:x fdst:nChw16c,alg:convolution_winograd,mb1_ic32oc64_ih68oh68kh3sh1dh0ph1_iw120ow120kw3sw1dw0pw1,2.64893
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw16c out:f32_nchw,num:1,1x64x68x120,0.327881
0:PASSED __REPRO: --alg=wino mb1ic32ih68iw120oc64oh68ow120kh3kw3ph1pw1n"wip"
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 failed:0

@angus1121 , @fengrenguang: Please reproduce this issue using a standalone MKL-DNN build and report the detailed instructions here. Until then, there's nothing we can do.

angus1121 · 2020-01-09T06:43:37Z

hi @rsdubtso core dump happend in the both machines, so I use the AVX512 machine which I have, when I create and use mkldnn_stream in one thread, it works. It core dump when I create the mkldnn_stream in one thread and I use it in another thread, The version we use is 0.21.0

rsdubtso · 2020-01-09T15:47:42Z

Thanks @angus1121 . Thanks for reminding that the issue is with multiple threads. I should have noticed this from the original post. Then what you need to do in 0.x is to pass -DMKLDNN_ENABLE_CONCURRENT_EXEC=TRUE to cmake when building MKL-DNN. In 1.x you have more options.

vpirogov · 2020-01-23T21:16:00Z

Closing due to lack of activity. Feel free to submit a new issue or reopen this one if the issue is not resolved.

fengrenguang added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Dec 26, 2019

rsdubtso added not a bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Jan 10, 2020

rsdubtso self-assigned this Jan 10, 2020

vpirogov closed this as completed Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convolution core dump #624

convolution core dump #624

fengrenguang commented Dec 26, 2019

emfomenk commented Dec 26, 2019

angus1121 commented Jan 9, 2020

rsdubtso commented Jan 9, 2020 •

edited

rsdubtso commented Jan 9, 2020

angus1121 commented Jan 9, 2020

rsdubtso commented Jan 9, 2020

vpirogov commented Jan 23, 2020

convolution core dump #624

convolution core dump #624

Comments

fengrenguang commented Dec 26, 2019

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

emfomenk commented Dec 26, 2019

angus1121 commented Jan 9, 2020

rsdubtso commented Jan 9, 2020 • edited

rsdubtso commented Jan 9, 2020

angus1121 commented Jan 9, 2020

rsdubtso commented Jan 9, 2020

vpirogov commented Jan 23, 2020

rsdubtso commented Jan 9, 2020 •

edited