Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convolution core dump #624

Closed
fengrenguang opened this issue Dec 26, 2019 · 7 comments
Closed

convolution core dump #624

fengrenguang opened this issue Dec 26, 2019 · 7 comments
Assignees

Comments

@fengrenguang
Copy link

Summary

Provide a short summary of the issue. Sections below provide guidance on what
factors are considered important to reproduce an issue.
primitive create and forward in different thread may lead to core dump

the error information
mkldnn/mkldnn/common/memory_tracking.hpp:240: void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const: Assertion `size() == 0' failed.

Version

Report DNNL version and githash. Version information is printed to stdout
in verbose mode.
0.21.0

Environment

DNNL includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:

  • CPU make and model (try lscpu; if your lscpu does not list CPU flags,
    try running cat /proc/cpuinfo | grep flags | sort -u)
    intel xeon CPU E5-2630 V4 @2.2GHz
  • OS version (uname -a)
    Linux sdw2 2.6.32-696.16.1.el6.x86_64 How do I do to build mkl_dnn by intel compiler. #1 SMP Wed Nov 15 16:51:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Compiler version (gcc --version)
    gcc version 8.2.0 (GCC)
  • CMake version (cmake --version)

cmake version 3.5.0-rc3

  • CMake output log
  • git hash (git log -1 --format=%H)

Steps to reproduce

Please check that the issue is reproducible with the latest revision on
master. Include all the steps to reproduce the issue.

You can use verbose mode
and benchdnn
to validate correctness of all primitives the library supports. If this does not
work a short C/C++ program or modified unit tests demonstrating the issue
will greatly help with the investigation.

Observed behavior

Document behavior you observe. For performance defects, like performance
regressions or a function being slow, provide a log including output generated
by your application in
verbose mode.

Expected behavior

Document behavior you expect.
how to solve this problem

@fengrenguang fengrenguang added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Dec 26, 2019
@emfomenk
Copy link

Hi @fengrenguang,

Could you please share the call stack (using gdb, run the application, and when the assertion caught type bt to see the trace)?
Also, could you please share the problem parameters? E.g if this is a convolution that causes the issue, share the ic, oc, mb, ih, kh, etc... as well as memory formats you used to create a convolution.

@angus1121
Copy link

hi @emfomenk
i am working with fengrenguang,here is the call stack,
inblob format: mkldnn_nChw16c shape: 1 32 68 120
outblob format:mkldnn_nChw16c shape 1 64 68 120
kernel shape 64 32 3 3
error infomation is mkldnn/mkldnn/common/memory_tracking.hpp:240: void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const: Assertion `size() == 0' failed.

here is fellowing the call stack:
#0 0x00007ffff572f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff573102a in __GI_abort () at abort.c:89
#2 0x00007ffff5727bd7 in _assert_fail_base (fmt=, assertion=assertion@entry=0x7ffff7b82048 "size() == 0",
file=file@entry=0x7ffff7b804b0 "../../../code/src/libcnn/layer/3rdparty/mkldnn/mkldnn/common/memory_tracking.hpp", line=line@entry=240,
function=function@entry=0x7ffff7be9ac0 <ZZNK6mkldnn4impl15memory_tracking10registry_t3getERKjPvE19__PRETTY_FUNCTION
> "void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const")
at assert.c:92
#3 0x00007ffff5727c82 in _GI___assert_fail (assertion=0x7ffff7b82048 "size() == 0", file=0x7ffff7b804b0 "../../../code/src/libcnn/layer/3rdparty/mkldnn/mkldnn/common/memory_tracking.hpp", line=240,
function=0x7ffff7be9ac0 <ZZNK6mkldnn4impl15memory_tracking10registry_t3getERKjPvE19__PRETTY_FUNCTION
> "void* mkldnn::impl::memory_tracking::registry_t::get(const key_t&, void*) const") at assert.c:101
#4 0x00007ffff7673336 in mkldnn::impl::cpu::_jit_avx512_core_fp32_wino_conv_4x3_t::_execute_data_W_S_G_D(float*, float*, float*, float*, mkldnn::impl::memory_tracking::grantor_t const&) const ()
from ../../../lib/X64_LINUX/libcnn_v3.7.so
#5 0x00007ffff768a88f in mkldnn::impl::cpu::jit_avx512_core_fp32_wino_conv_4x3_fwd_t::execute(mkldnn::impl::event_t*) const () from ../../../lib/X64_LINUX/libcnn_v3.7.so
#6 0x00007ffff7677463 in mkldnn::impl::cpu::cpu_engine_t::submit(mkldnn_primitive*, mkldnn::impl::event_t*, mkldnn::impl::nstl::vectormkldnn::impl::event_t*&) ()
from ../../../lib/X64_LINUX/libcnn_v3.7.so
#7 0x00007ffff7440a3f in mkldnn::impl::stream_eager_t::rerun_impl(mkldnn_primitive**) () from ../../../lib/X64_LINUX/libcnn_v3.7.so
#8 0x00007ffff743f4f3 in mkldnn_stream::rerun(mkldnn_primitive**) () from ../../../lib/X64_LINUX/libcnn_v3.7.so
#9 0x00007ffff7432767 in CConvLayerMkldnn::Forward() () from ../../../lib/X64_LINUX/libcnn_v3.7.so

@rsdubtso
Copy link

rsdubtso commented Jan 9, 2020

Hi @angus1121. Thanks for the update. There seems to be some inconsistency with the original report which mentioned 'Intel Xeon CPU E5-2630 V4 @2.2GHz' which supports AVX2 only, while the backtrace is for AVX-512 Winograd.

@rsdubtso
Copy link

rsdubtso commented Jan 9, 2020

I tried reproducing this with 0.20 and 0.20.6 but no luck.

$ MKLDNN_VERBOSE=1 ./tests/benchdnn/benchdnn --conv --alg=WINO mb1_ic32ih68iw120_oc64oh68ow120_kh3kw3ph1pw1
mkldnn_verbose,info,Intel(R) MKL-DNN v0.20.0 (Git Hash d89bf4babd7cce7efa6613387dca79c123164084),Intel(R) AVX512-Deep Learning Boost (Intel(R) AVX512-DL Boost)
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw16c,num:1,1x32x68x120,10.082
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,64x32x3x3,9.99292
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nchw out:f32_nChw16c,num:1,1x64x68x120,10.3831
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_x out:f32_x,num:1,64,0.0891113
mkldnn_verbose,exec,convolution,jit_wino_4x3:avx512_core,forward_training,fsrc:nChw16c fwei:OIhw16i16o fbia:x fdst:nChw16c,alg:convolution_winograd,mb1_ic32oc64_ih68oh68kh3sh1dh0ph1_iw120ow120kw3sw1dw0pw1,2.64893
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw16c out:f32_nchw,num:1,1x64x68x120,0.327881
0:PASSED __REPRO: --alg=wino mb1ic32ih68iw120oc64oh68ow120kh3kw3ph1pw1n"wip"
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 failed:0

@angus1121 , @fengrenguang: Please reproduce this issue using a standalone MKL-DNN build and report the detailed instructions here. Until then, there's nothing we can do.

@angus1121
Copy link

hi @rsdubtso core dump happend in the both machines, so I use the AVX512 machine which I have, when I create and use mkldnn_stream in one thread, it works. It core dump when I create the mkldnn_stream in one thread and I use it in another thread, The version we use is 0.21.0

@rsdubtso
Copy link

rsdubtso commented Jan 9, 2020

Thanks @angus1121 . Thanks for reminding that the issue is with multiple threads. I should have noticed this from the original post. Then what you need to do in 0.x is to pass -DMKLDNN_ENABLE_CONCURRENT_EXEC=TRUE to cmake when building MKL-DNN. In 1.x you have more options.

@rsdubtso rsdubtso added not a bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Jan 10, 2020
@rsdubtso rsdubtso self-assigned this Jan 10, 2020
@vpirogov
Copy link
Member

Closing due to lack of activity. Feel free to submit a new issue or reopen this one if the issue is not resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants