Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arithmetic exception in Xbyak::util::Cpu::Cpu() when libmkldnn running in virtual machine KVM #215

Closed
moting9 opened this issue Apr 16, 2018 · 5 comments
Labels
bug A confirmed library bug

Comments

@moting9
Copy link

moting9 commented Apr 16, 2018

we build intel caffe with mkldnn as default engine, and running on KVM with caffe time command, there is float exception

gdb error

Program received signal SIGFPE, Arithmetic exception.
0x00007fffee1faacc in Xbyak::util::Cpu::Cpu() ()
from /home/intel/caffe/external/mkldnn/install/lib/libmkldnn.so.0

kvm config

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2499.996
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm cons tant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1

@emfomenk
Copy link

emfomenk commented Apr 16, 2018

Hi @moting9,

Thanks for the report!
Did you try the latest master?
If it still fails, could you please build mkl-dnn with debug (-DCMAKE_BUILD_TYPE=Debug), run under gdb and point to exact place and circumstances of this FPE?

@moting9
Copy link
Author

moting9 commented Apr 16, 2018

Thanks for the quick response!
I tried to use latest mkldnn and set debug flag

[root@iZhp37p4s3u8hd5b67tpboZ caffe]# ./build/tools/caffe time -model ../intel-c affe-model-v1/caffe_tianrang.prototxt
caffe: /home/work/caffe/external/mkldnn/src/src/cpu/xbyak/xbyak_util.h:180: void Xbyak::util::Cpu::setCacheHierarchy(): Assertion `smt_width != 0' failed.
Aborted

@moting9
Copy link
Author

moting9 commented Apr 16, 2018

backtrace

(gdb) bt
#0 Xbyak::util::Cpu::setCacheHierarchy (this=0x7fffee86f7e0 <mkldnn::impl::cpu::(anonymous namespace)::cpu>) at /home/work/caffe/external/mkldnn/src/src/cpu/xbyak/xbyak_util.h:179
#1 0x00007fffee188384 in Xbyak::util::Cpu::Cpu (this=0x7fffee86f7e0 <mkldnn::impl::cpu::(anonymous namespace)::cpu>) at /home/work/caffe/external/mkldnn/src/src/cpu/xbyak/xbyak_util.h:405
#2 0x00007fffee17fa6e in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at /home/work/caffe/external/mkldnn/src/src/cpu/jit_generator.hpp:168
#3 0x00007fffee17faaa in _GLOBAL__sub_I_jit_avx2_1x1_conv_kernel_f32.cpp(void) () at /home/work/caffe/external/mkldnn/src/src/cpu/jit_avx2_1x1_conv_kernel_f32.cpp:677
#4 0x00007ffff7dea503 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5 0x00007ffff7ddc1aa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#6 0x0000000000000004 in ?? ()
#7 0x00007fffffffe7e8 in ?? ()
#8 0x00007fffffffe80d in ?? ()
#9 0x00007fffffffe812 in ?? ()
#10 0x00007fffffffe819 in ?? ()
#11 0x0000000000000000 in ?? ()

@vpirogov
Copy link
Member

Based on this backtrace the issue is the same as #208 and is resolved in revision a5f6077. Would you please update the library to the latest revision from the master branch (not v0.13) and verify that it is resolved?

@vpirogov vpirogov added the bug A confirmed library bug label Apr 16, 2018
@sinkingsugar
Copy link

sinkingsugar commented Dec 31, 2018

@vpirogov I am still having similar issues when using a VM (KVM, qemu, host cpu features)
e.g.:

Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
zsh: floating point exception (core dumped)  python

Pretty sure its mkl-dnn cos I had to skip using it when building our framework:
https://github.com/fragcolor-xyz/nimtorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug
Projects
None yet
Development

No branches or pull requests

4 participants