test_batch_normalization is failing #245

gcp · 2018-05-24T10:21:38Z

gcc version 7.3.0
various CPUs (Xeon(R) CPU E3-1240 v3, Ryzen 1700)
Ubuntu 16.04
No MKL
Current git master (72236df)

Doing a simple cmake .. && make && make test produces a test failure:

22/37 Test #22: test_batch_normalization ......................***Failed 45.53 sec

The text was updated successfully, but these errors were encountered:

emfomenk · 2018-05-24T20:17:44Z

Could you please provide a little bit more information? Could you please dump the output of the test, so that we can at least see, what exactly fails. Please also set MKLDNN_VERBOSE environment variable to 2 to get a little bit more verbose info.

gcp · 2018-05-25T07:42:27Z

https://sjeng.org/ftp/work/log.txt.gz

relevant part when run with MKLDNN_VERBOSE:

[ RUN      ] GoogleNet_Blocked_8/bnrm_test_float.TestsBnrm/2
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,0.0490723
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,1.323
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,0.0959473
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,1.10303
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,0.0629883
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,1.59814
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,0.0549316
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,1.229
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,0.0678711
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,1.17798
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,0.052002
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,1.09985
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:3,mb2ic192ih56iw56,0.0419922
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:3,mb2ic192ih56iw56,1.11719
mkldnn_verbose,create,batch_normalization,jit:avx2,backward_data,fdata:nChw8c fdiff:nChw8c,flags:0,mb2ic192ih56iw56,0.0610352
mkldnn_verbose,exec,batch_normalization,jit:avx2,backward_data,fdata:nChw8c fdiff:nChw8c,flags:0,mb2ic192ih56iw56,1.55908
The difference between (out_diff_src - ref_diff_src) / norm_max and 0. is 0.81878101825714111, which exceeds eps, where
(out_diff_src - ref_diff_src) / norm_max evaluates to -0.81878101825714111,
0. evaluates to 0, and
eps evaluates to 0.62720000743865967.

On one of the machines the test passes today (!). Could this be related to uninitialized memory or saving/restoring CPU state?

I will see if I can make the failure reappear, to exclude the possibility of having ssh'ed to the wrong machine when I reported it is reproducible on multiple.

gcp · 2018-05-25T07:52:35Z

I managed to reproduce the failure on both systems again. It does not happen if they are idle, but it does happen if there are other machine learning tasks busy (PyTorch+cuDNN or TF+cuDNN). It's very unlikely to be related to heat or stability (different CPU vendors, not overclocked, well-cooled workstations, exactly the same test that fails).

Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz is on:
Linux mozwell 4.15.0-20-generic #21~16.04.1-Ubuntu SMP Wed Apr 25 02:42:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Which is a ubuntu built kernel. nvidia-driver is 390.48

AMD Ryzen 7 1700 Eight-Core Processor is on:
Linux beast 4.15.13-ryzen #24 SMP Tue Mar 27 09:27:37 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
Which is directly from kernel.org. nvidia-driver is 390.30

My best guess is that there is some issue with memory initialization or processor state.

gcp · 2018-05-25T17:13:42Z

I got a report that this also reproduces on a 56-core AVX-512 machine, without NVIDIA drivers installed.

sneves · 2018-05-25T17:18:12Z

I can reproduce this, on a similar Ubuntu machine as Gian-Carlo. This box is a Skylake with 8 logical cores. Running the following command (with an otherwise idle machine) shows no failures:

OMP_NUM_THREADS=8 tests/gtests/test_batch_normalization --gtest_filter=GoogleNet_Blocked_8/bnrm_test_float.TestsBnrm/2 --gtest_repeat=-1

However, the command

OMP_NUM_THREADS=9 tests/gtests/test_batch_normalization --gtest_filter=GoogleNet_Blocked_8/bnrm_test_float.TestsBnrm/2 --gtest_repeat=-1

Fails almost immediately, in the manner described above.

This was also reproduced in a dual-socket Skylake-SP machine with 56 total logical cores. But there the behavior is different, and fewer-than-logical numbers of threads fail. In particular, running the test with 4, 6, 9, 10, 11, 12, 13 threads fails, and after that it seems to stop failing.

nastafie · 2018-06-06T19:15:14Z

Thanks for the details. I've reproduced the issue on Intel(R) Xeon(R) Platinum 8164 CPU and am looking into it.

beew · 2018-06-12T02:44:54Z

I am on Ubuntu 16.04 and have the same issue with mkl-dnn-0.14. mkl-dnn is compiled with the small library it downloaded (mklml_lnx_2018.0.3.20180406)

But I also have the intel-mkl lib (2018.2-199) and openblas installed in my system and I switch between them with update-alternatives

Interestingly test_batch_normalization fails if system intel-mkl is set to be default blas at run time but it passes if openblas is set to be default blas, moreover openblas appears to be slightly faster on other tests too. (openblas is self compiled, not Ubuntu's stock version)

vpirogov · 2018-06-30T22:33:55Z

Resolved in 1ce0f0b

vpirogov added the bug A confirmed library bug label May 29, 2018

mkl-dnn pushed a commit that referenced this issue Jun 30, 2018

cpu: bnorm: jit: fix github issue #245, data race in rbuf

1ce0f0b

vpirogov closed this as completed Jun 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_batch_normalization is failing #245

test_batch_normalization is failing #245

gcp commented May 24, 2018

emfomenk commented May 24, 2018

gcp commented May 25, 2018

gcp commented May 25, 2018 •

edited

Loading

gcp commented May 25, 2018

sneves commented May 25, 2018

nastafie commented Jun 6, 2018 •

edited

Loading

beew commented Jun 12, 2018 •

edited

Loading

vpirogov commented Jun 30, 2018

test_batch_normalization is failing #245

test_batch_normalization is failing #245

Comments

gcp commented May 24, 2018

emfomenk commented May 24, 2018

gcp commented May 25, 2018

gcp commented May 25, 2018 • edited Loading

gcp commented May 25, 2018

sneves commented May 25, 2018

nastafie commented Jun 6, 2018 • edited Loading

beew commented Jun 12, 2018 • edited Loading

vpirogov commented Jun 30, 2018

gcp commented May 25, 2018 •

edited

Loading

nastafie commented Jun 6, 2018 •

edited

Loading

beew commented Jun 12, 2018 •

edited

Loading