Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_batch_normalization is failing #245

Closed
gcp opened this issue May 24, 2018 · 8 comments
Closed

test_batch_normalization is failing #245

gcp opened this issue May 24, 2018 · 8 comments
Labels
bug A confirmed library bug

Comments

@gcp
Copy link

gcp commented May 24, 2018

gcc version 7.3.0
various CPUs (Xeon(R) CPU E3-1240 v3, Ryzen 1700)
Ubuntu 16.04
No MKL
Current git master (72236df)

Doing a simple cmake .. && make && make test produces a test failure:

22/37 Test #22: test_batch_normalization ......................***Failed 45.53 sec

@emfomenk
Copy link

Hi @gcp,

Could you please provide a little bit more information? Could you please dump the output of the test, so that we can at least see, what exactly fails. Please also set MKLDNN_VERBOSE environment variable to 2 to get a little bit more verbose info.

@gcp
Copy link
Author

gcp commented May 25, 2018

https://sjeng.org/ftp/work/log.txt.gz

relevant part when run with MKLDNN_VERBOSE:

[ RUN      ] GoogleNet_Blocked_8/bnrm_test_float.TestsBnrm/2
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,0.0490723
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,1.323
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,0.0959473
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:0,mb2ic192ih56iw56,1.10303
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,0.0629883
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,1.59814
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,0.0549316
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:1,mb2ic192ih56iw56,1.229
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,0.0678711
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_inference,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,1.17798
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,0.052002
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:2,mb2ic192ih56iw56,1.09985
mkldnn_verbose,create,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:3,mb2ic192ih56iw56,0.0419922
mkldnn_verbose,exec,batch_normalization,jit:avx2,forward_training,fdata:nChw8c fdiff:undef,flags:3,mb2ic192ih56iw56,1.11719
mkldnn_verbose,create,batch_normalization,jit:avx2,backward_data,fdata:nChw8c fdiff:nChw8c,flags:0,mb2ic192ih56iw56,0.0610352
mkldnn_verbose,exec,batch_normalization,jit:avx2,backward_data,fdata:nChw8c fdiff:nChw8c,flags:0,mb2ic192ih56iw56,1.55908
The difference between (out_diff_src - ref_diff_src) / norm_max and 0. is 0.81878101825714111, which exceeds eps, where
(out_diff_src - ref_diff_src) / norm_max evaluates to -0.81878101825714111,
0. evaluates to 0, and
eps evaluates to 0.62720000743865967.

On one of the machines the test passes today (!). Could this be related to uninitialized memory or saving/restoring CPU state?

I will see if I can make the failure reappear, to exclude the possibility of having ssh'ed to the wrong machine when I reported it is reproducible on multiple.

@gcp
Copy link
Author

gcp commented May 25, 2018

I managed to reproduce the failure on both systems again. It does not happen if they are idle, but it does happen if there are other machine learning tasks busy (PyTorch+cuDNN or TF+cuDNN). It's very unlikely to be related to heat or stability (different CPU vendors, not overclocked, well-cooled workstations, exactly the same test that fails).

Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz is on:
Linux mozwell 4.15.0-20-generic #21~16.04.1-Ubuntu SMP Wed Apr 25 02:42:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Which is a ubuntu built kernel. nvidia-driver is 390.48

AMD Ryzen 7 1700 Eight-Core Processor is on:
Linux beast 4.15.13-ryzen #24 SMP Tue Mar 27 09:27:37 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
Which is directly from kernel.org. nvidia-driver is 390.30

My best guess is that there is some issue with memory initialization or processor state.

@gcp
Copy link
Author

gcp commented May 25, 2018

I got a report that this also reproduces on a 56-core AVX-512 machine, without NVIDIA drivers installed.

@sneves
Copy link

sneves commented May 25, 2018

I can reproduce this, on a similar Ubuntu machine as Gian-Carlo. This box is a Skylake with 8 logical cores. Running the following command (with an otherwise idle machine) shows no failures:

OMP_NUM_THREADS=8 tests/gtests/test_batch_normalization --gtest_filter=GoogleNet_Blocked_8/bnrm_test_float.TestsBnrm/2 --gtest_repeat=-1

However, the command

OMP_NUM_THREADS=9 tests/gtests/test_batch_normalization --gtest_filter=GoogleNet_Blocked_8/bnrm_test_float.TestsBnrm/2 --gtest_repeat=-1

Fails almost immediately, in the manner described above.

This was also reproduced in a dual-socket Skylake-SP machine with 56 total logical cores. But there the behavior is different, and fewer-than-logical numbers of threads fail. In particular, running the test with 4, 6, 9, 10, 11, 12, 13 threads fails, and after that it seems to stop failing.

@vpirogov vpirogov added the bug A confirmed library bug label May 29, 2018
@nastafie
Copy link
Contributor

nastafie commented Jun 6, 2018

Thanks for the details. I've reproduced the issue on Intel(R) Xeon(R) Platinum 8164 CPU and am looking into it.

@beew
Copy link

beew commented Jun 12, 2018

I am on Ubuntu 16.04 and have the same issue with mkl-dnn-0.14. mkl-dnn is compiled with the small library it downloaded (mklml_lnx_2018.0.3.20180406)

But I also have the intel-mkl lib (2018.2-199) and openblas installed in my system and I switch between them with update-alternatives

Interestingly test_batch_normalization fails if system intel-mkl is set to be default blas at run time but it passes if openblas is set to be default blas, moreover openblas appears to be slightly faster on other tests too. (openblas is self compiled, not Ubuntu's stock version)

@vpirogov
Copy link
Member

Resolved in 1ce0f0b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug
Projects
None yet
Development

No branches or pull requests

6 participants