-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_batch_normalization is failing #245
Comments
Hi @gcp, Could you please provide a little bit more information? Could you please dump the output of the test, so that we can at least see, what exactly fails. Please also set MKLDNN_VERBOSE environment variable to 2 to get a little bit more verbose info. |
https://sjeng.org/ftp/work/log.txt.gz relevant part when run with MKLDNN_VERBOSE:
On one of the machines the test passes today (!). Could this be related to uninitialized memory or saving/restoring CPU state? I will see if I can make the failure reappear, to exclude the possibility of having ssh'ed to the wrong machine when I reported it is reproducible on multiple. |
I managed to reproduce the failure on both systems again. It does not happen if they are idle, but it does happen if there are other machine learning tasks busy (PyTorch+cuDNN or TF+cuDNN). It's very unlikely to be related to heat or stability (different CPU vendors, not overclocked, well-cooled workstations, exactly the same test that fails). Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz is on: AMD Ryzen 7 1700 Eight-Core Processor is on: My best guess is that there is some issue with memory initialization or processor state. |
I got a report that this also reproduces on a 56-core AVX-512 machine, without NVIDIA drivers installed. |
I can reproduce this, on a similar Ubuntu machine as Gian-Carlo. This box is a Skylake with 8 logical cores. Running the following command (with an otherwise idle machine) shows no failures:
However, the command
Fails almost immediately, in the manner described above. This was also reproduced in a dual-socket Skylake-SP machine with 56 total logical cores. But there the behavior is different, and fewer-than-logical numbers of threads fail. In particular, running the test with 4, 6, 9, 10, 11, 12, 13 threads fails, and after that it seems to stop failing. |
Thanks for the details. I've reproduced the issue on Intel(R) Xeon(R) Platinum 8164 CPU and am looking into it. |
I am on Ubuntu 16.04 and have the same issue with mkl-dnn-0.14. mkl-dnn is compiled with the small library it downloaded (mklml_lnx_2018.0.3.20180406) But I also have the intel-mkl lib (2018.2-199) and openblas installed in my system and I switch between them with update-alternatives Interestingly test_batch_normalization fails if system intel-mkl is set to be default blas at run time but it passes if openblas is set to be default blas, moreover openblas appears to be slightly faster on other tests too. (openblas is self compiled, not Ubuntu's stock version) |
Resolved in 1ce0f0b |
gcc version 7.3.0
various CPUs (Xeon(R) CPU E3-1240 v3, Ryzen 1700)
Ubuntu 16.04
No MKL
Current git master (72236df)
Doing a simple cmake .. && make && make test produces a test failure:
22/37 Test #22: test_batch_normalization ......................***Failed 45.53 sec
The text was updated successfully, but these errors were encountered: