Version 1.8.3
Overview: while v1.9 is in the works, this release fixes two issues, and pushes for an improved (OSX w/ Intel Compiler) and wider OS/Compiler coverage (MinGW, BSD, see Compatibility). Among minor or exotic issues resolved in this release, the stand-alone JIT-generated matrix transposes (out-of-place) are now limited to matrix shapes such that only reasonable amounts of code are generated. There has been also a rare synchronization issue reproduced with CP2K/smp in LIBXSMM v1.8.1 (and likely earlier), which is resolved since the previous release (v1.8.2).
JIT code generation/dispatch performance: JIT-generating code (non-transposed GEMMs) is known to be blazingly fast, which this release (re-)confirms with the extended dispatch microbenchmark: single-threaded code generation (uncontended) of matrix kernels with M,N,K := 4...64 (equally distributed random numbers) takes less than 25 µs on typical systems, and non-cached code dispatch takes less than 50x longer than calling a function that does nothing whereas cached code-dispatch takes less than 15x longer than an empty function (code dispatch is roughly three orders of magnitudes faster than code generation i.e., Nanoseconds vs. Microseconds).
INTRODUCED
- Support for mixing C and C++ code when using header-only based LIBXSMM.
- Issue 202: reintroduced copy-update with LIBXSMM's install target (make).
- Experimental: sketched Python support built into LIBXSMM (PYMOD=1).
IMPROVEMENTS / CHANGES
- Completed revision of synchronization layer (started in v1.8.2); initial documentation.
- Reduced TRACE output due to self-watching (internal) initialization/termination.
- Wider OS validation incl. more exotic sets (MinGW in addition to Cygwin, BSD).
- Prevent production code (non-debug) on 32-bit platforms (compilation error).
- Increased test variety while staying within same turnaround time limit.
- Continued to close implementation gaps (synchronization primitives).
- Sparse SOA domain received fixes/improvements driven by EDGE.
- More readable code snippets in documentation (reduced width).
- Initial preparation for JIT-generating SSE code (disabled).
- Improved detection of OpenBLAS library (Makefile.inc).
- Updated (outdated) support for Intel Compiler (OSX).
- Compliant soname under Linux and OSX.
FIXES
- Fixed selection of statically generated code targeting Skylake server (SKX).
- Sparse SOA domain: resolved issues pointed out by static analysis.
- Fixed support for JIT-generated matrix transpose (code size).
- Fixed selecting an incorrect prefetch strategy (BGEMM).