-
Notifications
You must be signed in to change notification settings - Fork 183
The "MM" stands for Matrix Multiplication, and the "S" clarifies the working domain, i.e., Small Matrix Multiplication. The latter also means the name is neither a variation of "MXM" nor an eXtreme Small Matrix Multiplication but rather about Intel Architecture (x86) - and no, the library is 64‑bit only. The spelling of the name might follow the syllables of libx\/smm, libx'smm, or libx‑smm.
NOTE: the library does not support 32-bit architecture (64‑bit only)
When characterizing the problem-size using the M, N, and K parameters, a problem-size suitable for LIBXSMM falls approximately within (M N K)1/3 <= 128 (which illustrates that non-square matrices or even "tall and skinny" shapes are covered as well). The library is typically used to generate code up to the specified threshold. Raising the threshold may not only generate excessive amounts of code (due to unrolling in M or K dimension), but also miss to implement a tiling scheme to effectively utilize the cache hierarchy. For auto-dispatched problem-sizes above the configurable threshold (explicitly JIT'ted code is not subject to the threshold), LIBXSMM is falling back to BLAS. In terms of GEMM, the supported kernels are limited to Alpha := 1, Beta := { 1, 0 }, and TransA := 'N'.
NOTE: Alpha, Beta, and TransA are limited to
1
,{ 1, 0 }
, and'N'
respectively.
In the last years, new workloads such as deep learning and more specifically convolutional neural networks (CNN) emerged and are pushing the limits of today's hardware. One of the expensive kernels is a small convolution with certain kernel sizes (3, 5, or 7) such that calculations in the frequency space is not the most efficient method when compared with direct convolutions. LIBXSMM's current support for convolutions aims for an easy-to-use invocation of small (direct) convolutions, which are intended for CNN training and classification. The Interface is currently ramping up, and the functionality increases quickly towards a broader set of use cases.
For cache-tiled or parallelized routines, please rely for example on OpenBLAS or Intel Math Kernel Library (Intel MKL). It is possible to reuse LIBXSMM's kernels for big(ger) matrix multiplications however, an implementation is out of scope for LIBXSMM's core functionality.
Given the application uses BLAS to carry out matrix multiplications, one may use the Call Wrapper, and measure the application performance e.g., time to solution. However, the latter can significantly improve when using LIBXSMM's API directly. To check whether there are applicable GEMM-calls, the Verbose Mode can help to collect an insight. Further, when an application uses Intel MKL 11.2 (or higher), then running the application with the environment variable MKL_VERBOSE=1 (env MKL_VERBOSE=1 ./workload > verbose.txt
) can collect a similar insight (grep -a "MKL_VERBOSE DGEMM(N,N" verbose.txt | cut -d'(' -f2 | cut -d, -f3-5"
).
One may have a look at issue #120 or #282, but in summary:
- Binary compatibility is not continuously tested (only manually for a subset of the API namely SMM domain).
- Major versions are likely breaking binary compatibility with existing integrations (that is typical).
- Minor versions may break binary compatibility of recently introduced features (may not be typical).
- Update and patch versions are binary compatible but may only be released on request (issue).
LIBXSMM's API for Small Matrix Multiplications (SMMs) is considered stable, and all major known applications (e.g., CP2K, EDGE, NEK5K, and SeisSol) either rely on SMMs or are able (and want) to benefit from an improved API of any of the other domains (e.g., DL). Until at least v2.0, LIBXSMM is not able to track or even maintain binary compatibility and hence the SONAME also goes with the semantic version. A list of public functions is maintained (but there is no distinction for a small subset of them that are only meant for communication between LIBXSMM and LIBXSMM/ext).
I am relying on a prebuilt version of CP2K (or another application), is LIBXSMM incorporated and which version is it?
This can be determined using the environment variable LIBXSMM_VERBOSE=2
(or higher verbosity). It is not even required to use an input or workload since the information in question is presented when the program terminates. For example:
LIBXSMM_VERBOSE=1 exe/Linux-x86-64-intelx/cp2k.psmp
[...]
LIBXSMM_VERSION: release-1.11
LIBXSMM_TARGET: clx
I am relying on a prebuilt version of an application, and I am concerned about optimal compiler flags.
LIBXSMM uses JIT-generated code according to the CPUID of the system. This is independent of the compiler flags used to build the library. If LIBXSMM was incorporated per classic ABI, LIBXSMM_DUMP_BUILD=1
environment variable allows to print build flags at termination of the application. The output of LIBXSMM_DUMP_BUILD=1
can yield hints about the flags used to build the application (if similar to the flags used for LIBXSMM).
The answer here focuses on the actual runtime support rather than the supported compiler tool chains used to build the library. All flavors of Linux are supported (if the library was successfully built), which includes installations running a security-hardened Linux kernel (SELinux). The Apple OS (OSX) is supported, which also includes more recent SIP-enabled versions (System Integrity Protection). The BSD OS is likely supported, but building the library is only occasionally validated. Microsoft Windows is supported for non-JIT operation, and for most (e.g., GEMM and MATCOPY) of the JIT-kernels (prefetch signature is not supported). There is currently no support for JIT in the DNN domain (no further check is performed, i.e., crash at runtime). See also issue #71.
The library generates acceptable code when using M=1
or N=1
. For example, building with make M=16 N=1 K=16 AVX=2
and inspecting the assembly (build directory) or dumping/disassembling the JIT code (see reference documentation) shows the minimum number of load/store instructions. Given that GEMV is a memory bound operation, this suggests reasonable code quality. LIBXSMM selects from multiple microkernels (specific for each ISA extension) by using a fixed scheme/heuristic, which should be acceptable for GEMV. The sample code under samples/smm provides ready-to-use benchmark drivers that can help to compare the performance with LAPACK/BLAS. Afore mentioned benchmarks exercise streaming all possible combinations of operands.
This question refers to the following kind of element type of the GEMM interface of LIBXSMM:
- Complex types: complex numbers in single and double-precision,
- Mixed types: e.g., real double-precision and complex double-precision There are no (immediate) plans to support more types for the GEMM part. Please note, that LIBXSMM indeed supports lower precision GEMM (wgemm).
All feedback and issue reports are handled openly, are welcome and considered (answered, and collected). However, we do not seek for "feature votes" since the development of the library is not a democratic process.