-
Notifications
You must be signed in to change notification settings - Fork 7
License
google/memcpy-gemm
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
# memcpy-gemm
memcpy-gemm provides libraries and binaries for testing the communication
and compute performance of GPUs. The primary binary generated for this purpose
is the memcpy_gemm program. memcpy_gemm supports testing communication speed
between any GPU and it's host or peers, and can test the compute
performance in half, single, or double precision for matrix multiplication.
## Requirements:
memcpy-gemm requires the following tools for installation
* bazel
* autoconf
Bazel relies on the following libraries for execution. When building, bazel
will pull the versions specified in the WORKSPACE file from internet sources.
* abseil
* glog
* googletest
* libnuma
* half
* re2
Currently, libnuma depends on autoconf for configuration, which we wrap in bazel
at build time.
memcpy-gemm also requires a CUDA installation. The default location is
/usr/local/cuda. To specify a different directory, export a "CUDA_PATH"
environmental variable. Note that bazel will create a symlink to this path
as "cuda", which is how cuda headers are referenced in the code. memcpy-gemm
is compatibile with CUDA >= 9.0, and makes no guarantees about long-term support
for older CUDA versions going forward.
A few CUDA sources in memcpy-gemm need to be compiled with NVCC. If CUDA_PATH is
correctly set (or the default location is correct), NVCC will be found and
invoked automatically. The NVCC compiler has limits on which host compilers
work with it - so you may need to modify the bazel host compiler. The quickest
way to do this is via the CC and CXX environmental variables.
## Building:
With the proper dependencies installed, building the memcpy_gemm binary
should be as simple as:
`bazel build :memcpy_gemm`
and the memcpy_gemm binary is built in to bazel-bin.
If Cuda and/or host compilers need to be configured, a build invocation may
look something like:
```
# In this example, we have a custom CUDA location, and we need to specify
# GCC 9, because the default gcc (10 in this example) doesn't work with CUDA 11
# NVCC
CUDA_PATH=/usr/local/cuda-11.0 CC=gcc-9 CXX=g++-9 bazel build :memcpy_gemm
```
To run tests, "bazel test :memcpy_gemm_test" or "bazel test :all". Note that
the memcpy_gemm_test assumes that one GPU is available, but makes no assumptions
about its performance characteristics.
## Creating a static build:
By default, when built on most linux systems memcpy-gemm dynamically links to
libgcc, libstdc++, and libc. While some dynamic dependencies to libc are unavoidable,
memcpy-gemm will link statically to libstdc++ and libgcc if compiled in the static configuration:
`bazel build --config=static :memcpy_gemm`
This mode is useful for sidestepping dependencies when building for cross-compilation.
### Configuring static build
memcpy-gemm uses it's own description of the C++ toolchain for static builds. On some linux systems,
the toolchain include path to the standard c++ library may be incorrectly configured. When building in
static mode, this can lead to errors such as:
> this rule is missing dependency declarations for the following files included by (some library)
> '/usr/lib/....some header'
> '/usr/lib/....other header'
In the short term, this issue can be fixed by modifying the C++ toolchain descriptor file.
Open /(path/to/memcpy-gemm/source)/toolchain/cc_toolchain_config.bzl. Search for the
cxx_bultin_include_directories variable, and add the path to your headers to the list.
## Running:
To see the list of options of memcpy-gemm, type "/path/to/memcpy_gemm --help".
### Supported compute types
memcpy-gemm supports a variety of input/output/compute precisions. The specific
supported precision depends on the GPU architecture and CUDA version. With CUDA
9 or later, the following data type combinations are allowed (cc=compute capability
of device).
| `input_precision` | `output_precision` | `compute_precision` | `restrictions` | `Notes`
| ----------------- | ------------------ | ------------------- | -------------- | -------
| single | single | single | none | same as fp_precision = 'single'
| double | double | double | none | same as fp_precision = 'double'
| half | half | half | cc >= 6.0 | same as fp_precision = 'half'
| half | single | single | cc >= 6.0 | commonly called 'mixed precision'
| int8 | int32 | int32 | cc >= 7.0 |
| int8 | single | single | cc >= 7.0 | On cc >= 7.5, can use tensor cores.
### Memory Usage
memcpy-GEMM allocates CPU and GPU memory for both memcpy and GEMM operations. For
memcpy tests, memcpy-gemm allocates a buffer on both the source and destination
devices (be it a CPU or GPU), with the size equal to the number of kilobytes
specified by the buffer_size_KiB flag. Buffers are not reused between flows, so
(for example) with flows 'c0-g0-a0 g0-g1-a1', GPU0 will allocate 2 buffers, one
for each flow.
GEMM calls allocate two input arrays and one ouput array on each device specified
in the --gpus flag, if the --gemm option is set to true. The total number of bytes
allocated is then:
bytes = (dim1_matrix1 * dim2_matrix1 * sizeof(input precision)) + \
(dim1_matrix2 * dim2_matrix2 * sizeof(input precision)) + \
(dim1_output_matrix * dim2_ouput_matrix * sizeof(output precision))
If M=N=K, using N as the dimension, this simplifies to:
bytes = 2 * (N^2 * sizeof(input precision)) + N^2 * sizeof(output precision)
Supported sizes of precisions are:
int8 - 1 byte
half - 2 bytes
bf16 - 2 bytes
single - 4 bytes
int32 - 4 bytes
double - 8 bytes
In addition, expect a small GPU memory overhead from the cublas environment initializations.
About
No description, website, or topics provided.
Resources
License
Code of conduct
Security policy
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published