This repository exposes source code to reproduce experiments as presented in "OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF".
This repository contains submodules. To get all contents, please clone with
git clone --recursive https://github.com/inetrg/Agere-LNCS-2017.git. The command will clone and checkout a dedicated CAF version, VAST version and the Indexing repository.
In addition to a host that runs OpenCL, you will need the following projects. The script
build_deps.sh will build them for you. If you want to do this manually, see
All project require Cmake (>3.1) and a C++ compiler to build. CAF specifically depends on C++11 while VAST requires a C++14 compatible compiler.
For the measurements presented in the Paper, we used the following hosts:
- a late 2013 iMac with a 3.5 GHz Intel Core i7 running OS X 10.12 and OpenCL version 1.2
- NVIDIA GeForce GTX 780M GPU with 4096 MB memory
- a Server with two twelve-core Intel Xeon CPUs clocked at 2.5 GHz running Linux (CentOS 7)
- Tesla C2075 GPU with Nvidia driver version 375.20
- Xeon Phi 5110P coprocessor using the Intel OpenCL Runtime 14.2
The paper present multiple benchmarks which are listed in separate sections below. The benchmark implementations are located in the
benchmarks folder with an exception of the benchmark related to bitmap indexing which is located in the separate
indexing repository that is included as a submodule.
list_devices included in the benchmarking programs can list the OpenCL devices available on your system.
Use Case: Indexing
The paper (Section 4.2) shows a graph on indexing that compares OpenCL Actors to a CPU implementation using VAST. There is a program to generate test data (
generate) in the build with the indexing project. Per default it generates values with 16 bit cardinality.
The programs for the comparison are the
vst executables. Both are available in the related files in the
src folder. The script
measure.sh in the indexing folder will create test data and perform the measurements for you.
If you want to perform different measurements, each program accepts a file with input data using the option
-f. Per default,
phases will choose the first available OpenCL device. You can also specify a device name in the code (line 337) or pass it as an argument via
-d. For further information see
Pipeline Messaging Overhead
The indexing project further builds the
overhead executable to estimate the time for passing messages between OpenCL actor stages (shortly mentioned in Section 3.6). It accepts the argument
--use_mapping_functions to decide whether the mapping functions or round trip time for a complete calculation is used for the estimate. Additionally, the argument
-i I determines how many iterations to execute.
The program prints the mean over
I iterations in microseconds.
This benchmark is presented in Section 5.1. It is measured by two programs, one for core actors (
bench_spawn_core) and one for OpenCL actors (
The OpenCL benchmark requires a matrix size > 0 (
-s N) as well as an iteration count (
-i I) that specifies how many actors will be spawned. The core actor benchmark only requires the iteration count (
-i I) and ignores the size argument. The paper includes measurements for values of I from 100,000 to 1,000,000 in steps of 100,000.
Each program prints the time to create
I actors in microseconds.
This benchmark is presented in Section 5.2 of the paper. It requires some adjustments to CAF as the timepoints used for measurement are usually included. The following external variables must be added to the command class
In lines 44 and 45:
static std::chrono::high_resolution_clock::time_point a; static std::chrono::high_resolution_clock::time_point b;
In line 100 at the beginning of the first
a = std::chrono::high_resolution_clock::now();
And in line 229 in the first line of the
b = std::chrono::high_resolution_clock::now();
The related program is called
bench_overhead and requires a matrix size as input (
-s N). The paper includes measurements for sizes of
N = 1000, 4000, 8000 and 12000. The benchmarks prints three measurements in microseconds: the total runtime, the time spent in OpenCL and the difference between both values (separated by commas).
The benchmark is presented in Section 5.3 of the paper. The programs for the comparison are
bench_native_comparison for native OpenCL and
bench_caf_comparison for the OpenCL actor. Both require a matrix size
-s N and an iteration count
-i I. Optionally, a device can be specified with
-d D. The paper uses a size of
N equal to 1,000 and iterations
I from 1,000 to 10,000 in steps of 1,000.
Each program prints the runtime for
I iterations in microseconds.
Scaling in a heterogeneous setup
This benchmark is implemented in the program
bench_matrix_offloading which calculates an image of Mandelbrot set. Configuration arguments are the width (
-W WIDTH) and height (
-H HEIGHT) of the image as well as the percentage that is offloaded with OpenCL (
--with-opencl=PERCENTAGE). Additionally, the program accepts a device name (
-d D) and a number of iterations to perform (
The graphs in the paper were calculated with a width and height of 16,000 for 100 and 1,000 iterations while moving the image from the CPU to an OpenCL device in steps of 10%.
The Origin project file in the repository includes the data and the graphs in the paper.
build_deps.sh will build all project for you, but this can be done manually via the following commands:
cd actor-framework ./configure --build-type=release make -j 4
cd vast ./configure --build-type=release --with-caf=../actor-framework/build make -j 4
cd indexing ./configure --build-type=release --with-caf=../actor-framework/build --with-vast=../vast/build make -j 4
cd benchmarks ./configure --build-type=release --with-caf=../actor-framework/build make -j 4
For more CAF-related benchmarks, see the CAF Benchmark Suit. It compares CAF to various actor implementations such as Scala or Erlang. For more general actor benchmarks, take a look at the Savina benchmark suite.