Pure is a parallel programming model and runtime system. Pure enables programmers to improve performance of parallel applications on multicore clusters with minimal additional programming effort.
- System Overview
- Example Application Pseudocode
- Directory Contents
- Installation
- Writing and Compiling Pure Applications
- Academic Papers
Pure is a parallel programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging efficient lock-free data structures in shared memory to achieve high-performance messaging and collective operations between the ranks within nodes.
In our PPoPP'24 paper, we showed significant speedups from Pure, including speedups up to 2.1Ă— on the CoMD molecular dynamics and the miniAMR adaptive mesh refinement applications scaling up to 4,096 cores. Further microbenchmarks in the paper show speedups over MPI from 2Ă— to 17Ă— on communication and collective operations running on 2 - 65,536 cores.
In this section we show a simple MPI program that implements a simple 1-D Jacobi-like stencil. This program is meant to illustrate the key features of Pure: messaging and optional task execution. See more detail in the Pure paper. Note that this code is slightly cleaned up for readability; see tests/jacobi_with_tasks
for the runnable versions.
#include "mpi.h"
void rand_stencil_mpi(double* const a, size_t arr_sz, size_t iters, int my_rank,
int n_ranks) {
double temp[arr_sz];
for (auto it = 0; it < iters; ++it) {
for (auto i = 0; i < arr_sz; ++i) {
temp[i] = random_work(a[i]);
}
for (auto i = 1; i < arr_sz - 1; ++i) {
a[i] = (temp[i - 1] + temp[i] + temp[i + 1]) / 3.0;
}
if (my_rank > 0) {
MPI_Send(&temp[0], 1, MPI_DOUBLE, my_rank - 1, 0, MPI_COMM_WORLD);
double neighbor_hi_val;
MPI_Recv(&neighbor_hi_val, 1, MPI_DOUBLE, my_rank - 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
a[0] = (neighbor_hi_val + temp[0] + temp[1]) / 3.0;
} // ends if not first rank
if (my_rank < n_ranks - 1) {
MPI_Send(&temp[arr_sz - 1], 1, MPI_DOUBLE, my_rank + 1, 0,
MPI_COMM_WORLD);
double neighbor_lo_val;
MPI_Recv(&neighbor_lo_val, 1, MPI_DOUBLE, my_rank + 1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
a[arr_sz - 1] =
(temp[arr_sz - 2] + temp[arr_sz - 1] + neighbor_lo_val) /
3.0;
} // ends if not last rank
} // ends for all iterations
}
N.B. See tests/jacobi_with_tasks/baseline
for the runnable version of this code.
#include "pure.h"
void rand_stencil_pure(double* const a, size_t arr_sz, size_t iters,
int my_rank, int n_ranks) {
double temp[arr_sz];
PureTask rand_work_task = [a, temp, arr_sz,
my_rank](chunk_id_t start_chunk,
chunk_id_t end_chunk,
std::optional<void*> cont_params) {
auto [min_idx, max_idx] =
pure_aligned_idx_range<double>(arr_sz, start_chunk, end_chunk);
for (auto i = min_idx; i <= max_idx; ++i) {
temp[i] = random_work(a[i]);
}
}; // ends defining the Pure Task rand_work_task
for (auto it = 0; it < iters; ++it) {
rand_work_task.execute(); // execute all chunks of rand_work_task
for (auto i = 1; i < arr_sz - 1; ++i) {
a[i] = (temp[i - 1] + temp[i] + temp[i + 1]) / 3.0;
}
if (my_rank > 0) {
pure_send_msg(&temp[0], 1, PURE_DOUBLE, my_rank - 1, 0,
PURE_COMM_WORLD);
double neighbor_hi_val;
pure_recv_msg(&neighbor_hi_val, 1, PURE_DOUBLE, my_rank - 1, 0,
PURE_COMM_WORLD);
a[0] = (neighbor_hi_val + temp[0] + temp[1]) / 3.0;
} // ends if not first rank
if (my_rank < n_ranks - 1) {
pure_send_msg(&temp[arr_sz - 1], 1, PURE_DOUBLE, my_rank + 1, 0,
PURE_COMM_WORLD);
double neighbor_lo_val;
pure_recv_msg(&neighbor_lo_val, 1, PURE_DOUBLE, my_rank + 1, 0,
PURE_COMM_WORLD);
a[arr_sz - 1] =
(temp[arr_sz - 2] + temp[arr_sz - 1] + neighbor_lo_val) /
3.0;
} // ends if not last rank
} // ends for all iterations
}
N.B. See tests/jacobi_with_tasks/pure
for the runnable version of this code.
This repository is organized into the following structure:
src
: Source code for the Pure runtime systemruntime
: Key runtime system source; key files includePureProcess
,PureThread
, andPurePipeline
.transport
: Key messaging and collective implementationssupport
: Helpers, debugging, and benchmarking infrastructureMakefile
: Build infrastructure for libpure
include
: Header files for the Pure runtime system and applicationstest
: Several complete Pure applications and their baseline (MPI) analogssupport
: Miscellaneous tools for building, analyzing, and debugging Pure applicationsruntime
: Pure runtime tools, includingpurify_all.rb
which is an MPI-to-Pure source-level translatorMakefile_includes
: Defines most of the Pure build system; see variables for relevant configuration options.misc
: Various Pure tools (perf-based profiling tools, clang-based sanitizer tools, debugging and profiling visualization tools, etc.)experiments
: Pure experiment infrastructure for running many (hundreds, thousands) of jobs via SLURM in parallel and combining and analyizing the results. Includes an optional web-based results reporting system.benchmark_helpers.rb
: Defines a DSL for Pure experiments and is the main driver of the Pure experiment frameworkmachine_helpers.rb
: Tooling for configuring Pure to specific machines and system softwarecombiner_helpers.rb
: Tools to combine results from independently-run experiments
R_helpers/
: Data analysis helpers for the Pure experiment data collection tools
lib
: Auto-generated directory where libpure is storedbuild
: Auto-generated directory where object files are stored
Pure is mostly implemented as a native C++17 library that is compatible with most systems. Pure does require a small number of dependencies, which must be installed on your system before Pure can be compiled and installed.
The foundational elements of Pure leverage shared memory to achieve improved performance relative to MPI; all of these foundational elements are written in native C++17. Pure does leverage a few external libraries, mostly for cross-node message-passing communication (MPI) and libraries used for non-performance-critical logging and debugging infrastructure (libjson and Boost).
Before proceeding, please ensure that the following are installed on your system:
- Any C++17-compatible compiler. Pure has been tested using gcc, clang, and Intel C compiler on Linux and OSX.
- MPI (2.0 or greater). Pure uses MPI by default for communication between nodes. The use of MPI, however, is transparent to the Pure application programmer. Pure has been tested with MPICH and Intel MPI, although any compatible implementation should work.
- Ruby is used as the scripting language for many Pure tools. Many Ruby Gems must be installed (you will be prompted to install them as you run various tools).
- R is used to analyze experimental data and generate performance results (optional)
- Boost
boost::hash
is used in non-performance-critical code to help manage memory deallocations; requires version 1.63+. - jsoncpp to generate rank-level stats files
- jemalloc (optional) if you would like to use jemalloc instead of the standard allocator.
-
Set the
CPL
environment variable to your Pure root (i.e., the directory containing this file). e.g., if you use Bash, put the following line in your.bash_profile
or.bashrc
:export CPL=path/to/pure`
-
Install jsoncpp, which we use to create json-based statistics for profiling purposes. Ideally just install jsoncpp using your favorite package manager (e.g.,
apt-get
orbrew
). You may also use the includedbuild_jsoncpp
script:./$CPL/support/misc/build_jsoncpp.sh && cd $CPL/src/3rd_party/jsoncpp && python amalgamate.py
-
Install Boost and update
BOOST_INCLUDE
andBOOST_LIB_DIR
insupport/Makefile_includes/Makefile.misk.mk
. Note: Boost 1.63 has been tested and other newer versions should work. -
You will likely have to fix some Makefile variables in
support/Makefile_includes/Makefile.misk.mk
and possibly others to ensure that important variables such ascc
,CC
,MPICH_PATH
,MPIRUN
,CFLAGS
,CXXFLAGS
,LFLAGS
,NPROC
(to get the number of processors on your system, which is system-dependent). We recommend creating a newifeq
section inMakefile.misc.mk
to select for your system (we use theOS
environment variable but feel free to select on a different unique system specifier).
-
#include "pure.h
in your C++ source files that make calls to the Pure runtime. -
Build your Pure applications using the provided Make-based Pure build system. Generally, configure your application using the Make variables listed in
test/Makefile.include.mk
that you wish to change from the default. Then,include ../../Makefile.include.mk
at the bottom of your applicationMakefile
. When you use the provided Pure build infrastructure, which we highly recommend, libpure will automatically be built and linked into your application executable. Note that you can choose if you prefer a static or dynamic libpure using theLINK_TYPE
application Makefile variable (LINK_TYPE = static
orLINK_TYPE = dynamic
). The build system also includes necessary header file search paths. See the example programs intest/*/pure/
. -
After you configure your application Makefile, build your code, run
make
and to run your application, runmake run
. -
N.B. Pure's build system includes an extensive set of build targets to help to build, run, debug, and profile your applications. You can browse the targets in
test/Makefile.include.mk
andsupport/Makefile_includes/*.mk
.
This distribution also includes infrastructure to build and profile non-Pure applications. This is useful as it allows you to create "baseline" applications to compare against Pure and use the same profiling infrastructure to time and compare baseline applications to Pure applications.
To compile non-Pure applications, the application developer needs to:
-
#include "pure_application_helpers.h
in your C++ source files that make calls to the helpers provided by Pure. Note that these helpers do not provide Pure runtime functionality but rather general application helpers related to benchmarking, profiling, and writing clean code. -
Build your non-Pure applications using the provided Make-based build system. Generally, configure your application using the Make variables listed in
support/Makefile_includes/Makefile.nonpure.mk
that you wish to change from the default. Then,include $(CPL)/support/Makefile_includes/Makefile.nonpure.mk
at the bottom of your applicationMakefile
. See the example programs intest/*/baseline/
. -
To build your code, run
make
and to run your application, runmake run
. -
N.B. Pure's build system includes an extensive set of compile-time options and build targets to help to modify the behavior of the Pure runtime and build, run, debug, and profile your applications, respectively. You can browse the targets in
support/Makefile_includes/*.mk
.
You can find simple Pure programs in the test
directory. We have additional programs that we are in the process of adding to this repository.
All options for the Pure runtime system are controlled using compile-time flags, which are typically specified in the application Makefile
. Most of these variables have reasonable defaults in tests/Makefile.include.mk
, but you can override them to test out different options. The Pure library and the application are built and stored in a directory that encodes the state of all of these configuration options using a SHA1 hash of the options; so, your system can have pre-built libpure
s and application binaries for different configuration options. Many of the general options are available for both Pure and "non-Pure" applications (i.e., MPI) to ensure we are using the same basic options when comparing performance.
Your application Makefile
must specify the following:
TOTAL_THREADS
: The total number of ranks in your application (possibly spread out over multiple machines). [Type:integer
]RUN_ARGS
: Command-line arguments passed to your application (i.e., readable usingargv
)ENABLE_HYPERTHREADS
: Specifies whether or not to run ranks on logical cores (aka "HyperThreads") or not. [Type:0
or1
]PURE_USER_CODE_SRCS
: Space-separated list of C or C++ source files that are run through the MPI-to-Pure source translator. For example,source_file.cpp
will be rewritten and saved assource_file.purified.cpp
. [Type: text]NON_PURIFIED_SOURCES
: Space-separated list of C or C++ source files that are not run through the MPI-to-Pure source translator. These files should make any calls to the Pure runtime explicitely. [Type: text]BIN_NAME
: The name of your binary file. [Type: text]
-
PURE_NUM_PROCS
: Manually specifies the number of processes running across your entire system. UseAUTO
to have the system use a good/reasonable default based on other settings. Default:AUTO
. [Type:integer
if notAUTO
] -
PURE_RT_NUM_THREADS
: Manually specifies the number of threads (and therefore ranks) to run per Pure process. Default:AUTO
. [Type:integer
if notAUTO
] -
THREADS_PER_NODE_LIMIT
: WhenPURE_NUM_PROCS
andPURE_RT_NUM_THREADS
are set toAUTO
, limits the number of ranks (threads) on a node to the set amount instead of the number of cores on that node (real or virtual, depending on the value ofENABLE_HYPERTHREADS
). -
PROCESS_CHANNEL_VERSION
: Specifies the version of the intra-process message data structure that is used. We recommend using version40
when not using Pure Tasks and version460
to enable work stealing when using Pure Tasks. There are number of other versions; seesupport/Makefile_includes/determine_preprocessor_vars.rb
for details. Version411
is a useful mode to test that Pure Tasks are producing the correct value when work stealing is disabled. [Type: integer] -
PCV_4_NUM_CONT_CHUNKS
: The number of "chunks" a Pure Task is broken into, which is relevant when work stealing is enabled. [Type:0
or1
] -
DEBUG
: Builds runtime and application in debug mode. Includes many runtime error checks, builds with debugging symbols (-g
) and no compiler optimization (-O0
,-fno-omit-frame-pointer
, etc.). [Type:0
or1
] -
PROFILE
: Builds runtime and application with debugging symbols but also with compiler optimizations (-O3
). Useful for performance profiling. [Type:0
or1
] -
RELEASE
: Builds runtime and application with no debugging symbols and with all compiler optimizations (-O3
,-march=native
, etc.). Most likely to provide optimal runtime performance [Type:0
or1
] -
DISABLE_PURIFICATION
: Disables the MPI-to-Pure source code translator. Default:0
. [Options:0
or1
]
PAUSE_FOR_DEBUGGER_ATTACH
: Pauses the application upon startup and before the application runs to give you a chance to attach your debugger to a Pure process. It prints out thepid
of up to four Pure processes.ASAN
: Enables the-fsanitize=address
compiler flag, which compiles and runs your application with Address Sanitizer enabled. Useful for finding memory leaks, double frees, out-of-bounds accesses, use-after-frees, etc.TSAN
: Enables the-fsanitize=thread
compiler flag, which compiles and runs your application with Thread Sanitizer enabled. Useful for data races in your application (including the Pure Runtime).MSAN
: Enables the-fsanitize=memory
compiler flag, which compiles and runs your application with Memory Sanitizer enabled. Useful for detecting uninitialized reads.UBSAN
: Enables the-fsanitize=undefined
compiler flag, which compiles and runs your application with Undefined Behavior Sanitizer enabled. Useful for detecting undefined behavior such as dereferencing misaligned or null pointers or signed integer overflow.COLLECT_THREAD_TIMELINE_DETAIL
: Runs the application collecting trace cycle-based timepoints, which can then be fed to the Pure Timeline profiler. Default:0
. [Options:0
or1
]DO_PRINT_CONT_DEBUG_INFO
: Prints out Pure Task debugging logs, which specify which Pure ranks execute which chunks of a particular Pure Task. Useful for debugging and getting a sense of distribution of chunk execution.
NUMA_SCHEME
: Specifies the manner in which ranks are laid out and pinned on the cores of a system. Specific to each system you run on. Seesupport/Makefile_includes/Makefile.cpu_config.mk
. Examples:bind_sequence
,bind_alternating
,none
, etc.USE_JEMALLOC
: Uses the jemalloc memory allocator instead of the default allocator.USER_CFLAGS
: Additional compiler flags to add when compiling C source files.USER_CXXFLAGS
: Additional compiler flags to add when compiling C++ source files.USER_LFLAGS
: Additional linker flags to add when linking.LINK_TYPE
: Determines if libpure is built as a static or dynamic library. Default:dynamic
. [Options:static
ordynamic
]
PROCESS_CHANNEL_BUFFERED_MSG_SIZE
: Number of usable entries in the lock-free circular buffer that is used for intra-node messaging ("Process Channels").BUFFERED_CHAN_MAX_PAYLOAD_BYTES
: Threshold, in bytes, between using "buffered" and "rendezvous" style point-to-point messaging. If the message size is equal to or less thanBUFFERED_CHAN_MAX_PAYLOAD_BYTES
Pure will use the buffered (i.e., copy twice) approach. Default:8192
. [Type: integer]
The Pure build system comes with many make-driven tools to help debug, profile, and run Pure applications. See below for some of the most useful targets. Run these commands from the application directory (where the application's Makefile
is).
make
: Default target builds the application, including libpuremake run
: Builds and runs the applicationmake vars
: Prints out the current configuration of key build parametersmake clean
: Deletes object files, libraries (i.e., libpure), and application executablesmake clean_test
: Deletes application object files and executables
make gdb
: Loads the application in gdb. Tip: define commands to be run when gdb first loads by definingUSER_GDB_COMMANDS
in your application Makefile.make gdb-run
: like thegdb
target, but immediately runs the program in gdb.make lldb
: Runs application in lldb.make valgrind
: Checks for memory leaks with valgrind memcheck. Note: We recommending running withASAN=1
in your Makefile instead of this.make massif
: Profiles heap using valgrind massif. Also see other related targets:massif-stack
,ms_print
,ms_print_stack
,ms_print_totals
make profile
: Does a performance counter-based profiling of the application; uses Linux perf. By default, measures cycles (cycles:ppp
), overridable withDEFAULT_PERF_EVENTS
environment variable. Runmake profile-report
to see the results of the profile.make profile-report
: View the perf-based results of a profile collected withmake profile
make flamegraph
: Generates a Flamegraph for your application using perf. Defaults to visualizing cycles.make thread-timeline
: Runs Pure's thread-timeline tool to visualize application and runtime event durations, and visualizes it in an interactive web-based interface.make profile-stat
: View performance counter stats (usingperf stat
)make profile-c2c
: Runsperf-c2c
cacheline contention analyzer on your Pure application.
make libpure
: Build libpure onlymake tidy
: Runs clang-tidy on the codebasemake purify-all
: Runs the Pure MPI-to-Pure source-level translator on Pure application code. Run by default automatically in above common build commands so this is usually not run by itself.make ranks-on-topo
: Creates a PDF showing the Pure ranks on top of the CPU topology. Useful if you are using custom rank layouts and want to make sure your rank layout is as you intended.make bloaty
: Profiles the binary size using Google Bloatymake list-targets
: Lists out the possible targets of the Pure build system. Note: it's probably more helpful to use this list as this has descriptions of each target.
PPoPP'24 "Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes"