An optimized C library for math, parallel processing and data movement
Branch: master
Clone or download
olajep examples/math/matmul: Fix host program output formatting
Signed-off-by: Ola Jeppsson <>
Latest commit 6feeb82 Sep 6, 2017
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmark build: Fix pure in-tree build Sep 27, 2016
config/m4 build: Save/restore toolchain environment variables for device configure Sep 27, 2016
devices/epiphany build: Build pal library for epiphany device Jul 4, 2016
doc Switch to non-recursive Automake Feb 23, 2016
examples examples/math/matmul: Fix host program output formatting Sep 6, 2017
host build: Fix pure in-tree build Sep 27, 2016
include include/pal.h [epiphany]: Don't force library dependency Aug 5, 2016
src devices/posix/epiphany [device]: Fixes needed e-hal's pal backend Nov 23, 2016
tests build: Fix pure in-tree build Sep 27, 2016
tools/regression tools:regression:log-code-size: Don't fail if benchmark fails Jul 8, 2015
.clang-format clang-format: Align trailing comments Feb 11, 2015
.gitignore .gitignore: More finegrained filter for /config Feb 23, 2016
.mailmap Update mailmap Jun 12, 2015
AUTHORS Adding more contributors to AUTHORS Jul 10, 2015 API "0.3" Jan 30, 2015 formatting Mar 6, 2015 Clarify commit message format in contrib guidelines Jun 11, 2015
LICENSE Copyright transfer Apr 11, 2016 build: Fix pure in-tree build Sep 27, 2016 Update Jun 14, 2015 API: Add p_mem_t structure Jul 22, 2016
bootstrap build: Do not track generated files Feb 17, 2015 build: Fix pure in-tree build Sep 27, 2016

PAL: The Parallel Architectures Library

Build Status Coverity Scan Build Status

The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronization, and inter-processor communication.


  1. Why?

  2. Design goals

  3. License

  4. Contribution Wanted!

  5. A Simple Example

  6. Build Instructions

  7. Library API reference
    7.0 Syntax
    7.1 Program Flow
    7.2 Data Movement
    7.3 Synchronization
    7.3 Basic Math
    7.5 Basic DSP
    7.4 Image Processing
    7.6 FFT (FFTW)
    7.7 Linar Algebra (BLAS)
    7.8 System Calls

8 Status Report

9 Benchmarking

##Why? Any sane and informed person knows that the future of computing is massively parallel. Unfortunately the energy needed to escape the current "von Neumann potential well" seems to be approaching infinity. The legacy programming stack is so effective and so easy to use that developers and companies simply cannot afford to choose the better (parallel) solution. To make parallel computing ubiquitous our only choice is to rewrite the whole software stack from scratch, including: algorithms, run-times, libraries, and applications. The goal of the Parallel Architectures Library project is to establish the lowest layer of this brave new programming stack.

##Design Goals

  • Fast (Super fast but no "belt AND suspenders")
  • Compact (Small enough to work for memory limited processors with <32KB RAM)
  • Scalable (Thread and data scalable)
  • Portable (Portable across different ISAs and systems)
  • Permissive (Apache 2.0 license to maximize industry adoption)

##License The PAL source code is licensed under the Apache License, Version 2.0. See LICENSE for full license text unless otherwise specified.

##Contribution Our goal is to make PAL a broad community project from day one. If just 100 people contribute one function each, we'll be done in a couple of days! If you know C, you are ready to contribute!!

Instructions for contributing can be found HERE.

##Build Instructions

###Install Prerequisites

$ sudo apt-get install libtool build-essential pkg-config autoconf automake doxygen

###Build Sequence

$ ./bootstrap
$ ./configure
$ make


To run the automated unit tests you need to run

$ make check

##A Simple Example The following sample shows how to use PAL launch a simple task on a remote processor within the system. The program flow should be familiar to anyone who has used accelerator programming frameworks.

Manager Code

#include <pal.h>
#include <stdio.h>
#define N 16
int main(int argc, char *argv[])

    // Stack variables
    char *file = "./hello_task.elf";
    char *func = "main";
    int status, i, all, nargs = 1;
    char *args[nargs];
    char argbuf[20];

    // References as opaque structures
    p_dev_t dev0;
    p_prog_t prog0;
    p_team_t team0;
    p_mem_t mem[4];

    // Execution setup
    dev0 = p_init(P_DEV_DEMO, 0);        // initialize device and team
    prog0 = p_load(dev0, file, func, 0); // load a program from file system
    all = p_query(dev0, P_PROP_NODES);   // find number of nodes in system
    team0 = p_open(dev0, 0, all);        // create a team

    // Running program
    for (i = 0; i < all; i++) {
        sprintf(argbuf, "%d", i); // string args needed to run main asis
        args[0] = argbuf;
        status = p_run(prog0, team0, i, 1, nargs, args, 0);
    p_wait(team0);    // not needed
    p_close(team0);   // close team
    p_finalize(dev0); // finalize memory

    return 0;

Worker Code (hello_task.elf)

#include <stdio.h>
int main(int argc, char* argv[]){
    int pid=0;
    int i;
    printf("--Processor %d says hello!--\n", pid);
    return i;



These program flow functions are used to manage the system and to execute programs. All PAL objects are referenced via handles (opaque objects).

p_init() initialize the run time
p_query() query a device object
p_load() load binary elf file into memory
p_run() run a program on a team of processors
p_open() open a team of processors
p_append() add members to team
p_remove() remove members from team
p_close() close a team of processors
p_barrier() team barrier
p_wait() wait for team to finish
p_fence() memory fence
p_finalize() cleans up run time
p_error() get error code (if any).
p_mem_error() get error code for a memory object (if any).

These functions are used for creating memory objects. The functions return a unique PAL handle for each new memory object. This handle can then be used by functions like p_read() and p_write() to access data within the memory object.

p_malloc() allocate memory on local processor
p_rmalloc() allocate memory on remote processor
p_free() free memory

The data movement functions move blocks of data between opaque memory objects and locations specified by pointers. The memory object is specified by a PAL handle returned by a previous API call. The exception is the p_memcpy function which copies blocks of bytes within a shared memory architecture only.

p_gather() gather operation
p_memcpy() fast memcpy()
p_read() read from a memory object
p_scatter() scatter operation
p_write() write to a memory object

The synchronization functions are useful for program sequencing and resource locking in shared memory systems.

p_mutex_lock() lock a mutex
p_mutex_trylock() try locking a mutex once
p_mutex_unlock() unlock (clear) a mutex
p_mutex_init() initialize a mutex
p_atomic_add() atomic fetch and add
p_atomic_sub() atomic fetch and sub
p_atomic_and() atomic fetch and 'and'
p_atomic_xor() atomic fetch and 'xor'
p_atomic_or() atomic fetch and 'or'
p_atomic_swap() atomic exchange
p_atomic_compswap() atomic compare and exchange

The math functions replace the traditional math lib functions and extend them to include support for data as well as task parallelism.

p_abs() absolute value
p_absdiff() absolute difference
p_add() add
p_acos() arc cosine
p_acosh() arc hyperbolic cosine
p_asin() arc sine
p_asinh() arc hyperbolic sine
p_cbrt() cubic root
p_cos() cosine
p_cosh() hyperbolic cosine
p_div() division
p_dot() dot product
p_exp() exponential
p_ftoi() float to
p_itof() integer to float conversion
p_inv() inverse
p_invcbrt() inverse cube root
p_invsqrt() inverse square root
p_ln() natural log
p_log10() denary log
p_max() finds max val
p_min() finds min val
p_mean() mean operation
p_median() finds middle value
p_mode() finds most common value
p_mul() multiplication
p_popcount() count the number of bits set
p_pow() element raised to a power
p_rand() random number generator
p_randinit() init random number generator
p_sort() heap sort
p_sin() sine
p_sinh() hyperbolic sine
p_sqrt() square root
p_stddev() calculates standard deviation
p_sub() subtract
p_sum() sum of all vector elements
p_sumsq() sum of all squared elements
p_tan() tangent
p_tanh() hyperbolic tangent

The digital signal processing (DSP) functions follow the same convention as the math function set.

p_acorr() autocorrelation (r[j] = sum ( x[j+k] * x[k] ), k=0..(n-j-1))
p_conv() convolution: r[j] = sum ( h[k] * x[j-k), k=0..(nh-1)
p_xcorr() correlation: r[j] = sum ( x[j+k] * y[k]), k=0..(nx+ny-1)
p_fir() FIR filter direct form: r[j] = sum ( h[k] * x [j-k]), k=0..(nh-1)
p_firdec() FIR filter with decimation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1)
p_firint() FIR filter with inerpolation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1)
p_firsym() FIR symmetric form
p_iir() IIR filter

The image processing functions follow the same convention as the math function set.

p_box3x3() box filter (3x3)
p_conv2d() 2d convolution
p_gauss3x3() gaussian blur filter (3x3)
p_median3x3() median filter (3x3)
p_laplace3x3() laplace filter (3x3)
p_prewitt3x3() prewitt filter (3x3)
p_sad8x8() sum of absolute differences (8x8)
p_sad16x16() sum of absolute differences (16x16)
p_sobel3x3() sobel filter (3x3)
p_scharr3x3() scharr filter (3x3)


  • An FFTW like interface


  • A port of the BLIS library?


  • Bionic libc implementation as starting point..