Test platform:

  • Azure VM | Standard NC24_Promo (24 vcpus, 220 GiB memory)
  • 4 x Tesla K80, which has compute capability 3.7. You can find out what card you got via lshw -C display; you can find out the compute capability of your card here.
  • Linux (ubuntu 18.04)

1. Install prerequisites

build essentials (gcc, make, ...), cmake, libelf.

cd $HOME
sudo apt update
sudo apt install build-essential
sudo snap install cmake --classic
sudo apt install -y libelf-dev libffi-dev
sudo apt install -y pkg-config
sudo apt install libnuma-dev

2. Install CUDA 10.1

(recall: this is for an Ubuntu 18.04 machine)

sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.105-418.39_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-1-local-10.1.105-418.39/
sudo apt update
sudo apt install cuda


rm cuda-repo-ubuntu1804-10-1-local-10.1.105-418.39_1.0-1_amd64.deb
export PATH=/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH

If everything went right, you should see something like

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

3. Install LLVM 11:

git clone
cd llvm-project
git checkout 3d9bb031d13c884122a5da456659e47dd52ec1f7
cd ..

Let's then build the compiler

mkdir build; cd build

Next CMake needs to generate Makefiles which will eventually be used for compilation:

cmake                                                                          \
  -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;libcxx;libcxxabi;lld;openmp" \
  -DCMAKE_BUILD_TYPE=Release                                                   \
  -DLLVM_TARGETS_TO_BUILD="X86;NVPTX"                                          \
  -DCMAKE_INSTALL_PREFIX=$HOME/llvm                                            \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_37                                      \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,37,50,52,60,61,70,75            \
  -DCMAKE_C_COMPILER=gcc                                                       \
  -DCMAKE_CXX_COMPILER=g++                                                     \
  -DLLVM_ENABLE_BINDINGS=OFF                                                   \
  -G "Unix Makefiles" $HOME/llvm-project/llvm

Now it's finally time to actually compile

make -j 24

Note: here I'm using -j 24 because the test platform has 24 physical cores. This may gonna take a while...

Once finished, we have to install it

make -j 24 install


export PATH=$HOME/llvm/bin:$PATH

Let's now rebuild the OpenMP runtime libraries with Clang

cd $HOME ; mkdir build-openmp; cd build-openmp

And then:

cmake                                                                          \
  -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;libcxx;libcxxabi;lld;openmp" \
  -DCMAKE_BUILD_TYPE=Release                                                   \
  -DLLVM_TARGETS_TO_BUILD="X86;NVPTX"                                          \
  -DCMAKE_INSTALL_PREFIX=$HOME/llvm                                            \
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_37                                      \
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,37,50,52,60,61,70,75            \
  -DCMAKE_C_COMPILER=clang                                                      \
  -DCMAKE_CXX_COMPILER=clang++                                                  \
  -DLLVM_ENABLE_BINDINGS=OFF                                                   \
  -G "Unix Makefiles" ../llvm-project/llvm

And finally, we actually rebuild and reinstall the OpenMP runtime libraries:

make -j 24
make -j 24 install
cd $HOME
rm -rf llvm-project build build-openmp

If everything went smooth, you should see something like

$ clang --version
clang version 11.0.0 ( afdb2ef2ed9debd419a29b78c23e4b84ce67ab0c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/devito/llvm/bin

4. Trying OpenMP 5.0 GPU offloading

Let's first install nvtop following the instructions here:

sudo apt install libncurses5-dev
git clone
mkdir -p nvtop/build && cd nvtop/build
cmake ..
sudo make install
cd $HOME
rm -rf nvtop

Now let's test offloading. We are gonna use the following toy app:

#include <malloc.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])
    int n = atoi(argv[1]);
    double* x = (double*)malloc(sizeof(double) * n);
    double* y = (double*)malloc(sizeof(double) * n);
    double idrandmax = 1.0 / RAND_MAX;
    double a = idrandmax * rand();
    for (int i = 0; i < n; i++)
        x[i] = idrandmax * rand();
        y[i] = idrandmax * rand();
    #pragma omp target data map(tofrom: x[0:n],y[0:n])
        #pragma omp target
        #pragma omp for
        for (int i = 0; i < n; i++)
            y[i] += a * x[i];
    double avg = 0.0, min = y[0], max = y[0];
    for (int i = 0; i < n; i++)
        avg += y[i];
        if (y[i] > max) max = y[i];
        if (y[i] < min) min = y[i];
    printf("min = %f, max = %f, avg = %f\n", min, max, avg / n);
    return 0;

Let's save it as omp-offloading.c and compile. Fingers crossed.

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_37 -Wall -O3 omp-offloading.c -o omp-offloading.o

No errors? Good! Some warnings about an old compute capability? That's OK too. The important thing is to see no errors at this point.

And now we run it while keeping nvtop on in another terminal. You should see the GPU utilization spiking at 100% !

./omp-offloading.o 10000000

5. Install a CUDA-aware MPI

Here we will OpenMPI according to the instructions here.

OpenMPI recommends using UCX1.4 built with GDRcopy for the most updated set of MPI features and for better performance. Let's first install GDRcopy

cd $HOME
sudo apt install check libsubunit0 libsubunit-dev
git clone
mv gdrcopy gdrcopy_src
mkdir gdrcopy
cd gdrcopy_src
make PREFIX=$HOME/gdrcopy CUDA=/usr/local/cuda-10.1 all install
sudo ./
cd $HOME
rm -rf gdrcopy_src

and include it to system's PATH

export PATH=$HOME/gdrcopy/bin:$PATH

You may want to check GDR copy installation by running the programs sanity, copybw, and copylat as shown here.

Next, let's install UCX as shown here:

cd $HOME
tar xzf ucx-1.8.0.tar.gz
cd ucx-1.8.0
./contrib/configure-release --prefix=$HOME/ucx --with-cuda=/usr/local/cuda-10.1 --with-gdrcopy=$HOME/gdrcopy
make -j 24 install
cd ..
rm -rf ucx-1.8.0 ucx-1.8.0.tar.gz

and then

export PATH=$HOME/ucx/bin:$PATH

If everything was ok, you should see something like this

$ ucx_info -v
# UCT version=1.8.0 revision c30b7da
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/devito/ucx --with-cuda=/usr/local/cuda-10.1 --with-gdrcopy=/home/devito/gdrcopy

Finally, let's install OpenMPI

cd $HOME
tar xzf openmpi-4.0.4.tar.gz
cd openmpi-4.0.4/
./configure --prefix=$HOME/openmpi --with-cuda=/usr/local/cuda-10.1 --with-ucx=$HOME/ucx
make -j 24 install
cd $HOME
rm -rf openmpi-4.0.4 openmpi-4.0.4.tar.gz

and then

export PATH=$HOME/openmpi/bin:$PATH

If everything was ok, you should see something like this

$ mpirun --version
mpirun (Open MPI) 4.0.4

Report bugs to

6. Running a Devito's example on multiple GPUs

Install pip3 and mpi4py

sudo apt install python3-pip
pip3 install mpi4py

Clone Devito repository, checkout at the right branch, and install Devito

git clone
cd devito
pip3 install -e .

To use Devito on multiple GPU, your code must add gpu-direct option to the Operator. The Operator must look like

op = Operator(eq, opt=('advanced', {'gpu-direct': True}))

Here is a test example

from devito import Grid, TimeFunction, Eq, Operator
grid = Grid(shape=(4, 4))
u = TimeFunction(name='u', grid=grid, space_order=2, time_order=0)[:] = 1
eq = Eq(u.forward, u.dx+1)
op = Operator(eq, opt=('advanced', {'gpu-direct': True}))

Let's call it Finally, let's run a code

DEVITO_PLATFORM=nvidiaX DEVITO_ARCH=clang DEVITO_LANGUAGE=openmp DEVITO_MPI=1 OMPI_CC=clang mpirun -np 4 python3
