# Manual building of an OpenACC code

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Data Management](./Data_management.ipynb)

---

During the training course, the building of examples will be done just by executing the code cells.
Even though the command line is always printed, we think it is important to practice the building process.

## Build with NVIDIA compilers

The compilers are:

- nvc: C compiler
- nvc++: C++ compiler
- nvfortran: Fortran compiler

### Compiler options for OpenACC

- `-acc`: the compiler will recognize the OpenACC directives

    OpenACC is also able to generate code for multicore CPUs (close to OpenMP).

    Some interesting options are:
  - `-acc=gpu`: to build for GPU
  - `-acc=multicore`: to build for CPU (multithreaded)
  - `-acc=host`: to build for CPU (sequential)
  - `-acc=noautopar`: disable the automatic parallelization inside `parallel` regions (the default is `-acc=autopar`)

All options can be found in the [documentation.](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#acc-cmdln-opts)

- `-gpu`: GPU-specific options to be passed to the compiler

    Some interesting options are:

  - `-gpu=ccXX`: specify the compute capability for which the code has to be built

      The list is available at [https://developer.nvidia.com/cuda-gpus#compute](https://developer.nvidia.com/cuda-gpus#compute).
  - `-gpu=managed`: activate NVIDIA Unified Memory (with it you can ignore data transfers, but it might fail sometime)
  - `-gpu=pinned`: activate _pinned_ memory. It can help to improve the performance of data transfers
  - `-lineinfo`: generate debugging line information; less overhead than -g

All options can be found in the [documentation.](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#gpu).

- `-Minfo`: the compiler prints information about the optimizations it uses
  - `-Minfo=accel`: information about OpenACC (Mandatory in this training course!)
  - `-Minfo=all`: all optimizations are printed (OpenACC, vectorization, FMA, ...). We recommend to use this option.

### Other useful compiler options

- `-o exec_name`: name of the executable
- `-Ox`: level of optimization (0 <= x <=4)
- `-Og`: optimize debugging experience and enables optimizations that do not interfere with debugging.
- `-fast`: equivalent to `-O2 -Munroll=c:1 -Mnoframei -Mlre`
- `-g`: add debugging symbols
- `-gopt`: instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when -g is not specified.

You can specify a comma-separated list of options for each flag.

### Examples

For instance to compile a C source code for GPU on NVIDIA V100 (Compute Capability 7.0), the following line should be executed:
```bash
nvc -acc=gpu,noautopar -gpu=cc70,managed -Minfo=all mysource.c -o myprog
```

The example below shows how to compile for the following setup:

- OpenACC for GPU `-acc=gpu`
- Compile for Volta architecture `-gpu=cc70`
- Activate optimizations `-fast`
- Print optimizations and OpenACC information `-Minfo=all`

```make
ACCFLAGS = -acc=gpu -gpu=cc70
OPTFLAGS = -fast
INFOFLAGS = -Minfo=all

myacc_exec: myacc.f90
    nvc -o myacc_exec $(ACCFLAGS) $(OPTFLAGS) $(INFOFLAGS) myacc.f90
```

## Build with GCC compilers

The compilers are:

- gcc: C compiler
- gxx: C++ compiler
- gfortran: Fortran compiler

### Compiler options for OpenACC

- `-fopenacc`: the compiler will recognize the OpenACC directives
- `-foffload`: enables the compiler to generate a code for the accelerator. Compilers for host and accelerator are separated
  - `-foffload=nvptx-none`: compile for NVIDIA devices

    It can be used to pass options such as optimization, libraries to link, etc (`-foffload=-O3 -foffload=-lm`).
    You can enclose options between "" and give it to `-foffload`.

### Other useful compiler options

- `-o exec_name`: name of the executable
- `-Ox`: level of optimization (0 <= x <=3)
- `-g`: add debugging symbols

### Example

The example shows how to compile for the following setup:

- OpenACC for GPU `-fopenacc`
- Compile for NVIDIA GPU `-foffload=nvptx-none`
- Activate optimizations `-O3 -foffload=-O3`

```make
ACCFLAGS = -fopenacc -foffload=nvptx-none
OPTFLAGS = -O3 -foffload=-O3
INFOFLAGS = -fopt-info

myacc_exec: myacc.c
    gcc -o myacc_exec $(ACCFLAGS) $(OPTFLAGS) $(INFOFLAGS) myacc.f90
```

## Exercise

- Execute the following cell which produces a file (just add the name you want after `writefile`).
- Open a terminal (File -> New -> Terminal)
- Load the compiler you wish to use (for example: `module load nvidia-compiler/21.7`)
- Use the information above to compile the file, you might need to modify the extension of the file "exercise" to "exercise.c" or "exercise.f90"
- If you want to make sure that the code ran on GPU you can do `export NVCOMPILER_ACC_TIME=1`
- Execute the code with `srun -n 1 --cpus-per-task=10 -A for@v100 --gres=gpu:1 --time=00:03:00 --hint=nomultithread --qos=qos_gpu-dev time <executable_name>`
- Bonus: Compile the code without OpenACC support and compare the elapsed time in both cases.

Example stored in: `../../examples/C/Manual_building_exercise.c`

In [None]:
%%writefile exercise
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void inplace_sum(double* A, double* B, size_t size)
{
    #pragma acc parallel loop present(A[0:size], B[0:size])
    for (size_t i=0; i<size; ++i)
        A[i] += B[i]; 
}

int main(void)
{
    size_t size = (size_t) 1e9;
    double* A = (double*) malloc(size*sizeof(double));
    double* B = (double*) malloc(size*sizeof(double));
    double sum = 0.0;

    #pragma acc data create(A[0:size], B[0:size])
    {   
        #pragma acc parallel loop present(A[0:size], B[0:size])
        for (size_t i=0; i<size; ++i)
        {   
            A[i] = sin(M_PI*(double)i/(double)size)*sin(M_PI*(double)i/(double)size);
            B[i] = cos(M_PI*(double)i/(double)size)*cos(M_PI*(double)i/(double)size);
        }   

        inplace_sum(A, B, size);

        #pragma acc parallel loop present(A[0:size], B[0:size]) reduction(+:sum)
        for (size_t i=0; i<size; ++i)
            sum += A[i];
    }   
    printf("This should be close to 1.0: %f\n", sum/(double) size);
    return 0;
}
