# Programação em GPU com OpenMP Offloading

Baseado no material
> Programação Paralela Híbrida: MPI + OpenMP Offloading
>
> Calebe P. Bianchini, Evaldo B. Costa, Gabriel P. Silva
>
> DOI: https://doi.org/10.5753/sbc.16010.0.3

e no repositório oficial do Minicurso:
> https://github.com/Programacao-Paralela-e-Distribuida/SSCAD24-MPI-OpenMP/

### Setup rápido do ambiente

(não esqueça de selecionar um Runtime adequado, com GPU)


In [None]:
!add-apt-repository -y ppa:ubuntu-toolchain-r/test &> /dev/null
!apt install -y gcc-13 g++-13 gcc-13-offload-nvptx libgomp1 &> /dev/null
!ln -sfnv /usr/bin/gcc-13 /usr/bin/gcc &> /dev/null
!gcc --version
!nvcc --version
!nvidia-smi

gcc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Fri Oct 18 19:43:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      

## Problema do PI

1. Criação do arquivo de código-fonte

In [None]:
%%writefile pi.c

#include <stdio.h>
#include <omp.h>
#include <math.h>

int main() {
  long int num_steps = 10000000000; // Número de passos para a integração
  double step = 1.0 / (double) num_steps;
  double pi = 0.0;
  double begin, end;
  begin = omp_get_wtime();
  #pragma omp target data map(tofrom: pi) map(to:num_steps, step) device(1)// Diretiva para offloading para a GPU
  #pragma omp target teams distribute parallel for reduction(+:pi) // Paralelização com OpenMP
     for (long int i = 0; i < num_steps; i++) {
          double x = (i + 0.5) * step;
          pi += 4.0 / (double) (1.0 + x * x);
     }
  pi *= step;
  end = omp_get_wtime();
  printf("Valor de Pi calculado: %2.15f\n", pi);
  printf("Tempo de execução: %f segundos\n", end - begin);
  return 0;
}

Writing pi.c


2. Compilação e execução

In [None]:
!gcc -fopenmp -fno-lto -fstack-protector pi.c -o pi
!./pi

Valor de Pi calculado: 3.141592653589563
Tempo de execução: 32.103054 segundos


3. Perfilamento

In [None]:
!nvprof ./pi

==7635== NVPROF is profiling process 7635, command: ./pi
Valor de Pi calculado: 3.141592653589563
Tempo de execução: 31.964028 segundos
==7635== Profiling application: ./pi
==7635== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   74.06%  94.080ms         1  94.080ms  94.080ms  94.080ms  cuCtxCreate
                   25.87%  32.868ms         1  32.868ms  32.868ms  32.868ms  cuCtxDestroy
                    0.06%  71.073us        16  4.4420us     153ns  65.610us  cuDeviceGetAttribute
                    0.00%  5.5540us         1  5.5540us  5.5540us  5.5540us  cuDeviceGetPCIBusId
                    0.00%  4.9520us         1  4.9520us  4.9520us  4.9520us  cuDeviceGetName
                    0.00%  4.8550us         2  2.4270us     686ns  4.1690us  cuCtxGetDevice
                    0.00%  2.0840us         4     521ns     185ns  1.1200us  cuDeviceGetCount
                    0.00%  1.8600us    