# GPU offloading in Fortran
This notebook guides through a few test programs to try and study gpu offloading in Fortran. 

The main routine we explore, benchmark2d2, consists of a 2d array update (4-point averaging):
```
  iter=0
  do while (iter < iter_max)
    do j=1,n-2;do i=1,m-2
      AN(i,j) = 0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
    enddo; enddo
    do j=1,n-2;do i=1,m-2
      A(i,j) = AN(i,j)
    enddo; enddo
    iter = iter+1
  enddo
```
The runtime is proportional to the size of the array (m\*n) as well as the number of iterations (iter_max).   

We also explore the timings of a simpler, embarassingly parallel array update:
```
do j=0,n-1;do i=0,m-1
     iter=0
     do while (iter < iter_max)
        A(i,j) = A(i,j)*(A(i,j)-1.0)
        iter = iter+1
     enddo
enddo; enddo
```

We realize that usually such  embarassingly parallel workload does not represent the ones that arise in science and engineering. It was designed only as a starting point to try various offloading schemes and compare their potential utility in Fortran.


## Executive Summary
- Offload via Fortran "do concurrent" scheme seems to be on par with  openACC "manged mode"  in speed.
- Offload via openMP "target"              scheme seems to be on par with  openACC "non-manged mode" in speed.
- Both of these schemes are available and were tested in nvfortran for Nvidia GPUs and ifx for Intel GPUs. 
- However the cross-compatibility (using ifx to offload to Nvidia Or using nvfortran to offload to Intel device) was not tested. 


### Simple array test on gfdl gpubox

#### The bad
- For the more complex problem the GPU answers (final_sum) are not repeatable and are also too different from CPU answers! Why?

#### The good
- For the simpler problem the GPU and CPU answers are the same.
- 'do concurrent' is much fater than openmp offload, partularly for larger problems. Why?
- Note the 4000x speedup on GPU relative to single thread CPU for a fully vectorizable subroutine. Ain't that weird?
- The GPU/CPU speedup reduces to 15x for a more realistic subroutien like 4-point average (Laplace operator).
- Timing to solution scales linearly with problem size for 'do concurrent' but lags behind for openmp. Why? 

### Gotchas
#### The Fortran Golden Rule "i-inner-do" seems to break for openmp gpu offload. 
Fortran is  [column-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order), i.e., arrays A(i,j) are arranged in memory so that A(i+1,j) is "next to" A(i,j). So in an i,j loop accessing A(i,j), i is better be the inner index to take advantage of the closer pacing in memory space. I.e., a construct like
```
do j=1,Nj;do i=1,Ni; A(i,j)=1; enddo;endo   
```
is "better" than 
```
do i=1,Ni;do j=1,Nj; A(i,j)=1; enddo;endo 
```
This is evident from the timings for the CPU code below where subroutines ending with "swapij" are of the second kind (j-inner).

But that rule seems to be breaking for the gpu offload via openmp where the j-inner seems to be faster then i-inner, as in C arrays (which are row-major). This is a hint that there might be Fortran to C translation happening by the compiler.

#### nvfortran -acc is default, -acc -ta=nvidia:managed is also probably default as we don't need data movement with ACC
#### nvfortran --stdpar speeds up openmp offload. Why? What does --stdpar do?
```
size      time(s) iterations initial_sum      final_sum       #ompthr   subroutine
100000000 994.535  2000    0.000066406416776  0.001709011693696 1  benchmark2d2_omp_gpu_without-stdpar
100000000  95.527  2000    0.000066406416776  0.001709011693696 1  benchmark2d2_omp_gpu_WITH-stdpar
```    

Hint found [here](https://developer.nvidia.com/blog/using-fortran-standard-parallel-programming-for-gpu-acceleration/): 
For nvfortran, activating standard parallelism (-stdpar=gpu) automatically activates managed memory. To use OpenACC directives to control data movement along with do concurrent, use the following flags: -acc=gpu -gpu=nomanaged.

### Some timings

In [2]:
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

Mon Dec 11 13:14:48 EST 2023
     subroutine Aij <-- (Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   367.745    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_cpu
     100000000   230.126    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu
     100000000  3261.778    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu_swapij
     100000000    55.858    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu
     100000000    62.188    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_collapse2
     100000000     9.631    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_collapse2_teams
     100000000     8.819    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_collapse2_loop
     100000000    14.100    20

In [1]:
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

Mon Dec 11 12:48:13 EST 2023
     subroutine Aij <-- (Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000    62.348    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu
     100000000    56.317    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_collapse2
     100000000     9.638    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_collapse2_teams
     100000000     8.822    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_collapse2_loop
     100000000    14.069    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_swapij
     100000000    14.100    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_swapij_collapse2
     100000000    55.945    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_swapij_collap

### Old gpubox 07/18/2023

In [26]:
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

Tue Jul 18 16:18:08 EDT 2023
     subroutine Aij <-- (Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   649.363    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_cpu
     100000000   962.242    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu
     100000000  7108.872    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu_swapij
     100000000    54.111    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu
     100000000    19.068    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_swapij
     100000000    31.745    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_acc_gpu
     100000000    25.128    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_acc_gpu_swapij
     100000000    29.988    2000    0.000066406416776    

In [25]:
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

Mon Jul 17 16:42:29 EDT 2023
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   603.275    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_cpu
     100000000  1054.563    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu
     100000000  7094.529    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu_swapij
     100000000    54.180    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu
     100000000    19.085    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_swapij
     100000000    31.399    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_acc_gpu
     100000000    27.769    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_acc_gpu_swapij
     100000000    30.245    2000    0.000066

In [24]:
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

Mon Jul 17 12:35:17 EDT 2023
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   829.885    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_cpu
     100000000   700.527    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu
     100000000  7106.718    2000    0.000066406416776    0.001709011693696    2     benchmark2d2_omp_cpu_swapij
     100000000    54.155    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu
     100000000    19.996    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_omp_gpu_swapij
     100000000    32.078    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_acc_gpu
     100000000    29.752    2000    0.000066406416776    0.001709011693696    1     benchmark2d2_acc_gpu_swapij
     100000000    29.807    2000    0.000066

In [19]:
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

Wed Jul 12 18:03:13 EDT 2023
     fully vectorizable subroutine Aij=Aij*(Aij-1)
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   429.859    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_cpu
     100000000   433.338    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_cpu_swapij
     100000000   215.287    2000    0.000066406416776    0.000002652284758    2     benchmark2d_omp_cpu
     100000000     0.506    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu
     100000000     0.446    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu_swapij
     100000000     0.350    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu_subij
     100000000     0.201    2000    0.000066406416776    0.000002652284758    1     benchmark2d_docon
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
   

In [23]:
##Single precision
##nvFORTRAN
!date; source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d_float.f90 -o gpu_offload_test2d_float ; ./gpu_offload_test2d_float

Fri Jul 14 13:39:49 EDT 2023
     fully vectorizable subroutine Aij=Aij*(Aij-1)
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   429.885    2000    0.000066406610131    0.000002652286184    1     benchmark2d_omp_cpu
     100000000   432.379    2000    0.000066406610131    0.000002652286184    1     benchmark2d_omp_cpu_swapij
     100000000   215.354    2000    0.000066406610131    0.000002652286184    2     benchmark2d_omp_cpu
     100000000     0.351    2000    0.000066406610131    0.000002652286184    1     benchmark2d_omp_gpu
     100000000     0.299    2000    0.000066406610131    0.000002652286184    1     benchmark2d_omp_gpu_swapij
     100000000     0.198    2000    0.000066406610131    0.000002652286184    1     benchmark2d_omp_gpu_subij
     100000000     0.160    2000    0.000066406610131    0.000002652286184    1     benchmark2d_docon
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
   

In [20]:
#nvC
!date; source envs.gpubox; nvc -mp=gpu  gpu_offload_test2d.c  -o gpu_offload_test2d_nvc ; ./gpu_offload_test2d_nvc

Thu Jul 13 10:56:26 EDT 2023
     fully vectorizable subroutine Aij=Aij*(Aij-1)
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
   100000000   767.603    2000    0.000066406416084    0.000002652284971    1           benchmark2d_omp_cpu
   100000000   396.312    2000    0.000066406416084    0.000002652284971    2           benchmark2d_omp_cpu
   100000000     0.673    2000    0.000066406416084    0.000002652284971    1           benchmark2d_omp_gpu
   100000000     0.474    2000    0.000066406416084    0.000002652284971    1    benchmark2d_omp_gpu_swapij
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4 
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
   100000000  1398.757    2000    0.000066406416084    0.001709011675887    1         benchmark2d_2_omp_cpu
   100000000   865.764    2000    0.000066406416084    0.001709011675887    2         benchmark2d_2_omp_cpu
   1

In [22]:
##Single precision
#nvC
!date; source envs.gpubox; nvc -mp=gpu  gpu_offload_test2d_float.c  -o gpu_offload_test2d_float_nvc ; ./gpu_offload_test2d_float_nvc

Fri Jul 14 10:41:46 EDT 2023
     fully vectorizable subroutine Aij=Aij*(Aij-1)
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
   100000000  1089.756    2000    0.000066406610131    0.000002652286412    1           benchmark2d_omp_cpu
   100000000   552.592    2000    0.000066406610131    0.000002652286412    2           benchmark2d_omp_cpu
   100000000     1.420    2000    0.000066406610131    0.000002652286412    1           benchmark2d_omp_gpu
   100000000     1.217    2000    0.000066406610131    0.000002652286412    1    benchmark2d_omp_gpu_swapij
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4 
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
   100000000  1487.218    2000    0.000066406610131    0.001702437060885    1         benchmark2d_2_omp_cpu
   100000000   744.085    2000    0.000066406610131    0.001702437060885    2         benchmark2d_2_omp_cpu
   1

In [21]:
#CLANG
!date; ulimit -s unlimited; /home/Niki.Zadeh/opt/llvm/install/bin/clang  -L/opt/gcc/11.3.0/lib64  -lm -O3 -fopenmp  -fopenmp-targets=nvptx64  gpu_offload_test2d.c  -o gpu_offload_test2d_clang; ./gpu_offload_test2d_clang

Thu Jul 13 15:46:17 EDT 2023
     fully vectorizable subroutine Aij=Aij*(Aij-1)
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
   100000000   430.594    2000    0.000066406416811    0.000002652284762    1           benchmark2d_omp_cpu
   100000000   215.794    2000    0.000066406416811    0.000002652284762    2           benchmark2d_omp_cpu
   100000000    19.083    2000    0.000066406416811    0.000002652284762    1           benchmark2d_omp_gpu
   100000000    19.126    2000    0.000066406416811    0.000002652284762    1    benchmark2d_omp_gpu_swapij
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4 
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
   100000000   642.038    2000    0.000066406416811    0.001709011694582    1         benchmark2d_2_omp_cpu
   100000000   279.017    2000    0.000066406416811    0.001709011694582    2         benchmark2d_2_omp_cpu
   1

## 06/30/2023

In [7]:
!source envs.gpubox; nvfortran -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d ; ./gpu_offload_test2d

#Time to solution for a few simple problems.
#
#fully vectorizable subroutine Aij=Aij*(Aij-1)
#     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
#     100000000   429.689    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp                                   
#     100000000   235.189    2000    0.000066406416776    0.000002652284758    2     benchmark2d_omp                                   
##
#     100000000     0.458    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu                               
#     100000000     0.353    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu_subij                         
#     100000000     0.189    2000    0.000066406416776    0.000002652284758    1     benchmark2d_docon                
#
#     200000000     0.893    2000    0.000033203208388    0.000001326142379    1     benchmark2d_omp_gpu                               
#     200000000     0.737    2000    0.000033203208388    0.000001326142379    1     benchmark2d_omp_gpu_subij                         
#     200000000     0.444    2000    0.000033203208388    0.000001326142379    1     benchmark2d_docon                                 
#
#    1000000000    10.061    2000    0.000006640641678    0.000000265228476    1     benchmark2d_omp_gpu                               
#    1000000000     9.850    2000    0.000006640641678    0.000000265228476    1     benchmark2d_omp_gpu_subij                         
#    1000000000     1.995    2000    0.000006640641678    0.000000265228476    1     benchmark2d_docon                                 
#
#     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
#     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
#     100000000   620.114    2000    0.000066406416776    0.002417809226842    1     benchmark2d2_omp_cpu                              
#     100000000    38.148    2000    0.000066406416776    0.001984566887412    1     benchmark2d2_omp_gpu                              
#     100000000     4.982    2000    0.000066406416776    0.001835117539473    1     benchmark2d2_docon 
#     100000000   618.343    2000    0.000066406416776    0.002417809226842    1     benchmark2d2_omp_cpu                              
#     100000000    38.014    2000    0.000066406416776    0.001984539257347    1     benchmark2d2_omp_gpu                              
#     100000000     5.014    2000    0.000066406416776    0.001835053249559    1     benchmark2d2_docon                                
#     100000000    38.217    2000    0.000066406416776    0.001984416922111    1     benchmark2d2_omp_gpu                              
#     100000000     4.969    2000    0.000066406416776    0.001835279846870    1     benchmark2d2_docon   
#     100000000    38.183    2000    0.000066406416776    0.001984440023023    1     benchmark2d2_omp_gpu                              
#     100000000     5.052    2000    0.000066406416776    0.001835121322498    1     benchmark2d2_docon 
#     100000000    38.132    2000    0.000066406416776    0.001984400154502    1     benchmark2d2_omp_gpu                              
#     100000000     4.991    2000    0.000066406416776    0.001835447374488    1     benchmark2d2_docon             

     fully vectorizable subroutine Aij=Aij*(Aij-1)
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000     0.505    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu                               
     100000000     0.381    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp_gpu_subij                         
     100000000     0.235    2000    0.000066406416776    0.000002652284758    1     benchmark2d_docon                                 
     non-vectorizable subroutine Aij=(Ai-1,j + Ai+1,j + Ai,j-1 + Ai,j+1)/4
     size        time(s) iterations initial_sum          final_sum        #ompthr    subroutine
     100000000   618.343    2000    0.000066406416776    0.002417809226842    1     benchmark2d2_omp_cpu                              
     100000000    38.014    2000    0.000066406416776    0.001984539257347    1     benchmark2d2_omp_gpu                              
     10

## Older results

### openMP on CPU

In [None]:
!source envs.gpubox; \rm gpu_offload_benchmark2d ; nvfortran -mp gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d

In [15]:
##This test could take too long to run on CPU
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
!echo '16' | ./gpu_offload_benchmark2d
#      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
#    1600000000 12306.969    2000    0.000016602849567    0.000000663121053    1     benchmark2d_omp 
#     100000000   770.307    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp
#     100000000   387.471    2000    0.000066406416776    0.000002652284758    2     benchmark2d_omp
#     100000000   200.788    2000    0.000066406416776    0.000002652284758    4     benchmark2d_omp
#     100000000    99.781    2000    0.000066406416776    0.000002652284758    8     benchmark2d_omp
#     100000000    54.811    2000    0.000066406416776    0.000002652284758   16     benchmark2d_omp


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000    58.132    2000    0.000066406416776    0.000002652284758   16     benchmark2d_omp                                   


### openMP offload to GPU

In [2]:
!source envs.gpubox; \rm gpu_offload_benchmark2d ; nvfortran -mp=gpu gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [3]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-1' | ./gpu_offload_benchmark2d
    
#  1600000000     9.358    2000    0.000016602849567    0.000000663121053    0     benchmark2d_omp_gpu
#   100000000     1.438    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.438    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               
     100000000     1.470    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               
     100000000     1.476    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               
     100000000     1.450    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               


In [4]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-2' | ./gpu_offload_benchmark2d
    
# 1600000000     7.342    2000    0.000016602849567    0.000000663121053    0     benchmark2d_omp_gpu_teams 
#  100000000     1.086    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.174    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         
     100000000     1.086    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         
     100000000     1.193    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         
     100000000     1.222    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         


The following is a hack to use 2 GPU devices by deviding the 2d array into 2 blocks.It gets a better performance for larger arrays.

In [5]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-3' | ./gpu_offload_benchmark2d
    
#1600000000     9.016    2000    0.000016602849567    0.000000663121053    0     benchmark2d_omp_gpu_teams_2devs
# 100000000     1.153    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.164    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   
     100000000     1.348    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   
     100000000     1.153    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   
     100000000     1.303    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   


## openACC offload to GPU
### A) managed mode

In [6]:
!source envs.gpubox;\rm ./gpu_offload_benchmark2d ; nvfortran -mp -acc -ta=nvidia:managed gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [7]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-5' | ./gpu_offload_benchmark2d
    
#1600000000     2.979    2000    0.000016602849567    0.000000663121053    0     benchmark2d_acc
# 100000000     0.666    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.726    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     0.748    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     0.666    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     0.741    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   


The following is a hack to use 2 GPU devices by deviding the 2d array into 2 blocks.It gets a better performance for larger arrays.

In [8]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-52' | ./gpu_offload_benchmark2d

#1600000000     2.155    2000    0.000016602849567    0.000000663121053    0     benchmark2d_acc_2dev   
# 100000000     0.573    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev    

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.631    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              
     100000000     0.573    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              
     100000000     0.666    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              
     100000000     0.668    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              


### B) non-managed mode

In [9]:
!source envs.gpubox;\rm ./gpu_offload_benchmark2d ; nvfortran -mp -acc gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [10]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-5' | ./gpu_offload_benchmark2d
    
#1600000000     7.428    2000    0.000016602849567    0.000000663121053    0     benchmark2d_acc
# 100000000     1.114    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.169    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     1.114    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     1.291    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     1.206    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   


## do concurrent

In [11]:
!source envs.gpubox; \rm gpu_offload_benchmark2d ; nvfortran -mp -stdpar gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [12]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-4' | ./gpu_offload_benchmark2d
    
#1600000000     2.941    2000    0.000016602849567    0.000000663121053    0     benchmark2d_docon   
# 100000000     0.709    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon 

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.778    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 
     100000000     0.718    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 
     100000000     0.784    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 
     100000000     0.709    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 


The following is a hack to use 2 GPU devices by deviding the 2d array into 2 blocks.It gets a better performance for larger arrays.

In [13]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-42' | ./gpu_offload_benchmark2d
    
# 1600000000     2.152    2000    0.000016602849567    0.000000663121053    0     benchmark2d_docon_2dev_hack
#  100000000     0.643    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.692    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       
     100000000     0.695    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       
     100000000     0.696    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       
     100000000     0.643    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       


### Quick test for gpu offload single precision array
This is a Fortran program to  quickly test gpu offload via openmp and do concurrent. It has been run on Intel (ifx) and Nvidia (nvfortran) GPU patforms.

## Intel GPUs
Here are some timing numbers I optained on Intel devcloud platform for Intel GPU's under Intel ifx and some corresponding numbers for nvidia of the same size. Intel GPUs in this platform could not handle double precision arrays, so I had to reduce real\*8 used above to real. Also these GPUs could not handle the array size used above, so I had to reduce the size. For details see  [https://github.com/nikizadehgfdl/platforms/blob/master/samples/gpu/openmp/test_omp_gpu.f90](test_omp_gpu.f90).

```
Intel GPU under Intel ix -O2
!      size      time(s) iterations initial_sum     final_sum        omp_nthreads    subroutine
!      16000000     0.727     200    0.000165990830283    0.000015737356080    1     benchmark2d_omp_gpu
!      16000000     0.069     200    0.000165990830283    0.000015737356080    1     benchmark2d_docon        
!      16000000     1.355    2000    0.000165990830283    0.000006629713880    1     benchmark2d_omp_gpu
!      16000000     0.502    2000    0.000165990830283    0.000006629713880    1     benchmark2d_docon
!     100000000     4.387    2000    0.000066406377300    0.000002652283911    1     benchmark2d_omp_gpu
!     100000000     3.140    2000    0.000066406377300    0.000002652283911    1     benchmark2d_docon
!     400000000    14.626    2000    0.000033204814827    0.000001326215852    1     benchmark2d_omp_gpu
!     400000000    12.551    2000    0.000033204814827    0.000001326215852    1     benchmark2d_docon   
!     16000000 bombs
Nvidia GPU under nvfortran
!      16000000     0.023     200    0.000165991063113    0.000015737357899    1     benchmark2d_omp_gpu              
!      16000000     0.024     200    0.000165991063113    0.000015737357899    1     benchmark2d_docon  
!      16000000     0.080    2000    0.000165991063113    0.000006629711152    1     benchmark2d_omp_gpu                 !      16000000     0.035    2000    0.000165991063113    0.000006629711152    1     benchmark2d_docon       
!     100000000     0.245    2000    0.000066406464612    0.000002652284593    1     benchmark2d_omp_gpu             
!     100000000     0.144    2000    0.000066406464612    0.000002652284593    1     benchmark2d_docon
!
!     100000000   215.581    2000    0.000066406464612    0.000002652284593    2     benchmark2d_omp_cpu
!     100000000   429.280    2000    0.000066406464612    0.000002652284593    1     benchmark2d_omp_cpu
!
!     400000000     0.822    2000    0.000033204873034    0.000001326215056    1     benchmark2d_omp_gpu                 !     400000000     0.430    2000    0.000033204873034    0.000001326215056    1     benchmark2d_docon                   !    1600000000     3.203    2000    0.000016602891264    0.000000663109006    1     benchmark2d_omp_gpu            
!    1600000000     1.919    2000    0.000016602891264    0.000000663109006    1     benchmark2d_docon   
```

In [14]:
!source envs.gpubox;\rm ./gpu_offload_test2d; nvfortran -O2 -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d;./gpu_offload_test2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4

rm: cannot remove ‘./gpu_offload_test2d’: No such file or directory
      size      time(s) iterations initial_sum     final_sum        omp_nthreads    subroutine
     100000000     0.563    2000    0.000066406464612    0.000002652284593    1     benchmark2d_omp_gpu                               
     100000000     0.440    2000    0.000066406464612    0.000002652284593    1     benchmark2d_docon                                 
