# GPU offloading in Fortran
This notebook guides through a few test programs to try and study gpu offloading in Fortran. The main routine is a simple 2d array calculation:
```
do j=0,n-1;do i=0,m-1
     iter=0
     do while (iter < iter_max)
        A(i,j) = A(i,j)*(A(i,j)-1.0)
        iter = iter+1
     enddo
enddo; enddo
```

The runtime is proportional to the size of the array (m\*n) as well as the number of iterations (iter_max).   
The code inside the ij loop is supposed to simulate an ij-independent workload. We realize that usually such  embarassingly parallel workload does not represent the ones that arise in science and engineering. It was designed only as a starting point to try various offloading schemes and compare their potential utility in Fortran.

## Executive Summary
- Offload via Fortran "do concurrent" scheme seems to be on par with  openACC "manged mode"  in speed.
- Offload via openMP "target"              scheme seems to be on par with  openACC "non-manged mode" in speed.
- Both of these schemes are available and were tested in nvfortran for Nvidia GPUs and ifx for Intel GPUs. 
- However the cross-compatibility (using ifx to offload to Nvidia Or using nvfortran to offload to Intel device) was not tested. 


### openMP on CPU

In [None]:
!source envs.gpubox; \rm gpu_offload_benchmark2d ; nvfortran -mp gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d

In [15]:
##This test could take too long to run on CPU
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
!echo '16' | ./gpu_offload_benchmark2d
#      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
#    1600000000 12306.969    2000    0.000016602849567    0.000000663121053    1     benchmark2d_omp 
#     100000000   770.307    2000    0.000066406416776    0.000002652284758    1     benchmark2d_omp
#     100000000   387.471    2000    0.000066406416776    0.000002652284758    2     benchmark2d_omp
#     100000000   200.788    2000    0.000066406416776    0.000002652284758    4     benchmark2d_omp
#     100000000    99.781    2000    0.000066406416776    0.000002652284758    8     benchmark2d_omp
#     100000000    54.811    2000    0.000066406416776    0.000002652284758   16     benchmark2d_omp


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000    58.132    2000    0.000066406416776    0.000002652284758   16     benchmark2d_omp                                   


### openMP offload to GPU

In [2]:
!source envs.gpubox; \rm gpu_offload_benchmark2d ; nvfortran -mp=gpu gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [3]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-1' | ./gpu_offload_benchmark2d
    
#  1600000000     9.358    2000    0.000016602849567    0.000000663121053    0     benchmark2d_omp_gpu
#   100000000     1.438    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.438    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               
     100000000     1.470    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               
     100000000     1.476    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               
     100000000     1.450    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu                               


In [4]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-2' | ./gpu_offload_benchmark2d
    
# 1600000000     7.342    2000    0.000016602849567    0.000000663121053    0     benchmark2d_omp_gpu_teams 
#  100000000     1.086    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.174    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         
     100000000     1.086    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         
     100000000     1.193    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         
     100000000     1.222    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams                         


The following is a hack to use 2 GPU devices by deviding the 2d array into 2 blocks.It gets a better performance for larger arrays.

In [5]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-3' | ./gpu_offload_benchmark2d
    
#1600000000     9.016    2000    0.000016602849567    0.000000663121053    0     benchmark2d_omp_gpu_teams_2devs
# 100000000     1.153    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.164    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   
     100000000     1.348    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   
     100000000     1.153    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   
     100000000     1.303    2000    0.000066406416776    0.000002652284758    0     benchmark2d_omp_gpu_teams_2devs                   


## openACC offload to GPU
### A) managed mode

In [6]:
!source envs.gpubox;\rm ./gpu_offload_benchmark2d ; nvfortran -mp -acc -ta=nvidia:managed gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [7]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-5' | ./gpu_offload_benchmark2d
    
#1600000000     2.979    2000    0.000016602849567    0.000000663121053    0     benchmark2d_acc
# 100000000     0.666    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.726    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     0.748    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     0.666    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     0.741    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   


The following is a hack to use 2 GPU devices by deviding the 2d array into 2 blocks.It gets a better performance for larger arrays.

In [8]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-52' | ./gpu_offload_benchmark2d

#1600000000     2.155    2000    0.000016602849567    0.000000663121053    0     benchmark2d_acc_2dev   
# 100000000     0.573    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev    

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.631    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              
     100000000     0.573    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              
     100000000     0.666    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              
     100000000     0.668    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc_2dev                              


### B) non-managed mode

In [9]:
!source envs.gpubox;\rm ./gpu_offload_benchmark2d ; nvfortran -mp -acc gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [10]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-5' | ./gpu_offload_benchmark2d
    
#1600000000     7.428    2000    0.000016602849567    0.000000663121053    0     benchmark2d_acc
# 100000000     1.114    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     1.169    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     1.114    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     1.291    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   
     100000000     1.206    2000    0.000066406416776    0.000002652284758    0     benchmark2d_acc                                   


## do concurrent

In [11]:
!source envs.gpubox; \rm gpu_offload_benchmark2d ; nvfortran -mp -stdpar gpu_offload_benchmark2d.f90 -o gpu_offload_benchmark2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4



In [12]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-4' | ./gpu_offload_benchmark2d
    
#1600000000     2.941    2000    0.000016602849567    0.000000663121053    0     benchmark2d_docon   
# 100000000     0.709    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon 

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.778    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 
     100000000     0.718    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 
     100000000     0.784    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 
     100000000     0.709    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon                                 


The following is a hack to use 2 GPU devices by deviding the 2d array into 2 blocks.It gets a better performance for larger arrays.

In [13]:
print("      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine")
for i in range(1,5):
    !echo '-42' | ./gpu_offload_benchmark2d
    
# 1600000000     2.152    2000    0.000016602849567    0.000000663121053    0     benchmark2d_docon_2dev_hack
#  100000000     0.643    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack

      size      time(s) iterations initial_sum          final_sum        omp_nthreads    subroutine
     100000000     0.692    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       
     100000000     0.695    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       
     100000000     0.696    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       
     100000000     0.643    2000    0.000066406416776    0.000002652284758    0     benchmark2d_docon_2dev_hack                       


### Quick test for gpu offload single precision array
This is a Fortran program to  quickly test gpu offload via openmp and do concurrent. It has been run on Intel (ifx) and Nvidia (nvfortran) GPU patforms.

## Intel GPUs
Here are some timing numbers I optained on Intel devcloud platform for Intel GPU's under Intel ifx and some corresponding numbers for nvidia of the same size. Intel GPUs in this platform could not handle double precision arrays, so I had to reduce real\*8 used above to real. Also these GPUs could not handle the array size used above, so I had to reduce the size. For details see  [https://github.com/nikizadehgfdl/platforms/blob/master/samples/gpu/openmp/test_omp_gpu.f90](test_omp_gpu.f90).

```
Intel GPU under Intel ix -O2
!      size      time(s) iterations initial_sum     final_sum        omp_nthreads    subroutine
!      16000000     0.727     200    0.000165990830283    0.000015737356080    1     benchmark2d_omp_gpu
!      16000000     0.069     200    0.000165990830283    0.000015737356080    1     benchmark2d_docon        
!      16000000     1.355    2000    0.000165990830283    0.000006629713880    1     benchmark2d_omp_gpu
!      16000000     0.502    2000    0.000165990830283    0.000006629713880    1     benchmark2d_docon
!     100000000     4.387    2000    0.000066406377300    0.000002652283911    1     benchmark2d_omp_gpu
!     100000000     3.140    2000    0.000066406377300    0.000002652283911    1     benchmark2d_docon
!     400000000    14.626    2000    0.000033204814827    0.000001326215852    1     benchmark2d_omp_gpu
!     400000000    12.551    2000    0.000033204814827    0.000001326215852    1     benchmark2d_docon   
!     16000000 bombs
Nvidia GPU under nvfortran
!      16000000     0.023     200    0.000165991063113    0.000015737357899    1     benchmark2d_omp_gpu              
!      16000000     0.024     200    0.000165991063113    0.000015737357899    1     benchmark2d_docon  
!      16000000     0.080    2000    0.000165991063113    0.000006629711152    1     benchmark2d_omp_gpu                 !      16000000     0.035    2000    0.000165991063113    0.000006629711152    1     benchmark2d_docon       
!     100000000     0.245    2000    0.000066406464612    0.000002652284593    1     benchmark2d_omp_gpu             
!     100000000     0.144    2000    0.000066406464612    0.000002652284593    1     benchmark2d_docon
!
!     100000000   215.581    2000    0.000066406464612    0.000002652284593    2     benchmark2d_omp_cpu
!     100000000   429.280    2000    0.000066406464612    0.000002652284593    1     benchmark2d_omp_cpu
!
!     400000000     0.822    2000    0.000033204873034    0.000001326215056    1     benchmark2d_omp_gpu                 !     400000000     0.430    2000    0.000033204873034    0.000001326215056    1     benchmark2d_docon                   !    1600000000     3.203    2000    0.000016602891264    0.000000663109006    1     benchmark2d_omp_gpu            
!    1600000000     1.919    2000    0.000016602891264    0.000000663109006    1     benchmark2d_docon   
```

In [14]:
!source envs.gpubox;\rm ./gpu_offload_test2d; nvfortran -O2 -mp=gpu -stdpar gpu_offload_test2d.f90 -o gpu_offload_test2d;./gpu_offload_test2d


Lmod is automatically replacing "gcc/11.3.0" with "nvhpc-no-mpi/22.5".


Due to MODULEPATH changes, the following have been reloaded:
  1) hdf5/1.12.2     2) netcdf/4.9.0     3) openmpi/4.1.4

rm: cannot remove ‘./gpu_offload_test2d’: No such file or directory
      size      time(s) iterations initial_sum     final_sum        omp_nthreads    subroutine
     100000000     0.563    2000    0.000066406464612    0.000002652284593    1     benchmark2d_omp_gpu                               
     100000000     0.440    2000    0.000066406464612    0.000002652284593    1     benchmark2d_docon                                 
