# Using CUDA libraries

---
Requirements:

- [Get started](./Get_started.ipynb)
- [Atomic operations](./Atomic_operations.ipynb)
- [Data Management](./Data_management.ipynb)

---

OpenACC is interoperable with CUDA and GPU-accelerated libraries.
It means that if you create some variables with OpenACC you will be able to use the GPU (device) pointer to call a CUDA function.

## `acc host_data use_device`

To call a CUDA function, the host needs to retrieve the address of your variable on the GPU.
For example:
```fortran
real, dimension(system_size) :: array
!$acc enter data create(array(:))

!$acc host_data use_device(array)
    ! inside the block `array` stores the address on the GPU
    call cuda_function(array)
!$acc end host_data
```

## Example with CURAND

The pseudo-random number generators of the standard libraries are not (as of 2021) available with OpenACC.
One solution is to use CURAND from NVIDIA.

In this example we generate a large array of random integer numbers in [0,9] with CURAND.
Then a count of each occurrence is performed on the GPU with OpenACC.

The implementation of the generation of the integers list is given but is beyond the scope of the training course.

Example stored in: `../../examples/Fortran/Using_CUDA_random_example.f90`

In [None]:
%%idrrun -a --options "-Mcudalib=curand"
program using_cuda
    use openacc
    use openacc_curand
    use, intrinsic :: ISO_FORTRAN_ENV , only : REAL32, INT32 
    implicit none      

    integer(kind=INT32), dimension(:), allocatable :: shots
    integer(kind=INT32)                            :: histo(10)
    integer(kind=INT32)                            :: nshots
    type(curandStateXORWOW)                        :: h
    integer(kind=INT32)                            :: seed,seq,offset    
    integer(kind=INT32)                            :: i

    do i = 1, 10
        histo(i) = 0
    enddo

    nshots = 1e9
    !  Allocate memory for the random numbers
    allocate(shots(nshots))

    ! NVIDIA curand will create our initial random data
    !$acc parallel create(shots(:)) copyout(histo(:))
    seed = 1234!5 + j
    seq = 0
    offset = 0
    !$acc loop vector
    do i = 1, 32
        call curand_init(seed, seq, offset, h)
    enddo

    !$acc loop
    do i = 1, nshots
        shots(i) = abs(curand(h))
    enddo

    !  Count the number of time each number was drawn
    !$acc loop
    do i = 1, nshots
        shots(i) = mod(shots(i),10) + 1
        !$acc atomic update
        histo(shots(i)) = histo(shots(i)) + 1
    enddo
    !$acc end parallel
    !  Print results
    do i = 1, 10 
        write(0,"(i2,a2,i10,a2,f5.3,a1)") i,": ",histo(i)," (",dble(histo(i))/dble(1e9),")" 
    enddo
    deallocate(shots)
end program using_cuda

It is also possible to create interface to call CUDA functions and CUDA homemade kernels directly.
The example below reproduce the above CURAND example by calling it without using the OpenACC API.

In [None]:
%%idrrun -n -l cuda --options "-arch=sm_70" --object --keep Using_CUDA_random_function.cu
#include <stdio.h>
#include <curand.h>

// Fill d_buffer with num random numbers
extern "C" void fill_rand(float *d_buffer, int num, void *stream)
{
  curandGenerator_t gen;
  int status;

  // Create generator
  status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);

  // Set CUDA stream
  status |= curandSetStream(gen, (cudaStream_t)stream);

  // Set seed
  status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);

  // Generate num random numbers
  status |= curandGenerateUniform(gen, d_buffer, num);

  // Cleanup generator
  status |= curandDestroyGenerator(gen);

  if (status != CURAND_STATUS_SUCCESS) {
      printf ("curand failure!\n");
      exit (EXIT_FAILURE);
  }
}

In [None]:
%%idrrun -a -l fortran --options "-Mcudalib=curand Using_CUDA_random_function.cu.o"
program using_cuda
    use ISO_C_BINDING
    use openacc
    use, intrinsic :: ISO_FORTRAN_ENV , only : REAL32, INT32
    implicit none

    interface
        subroutine fill_rand(positions, length, stream) BIND(C,NAME='fill_rand')
            use ISO_C_BINDING
            use openacc
            implicit none
            type (C_PTR)   , value          :: positions
            integer (C_INT), value          :: length
            integer(acc_handle_kind), value :: stream
        end subroutine fill_rand
     end interface


    integer(C_INT), dimension(:), allocatable :: shots
    integer(C_INT)                            :: histo(10)
    integer(C_INT)                            :: nshots
    integer(acc_handle_kind)                  :: stream

    integer(kind=INT32)                       :: i,j

    do i = 1, 10
        histo(i) = 0
    enddo

    nshots = 1e9
    !  Allocate memory for the random numbers
    allocate(shots(nshots))

    ! OpenACC may not use the default CUDA stream so we must query it
    stream = acc_get_cuda_stream(acc_async_sync)

    !$acc data create(shots(:)) copyout(histo(:))
    ! NVIDIA cuRandom will create our initial random data
    !$acc host_data use_device(shots)
    call fill_rand(C_LOC(shots), nshots, stream)
    !$acc end host_data

    !  Count the number of time each number was drawn
    !$acc parallel loop present(shots(:), histo(:))
    do i = 1, nshots
        shots(i) = mod(shots(i),10) + 1
        !$acc atomic update
        histo(shots(i)) = histo(shots(i)) + 1
    enddo
    !$acc end data
    !  Print results
    do i = 1, 10
        write(0,"(i2,a2,i10,a2,f5.3,a1)") i,": ",histo(i)," (",dble(histo(i))/dble(1e9),")"
    enddo
    deallocate(shots)
end program using_cuda