# Advanced loop configuration

---
**Requirements:**

- [Get started](./Get_started.ipynb)
- [Variables_status](./Variables_status.ipynb)
- [Data management](./Data_management.ipynb)

---

Different levels of parallelism are generated by the _gang_, _worker_ and _vector_ clauses.
The _loop_ directive is responsible for sharing the parallelism across the different levels.

The degree of parallelism in a given level is determined by the numbers of gangs, workers and threads. These numbers are defined by the implementation. This default behavior depends on not only the target architecture but also on the portion of code on which the parallelism is applied.
No modifications of this default behavior is recommended as it presents good optimization.

It is however possible to specify the numbers of gangs, workers and threads in the parallel construct with the _num_gangs_, _num_workers_ and _vector_length_ clauses. These clauses are allowed with the _parallel_ and _kernel_ construct. You might want to use these clauses in order to:

- debug (to restrict the execution on a single gang (without restrictions on the vectors as the _serial_ clause will do), to vary the parallelism degree in order to expose a race condition ...)
- limit the number of gang to lower the memory occupancy when you have to privatize arrays

## Syntax

Clauses to specify the numbers of gangs, workers and vectors are _num_gangs_, _num_workers_ and _vector_length_.

It is possible to use either numbers or variables as arguments of these clauses. The numbers must be positive integers and the variable should refer to a scalar integer variable.

```fortran
!$acc parallel num_gangs(3500) vector_length(128)
!$acc loop gang
do j = 1, size_j
    !$acc loop vector
    do i = 1, size_i
        ! A fabulous calculation
    enddo
enddo
!$acc end parallel

!$acc parallel loop gang num_gangs(size_j/2) vector_length(128)
do j = 1, size_j
    !$acc loop vector
    do i = 1, size_i
        ! A fabulous calculation
    enddo
enddo
!$acc end parallel
```

## Restrictions

The restrictions described here are for NVIDIA architectures.

- The number of gang is limited to 2³¹-1 (65535 if the compute capability is lower than 3.0).
- The product num_workers x vector_length can not be higher than 1024 (512 if the compute capability is lower than 2.0).
- To achieve performances, it is better to set the vector_length as a multiple of 32 (up to 1024).
- Using routines with a _vector_ level of parallelization or higher sets the _vector_length_ to 32 (compiler limitation).

This restrictions can vary with the architecture and it is advised to refer to the "Cuda C programming Guide" (Section G "Features and Technical Specifications") for future implementations.

## Example

Example stored in: `../../examples/Fortran/Loop_configuration_example.f90`

In [None]:
%%idrrun -a
program loop_configuration
    use ISO_FORTRAN_ENV, only : INT32, REAL64
    use openacc
    implicit none    
    integer(kind=INT32), parameter        :: n = 500
    integer(kind=INT32), dimension(n,n,n) :: table
    integer(kind=INT32)                   :: ngangs, nworkers, nvectors    
    integer(kind=INT32)                   :: i, j, k

    ngangs   = 450
    nworkers = 4
    nvectors = 16 

    !$acc parallel loop gang num_gangs(ngangs) num_workers(nworkers) vector_length(nvectors) copyout(table(:,:,:))
    do k = 1, n
        !$acc loop worker
        do j = 1, n
            !$acc loop vector
            do i = 1, n
                table(i,j,k) = i + j*1000 + k*1000*1000
            enddo
        enddo
    enddo

    print *, table(1,1,1), table(n,n,n)
end program loop_configuration      

## Exercise

A simple exercise can be to modify the value of the _num_gang_ clause (and add a variation to the vector length) and then compare the execution time.

For a change, we will make an exercise that don't make sense physically. It can however come handy, especially if you try the practical work on HYDRO. In this exercise, you will have to:

- parallelize a few lines of codes
- be sure that it reproduces well the CPU behavior
- manually modify the number of gangs
- observe that the number of gangs will be limited by the system's size in this code

Example stored in: `../../examples/Fortran/Loop_configuration_exercise.f90`

In [None]:
%%idrrun --cliopts "500"
program Loop_configuration
    use ISO_FORTRAN_ENV, only : INT32, REAL64, INT64
    implicit none
    integer(kind=INT32 ), parameter              :: system_size = 50000
    real   (kind=REAL64), dimension(system_size) :: array
    real   (kind=REAL64), dimension(system_size) :: table
    real   (kind=REAL64)                         :: sum_val, res, norm, time
    integer(kind=INT32 )                         :: i, j, length, ngangs, numarg
    character(len=:)    , allocatable            :: arg1

    numarg = command_argument_count()
    if (numarg .ne. 1) then
        write(0,*) "Error, you should provide an argument of integer kind to specify the number of gangs that will be used"
        stop
    endif
    call get_command_argument(1,LENGTH=length)
    allocate(character(len=length) :: arg1)
    call get_command_argument(1,VALUE=arg1)
    read(arg1,'(i10)') ngangs

    norm    = 1.0_real64 / (int(system_size, INT64) * int(system_size, INT64))
    res     = 0.0_real64 ! to compare CPU and GPU quickly

    do j = 1, system_size
        sum_val = 0.0_real64
        do i = 1, system_size
            table(i) = (i+j) * norm
        enddo

        do i = 1, system_size
            sum_val  = sum_val + table(i)
        enddo
        array(j) = sum_val
        res = res + sum_val
    enddo

    print *, "result: ",res

!    do i = 1, system_size
!        print *, i,array(i)
!    enddo

end program Loop_configuration

### Solution

Example stored in: `../../examples/Fortran/Loop_configuration_solution.f90`

In [None]:
%%idrrun -a --cliopts "500"
program Loop_configuration
    use ISO_FORTRAN_ENV, only : INT32, REAL64, INT64
    implicit none
    integer(kind=INT32 ), parameter              :: system_size = 50000
    real   (kind=REAL64), dimension(system_size) :: array
    real   (kind=REAL64), dimension(system_size) :: table
    real   (kind=REAL64)                         :: sum_val, res, norm, time
    integer(kind=INT32 )                         :: i, j, length, ngangs, numarg
    character(len=:)    , allocatable            :: arg1

    numarg = command_argument_count()
    if (numarg .ne. 1) then
        write(0,*) "Error, you should provide an argument of integer kind to specify the number of gangs that will be used"
        stop
    endif
    call get_command_argument(1,LENGTH=length)
    allocate(character(len=length) :: arg1)
    call get_command_argument(1,VALUE=arg1)
    read(arg1,'(i10)') ngangs

    norm    = 1.0_real64 / (int(system_size, INT64) * int(system_size, INT64))    
    res     = 0.0_real64 ! to compare CPU and GPU quickly

    !$acc parallel num_gangs(ngangs) copyout(array(:)) private(table(:))
    !$acc loop gang reduction(+:res)
    do j = 1, system_size
        sum_val = 0.0_real64
        !$acc loop vector
        do i = 1, system_size
            table(i) = (i+j) * norm
        enddo

        !$acc loop vector reduction(+:sum_val)
        do i = 1, system_size
            sum_val  = sum_val + table(i)
        enddo
        array(j) = sum_val
        res = res + sum_val
    enddo
    !$acc end parallel

    print *, "result: ",res

!    do i = 1, system_size
!        print *, i,array(i)  
!    enddo

end program Loop_configuration