# Asynchronous and concurrent execution on GPUs
>*Melina Abeling, Julian Aeissen and Michele Pagani*. Supervised by *Oliver Fuhrer*.

## Introduction

<div style='text-align: justify;
            width: 90%;'>
In order to better represent earth´s complex climate the resolution of climate models is becoming continuously higher which improves the skill of climate and weather models forecasts. In numerical weather predictions the grid consisting of cells or points, for example an icosahedron (ICON-model) sphere, is projected on the globe to discretize and solve the governing, often differential equations, for the whole grid size (Zängel et al., 2015, Wahib 2014). Associated with the higher spatial resolution is a need for higher computing power, as it means a finer grid consisting of more cells and points for which calculations have to be done in greater detail.
Here, high performance computing and especially Graphical Process Units (GPU´s), hardware that accellerates computing also due to the ability to process multiple tasks in parallel, are important to meet the demand for better computing performance. 
One method commonly used to solve the necessary partial differential equations in climate models are stencil motivs (Schäfer & Fey, 2011; Wahib 2014).
Those algorithms are space and time discrete that uniformly compute the value of a specific grid point based on its own value and those of its neigbouring grid points (Schäfer & Fey, 2011).
<br> <br>
XXXXXX
<br> <br>
This project aims to investigate how the maxminum efficiency for various sized tasks differ for asynchronous and concurrent execution on GPUs or if it is even possible to reach a total utilization of the available hardware. For this a simple 2D Jacobian stencil and a Gaussian stencil will be utilized to investigate in which scenarios which execution leads to maximum efficiency/ or maximum efficiency could not be reached.
<br> <br>
XXXXXXXX
</div>

***

## Methods

### Initial fields
*TODO*
> square and grid, to better check the stencil and computations correctness

![Initial Fields](images/InitialField.png)


<div style='text-align: justify;
            width: 90%;'>
    
To investigate whether or better when asynchronous or concurrent execution on GPUs is most preferable, different scenarios where applied. Therefore two stencils of different size in order to see the influence they have on the preformance of GPU execution (see DIAGRAM). 

### PICTURE

#### 2D Jacobi stencil
The Jacobi Stencil calculates a weighted average of nearest neighbours and the center grid points and is therefore a five point stencil. In one time step two multiplication and 4 additions are performed. With double precision numbers at least 8 bytes have to be read or written (single-digit), yielding an Arithmetic Intensity of <0.5 FLOP/Byte.

![Jacobi](images/JacobiStencil_Scheme.png)

### PICTURE

#### Effect of the Jacobi Stencil after 1000 iterations

![Effect Jacobi](images/EffectStencilA.png)


#### Gaussian 5x5 stencil
The gaussian 5x5 stencil on the other hand is much larger, being a 25 point stencil. It is a discrete approximation of the 2D Gaussian filter/blurr. In one time step/grid update 24 FLOP are performed per grid point (addition or multiplication or addition followed by multiplication). Again at least one new number is written or read (with double precision) yielding an intensity of < 3 FLOP/Byte which can be considerably larger than for the smaller Jacobi stencil.

![Gauss Scheme](images/GaussStencilScheme.png)

#### Effect of the Gaussian blurr after 1000 iterations

![Effect Jacobi](images/EffectStencilB.png)


#### GPU parallelization

To investigate the impact on performance of different levels of concurrency on the GPU two approaches were applied. First, the difference in performance with different levels of parallelization was compared, i.e. divide the total field into a varying number of tiles. Each tile is assigned a different stream on the GPU. In the second approach again the field is divided into tiles, but the execution of the tiles is now done sequentially via a for loop which represents different tasks for the CPU.

tiling the field in equal portions analyzing 2 different things:

difference in performance with different levels of parallelization (GPU level vs Stream level) ( 1 tile per stream)
parallization of multiple independent tasks with streams (fixed number of tiles, different subdivision of them in the streams)

<img src="images/tiledGrid.png" alt="drawing" width="500"/>


## Results
<div style='text-align: justify;
            width: 90%;'>

### GPU vs Streams 
#### Performance over concurrency


As expected, streams come with overhead so dividing the field in tiles and computing each tile in a different stream is leads to a worse performance than having the whole field in a single stream. 

![Parallel exec](images/gpu_Parallel.png)


#### Performance over grid size

Figure XXXXXX shows the execution time of 10 iterations with the Jacobi Stencil and the Gaussian stencil for different total field sizes. Since The curves flatten towards small grid sizes (up to field sizes of 1024x1024 the time stays almost constant over time), we can readily read off the overhead of the programm. The overhead becomes -unsurprisingly- bigger with more streams. The actual execution time of the stencil calculations dominates/becomes relevant, depending on the number of streams, only for grid sizes bigger than (1024x1024). Therefore


![grid Size A](images/gpu_fieldSize.png)
![grid Size B](images/gpu_fieldSizeB.png)

    
    
</div>





***

## Conclusion

<div style='text-align: justify;
            width: 90%;'>
This work demonstrated how 

<br>    <br> 

Further investigations on how 
</div>



***

## References

<div style='text-align: justify;
            width: 90%;'>
Alizadeh, O. (2022). Advances and challenges in climate modeling. Climatic Change, 170(1), 18. https://doi.org/10.1007/s10584-021-03298-4 
<br> <br>
Schäfer, A., & Fey, D. (2011). High performance stencil code algorithms for  GPGPUs [Proceedings of the International Conference on Computational Science, ICCS 2011]. Procedia Computer Science, 4, 2027–2036. https://doi.org/https://doi.org/10.1016/j.procs.2011.04.221
<br> <br>
Wahib, M., & Maruyama, N. (2014). Scalable kernel fusion for memory-bound GPU applications. SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 191–202. https://doi.org/10.1109/SC.2014.21
<br> <br>
Zängl, G., Reinert, D.,Rípodas, P., & Baldauf, M.(2015).The ICON(ICOsahedral Non-hydrostatic) modelling framework of DWD and MPI-M: Description of the non-hydrostatic dynamical core. Quarterly Journal of the Royal Meteorological Society, 141(687), 563–579. https://doi.org/10.1002/qj.2378

</div>
