# GPU Programming

So far we have been looking into frameworks that allow us to parallelize our code over CPU cores. 
In this part of the notes we shift attention to Graphics Processing Unit (GPU) programming. In general, the GPU
provides much higher instruction throughput as well as memory bandwidth that a CPU with a similar operating envelope [1].
Leveraging these capabilities allows an application to run faster on the GPU.
We will be looking at three different programming paradigms. Namely:

- <a href="https://developer.nvidia.com/cuda-toolkit">CUDA toolkit</a>
- <a href="https://www.khronos.org/opencl/">OpenCL</a>
- <a href="https://www.openacc.org/">OpenACC</a>

- **CUDA:** Compute Unified Device Architecture2 was introduced by Nvidia in
late 2006 as one of the first credible systems for GPU programming that broke
free of the “code-it-as-graphics” approach used until then. CUDA provides two
sets of APIs (a low, and a higher-level one), and it is available freely for
Windows, Mac OS X, and Linux operating systems. Although it can be
considered too verbose, for example requiring explicit memory transfers
between the host and the GPU, it is the basis for the implementation of
higher-level third-party APIs and libraries, as explained below. CUDA, as of
Summer 2014 in its sixth incarnation, is specific to Nvidia hardware only.

- **OpenCL:** Open Computing Language3 is an open standard for writing
programs that can execute across a variety of heterogeneous platforms that
include GPUs, CPU, DSPs, or other processors. OpenCL is supported by both
Nvidia and AMD. It is the primary development platform for AMD GPUs.
OpenCL’s programming model matches closely the one offered by CUDA.

- **OpenACC:** An open specification4 for an API that allows the use of compiler
directives (e.g., #pragma acc, in a similar fashion to OpenMP) to
automatically map computations to GPUs or multicore chips according to a
programmer’s hints.

One of the major differences betweena GPUs and CPUs is that the former devote a big portion of their silicon real estate to compute logic, compared
to conventional CPUs that devote large portions of it to on-chip cache memory. This results in having hundreds or thousands (!) of cores in contemporary GPUs. 

In order to put all this computational power to use, we must create at least one separate thread for each core. Even more are needed so that computation
can be overlapped with memory transfers. This obviously mandates a shift in the programming paradigm we employ. Going from a handful to thousands of threads
requires a different way of partitioning and processing loads.

----
**Remark**


Having disjoint memories means that data must be explicitly transferred between
the host and the device whenever data need to be processed by the GPU or results collected by the
CPU. Considering that memory access is a serious bottleneck in GPU utilization
(despite phenomenal speeds like a maximum theoretical memory bandwidth of
336GB/s for Nvidia’s GTX TITAN Black, and 320GB/s for AMD’s Radeon R9
290X), communicating data over relatively slow peripheral buses like the PCIe1
is a major problem. In subsequent sections we examine how this communication
overhead can be reduced or “hidden” by overlapping it with computation.


---

A second characteristic of GPU computation is that GPU devices may not adhere
to the same floating-point representation and accuracy standards as typical CPUs.
This can lead to the accumulation of errors and the production of inaccurate results.
Although this is a problem that has been addressed by the latest chip offerings by
Nvidia and AMD, it is always recommended, as a precaution during development,
to verify the results produced by a GPU program against the results produced by an
equivalent CPU program. For example, CUDA code samples available with Nvidia’s
SDK typically follow this approach.

## References

1. <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html">CUDA C++ Programming Guide</a>