### Kernal summary

 **function type qualifiers:**
 - $\_\_$global$\_\_$ is the qualifier for kernels (which can be called from the host and executed on the device).
- $\_\_$host$\_\_$ functions are called from the host and execute on the host. (This is the default qualifier and is often omitted.)
- $\_\_$device$\_\_$ functions are called from the device and execute on the device. (A function that is called from a kernel needs the $\_\_$device$\_\_$ qualifier.)
- Prepending $\_\_$host$\_\_$ $\_\_$device$\_\_$ causes the system to compile separate host and device versions of the function.


Kernels cannot return a value, so the return type is always void, and kernel declarations start as follows:
<br>
- $\_\_$global$\_\_$ void aKernel(typedArgs)

### Summary of previous discussions

- cudaMalloc() allocates device memory
- cudaMemcpy() transfers data to or from a device.
- cudaFree() frees device memory that is no longer in use.
- $\_\_$syncThreads() synchronizes threads within a block.- Once all threads have reached this point, execution resumes normally
- cudaDeviceSynchronize() effectively synchronizes all threads in a grid.
- cudaMallocManaged() - The unified memory relieves you from having to create separate copies of an array (on the host and the device) and from explicitly calling for data transfers between CPU and GPU. Instead, you can create a single managed
array that can be accessed from both host and device. In reality, the data in the array needs to be transferred between host and device, but the CUDA system schedules and executes those transfers so you don’t have to.
- $\_\_$constant$\_\_$ - stored in global memory (cached),read-only for threads, written by host
- $\_\_$shared$\_\_$ -stored in shared memory (latency comparable to registers), accessible by all threads in the same threadblock, lifetime: block lifetime


### Communication Pattern

Let’s brief on the different communication patterns seen in parallel computing. Usually, this is about how
to map tasks, and memory together. That is mapping threads in CUDA and the memory that they are
communicating through.

#### Map

- The pattern, map, the program has many data elements: such as the elements of an array, entries in a
matrix, or pixels in an image.
- Map requires the application of the same function of computation on each piece of data. This means that each thread will read from and write to a specific place in memory

#### Gather
- Suppose you want to each thread to compute and store the average accross a range of data elements.
- Suppose that you want to apply a blur to an image, by setting each pixel to the mean of its neighbouring pixels.
- This operation is called gather, because each thread gathers input data elements together from different
places to compute an output under some operation.

#### Scatter
- Now suppose that you want to do the opposite operation. We can have each thread read an input and take
a fraction of its value and add it to the neighbouring points as an output result.
- When each thread needs to write its output in a different or multiple places we call this the scatter operation.

#### Transpose
- Transpose is a pattern that can be very useful in array, matrix, image, and data structure manipulation.
For example we might have a 2D array, such as an image in row-major order.

<img src="fig/memory-hierarchy-in-gpus-2.png" width="500"/>