# Memory managment in OpenCL

In previous lessons we have looked at straightforward ways in which memory was allocated on the host and then copied to the device for use as global memory by the kernel. In the introduction we briefly covered the five different memory spaces that are accessible to an OpenCL Program.

* Host memory
* Global memory
* Local (shared) memory
* Private memory
* Constant memory

**Host memory** is usually the largest memory space on the host, and **global memory** is the largest and slowest memory space available on the compute device. **Local** and **Constant** memory is usually placed in the small, fast caches on the compute device. **Private memory** is usually located in the registers, which are normally the fastest and smallest memory spaces available on the compute device. A programmer has some degree of control over where memory is stored during the operation of an OpenCL program. The diagram below shows what memory is available for access by both host and kernel threads (work-items) at runtime. 

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/memory_spaces.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Access to memory from kernel and host threads.</figcaption>
</figure>

Kernel threads (work-items) can access *global*, *constant*, *local* and *private* memory, whereas host threads can only access *host* and *global* memory. Private memory for a kernel thread is exclusive to the kernel, meaning that no other kernel can access the same private memory. Local memory is accessible to all kernel threads in a workgroup, but not to kernel threads from another workgroup. *Global* and *Constant* memory is accessible from all kernel threads.

## Memory access from the host

From the introduction we know that Buffers are allocated on the host and they are migrated in and out of the compute device when they are needed. Here are some ways we can create Buffers and transfer memory in and out of them. 

### Buffer creation

Thus far we have been creating Buffers with the [clCreateBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clCreateBuffer.html) function using the **CL_MEM_READ_WRITE** flag. For example, this code creates a Buffer that has read-write access from the kernel, but no additional functionality.

```C++
    cl_mem buffer_C = clCreateBuffer(context, 
                                     CL_MEM_READ_WRITE, 
                                     nbytes_C, 
                                     NULL, 
                                     &errcode);
```

#### IO permission flags

We can choose other IO flags to let the OpenCL implementation how the Buffer is to be used. This may unlock additional optimisations.

| **Allocation flag** | **Functionality** | 
| :- | :- | 
|CL_MEM_READ_WRITE| Read-write access from a kernel | 
|CL_MEM_WRITE_ONLY| Write-only access from a kernel | 
|CL_MEM_READ_ONLY | Read-only access from a kernel | 
|CL_MEM_HOST_WRITE_ONLY | Write-only access from the host | 
|CL_MEM_HOST_READ_ONLY | Read-only access from the host | 
|CL_MEM_HOST_NO_ACCESS | No access from the host | 

Common-sense rules apply in the use of these flags, for example **CL_MEM_WRITE_ONLY** is incompatible with **CL_MEM_READ_WRITE**, and behaviour is undefined if one tries to write to a buffer that has been set as **CL_MEM_READ_ONLY**.

#### Using host memory

The flag **CL_MEM_USE_HOST_PTR** allows the Buffer to use host memory as the backing store for a Buffer. One must make sure that there is enough host memory allocated to cover the memory used by the buffer and that the host memory is not de-allocated while the buffer is using it. OpenCL implementations are free to allocate caches on the compute device for temporary usage and then synchronize as required. Memory synchronization can be explicitly done using **mapping**, which will be discussed shortly.

Similarly, the flag **CL_MEM_COPY_HOST_PTR** creates an OpenCL buffer but copies memory from a host pointer during buffer creation. After the copy finishes, the host pointer is then released.  

Both **CL_MEM_USE_HOST_PTR** and **CL_MEM_COPY_HOST_PTR** require a host pointer to be passed into the call to [clCreateBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clCreateBuffer.html).

#### Creating buffers for asynchronous copies

Pinned memory is host memory that cannot be paged out to swap. It enables fast Direct Memory Access (DMA) transfers from the host, however it is limited by the OS to a fraction of the available memory. Normally, transfers between host and device - for example using [clEnqueueReadBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBuffer.html) or [clEnqueueWriteBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBuffer.html) - are synchronous, meaning they actually block until the transfer completes. The flag **CL_MEM_ALLOC_HOST_PTR** allocates pinned memory on the host as the backing store for the OpenCL buffer. This also enables asynchronous transfers so that IO movement can occur at the same time as compute.

### Explicit memory movement

The OpenCL 1.2 and earlier standards require explicit memory transfers between OpenCL buffer and host. You have the option of copying either contiguous or rectangular regions of allocated memory. By rectangular I mean that if the memory allocation is interpreted as being **folded into a multidimensional array** then a rectangular copy would copy a rectangular region of that array.

#### Contiguous copies

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/contiguous_memory_copy.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Contiguous memory copy.</figcaption>
</figure>

If you need to copy contiguous chunks of memory, then [clEnqueueWriteBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueWriteBuffer.html) **writes to** the OpenCL buffer from host memory and [clEnqueueReadBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBuffer.html) reads **from** a buffer to host memory. The function [clEnqueueCopyBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueCopyBuffer.html) performs a copy of contiguous memory between two OpenCL buffers **within the same OpenCL context**. All three options have the ability to specify a starting offset (in bytes) within the OpenCL buffer, and [clEnqueueCopyBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBuffer.html) has the additional capability to specify the offset for the destination buffer.

#### Rectangular copies

Sometimes a contiguous copy is not sufficient, particularly when you treat the the allocation as a multi-dimensional array and wish to copy a rectangular region. The functions [clEnqueueWriteBufferRect](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueWriteBufferRect.html) **writes** a 3D rectangular region to the OpenCL buffer from host memory and [clEnqueueReadBufferRect](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueReadBufferRect.html) **reads** a 3D rectangular region from host memory. 

<figure style="margin-left:auto; margin-right:auto; width:80%;">
    <img style="vertical-align:middle" src="../images/rectangular_memory_copy.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Rectangular memory copy.</figcaption>
</figure>

I found the documentation on these functions quite confusing to comprehend, particularly in translating from one indexing system to another. It is important to remember that the word **"row"** in the documentation is the dimension along which memory is contiguous. Row pitch is the number of bytes along a row of **the memory allocation**, and slice pitch is the number of bytes in a slice of **the memory allocation**. Within the memory allocation a **region** is selected for the copy, it is of size (nbytes along the row, nrows, nslices). If you are only doing a 2D copy then use a value of 1 for **nslices**. The copy can be located at an **origin** within the memory allocation, this has units of (offset in bytes along a row, row id, slice id). Row id and slice id's start at 0. 

Rectangular copies result in more efficient copies than enqueuing numerous calls to the contiguous memory copy functions to cover the same region. In similar fashion, the function  [clEnqueueCopyBufferRect](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueCopyBufferRect.html) copies 3D rectangular regions between OpenCL buffers. 

#### Buffer mapping

With the contiguous and rectangular copies there are two memory spaces in use, one for the OpenCL buffer and one for the host. It is possible to map (or make available) an OpenCL buffer as an allocation of host memory, thus avoiding the explicit memory transfer. This approach can be particularly beneficial when a CPU or integrated GPU is employed, as the memory for the buffer is already on the host and no actual transfer from the device is needed. Buffer mapping can be useful for GPU's as well. With buffer mapping the OpenCL implementation has the opportunity to optimise transfers and synchronisation between OpenCL buffer and host. The command [clEnqueueMapBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueMapBuffer.html) maps an OpenCL buffer into host memory. It uses the following mapping flags:

| **Mapping flag** | **Functionality** | 
| :- | :- | 
|CL_MAP_READ | The buffer is being mapped for reading | 
|CL_MAP_WRITE | The buffer is being mapped for writing | 
|CL_MAP_WRITE_INVALIDATE_REGION | The buffer is going to be written by the host soon and we don't need to care what is in it (potential source of optimisation).

When a buffer is mapped, access to it from an OpenCL kernel is considered to be undefined behaviour. The [clEnqueueUnmapMemObject](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueUnmapMemObject.html) function unmaps the memory from the host and makes it available to kernels again.

#### Initialisation

Sometimes you need to fill or initialise the contents of a buffer. The [clEnqueueFillBuffer](https://www.khronos.org/registry/OpenCL/sdk/3.0/docs/man/html/clEnqueueFillBuffer.html) command fills buffers with a user-defined pattern such as 0, for example.

## Memory access from kernels

### Global

### Local

### Shared

### Private

### Accessing vector elements 

## Using shared memory

### Allocation 

### Synchronisation

### Use within a kernel