# Unified Shared Memory (USM) (C/C++)

#### Sections
- [Learning Objectives](#Learning-Objectives)
- [What is Unified Shared Memory](#What-is-Unified-Shared-Memory?)
- [Allocating Unified Shared Memory](#Allocating-Unified-Shared-Memory)
- _Code:_ [Lab Exercise: Shared Memory Allocation ](#Lab-Exercise:-Shared-Memory-Allocation)
- _Code:_ [Explicit USM](#USM-Explicit-Data-Movement)

## Learning Objectives
* Use the Unified Shared Memory feature to simplify OpenMP* Offload programming
* Understand implicit and explicit way of moving memory using USM

### Prerequisites
Basic understanding of OpenMP constructs are assumed for this module. You also should have already went through the  [Introduction to OpenMP Offload module](../intro/intro.ipynb) and [Managing Device Data module](../datatransfer/datatransfer.ipynb), where the basics of using the Jupyter notebooks with the Intel® DevCloud and an introduction to the OpenMP `target` and `map` constructs were discussed.

## What is Unified Shared Memory?

Unified Shared Memory (USM) is a tool for data management currently supported by the Intel&reg; oneAPI DPC++/C++ Compiler. USM is a
__pointer-based approach__ that should be familiar to C and C++ programmers who use malloc
or new to allocate data. USM __simplifies development__ for the programmer when __porting existing
C/C++ code__ to support OpenMP Offload.

### Developer View of USM

The picture below shows __developer view of memory__ without USM and with USM. 

With USM, the developer can reference that same memory object in host and device code.  

![Developer View of USM](Assets/usm_dev_view.png)

***
## Allocating Unified Shared Memory
In the previous modules, we used the `map` clause with the `target`, `target data`, and `target enter/exit data` pragmas which enabled the mapping of memory between host and device data environments. However, we can also use OpenMP USM routines to simplify the management of host and device memories.

### Types of USM

USM provides different types of memory to allow both explicit and implicit models for managing memory.

Device memory can be allocated for explicit user control of data movement. Host and shared memory are provided to allow implicit accesses from the accelerator device.

The following table illustrates the properties of the different USM memories and how they can be allocated.

|Type | Location | Accessible From |  Allocation Routine |
|:----:|:----:|:----:|:----|
|Host | Host | Host or Device | omp_target_alloc_host(size,device_num) |
|Device |Device | Device | omp_target_alloc_device(size,device_num) |
|Shared | Host or Device | Host or Device | omp_target_alloc_shared(size,device_num) |

Memories allocated using the above functions can be freed using the `omp_target_free(pointer, device_num)` call.

## Lab Exercise: Shared Memory Allocation 

In this exercise, you will use the shared allocation routine to highlight the usage of Unified Shared Memory. Shared memory is accessible from both the host and device. Its location is managed by the runtime and can reside on the host and/or the device.

The primary source file, main.cpp, is written for you. 
It includes alloc_func.cpp that you will write out. If you would like to see the contents of main.cpp, execute the following cell.


In [None]:
#See the contents of main.cpp
%pycat main.cpp

In the cell below, the shared allocation routine is used to allocate shared memory for the array of floats `x` and `y`. It uses `deviceId` as the device_number and `ARRAY_SIZE * sizeof(float)` as the size of the array in bytes.

Execute the cell below write the allocation code to file.

In [None]:
%%writefile lab/alloc_func.cpp
//Allocate Shared Memory 
float *x =
    (float *)omp_target_alloc_shared(ARRAY_SIZE * sizeof(float), deviceId);
float *y =
    (float *)omp_target_alloc_shared(ARRAY_SIZE * sizeof(float), deviceId);

### Compile the Code
Next, compile the code using *compile_c.sh*. If you would like to see the contents of compile_c.sh execute the following cell.

In [None]:
# Optional: Run this cell to see the contents of compile_c.sh
%pycat compile_c.sh

Execute the following cell to perform the compilation

In [None]:
!chmod 755 compile_c.sh; ./compile_c.sh;

### Execute the code
Next, run the code using the script *run.sh*.

In [None]:
# Optional: Run this cell to see the contents of run.sh
%pycat run.sh

Execute the following cell to execute the compiled program. Look for the passed message.

_If the Jupyter cells are not responsive or if they error out when you compile the samples, please restart the Kernel and compile the samples again_

In [None]:
! chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi

### USM Explicit Data Movement
The code below shows an implementation of USM using <code>omp_target_alloc_device</code>, in which data movement between host and device must be explicitly managed by developers using <code>omp_target_memcpy</code>. This gives developers to have more control may result in performance improvement.

The code below demonstrates USM Explicit Data Movement: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and run ▶ the cell to save the code to file.

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/usm_explicit.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <omp.h>
#include <stdio.h>

#pragma omp requires unified_shared_memory

constexpr int ARRAY_SIZE = 256;

void init1(float *x, int N) {
  for (int i = 0; i < N; i++) x[i] = 1.0;
}
void init2(float *x, int N) {
  for (int i = 0; i < N; i++) x[i] = 2.0;
}
int main() {
  int deviceId = (omp_get_num_devices() > 0) ? omp_get_default_device()
                                             : omp_get_initial_device();

  // Allocate memory on host
  float *x = (float *)malloc(ARRAY_SIZE * sizeof(float));
  float *y = (float *)malloc(ARRAY_SIZE * sizeof(float));

  double tb, te;
  int correct_count = 0;

  init1(x, ARRAY_SIZE);
  init1(y, ARRAY_SIZE);

  printf("Number of OpenMP Devices: %d\n", omp_get_num_devices());

  tb = omp_get_wtime();

  // Allocate memory on device
  float *x_dev =
      (float *)omp_target_alloc_device(ARRAY_SIZE * sizeof(float), deviceId);
  float *y_dev =
      (float *)omp_target_alloc_device(ARRAY_SIZE * sizeof(float), deviceId);

  // Explicit data movement from Host to device
  int error = omp_target_memcpy(x_dev, x, ARRAY_SIZE * sizeof(float), 0, 0,
                                deviceId, 0);
  error = omp_target_memcpy(y_dev, y, ARRAY_SIZE * sizeof(float), 0, 0,
                            deviceId, 0);

#pragma omp target
  {
    for (int i = 0; i < ARRAY_SIZE; i++) x_dev[i] += y_dev[i];
  }

  // Explicit Data Movement from Device to Host
  error = omp_target_memcpy(x, x_dev, ARRAY_SIZE * sizeof(float), 0, 0, 0,
                            deviceId);
  error = omp_target_memcpy(y, y_dev, ARRAY_SIZE * sizeof(float), 0, 0, 0,
                            deviceId);

  init2(y, ARRAY_SIZE);

  // Explicit data movement from Host to device
  error = omp_target_memcpy(x_dev, x, ARRAY_SIZE * sizeof(float), 0, 0,
                            deviceId, 0);
  error = omp_target_memcpy(y_dev, y, ARRAY_SIZE * sizeof(float), 0, 0,
                            deviceId, 0);

#pragma omp target
  {
    for (int i = 0; i < ARRAY_SIZE; i++) x_dev[i] += y_dev[i];
  }
  // Explicit Data Movement from Device to Host
  error = omp_target_memcpy(x, x_dev, ARRAY_SIZE * sizeof(float), 0, 0, 0,
                            deviceId);
  error = omp_target_memcpy(y, y_dev, ARRAY_SIZE * sizeof(float), 0, 0, 0,
                            deviceId);

  te = omp_get_wtime();

  printf("Time of kernel: %lf seconds\n", te - tb);

  for (int i = 0; i < ARRAY_SIZE; i++)
    if (x[i] == 4.0) correct_count++;

  printf("Test: %s\n", (correct_count == ARRAY_SIZE) ? "PASSED!" : "Failed");

  omp_target_free(x_dev, deviceId);
  omp_target_free(y_dev, deviceId);
  free(x);
  free(y);

  return EXIT_SUCCESS;
}

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_usm_explicit.sh; if [ -x "$(command -v qsub)" ]; then ./q run_usm_explicit.sh; else ./run_usm_explicit.sh; fi

_If the Jupyter cells are not responsive, or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_.

# Summary
USM makes it easy to use OpenMP Offload. USM allows a simple implicit data movement approach to get functional quickly. USM also provides controlled data movement with explicit approach.

<html><body><span style="color:Red"><h1>Reset Notebook</h1></span></body></html>

##### Should you be experiencing any issues with your notebook or just want to start fresh run the below cell.

In [None]:
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets
button = widgets.Button(
    description='Reset Notebook',
    disabled=False,
    button_style='', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='This will update this notebook, overwriting any changes.',
    icon='check' # (FontAwesome names without the `fa-` prefix)
)
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
          # what happens when we press the button
          clear_output()
          !rsync -a --size-only /data/oneapi_workshop/OpenMP_Offload/datatransfer/ ~/OpenMP_Offload/datatransfer
          print('Notebook reset -- now click reload on browser.')
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

***

@Intel Corporation | [\*Trademark](https://www.intel.com/content/www/us/en/legal/trademarks.html)