# Introduction to Data Parallel Control (dpctl) 

## Sections
- [Introduction to Data parallel Control (dpctl)](#Introduction-to-Data-Parallel-Control-(dpctl))
    - _Code:_ [dpctl.device_context()](#dpctl.device_context())    
- [Managing SYCL USM memory using dpctl.memory](#Managing-SYCL-USM-memory-using-dpctl.memory)
    - _Code:_ [dpctl Memory API](#dpctl-Memory-API)   
- _Code:_ [Memory Management in DPPY](#Memory-Management-in-DPPY)      

## Learning Objectives

* Utilize __Data Parallel Control (dpctl)__ to manage different devices
* Usage of the classes and the functions of dpctl
* Use dpctl.memory to create Python objects backed by SYCL USM memory

## Introduction to Data Parallel Control (dpctl) 
Dpctl provides a lightweight Python wrapper over a subset of DPC++/SYCL’s API. The goal of dpctl is not (yet) to provide an abstraction for every SYCL function. Dpctl is intended to provide a common runtime to manage specific SYCL resources, such as devices and USM memory, for SYCL-based Python packages and extension modules.

The main features presently provided by dpctl are:

1. Python wrapper classes for the main SYCL runtime classes mentioned in Section 4.6 of SYCL provisional 2020 spec (https://bit.ly/3asQx07): `platform`, `device`, `context`, `device_selector`, and `queue`.
1. A USM memory manager to create Python objects that use SYCL USM for data allocation.


Dpctl is available as part of the oneAPI Intel Distribution of Python (IDP). Once oneAPI is installed, dpctl is ready to be used by setting up the IDP that is available inside oneAPI. 

## Managing SYCL devices using dpctl

### dpctl.device_context()
Yields a SYCL queue corresponding to the input filter string.

This context manager “activates”, i.e., sets as the currently usable queue, the SYCL queue defined by the “backend:device type:device id” tuple. The activated queue is yielded by the context manager and can also be accessed by any subsequent call to dpctl.get_current_queue() inside the context manager’s scope. The yielded queue is removed as the currently usable queue on exiting the context manager.

To create a scope within which the openCL GPU, a programmer needs to do the following.
```
import dpctl
with dpctl.device_context("opencl:gpu"):
    pass
```

## Managing SYCL devices using dpctl

### Classes

* dpctl.SyclContext : A Python class representing cl::sycl::context
* dpctl.SyclDevice : A Python class representing cl::sycl::device
* dpctl.SyclEvent : A Python class representing cl::sycl::event
* dpctl.SyclPlatform : A Python class representing cl::sycl::event
* dpctl.SyclQueue : A Python class representing cl::sycl::event

## dpctl SyclDevice

This is a python equivalent for cl::sycl::device class.
There are two ways of creating a SyclDevice instance:

* By directly passing in a filter string to the class constructor. The filter string needs to conform to the [DPC++ filter selector SYCL extension.](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/FilterSelector/FilterSelector.adoc "Filter Selection")

```
import dpctl

# Create a SyclDevice with an explicit filter string,
# in this case the first level_zero gpu device.
level_zero_gpu = dpctl.SyclDevice("level_zero:gpu:0"):
level_zero_gpu.print_device_info()
```

* The other way is by calling one of the device selector helper functions as shown below

```
import dpctl

# Create a SyclDevice of type GPU based on whatever is returned
# by the SYCL `gpu_selector` device selector class.
# d = dpctl.select_cpu_device()
# d = dpctl.select_accelerator_device()
# d = dpctl.select_host_device()
# d = dpctl.select_default_device()
d = dpctl.select_gpu_device():
d.print_device_info()

```

* dpctl.get_devices(backend=backend_type.all, device_type=device_type_t.all) returns a list of dpctl.SyclDevice instances selected based on the given dpctl.device_type and dpctl.backend_type values.

* backend (optional) – Defaults to dpctl.backend_type.all. A dpctl.backend_type enum value or a string that specifies a SYCL backend. Currently, accepted values are: “cuda”, “opencl”, “level_zero”, or “all”.

* device_type (optional) – Defaults to dpctl.device_type.all. A dpctl.device_type enum value or a string that specifies a SYCL device type. Currently, accepted values are: “gpu”, “cpu”, “accelerator”, “host_device”, or “all”.


The below example shows the usage of the dpCTL API to retrieve the current device platforms and the devices specific to the 

## dpctl sample code

The below example shows the usage of the dpCTL API to retrieve the current device platforms and the devices specific to the current device.

The code below demonstrates usage of DPCTL code: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/simple_dpctl_queue.py

#                      Data Parallel Control (dpctl)
#
# Copyright 2020-2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Examples illustrating SYCL device selection features provided by dpctl.
"""

import dpctl


def print_device(d):
    "Display information about given device argument."
    if type(d) is not dpctl.SyclDevice:
        raise ValueError
    print("Name: ", d.name)
    print("Vendor: ", d.vendor)
    print("Driver version: ", d.driver_version)
    print("Backend: ", d.backend)
    print("Max EU: ", d.max_compute_units)


def create_default_device():
    """
    Create default SyclDevice using `cl::sycl::default_selector`.
    Device created can be influenced by environment variable
    SYCL_DEVICE_FILTER, which determines SYCL devices seen by the
    SYCL runtime.
    """
    d1 = dpctl.SyclDevice()
    d2 = dpctl.select_default_device()
    assert d1 == d2
    print_device(d1)
    return d1


def create_gpu_device():
    """
    Create a GPU device.
    Device created can be influenced by environment variable
    SYCL_DEVICE_FILTER, which determines SYCL devices seen by the
    SYCL runtime.
    """
    d1 = dpctl.SyclDevice("gpu")
    d2 = dpctl.select_gpu_device()
    assert d1 == d2
    print_device(d1)
    return d1


def create_gpu_device_if_present():
    """
    Select from union of two selections using default_selector.
    If a GPU device is available, it will be selected, if not,
    a CPU device will be selected, if available, otherwise an error
    will be raised.
    Device created can be influenced by environment variable
    SYCL_DEVICE_FILTER, which determines SYCL devices seen by the
    SYCL runtime.
    """
    d = dpctl.SyclDevice("gpu,cpu")
    print("Selected " + ("GPU" if d.is_gpu else "CPU") + " device")


def custom_select_device():
    """
    Programmatically select among available devices.
    Device created can be influenced by environment variable
    SYCL_DEVICE_FILTER, which determines SYCL devices seen by the
    SYCL runtime.
    """
    # select devices that support half-precision computation
    devs = [d for d in dpctl.get_devices() if d.has_aspect_fp16]
    # choose the device with highest default_selector score
    max_score = 0
    selected_dev = None
    for d in devs:
        if d.default_selector_score > max_score:
            max_score = d.default_selector_score
            selected_dev = d
    if selected_dev:
        print_device(selected_dev)
    else:
        print("No device with half-precision support is available.")
    return selected_dev


if __name__ == "__main__":
    create_default_device()
    create_gpu_device()
    create_gpu_device_if_present()
    custom_select_device()

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_dpctl_queue.sh; if [ -x "$(command -v qsub)" ]; then ./q run_dpctl_queue.sh; else ./run_dpctl_queue.sh; fi

## dpctl SyclQueue

dpctl Queue is a python class representing cl::sycl::queue. There are multiple ways to create a dpctl.SyclQueue object:

* Invoking the constructor with no arguments creates a context using the default selector.


```
import dpctl

# Create a default SyclQueue
q = dpctl.SyclQueue()
print(q.sycl_device)
```

* Invoking the constructor with specific filter selector string that creates a queue for the device corresponding to the filter string.

```
import dpctl

# Create in-order SyclQueue for either gpu, or cpu device
q = dpctl.SyclQueue("gpu,cpu", property="in_order")
print([q.sycl_device.is_gpu, q.sycl_device.is_cpu])
```

* Invoking the constructor with a dpctl.SyclDevice object creates a queue for that device, automatically finding/creating a dpctl.SyclContext for the given device.

```
import dpctl

d = dpctl.SyclDevice("gpu")
q = dpctl.SyclQueue(d)
ctx = q.sycl_context
print(q.sycl_device == d)
print(any([ d == ctx_d for ctx_d in ctx.get_devices()]))
```

The below example shows the usage of the dpctl queue creation.

The code below demonstrates usage of DPCTL code: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/dpctl_queue_2.py

#                      Data Parallel Control (dpctl)
#
# Copyright 2020-2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import dpctl


def create_default_queue():
    """Create a queue from default selector."""
    q = dpctl.SyclQueue()
    # Queue is out-of-order by default
    print("Queue {} is in order: {}".format(q, q.is_in_order))


def create_queue_from_filter_selector():
    """Create queue for a GPU device or,
    if it is not available, for a CPU device.
    Create in-order queue with profilign enabled.
    """
    q = dpctl.SyclQueue("gpu,cpu", property=("in_order", "enable_profiling"))
    print("Queue {} is in order: {}".format(q, q.is_in_order))
    # display the device used
    print("Device targeted by the queue:")
    q.sycl_device.print_device_info()


def create_queue_from_device():
    """
    Create a queue from SyclDevice instance.
    """
    cpu_d = dpctl.SyclDevice("opencl:cpu:0")
    q = dpctl.SyclQueue(cpu_d, property="enable_profiling")
    assert q.sycl_device == cpu_d
    print(
        "Number of devices in SyclContext " "associated with the queue: ",
        q.sycl_context.device_count,
    )


def create_queue_from_subdevice():
    """
    Create a queue from a sub-device.
    """
    cpu_d = dpctl.SyclDevice("opencl:cpu:0")
    sub_devs = cpu_d.create_sub_devices(partition=4)
    q = dpctl.SyclQueue(sub_devs[0])
    # a single-device context is created automatically
    print(
        "Number of devices in SyclContext " "associated with the queue: ",
        q.sycl_context.device_count,
    )


def create_queue_from_subdevice_multidevice_context():
    """
    Create a queue from a sub-device.
    """
    cpu_d = dpctl.SyclDevice("opencl:cpu:0")
    sub_devs = cpu_d.create_sub_devices(partition=4)
    ctx = dpctl.SyclContext(sub_devs)
    q = dpctl.SyclQueue(ctx, sub_devs[0])
    # a single-device context is created automatically
    print(
        "Number of devices in SyclContext " "associated with the queue: ",
        q.sycl_context.device_count,
    )


if __name__ == "__main__":
    create_default_queue()
    create_queue_from_filter_selector()
    create_queue_from_device()
    create_queue_from_subdevice()
    create_queue_from_subdevice_multidevice_context()   

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_dpctl_queue2.sh; if [ -x "$(command -v qsub)" ]; then ./q run_dpctl_queue2.sh; else ./run_dpctl_queue2.sh; fi

## Unified Shared Memory

Unified Shared Memory (USM) is a DPC++ tool for data management. USM is a __pointer-based approach__ that should be familiar to C and C++ programmers who use malloc or new to allocate data. USM __simplifies development__ for the programmer when __porting existing C/C++ code__ to DPC++.

### Developer view of USM
The picture below shows __developer view of memory__ without USM and with USM.  With USM, the developer can reference the same memory object in host and device code.

<img src="Assets/usm.png">

## Managing SYCL USM memory using dpctl.memory


dpctl.memory provides Python objects for untyped USM memory container of bytes for each kind of USM pointers: shared pointers, device pointers and host pointers. Shared and host pointers are accessible from both host and a device, while device pointers are only accessible from device. Python objects corresponding to shared and host pointers implement Python simple buffer protocol. It is therefore possible to use these objects to manipulate USM memory using NumPy or bytearray, memoryview, or array.array classes.

* dpctl.memory.MemoryUSMDevice: Allocates nbytes of USM device memory only accessible from the device.
* dpctl.memory.MemoryUSMHost: Allocates nbytes of USM host memory accessible from both host and a device.
* dpctl.memory.MemoryUSMShared: Allocates nbytes of USM shared memory accessible from both host and a device.


| Type | function call | Description | Accessible on Host | Accessible on Device |
|:---|:---|:---|:---:|:---:|
| Device | MemoryUSMDevice | Allocation on device (explicit) | NO | YES |
| Host | MemoryUSMHost |Allocation on host (implicit) | YES | YES |
| Shared | MemoryUSMShared | Allocation can migrate between host and device (implicit) | YES | YES |



Following are the common functions used with the above classes
* copy_from_device(): Copy SYCL memory underlying the argument object into the memory of the instance.
* copy_from_host(): Copy content of Python buffer provided by obj to instance memory.
* copy_to_host(): Copy content of instance’s memory into memory of obj.

### dpctl Memory API
The code below demonstrates usage of dPCtl Memory API: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/simple_dpctl.py

#                      Data Parallel Control (dpctl)
#
# Copyright 2020-2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Demonstrates host to device copy functions using dpctl.memory.
"""

import numpy as np

import dpctl.memory as dpmem

ms = dpmem.MemoryUSMShared(32)
md = dpmem.MemoryUSMDevice(32)

host_buf = np.random.randint(0, 42, dtype=np.uint8, size=32)

# copy host byte-like object to USM-device buffer
md.copy_from_host(host_buf)

# copy USM-device buffer to USM-shared buffer in parallel using
# sycl::queue::memcpy.
ms.copy_from_device(md)

# build numpy array reusing host-accessible USM-shared memory
X = np.ndarray((len(ms),), buffer=ms, dtype=np.uint8)

# Display Python object NumPy ndarray is viewing into
print("numpy.ndarray.base: ", X.base)
print("")

# Print content of the view
print("View..........: ", X)

# Print content of the original host buffer
print("host_buf......: ", host_buf)

# use copy_to_host to retrieve memory of USM-device memory
print("copy_to_host(): ", md.copy_to_host())



#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_simple_dpctl.sh; if [ -x "$(command -v qsub)" ]; then ./q run_simple_dpctl.sh; else ./run_simple_dpctl.sh; fi

### Memory Management in DPPY

numba-dppy uses DPC++'s USM shared memory allocator (memory_alloc) to enable host to device and vice versa data transfer. By using USM shared memory allocator, numba-dppy allows seamless interoperability between numba-dppy and other SYCL-based Python extensions and across multiple kernels written using numba_dppy.kernel decorator.

numba-dppy uses the USM memory manager provided by dpctl and supports the SYCL USM Array Interface protocol to enable zero-copy data exchange across USM memory-backed Python objects.

USM pointers make sense within a SYCL context and can be of four allocation types: host, device, shared, or unknown. Host applications, including CPython interpreter, can work with USM pointers of type host and shared as if they were ordinary host pointers. Accessing device USM pointers by host applications is not allowed.

The code below demonstrates usage of USM memory management: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/dpctl_mem_sample.py
# Copyright 2020, 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from time import time
from math import sqrt
import numpy as np
import argparse
import numba_dppy as dppy
import dpctl
import dpctl.memory as dpctl_mem

parser = argparse.ArgumentParser(description="Program to compute pairwise distance")

parser.add_argument("-n", type=int, default=10, help="Number of points")
parser.add_argument("-d", type=int, default=3, help="Dimensions")
parser.add_argument("-r", type=int, default=1, help="repeat")
parser.add_argument("-l", type=int, default=1, help="local_work_size")

args = parser.parse_args()

# Global work size is equal to the number of points
global_size = args.n
# Local Work size is optional
local_size = args.l

X = np.random.random((args.n, args.d))
D = np.empty((args.n, args.n))


@dppy.kernel
def pairwise_distance(X, D, xshape0, xshape1):
    """
    An Euclidean pairwise distance computation implemented as
    a ``kernel`` function.
    """
    idx = dppy.get_global_id(0)

    # for i in range(xshape0):
    for j in range(X.shape[0]):
        d = 0.0
        for k in range(X.shape[1]):
            tmp = X[idx, k] - X[j, k]
            d += tmp * tmp
        D[idx, j] = sqrt(d)


def driver():
    # measure running time
    times = list()

    xbuf = dpctl_mem.MemoryUSMShared(X.size * X.dtype.itemsize)
    x_ndarray = np.ndarray(X.shape, buffer=xbuf, dtype=X.dtype)
    np.copyto(x_ndarray, X)

    dbuf = dpctl_mem.MemoryUSMShared(D.size * D.dtype.itemsize)
    d_ndarray = np.ndarray(D.shape, buffer=dbuf, dtype=D.dtype)
    np.copyto(d_ndarray, D)

    for repeat in range(args.r):
        start = time()
        pairwise_distance[global_size, local_size](
            x_ndarray, d_ndarray, X.shape[0], X.shape[1]
        )
        end = time()

        total_time = end - start
        times.append(total_time)

    np.copyto(X, x_ndarray)
    np.copyto(D, d_ndarray)

    return times


def main():
    times = None

    # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
    # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
    device = dpctl.select_default_device()
    print("Using device ...")
    device.print_device_info()

    with dpctl.device_context(device):
        times = driver()

    times = np.asarray(times, dtype=np.float32)
    print("Average time of %d runs is = %fs" % (args.r, times.mean()))

    print("Done...")


if __name__ == "__main__":
    main()

#### Build and Run
Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_dpctl_mem.sh; if [ -x "$(command -v qsub)" ]; then ./q run_dpctl_mem.sh; else ./run_dpctl_mem.sh; fi

# Summary
In this module you will have learned the following:

* __Data parallel Control (dpCtl)__ classes and the functions of dpCtl
* How to use dpCtl Memory Python API
* How to perform Memory Management in DPPY