www.locuz.com

# Intel oneAPI Series - Deep Dive (Live Demo and Hands-on): oneAPI and DPC++ on Intel DevCloud

Presented By: Mandeep Kumar



Converge to the Cloud

# Agenda



- Overview of Intel oneAPI and Data Parallel C++
- Introduction to the Intel DevCloud
- Setup of Intel DevCloud and JupyterLab Environment
- DPC++ Program Structure
- Demonstration of Intel VTune Profiler on Intel DevCloud
- Demonstration of Intel DPC++ Compatibility Tool on Intel DevCloud



# **Learning Objectives**



- Articulate how oneAPI can help to solve the challenges of programming in a heterogeneous world
- Understand the DPC++ language and programming model
- Profile a DPC++ application using Intel VTune Profiler on Intel DevCloud
- Learn how to migrate CUDA code to Data Parallel C++ using the Intel DPC++ Compatibility tool



# Programming Challenges for Multiple Architectures



- Growth in specialized workloads
- · Variety of data-centric hardware required
- Separate programming models and toolchains for each architecture are required today
- Software development complexity limits freedom of architectural choice





# Introducing oneAPI



- Cross-architecture programming that delivers freedom to choose the best hardware
- Based on industry standards and open specifications
- Exposes cutting-edge performance features of latest hardware
- Compatible with existing high-performance languages and programming models including C++, OpenMP, Fortran, and MPI





# oneAPI Industry Initiative: Break the Chains of Proprietary Lock-in



- A cross-architecture language based on C++ and SYCL standards
- Powerful libraries designed for acceleration of domainspecific functions
- Low-level hardware abstraction layer
- Open to promote community and industry collaboration
- Enables code reuse across architectures and vendors







# Data Parallel C++: Standards-based, Cross-architecture Language



### Parallelism, productivity and performance for CPUs and Accelerators

- Delivers accelerated computing by exposing hardware features
- Allows code reuse across hardware targets, while permitting custom tuning for specific accelerators
- Provides an open, cross-industry solution to single architecture proprietary lock-in

### Based on C++ and SYCL

- Delivers C++ productivity benefits, using common, familiar C and C++ constructs
- Incorporates SYCL from the Khronos Group to support data parallelism and heterogeneous programming

### Community Project to drive language enhancements

- Provides extensions to simplify data parallel programming
- Continues evolution through open and cooperative development

Apply your skills to the next innovation, not rewriting software for the next hardware platform





### Intel oneAPI Toolkits



A complete set of proven developer tools expanded from CPU to XPU



# Intel® oneAPI Base Toolkit

**Native Code Developers** 



A core set of high-performance tools for building C++, Data Parallel C++ applications & oneAPI library-based applications

# Add-on Domain-specific Toolkits

Specialized Workloads



# Intel® oneAPI Tools for HPC

Deliver fast Fortran, OpenMP & MPI applications that scale



### Intel® oneAPI Tools for IoT

Build efficient, reliable solutions that run at network's edge



### Intel® oneAPI Rendering Toolkit

Create performant, high-fidelity visualization applications

# Toolkits powered by oneAPI

Data Scientists & Al Developers



### Intel® Al Analytics Toolkit

Accelerate machine learning & data science pipelines with optimized DL frameworks & high-performing Python libraries



### Intel® Distribution of OpenVINO<sup>TM</sup> Toolkit

Deploy high performance inference & applications from edge to cloud



# Transition from Parallel Studio XE with <u>no</u> Disruption





# Transition from Parallel Studio XE with no Disruption









# Transition from Parallel Studio XE with no Disruption











# Intel oneAPI Toolkits Free Availability



### **Get Started Quickly**

Code Samples, Quick-start Guides, Webinars, Training

software.intel.com/oneapi







### oneAPI Available on: Intel DevCloud



A development sandbox to develop, test and run workloads across a range of Intel CPUs, GPUs, and FPGAs using Intel's oneAPI software.

### Get Up & Running In Seconds!

software.intel.com/devcloud/oneapi





### Create an Intel DevCloud Account



### software.intel.com/devcloud/oneapi





### Subject: Intel® DevCloud Account Confirmation - Email Verification

Thank you for registering for an Intel® DevCloud Account.

Please verify your email address by clicking the link below. The link will expire in 5 days.

### Verify your email

Your password should be protected as confidential. Your use of the password and Intel's websites are governed by Intel's Terms and Conditions of Use linked from the bottom of each respective site's web pages.

If you have any questions, please contact us.

To manage your profile, including available marketing subscriptions, please visit  $\underline{\text{My}}$  Intel.

Please DO NOT reply to this e-mail message. This is an automated response.

To ensure that you continue receiving our e-mails, please add us to your address book.

Intel is committed to protecting your privacy. For more information, please see the Intel Privacy Notice.



Legal Information | Trademarks © Intel Corporation. All rights reserved.

### Subject: Welcome to Intel® DevCloud

Your Intel® DevCloud account has been activated! You are now able to develop, test, and run your workloads across a range of Intel® CPUs, GPUs, FPGAs and Edge Devices using the latest Intel® software.

Access a variety of tools to get started by visiting https://software.intel.com/devcloud
Your access to Intel DevCloud expires on

You can extend your access within 30 days of expiration. For you privacy, your account and data will be deleted upon expiration, so make sure to backup any data you wish to preserve before then.

Thank you for using Intel® DevCloud.



<sup>\*</sup>Other names and brands may be claimed as the property of others.

<sup>\*\*</sup>Intel is not responsible for content of sites outside our intranet sites.

# Access a variety of tools to get started



### devcloud.intel.com/oneapi



### and FPGA Architectures

- Intel® Xeon® Scalable 6128 processors
- Intel® Xeon® Scalable 8256 processors
- Intel® Xeon® E-2176 P630 processors (with Intel® Graphics Technology)

- Intel® Xeon® E-2176 P630 processors (with Intel®
- Intel® Iris® Xe MAX

- Intel® Arria® 10 FPGAs
- Intel® Stratix® 10 FPGAs

### What You Get

- Free access to Intel® oneAPI toolkits and components and the latest Intel® hardware
- 220 GB of file storage
- 120 days of access (extensions available)
- Terminal Interface (Linux\*)
- Microsoft Visual Studio\* Code integration
- · Remote Desktop for Intel® oneAPI Rendering Toolkit

### Why oneAPI?

- · Freedom of choice for accelerated computing across multiple architectures: CPU, GPU, and FPGA
- · An open alternative to proprietary lock-in
- Data Parallel C++ (DPC++)—an open, standards-based evolution of ISO C++ and Khronos SYCL\*
- · Optimized libraries for API-based programming
- · Advanced analysis and debug tools
- CUDA\* source code migration
- · Additional support for OpenCL and RTL development on



# **Choose your Connection Method**









# **Setup of Intel DevCloud and JupyterLab Environment**



# Launch Jupyterlab and select Terminal

Simple 0 5 5 @













### Connect with Linux/macOS SSH Client



### devcloud.intel.com/oneapi/documentation/connect-with-ssh-linux-macos/





# A Complete DPC++ Program



### Single source

 Host code and heterogeneous accelerator kernels can be mixed in same source files

### Familiar C++

 Library constructs add functionality, such as:

| Construct     | Purpose         |
|---------------|-----------------|
| queue         | Work targeting  |
| malloc_shared | Data management |
| parallel_for  | Parallelism     |

Host code

Accelerator device code

Host code

```
#include <CL/sycl.hpp>
constexpr int N=16;
using namespace sycl;
int main() {
  queue q;
  int *data = malloc shared<int>(N, q);
  q.parallel_for(N, [=](auto i) {
      data[i] = i;
  }).wait();
  for (int i=0; i<N; i++) std::cout << data[i] << "\n";</pre>
  free(data, q);
  return 0;
```





# **SYCL Classes**



### Device



- The device class represents the capabilities of the accelerators in a oneAPI system.
- The device class contains member functions for querying information about the device, which is useful for DPC++ programs where multiple devices are created.
- The function get\_info gives information about the device:

```
queue q;
device my_device = [q.get_device();
std::cout << "Device: " << [my_device.get_info<info::device::name>() << std::endl;</pre>
```



### **Device Selector**



- The device\_selector class enables the runtime selection of a particular device to execute kernels based upon user-provided heuristics.
- The following code sample shows use of the standard device selectors (default\_selector, cpu\_selector, gpu\_selector...) and a derived device\_selector

```
default_selector selector;
// host_selector selector;
// cpu_selector selector;
// gpu_selector selector;
queue q(selector);
std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;</pre>
```



### Queue



- A queue submits command groups to be executed by the SYCL runtime
- Queue is a mechanism where work is submitted to a device.
- A Queue map to one device and multiple queues can be mapped to the same device.

```
queue q;

q.submit([&](handler& h) {
    // COMMAND GROUP CODE
});
```



# Choosing Where Device Kernels Run



### Work is submitted to queues

- Each queue is associated with exactly one device (e.g. a specific GPU or FPGA)
- You can:
  - Decide which device a queue is associated with (if you want)
  - Have as many queues as desired for dispatching work in heterogeneous systems

| Create queue targeting any device:                              | queue();                                                                                                                                                        |
|-----------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Create queue targeting a pre-<br>configured classes of devices: | <pre>queue(cpu_selector{}); queue(gpu_selector{}); queue(intel::fpga_selector{}); queue(accelerator_selector{}); queue(host_selector{});</pre> Always available |
| Create queue targeting specific device (custom criteria):       | <pre>class custom_selector : public device_selector {   int operator()( // Any logic you want!    queue(custom_selector{});</pre>                               |



### Kernel



- The kernel class encapsulates methods and data for executing code on the device when a command group is instantiated
- Kernel object is not explicitly constructed by the user
- Kernel object is constructed when a kernel dispatch function, such as parallel\_for, is called

```
q.submit([&](handler& h) {
    h.parallel_for(range<1>(N), [=](id<1> i) {
        A[i] = B[i] + C[i]);
    });
});
```



# DPC++ language and runtime



- DPC++ language and runtime consists of a set of C++ classes, templates, and libraries
- Application scope and command group scope:
  - Code that executes on the host
  - The full capabilities of C++ are available at application and command group scope
- Kernel scope:
  - Code that executes on the device.
  - At kernel scope there are limitations in accepted C++



### Parallel Kernels



- Parallel Kernel allows multiple instances of an operation to execute in parallel.
- Useful to offload parallel execution of a basic for-loop in which each iteration is completely independent and in any order.
- Parallel kernels are expressed using the parallel\_for function

### for-loop in CPU application

```
for(int i=0; i < 1024; i++){
    a[i] = b[i] + c[i];
});</pre>
```



```
Offload to accelerator using parallel_for
```

```
h.parallel_for(range<1>(1024), [=](id<1> i){
    A[i] = B[i] + C[i];
});
```



### **Basic Parallel Kernels**



The functionality of basic parallel kernels is exposed via range, id and item classes

- range class is used to describe the iteration space of parallel execution
- id class is used to index an individual instance of a kernel in a parallel execution
- item class represents an individual instance of a kernel function, exposes additional functions to query properties of the execution range

```
h.parallel_for(range<1>(1024), [=](id<1> idx){

// CODE THAT RUNS ON DEVICE
});
```

```
h.parallel_for(range<1>(1024), [=](item<1> item){
    auto idx = item.get_id();
    auto R = item.get_range();
    // CODE THAT RUNS ON DEVICE
});
```



### ND-Range Kernels



Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level.

ND-Range kernel is another way to expresses parallelism which enable low level performance tuning by providing access to local memory and mapping executions to compute units on hardware.

- The entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single compute unit on hardware.
- The grouping of kernel executions into work-groups will allow control of resource usage and load balance work distribution.



**ND-Range** 



### ND-Range Kernels



The functionality of nd\_range kernels is exposed via nd\_range and nd\_item classes

- nd\_range class represents a grouped execution range using global execution range and the local execution range of each work-group.
- nd\_item class represents an individual instance of a kernel function and allows to query for work-group range and index.



# **Buffer Memory Model**



**Buffers:** Encapsulate data in a SYCL application

Across both devices and host!

Accessors: Mechanism to access buffer data

 Create data dependencies in the SYCL graph that order kernel executions

```
queue q;
std::vector<int> v(N, 10);
{
    buffer buf(v);
    q.submit([&](handler& h) {
        accessor a(buf, h , write_only);
        h.parallel_for(N, [=](auto i) { a[i] = i; });
    });
}
for (int i = 0; i < N; i++) std::cout << v[i] << " ";</pre>
```



# **DPC++ Code Anatomy**



- oneAPI programs require the include of CL/sycl.hpp
- It is recommended to employ the namespace statement to save typing repeated references into the sycl namespace

```
#include <CL/sycl.hpp>
using namespace sycl;
```



### DPC++ Code Anatomy



```
void dpcpp code(int* a, int* b, int* c) {
 // Setting up a DPC++ device queue
 queue q;
  // Setup buffers for input and output vectors
 buffer buf a(a, range<1>(N));
 buffer buf b(b, range<1>(N));
 buffer buf c(c, range<1>(N));
  //Submit Command group function object to the queue
 q.submit([&](handler &h){
    //Create device accessors to buffers allocated in global memory
    accessor A(buf a, h, read only);
    accessor B(buf b, h, read only);
    accessor C(buf c, h, write only);
    //Specify the device kernel body as a lambda function
   h.parallel for(range<1>(N), [=](auto i){
     C[i] = A[i] + B[i];
Kernel invocations
                   Kernel is invoked for
                                       Kernel invocation has
```

Step 1: create a device queue (developer can specify a device type via device selector or use default selector)

Step 2: create buffers (represent both host and device memory)

Step 3: submit a command for (asynchronous) execution

Step 4: create buffer accessors to access buffer data on the device

Step 5: send a kernel (lambda) for execution

Step 6: write a kernel

are executed in each element of the access to the parallel invocation id range

Done!

The results are copied to vector `c` at `buf c` buffer destruction



### **Custom Device Selector**



• The following code shows derived **device\_selector** that employs a device selector heuristic. The selected device prioritizes a GPU device because the integer rating returned is higher than for CPU or other accelerator.

```
#include <CL/sycl.hpp>
using namespace cl::sycl;
class my device selector : public device selector {
public:
  int operator()(const device& dev) const override {
   int rating = 0;
   if (dev.is_gpu() & (dev.get_info<info::device::name>().find("Intel") != std::string::npos))
      rating = 3;
   else if (dev.is_gpu()) rating = 2;
   else if (dev.is_cpu()) rating = 1;
   return rating;
  };
};
int main() {
 my_device_selector selector;
  queue q(selector);
  std::cout << "Device: " << q.get device().get info<info::device::name>() << std::endl;</pre>
  return 0;
```





# **Hands-on Coding on Intel DevCloud**



### Intel VTune Profiler: DPC++ Profiling - Tune for CPU, GPU & FPGA



### Analyze Data Parallel C++ (DPC++)

See the lines of DPC++ that consume the most time

### Tune for Intel CPUs, GPUs & FPGAs

Optimize for any supported hardware accelerator

### **Optimize Offload**

Tune OpenMP offload performance

### Wide Range of Performance Profiles

CPU, GPU, FPGA, threading, memory, cache, storage...

### **Supports Popular Languages**

DPC++, C, C++, Fortran, Python, Go, Java, or a mix









# Hands-on Intel VTune Profiler on Intel DevCloud



# Intel DPC++ Compatibility Tool: Minimizes Code Migration Time



- Assists developers migrating code written in CUDA to DPC++ once, generating human readable code wherever possible
- ~80-90% of code typically migrates automatically
- Inline comments are provided to help developers finish porting the application

# Intel DPC ++ Compatibility Tool Usage Flow Complete Coding & Tune to Desired Performance Human Readable DPC++ with inline Comments Developer's CUDA Source Code Tool DPC++ Source Code





# Hands-on Intel DPC++ Compatibility Tool on Intel DevCloud







# Thanks!

