# Data Parallel C++

- Data Parallel C++ (DPC++) is a __unified programming language__ that enables programming for __diverse architectures__ like CPU, GPU, FPGA and other accelerators. 
- DPC++ is based on __industry standards__ and __open specification__. 
- DPC++ = __C++__ and __SYCL__* standard and extensions. 
- DPC++ simplifies __expressing parallelism__.

# Hello World

The code example does the following:
1. __allocate__ an array
2. __initialize__ the array to some value
3. do __computation__ on each element of array (multiply by 2)
4. __print output__ data array

There are 2 code examples doing the above computation using __C++__ and __DPC++__:

1. The first code example uses __C++__ to allocate data array, initialize data, computes using `for`-loop
2. The second code example uses __DPC++__ to allocate memory using __Unified Shared Memory__, initialize data, create, compute by __offloading to accelerator__ device(GPU) using `parallel_for`
3. The third code example uses __DPC++__ to allocate data, initialize data, create __buffer and accessor__ to copy memory between host and device, compute by __offloading to accelerator__ device(GPU) using `parallel_for`

## C++

#### Serial Computation on CPU


In [None]:
%%writefile simple_cpp.cpp
#include <iostream>

static const int N = 4;

int main(){
    //# Memory Allocation
    float *data = (float*)malloc(sizeof(float)*N);

    //# Initialization
    for(int i=0; i<N; i++) data[i] = i;

    //# Serial Computation
    for(int i=0; i<N; i++) data[i] *= 2;

    //# Print Output
    for(int i=0; i<N; i++) std::cout << data[i] << std::endl;

    free(data);
}


#### Build and Run

In [None]:
!./q run_hello_cpp.sh

## DPC++

#### Parallel Computation Offloaded to device

The code below show DPC++ implementation of Hello World which uses Unified Shared Memory and other simplification of SYCL standard

In [None]:
%%writefile simple_dpcpp.cpp
#include <CL/sycl.hpp>
using namespace sycl;

static const int N = 4;

int main(){
    //# define queue which has default device associated for offload
    queue q;
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    //# USM Allocation enables data access on host and device
    float *data = static_cast<float*>(malloc_shared(N * sizeof(float), q));

    //# Initialization
    for(int i=0; i<N; i++) data[i] = i;

    //# Offload parallel computation to GPU device
    q.parallel_for(range<1>(N), [=] (id<1> i){
        data[i] *= 2;
    }).wait();

    //# Print Output
    for(int i=0; i<N; i++) std::cout << data[i] << std::endl;

    free(data, q);
}


#### Build and Run

In [None]:
!./q run_hello_dpcpp.sh

## SYCL

#### Parallel Computation Offloaded to GPU

The code below shows SYCL implementation of Hello World using buffers and accessors


In [None]:
%%writefile simple_sycl.cpp
#include <CL/sycl.hpp>
using namespace cl::sycl;

static const int N = 4;

int main(){
    //# define queue which has default device associated for offload
    queue q;
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

    //# Memory Allocation
    float *data = (float*)malloc(sizeof(float)*N);
    
    //# Initialization
    for(int i=0; i<N; i++) data[i] = i;

    //# Create buffer
    buffer<float> data_buffer(data, range<1>(N));

    //# Submit command groups to GPU device
    q.submit([&](handler &h){
        //# Create accessor to copy buffer to device
        auto DATA = data_buffer.get_access<access::mode::read_write>(h);
        
        //# Parallel Computation on device
        h.parallel_for<class kernel_doubler>(range<1>(N), [=] (id<1> i){
            DATA[i] *= 2;
        });
    });
 
    //# Update buffer on host
    data_buffer.get_access<access::mode::read>();
    
    //# Print Output
    for(int i=0; i<N; i++) std::cout << data[i] << std::endl;

    free(data);
}


#### Build and Run

In [None]:
!./q run_hello_sycl.sh