**DPC++**

DPC++ is Intel's answer for parallel programming. Following is a simple explanation for a sample DPC++ program.

Prerequisites:
C++
Parallel programming basic understanding

This program aims to achieve **Vector Addition**
If X and Y are two vectors, the program wants to give aX+Y as output.

Normal for loop:
    

In [None]:

    for (size_t i=0; i<length; ++i) {
        Z[i] += A * X[i] + Y[i];
        }


This is the "parallel for" loop for the operation:



In [None]:
h.parallel_for<class saxpy>( sycl::range<1>{length}, [=] (sycl::id<1> it) {
    const int i = it[0];
    Z[i] += A * X[i] + Y[i];
});
    


In [None]:
Points to be noted:

1) The loop body is expressed as a lambda, which is the part inside of the {}.

2) The loop iterator is expressed in terms of a sycl::range and an sycl::id. 
Here, sycl::range<1>{length}, [=] (sycl::id<1> it)

3) "it" is the loop  iterartor and range<1> denotes 1-dimensional.

4) " <class saxpy> " template argument to parallel_for. This is just a way to name the kernel



**SYCL QUEUES**(For device selection like CPU, GPU)

In [None]:
Whenever we want to compute on a device, we need to create a work queue:
    sycl::queue q(sycl::default_selector{});
            
sycl::queue q(sycl::host_selector{});        // run on the CPU without a runtime (e.g. no OpenCL)
sycl::queue q(sycl::cpu_selector{});         // run on the CPU with a runtime (e.g. OpenCL)
sycl::queue q(sycl::gpu_selector{});         // run on the GPU
sycl::queue q(sycl::accelerator_selector{}); // run on an FPGA or other acclerator

Managing **data** using  **buffers**

In [None]:
The canonical way to manage data in SYCL is with buffers

// T is a data type, e.g. float
std::vector<T> h_X(length,xval);
sycl::buffer<T,1> d_X { h_X.data(), sycl::range<1>(h_X.size()) };//range<1> means 1 dimensional buffer
//for storing a 1-D array. 

**Controlling device execution**

We use the submit method to enqueue work to the device queue, q. This method returns an opaque handler, against which we execute kernels, in this case via parallel_for.



In [None]:
q.submit([&](sycl::handler& h) {
    ...
    h.parallel_for<class nstream>( sycl::range<1>{length}, [=] (sycl::id<1> it) {
        ....
    });
});
q.wait();

**Compute kernels and buffers**

This is the final step:


In [None]:
q.submit([&](sycl::handler& h) {

    auto X = d_X.template get_access<sycl::access::mode::read>(h);//reading array X
    auto Y = d_Y.template get_access<sycl::access::mode::read>(h);//reading array Y
    auto Z = d_Z.template get_access<sycl::access::mode::read_write>(h);//Z = aX+Y

    h.parallel_for<class nstream>( sycl::range<1>{length}, [=] (sycl::id<1> it) {
        ...
    });
});

**Whole Program**

In [None]:
std::vector<float> h_X(length,xval);
    std::vector<float> h_Y(length,yval);
    std::vector<float> h_Z(length,zval);

    try {

        sycl::queue q(sycl::default_selector{});

        const float A(aval);

        sycl::buffer<float,1> d_X { h_X.data(), sycl::range<1>(h_X.size()) };
        sycl::buffer<float,1> d_Y { h_Y.data(), sycl::range<1>(h_Y.size()) };
        sycl::buffer<float,1> d_Z { h_Z.data(), sycl::range<1>(h_Z.size()) };

        q.submit([&](sycl::handler& h) {

            auto X = d_X.template get_access<sycl::access::mode::read>(h);
            auto Y = d_Y.template get_access<sycl::access::mode::read>(h);
            auto Z = d_Z.template get_access<sycl::access::mode::read_write>(h);

            h.parallel_for<class nstream>( sycl::range<1>{length}, [=] (sycl::id<1> it) {
                const int i = it[0];
                Z[i] += A * X[i] + Y[i];
            });
          });
          q.wait();
    }
    catch (sycl::exception & e) {
        std::cout << e.what() << std::endl;
        return 1;
    }