# Introduction to oneDPL

##### Sections
- _Code_: [Hello, PSTL!](#Hello,-PSTL!)
- _Code_: [Fancy Iterators](#Fancy-iterators)
- _Code_: [Monte Carlo Example](#Monte-Carlo-Pi)

***
# Hello, PSTL!

To begin, we'll build and run a very simple C++ application that invokes the classic STL algorithm ``sort`` on an input sequence of random integers.

To make this exercise more interesting we parallelize our sort at a high-level. To do that, we simply add a parallel execution policy and take the algorithm for a second run. Adding a parallel execution policy is as easy as inserting `std::execution::par` as a first parameter. This tells the compiler and runtime that it's safe to execute iterations in parallel on multiple threads of execution. It's a hint by the user but not a mandatory requirement to run any optimizations for the runtime. A high-quality implementation like Parallel STL with TBB as a backend could benefit by splitting the sort into finer grained tasks. Those can be executed by TBB's scheduler in a multithreaded fashion. Of course that adds some overhead, such as maintaining a thread pool or synchronize the tasks. So it's not advised to add a parallel execution policy to each and every algorithm. It's a user's (Yes that's you) responsibility to ensure a parallel optimization hint is legal and beneficial.

Consequently, in our simple example we time both runs so that we can compare them. Please note that intentionally any warm-up for the parallel execution is skipped.

Inspect the code below - there are no modifications necessary. Run the first cell to create the file, then run the cell below it to compile and execute the code.
1. Inspect the code cell below, then click run ▶ to save the code to a file
2. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

What are your expectations regarding the speedup? We're likely running on an system with two Intel Xeon processor with 6 cores each (12 hardware threads) in the current DevCloud configuration, as of Sept'20. 

What happens if we decrease the size of the input sequence to let's say 50? 

In [None]:
%%writefile lab/sort.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>
#include <random>

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

int main() {
  std::default_random_engine e1(0);
  std::uniform_int_distribution<int> d(0, 5000);

  // initialize input vector with random numbers
  std::vector<int> data(5000000);
  for(auto& i:data) {
    i = d(e1);
  }

  {
    std::vector<int> input(data);
    std::cout << "Running sequentially:\n";
    auto st0 = std::chrono::high_resolution_clock::now();
    std::sort(input.begin(), input.end());
    auto st1 = std::chrono::high_resolution_clock::now();
    std::cout << "Serial time   = " << 1e-9 * (st1-st0).count() << " seconds\n";
  }

  {
    std::vector<int> input(data);
    std::cout << "\nRunning in parallel:\n";
    auto pt0 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, input.begin(), input.end());
    auto pt1 = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel time = " << 1e-9 * (pt1-pt0).count() << " seconds\n";
  }
    
  return 0;
}

In [None]:
%%writefile scripts/run_sort.sh
#!/bin/bash
#==========================================
# Copyright (c) 2020 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#==========================================

source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1
/bin/echo "##" $(whoami) is compiling oneDPL example
rm -rf bin/sort
dpcpp lab/sort.cpp -o bin/sort -tbb
bin/sort

Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 scripts/run_sort.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_sort.sh; else ./scripts/run_sort.sh; fi

***
# Fancy iterators

Next we want to demonstrate how the applicability of classic STL algorithms can be extended by special iterators, so called fancy iterators. Imagine the following problem: A logic sequence of pairs (key and data elements) has to be sorted according to its key element. The elements are stored in two containers, one vector that holds the keys and another vector that holds the data elements.

oneDPL as part of Intel oneAPI supports fancy iterators that can be used with the C++17 parallel algorithms. For example a `counting_iterator` is provided that represents a linear increasing sequence. With one big advantage compared to a classic vector that holds the data. This can actually save memory bandwidth. Because the sequence does not need a representation in memory.

In our example we use a `zip_iterator` that can be used to tie two or even multiple sequences. The resulting iterator can now be used as an input or output of STL algorithm. Like in the following example.

Complete the instructions in the following code template to implement a sort by key using a zip iterator.

In [None]:
%%writefile lab/fancy_sort.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

int main() {

  {
    using oneapi::dpl::make_zip_iterator;
    std::vector<int> keys = { 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    std::vector<int> data = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    // *** Step 1: create a zip iterator with iterator to begin of keys and data as input.
    auto zip_in = data.begin();
    const size_t n = std::distance(keys.begin(), keys.end());  

    std::cout << "\nRunning in parallel:\n";
    auto pt0 = std::chrono::high_resolution_clock::now();
    // *** Step 2: Replace left and right side of comparison with get<>() to extract and compare key
    auto custom_func = [](auto l, auto r){ using std::get; return l < r;};
    // *** Step 3: Replace all instances of iterator data.begin() with your zip iterator
    std::sort(std::execution::par, zip_in, zip_in + n, custom_func);
    auto pt1 = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel time = " << 1e-9 * (pt1-pt0).count() << " seconds\n";
      
    std::cout << "Sorted keys:" ;
    for(auto i:keys)
      std::cout << i << " ";
    std::cout << "\n" << "Sorted data:";
    for(auto i:data)
      std::cout << i << " ";
    std::cout << "\n";
  }
    
  return 0;
}

### Solution (Don't peak unless you have to)

In [None]:
%%writefile lab/fancy_sort.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

int main() {

  {
    using oneapi::dpl::make_zip_iterator;
    std::vector<int> keys = { 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    std::vector<int> data = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    auto zip_in = make_zip_iterator(keys.begin(), data.begin());
    const size_t n = std::distance(keys.begin(), keys.end());  
    auto custom_func = [](auto l, auto r){ using std::get; return get<0>(l) < get<0>(r);};
      
    std::cout << "\nRunning in parallel:\n";
    auto pt0 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, zip_in, zip_in + n, custom_func);
    auto pt1 = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel time = " << 1e-9 * (pt1-pt0).count() << " seconds\n";
      
    std::cout << "Sorted keys:" ;
    for(auto i:keys)
      std::cout << i << " ";
    std::cout << "\n" << "Sorted data:";
    for(auto i:data)
      std::cout << i << " ";
    std::cout << "\n";
  }
    
  return 0;
}

In [None]:
%%writefile scripts/run_fancy_sort.sh
#!/bin/bash
#==========================================
# Copyright (c) 2020 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#==========================================

source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1
/bin/echo "##" $(whoami) is compiling oneDPL example
rm -rf bin/fancy_sort
dpcpp lab/fancy_sort.cpp -o bin/fancy_sort -tbb
bin/fancy_sort

Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 scripts/run_fancy_sort.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_fancy_sort.sh; else ./scripts/run_fancy_sort.sh; fi

***
# Monte Carlo Pi

Let's recreate the Direct programming DPC++ example using the DPC++ library (oneDPL). We'll use the algorithm ``transform_reduce``, which can for example be used to parallelize the computation of a dot product, by applying a parallel execution policy. Well or the monte carlo simulation computing PI.  

In [None]:
%%writefile lab/monte_carlo_pi.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <CL/sycl.hpp>

#include <oneapi/dpl/random>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

#define N 500
#define LOCAL_N 1000

#define SEED 667
#define PI 3.1415926535897932384626433832795

int main() {
    
    using oneapi::dpl::counting_iterator;
    double estimated_pi = 3.14;
    {
        sycl::queue q; //(sycl::gpu_selector{});
        auto policy = oneapi::dpl::execution::make_device_policy(q);
 
        auto sum = transform_reduce( policy, counting_iterator<int>(0), counting_iterator<int>(N), 0, std::plus<float>{}
                                    , [=](auto n){
                                        float local_sum = 0.0f;
                                        // Get random coords
                                        oneapi::std::minstd_rand engine(SEED, n);
                                        oneapi::std::uniform_real_distribution<float> distr(-1.0f,1.0f);
                                        for(int i = 0; i < LOCAL_N; ++i) {
                                            float x = distr(engine);
                                            float y = distr(engine);
                                            auto hypotenuse_sqr = x * x + y * y;
                                            if (hypotenuse_sqr <= 1.0)
                                                local_sum += 1.0;
                                        }
                                        return local_sum / (float)LOCAL_N;
                                    });

        estimated_pi = 4.0 * (float)sum / N;
    }
        
    // Printing Results
    std::cout << "Estimated value of Pi = " << estimated_pi << std::endl;
    std::cout << "Exact value of Pi = " << PI << std::endl;
    std::cout << "Absolute error = " << fabs( PI-estimated_pi ) << std::endl;

    return 0;
}

In [None]:
%%writefile scripts/run_monte_carlo_pi.sh
#!/bin/bash
#==========================================
# Copyright (c) 2020 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#==========================================

source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1
/bin/echo "##" $(whoami) is compiling oneDPL example
rm -rf bin/monte_carlo_pi
dpcpp lab/monte_carlo_pi.cpp -o bin/monte_carlo_pi -tbb
bin/monte_carlo_pi

Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 scripts/run_monte_carlo_pi.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_monte_carlo_pi.sh; else ./scripts/run_monte_carlo_pi.sh; fi

### 