# Introduction to Parallel STL

##### Sections
- _Code_: [Hello, PSTL!](#Hello,-PSTL!)
- _Code_: [Fancy Iterators](#Fancy-iterators)

The exercises in this notebook mainly consists of 3 parts. Description in the first cell. Code example in the second cell. You can modify the code inline and run the cell to create C++ source code. A following solution code cell is optional and is designed to overwrite the code snipped with a solution. Last cell is used to compile and run the exercise on the DevCloud batch system.

# Hello, PSTL!

To begin, we'll build and run a very simple C++ application that invokes the classic STL algorithm ``sort`` on an input sequence of random integers.

To make this exercise more interesting we parallelize our sort at a high-level. To do that, we simply add a parallel execution policy and take the algorithm for a second run. Adding a parallel execution policy is as easy as inserting `std::execution::par` as a first parameter. This tells the compiler and runtime that it's safe to execute iterations in parallel on multiple threads of execution. It's a hint by the user but not a mandatory requirement to run any optimizations for the runtime. A high-quality implementation like Parallel STL with TBB as a backend could benefit by splitting the sort into finer grained tasks. Those can be executed by TBB's scheduler in a multithreaded fashion. Of course that adds some overhead, such as maintaining a thread pool or synchronize the tasks. So it's not advised to add a parallel execution policy to each and every algorithm. It's a user's (Yes that's you) responsibility to ensure a parallel optimization hint is legal and beneficial.

Consequently, in our simple example we time both runs so that we can compare them. Please note that intentionally any warm-up for the parallel execution is skipped.

Inspect the code below - there are no modifications necessary. Run the first cell to create the file, then run the cell below it to compile and execute the code.
1. Inspect the code cell below, then click run ▶ to save the code to a file
2. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

What are your expectations regarding the speedup? We're likely running on an system with two Intel Xeon processor with 6 cores each (12 hardware threads) in the current DevCloud configuration, as of Sept'20. 

What happens if we decrease the size of the input sequence to let's say 50? 

In [None]:
%%writefile lab/sort.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>
#include <random>

#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

int main() {
  std::default_random_engine e1(0);
  std::uniform_int_distribution<int> d(0, 5000);

  // initialize input vector with random numbers
  std::vector<int> data(5000000);
  for(auto& i:data) {
    i = d(e1);
  }

  {
    std::vector<int> input(data);
    std::cout << "Running sequentially:\n";
    auto st0 = std::chrono::high_resolution_clock::now();
    std::sort(input.begin(), input.end());
    auto st1 = std::chrono::high_resolution_clock::now();
    std::cout << "Serial time   = " << 1e-9 * (st1-st0).count() << " seconds\n";
  }

  {
    std::vector<int> input(data);
    std::cout << "\nRunning in parallel:\n";
    auto pt0 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, input.begin(), input.end());
    auto pt1 = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel time = " << 1e-9 * (pt1-pt0).count() << " seconds\n";
  }
    
  return 0;
}

Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 run_sort.sh; if [ -x "$(command -v qsub)" ]; then ./q run_sort.sh; else ./run_sort.sh; fi

# Fancy iterators

Next we want to demonstrate how the applicability of classic STL algorithms can be extended by special iterators, so called fancy iterators. Imagine the following problem: A logic sequence of pairs (key and data elements) has to be sorted according to its key element. The elements are stored in two containers, one vector that holds the keys and another vector that holds the data elements.

oneDPL as part of Intel oneAPI supports fancy iterators that can be used with the C++17 parallel algorithms. For example a `counting_iterator` is provided that represents a linear increasing sequence. With one big advantage compared to a classic vector that holds the data. This can actually save memory bandwidth. Because the sequence does not need a representation in memory.

In our example we use a `zip_iterator` that can be used to tie two or even multiple sequences. The resulting iterator can now be used as an input or output of STL algorithm. Like in the following example.

Complete the instructions in the following code template to implement a sort by key using a zip iterator.

In [5]:
%%writefile lab/fancy_sort.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

int main() {

  {
    using oneapi::dpl::make_zip_iterator;
    std::vector<int> keys = { 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    std::vector<int> data = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    // *** Step 1: create a zip iterator with iterator to begin of keys and data as input.
    auto zip_in = data.begin();
    const size_t n = std::distance(keys.begin(), keys.end());  

    std::cout << "\nRunning in parallel:\n";
    auto pt0 = std::chrono::high_resolution_clock::now();
    // *** Step 2: Replace left and right side of comparison with get<>() to extract and compare key
    auto custom_func = [](auto l, auto r){ using std::get; return l < r;};
    // *** Step 3: Replace all instances of iterator data.begin() with your zip iterator
    std::sort(std::execution::par, zip_in, zip_in + n, custom_func);
    auto pt1 = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel time = " << 1e-9 * (pt1-pt0).count() << " seconds\n";
      
    std::cout << "Sorted keys:" ;
    for(auto i:keys)
      std::cout << i << " ";
    std::cout << "\n" << "Sorted data:";
    for(auto i:data)
      std::cout << i << " ";
    std::cout << "\n";
  }
    
  return 0;
}

Overwriting lab/fancy_sort.cpp


# Solution

In [6]:
%%writefile lab/fancy_sort.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>

int main() {

  {
    using oneapi::dpl::make_zip_iterator;
    std::vector<int> keys = { 0, 1, 0, 1, 0, 1, 0, 1, 0, 1};
    std::vector<int> data = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    auto zip_in = make_zip_iterator(keys.begin(), data.begin());
    const size_t n = std::distance(keys.begin(), keys.end());  
    auto custom_func = [](auto l, auto r){ using std::get; return get<0>(l) < get<0>(r);};
      
    std::cout << "\nRunning in parallel:\n";
    auto pt0 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, zip_in, zip_in + n, custom_func);
    auto pt1 = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel time = " << 1e-9 * (pt1-pt0).count() << " seconds\n";
      
    std::cout << "Sorted keys:" ;
    for(auto i:keys)
      std::cout << i << " ";
    std::cout << "\n" << "Sorted data:";
    for(auto i:data)
      std::cout << i << " ";
    std::cout << "\n";
  }
    
  return 0;
}

Overwriting lab/fancy_sort.cpp


Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 run_fancy_sort.sh; if [ -x "$(command -v qsub)" ]; then ./q run_fancy_sort.sh; else ./run_fancy_sort.sh; fi

# Dot product

In this exercise we'll demonstrate the use of an another algorithm: `transform_reduce`. It is available since C++17 and can be used to parallelize the computation of an `inner_product`.

https://en.cppreference.com/w/cpp/algorithm/transform_reduce

Exploring parallel execution policies which have been introduced by C++17 we shouldn't forget to mention `par_unseq`. This policy is the least restrictive by combining `par` and `unseq` policies. `unseq` policy allows the interleaving of iterations on a single thread, which is useful for vectorization. As a standalone policy, `unseq` is available since C++20. And can be useful especially for nested algorithm. The combined policy enables multi-threading and vectorization for a specific algorithm.

Modify the following example by replacing the sequential policies according to the instructions in the code.

In [None]:
%%writefile lab/dot_product.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/numeric>

auto exec = std::execution::seq;

int main() {
  const size_t N = 100000;
  const size_t runs = 10;
  double r2, r1, r;

  // initialize input vectors
  std::vector<double> a(N);
  std::vector<double> b(N);

  auto input = oneapi::dpl::counting_iterator<size_t>(0);
  std::for_each_n(std::execution::par_unseq, input, N, [&](size_t i){ a[i] = 1.0; b[i] = 2.0; });
 
  // First run inner product
  auto st0 = std::chrono::high_resolution_clock::now();
  for(int k = 0; k <= runs; ++k) {
    if(k==1) st0 = std::chrono::high_resolution_clock::now();
    r1 = std::inner_product(a.begin(), a.end(), b.begin(), 0.0);
  }
  auto st1 = std::chrono::high_resolution_clock::now();

  // Second run parallel transform reduce
  auto pt0 = std::chrono::high_resolution_clock::now();
  for(int k = 0; k <= runs; ++k) {
    if(k==1) pt0 = std::chrono::high_resolution_clock::now();
    // *** Step 1: Replace the first parameter by execution policy par.
    r = std::transform_reduce(exec,
                              a.begin(), a.end(),
                              b.begin(), 0.0);
  }
  auto pt1 = std::chrono::high_resolution_clock::now();
  
  // Third run parallel and unsequenced transform reduce
  auto pt2 = std::chrono::high_resolution_clock::now();
  for(int k = 0; k <= runs; ++k) {
    if(k==1) pt2 = std::chrono::high_resolution_clock::now();
    // *** Step 2: Replace the first parameter by execution policy par_unseq.
    r2 = std::transform_reduce(exec,
                              a.begin(), a.end(),
                              b.begin(), 0.0);
  }
  auto pt3 = std::chrono::high_resolution_clock::now();

  std::cout << "Serial time    = " << 1e-6 * (st1-st0).count() / runs << " milliseconds\n";
  std::cout << "Parallel time  = " << 1e-6 * (pt1-pt0).count() / runs << " milliseconds\n"; 
  std::cout << "Par_unseq time = " << 1e-6 * (pt3-pt2).count() / runs << " milliseconds\n";
    
  std::cout << "Result: " << r << ", " << r1 << ", " << r2 << "\n";
    
  return 0;
}

# Solution

In [None]:
%%writefile lab/dot_product.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <vector>

#include <oneapi/dpl/iterator>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/numeric>

auto exec = std::execution::seq;

int main() {
  const size_t N = 100000;
  const size_t runs = 10;
  double r2, r1, r;

  // initialize input vectors
  std::vector<double> a(N);
  std::vector<double> b(N);

  auto input = oneapi::dpl::counting_iterator<size_t>(0);
  std::for_each_n(std::execution::par_unseq, input, N, [&](size_t i){ a[i] = 1.0; b[i] = 2.0; });
 
  // First run inner product
  auto st0 = std::chrono::high_resolution_clock::now();
  for(int k = 0; k <= runs; ++k) {
    if(k==1) st0 = std::chrono::high_resolution_clock::now();
    r1 = std::inner_product(a.begin(), a.end(), b.begin(), 0.0);
  }
  auto st1 = std::chrono::high_resolution_clock::now();

  // Second run parallel transform reduce
  auto pt0 = std::chrono::high_resolution_clock::now();
  for(int k = 0; k <= runs; ++k) {
    if(k==1) pt0 = std::chrono::high_resolution_clock::now();
    // *** Step 1: Replace the first parameter by execution policy par.
    r = std::transform_reduce(std::execution::par,
                              a.begin(), a.end(),
                              b.begin(), 0.0);
  }
  auto pt1 = std::chrono::high_resolution_clock::now();
  
  // Third run parallel and unsequenced transform reduce
  auto pt2 = std::chrono::high_resolution_clock::now();
  for(int k = 0; k <= runs; ++k) {
    if(k==1) pt2 = std::chrono::high_resolution_clock::now();
    // *** Step 2: Replace the first parameter by execution policy par_unseq.
    r2 = std::transform_reduce(std::execution::par_unseq,
                              a.begin(), a.end(),
                              b.begin(), 0.0);
  }
  auto pt3 = std::chrono::high_resolution_clock::now();

  std::cout << "Serial time    = " << 1e-6 * (st1-st0).count() / runs << " milliseconds\n";
  std::cout << "Parallel time  = " << 1e-6 * (pt1-pt0).count() / runs << " milliseconds\n"; 
  std::cout << "Par_unseq time = " << 1e-6 * (pt3-pt2).count() / runs << " milliseconds\n";
    
  std::cout << "Result: " << r << ", " << r1 << ", " << r2 << "\n";
    
  return 0;
}

Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 run_dot_product.sh; if [ -x "$(command -v qsub)" ]; then ./q run_dot_product.sh; else ./run_dot_product.sh; fi

Next, turn off the auto vectorization and re-run. The run script has been already modified.

In [None]:
! chmod 755 q; chmod 755 run_dot_product_no_vec.sh; if [ -x "$(command -v qsub)" ]; then ./q run_dot_product_no_vec.sh; else ./run_dot_product_no_vec.sh; fi

### 