# oneTBB Loop Partitioners

##### Sections
- [The partitioners available in oneTBB](#The-partitioners-available-in-oneTBB)
- _Code_: [Partitioners and grainsize](#Partitioners-and-grainsize)

## Learning Objectives
* Learn to use partitioners to change the way that oneTBB divides TBB Ranges

## The partitioners available in oneTBB

There are several partitioners provided in oneTBB that can be used with the loop algorithms: 
`parallel_for`, `parallel_reduce` and `parallel_scan`.  The table below summarizes their
behaviors.

![Partitioner Table](img/partitioners.png)

In this module, we will use different partitioners and grainsizes and see their effect
on the distribution of tasks to threads and the size of tasks created by a simple use of `tbb::parallel_for`.
In real applications, we need to consider both
application characteristics and the platform characteristics
to select the best partitioner for our problem.

`auto_partitioner` is the default for TBB loop algorithms and is typically a good choice.
It tries to create chunks that are small enough to allow for dynamic load balancing, without
creating very small tasks that lead to excessive scheduling overheads.

A `simple_partitioner` is useful when you want to carefully control grainsizes and do not 
want TBB to automatically determine the size of chunks distributed to threads. This can be
useful, for example, when implementing cache-oblivious algorithms, where it is important to
divide the iteration space into small enough pieces to benefit from cache-obliviousness.

`affinity_partitioner` is useful when you can benefit from repeating the same distribution of
tasks to threads on multiple invocations of loop nests. It determines chunk sizes using an
algorithm similar to `auto_partitioner` but stores the pattern and attempts to recreate it on
future executions.

Finally, `static_partitioner` has the lowest overheads since, as its name implies, it doesn't
do dynamic balancing of the load. When we have a well balanced workload running on a unloaded
machine, a static partitioning can provide excellent performance. However if the workload is
imbalanced or some cores are more loaded than others, we lose the benefits of dynamic load
balancing.

## Trying out partitioners

In this section, we will execute a very simple example that runs a `tbb::parallel_for` that
spins for roughly 100 nanoseconds for each loop iteration. Since all of the iterations spin
for roughly the same amount of time, there is no load imbalance among the iterations. But our
example records the assignment of iterations to threads and we will visualize this assignment
using a python script.  The output will also state how many loop body tasks were created and
how long the loop took to execute.

The test loop, found in common/partitioner_test.cpp, is reproduced below:

```cpp
std::atomic<int> num_body_executions = 0;

template<typename Partitioner>
static auto run_test(const std::string &partitioner_name, int gs = 1) {
  auto t0 = std::chrono::high_resolution_clock::now();
  tbb::parallel_for(tbb::blocked_range<int>(0, 
                                            std::thread::hardware_concurrency() * 1000, 
                                            gs),
    [v](const tbb::blocked_range<int> &b) {
      const int time_per_iteration = 100; // 100 ns
      ++num_body_executions;
      for (int i = b.begin(); i < b.end(); ++i) {
        auto t0 = std::chrono::high_resolution_clock::now();
        while ((std::chrono::high_resolution_clock::now() - t0).count() < time_per_iteration);
        v[i] = tbb::this_task_arena::current_thread_index();
      }
    }, Partitioner{}
  );
  double execution_time = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
  return std::make_pair( execution_time, number_of_body_executions );
}
```

In the code above, Partitioner is the partitioner type passed as a template argument to run_test 
and `gs` is an optional grainsize. The test will execute the parallel loop using 
`std::thread::hardware_concurrency() * 1000` iterations. Each iteration will spin for about 100 ns.
The output will be written to a csv file so that it can be viewed as a 2D graph showing which
threads executed which iterations. The threads will appear as different colors.

Perform the following steps to complete this exercise:
1. Inspect the code cell below, then click run ▶ to save the code to a file
2. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile, execute the code and generate the output graph.  Not the distribution of colors, the number of body tasks executed and the execution time.
3. Modify the grainsize argument, trying values such as 10, 100 and 1000. Then re-run the cell in the __Build and Run__ section to generate a new graph and results.
4. Pass a grainsize of 1 but use different partitioners, such as `auto_partitioner` and `static_partitioner`. After each change, re-run the cell in the __Build and Run__ section to generate a new graph and results.

In [None]:
%%writefile lab/partitioner.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include "../common/partitioner_test.h"
#include <iostream>

int main() {
  // STEP 3: Try different grainsize values, maybe 10, 100 and 1000
  // STEP 4: Pass a grainsize of 1, but change the partitioner type to auto_partitioner or static_partitioner
  auto r = run_test<tbb::simple_partitioner>(/* title in chart */ "simple_partitioner", /* grainsize */ 1);
  std::cout << "Wallclock time == " << r.first << "\n"
            << "Number of body executions == " << r.second << "\n";
}

In [None]:
!chmod 755 q; chmod 755 ./scripts/run_partitioner.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_partitioner.sh; else ./scripts/run_partitioner.sh; fi
%run common/plot_tids.py