# oneTBB Controlling Maximum Concurrency

##### Sections
- [Controlling the maximum concurrency used by oneTBB](#Controlling-the-maximum-concurrency-used-by-oneTBB)
- _Code_: [A simple scalability study](#A-simple-scalability-study)
- _Code_: [Divide up resources with task_arena](#Divide-up-resources-with-task-arena)

## Learning Objectives
* Learn how ``tbb::global_control`` and ``tbb::task_arena`` work together to set the maximum available concurrency for computations

# Controlling the number of threads used by oneTBB

oneTBB has a global pool of worker threads and one or more task arenas. Together, the pool of worker threads 
and the task arenas are referred to as the TBB scheduler. If we do nothing special, the oneTBB library 
creates a thread pool with P-1 worker threads, where P is the number of hardware threads supported by the CPUs (`std::thread::hardware_concurrency()`). And, by default, all TBB tasks are shared in a single, implicit task 
arena that contains enough slots for the main thread and the P-1 worker threads.

Typically, these defaults are the best configuration, providing good scalability while avoiding oversubscription. 
Whenever we execute a TBB algorithm, use a TBB flow graph or create tasks, all the work is scheduled on to this 
limited set of threads that are scheduled by the OS on to the CPU cores. 

However, there are scenarios that can benefit from changing these defaults.

Some reasons we might change the number of threads used by a parallel computation include:

* We know that our computations have limited scalability and that using more than a certain number of threads will just add scheduling overheads.
* Our tasks consume a lot of resources when executing, and so we need to limit the number of tasks that execute concurrently.
* We want to leave room on our platform for other work, and so we want to limit the number of cores used concurrently by a particular computation.

Threading Building Blocks provides two classes that can be used to limit the maximum concurrency 
used by parallel computations: ``tbb::global_control`` and ``tbb::task_arena``.  Each serves a different 
purpose but together they determine the maximum number of threads our parallel computations can use
concurrently.

In this module, we work through a few exercises to better understand how to use `tbb::global_control` 
and `tbb::task_arena` objects.

## A simple scalability study

In our first set of exercises, we will run a very simple parallel loop that executes its loop body 1 million times, spinning in each invocation of the body for about 1 microsecond. The sum of the times for all the iterations therefore equals about 1 second. We will make a series of modifications to this example to better understand the impact of `tbb::global_control` and `tbb::task_arena` objects.

### The default case

If we do nothing special, oneTBB creates P-1 workers and executes tasks in an implicit task area with P slots. In this section, we build and execute a base case that uses these defaults. Our test will track the number of unique threads that participate in the computation and print out the execution time and the number of threads that participated. When we run the base case, we should expect to see that the number of threads used in the computation are equal to `std::thread::hardware_concurrency()` and that the time to execute the parallel loop is roughly 1 second / P. We shouldn't expect the time to be exactly
1/P since there is overhead in parallel scheduling and because the spin loop in our test loop is not precise -- in fact it executes for *at least* 1 microsecond 
not *exactly* 1 microsecond, so we should expect the time to be bigger than 1/P, but not a lot bigger.  You can see the details of the
function `run_test` in common/test_function.h.

Inspect the code below - there are no modifications necessary. Run the first cell to create the file, then run the cell below it to compile and execute the code.
1. Inspect the code cell below, then click run ▶ to save the code to a file
2. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/scalability-no-controls.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  run_test(); // warm-up run
  auto t0 = std::chrono::high_resolution_clock::now();
  auto num_participating_threads = run_test(); // test run
  auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
  std::cout << "Ran test with on hw with " << hw_threads << " threads using "
            << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
            << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-no.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-no.sh; else ./scripts/run_scalability-no.sh; fi

### tbb::global_control and tbb::task_arena

The class `tbb::global_control` is used to change the setting of global control variables used by the oneTBB library. 
You can find detailed documentation about the `global_control` class [here](https://spec.oneapi.com/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html).

```cpp
// Defined in header <tbb/global_control.h>
namespace tbb {
  class global_control {
  public:
    enum parameter {
      max_allowed_parallelism,
      thread_stack_size
    };

    global_control(parameter p, size_t value);
    ~global_control();

    static size_t active_value(parameter param);
  };
} // namespace tbb
```

In this module, we focus on using `global_control` to set `max_allowed_parallelism`, which controls the maximum number of 
threads in a process that can concurrently execute TBB tasks. Control variables can be created in different threads, and may 
have nested or overlapping scopes. However, at any point in time each controlled parameter has a single active 
value that applies to the whole process. This value is selected from all currently existing control variables by 
applying a parameter-specific selection rule. For `max_allowed_parallelism` the rule is that the selected value is the 
minimum of the `max_allowed_parallelism` values.

Using the minimum value makes a lot of sense. In the common case, we want TBB to use as many threads as it has available to 
execute our parallel work. It is only in exceptional cases that a developer chooses to restrict `max_allowed_parallelism`.
As we noted above this might be because the developer knows that the parallelism won't scale beyond a certain point or that the 
number of concurrent tasks must be limited to avoid using too many resources (such as memory). So, if there are conflicting
requests on how much the parallelism must be limited, to be safe, TBB conservatively chooses the most restrictive request.

The class `tbb::task_arena` represents a place where threads may share and execute tasks. You can find detailed documentation
about the `task_arena` class [here](https://spec.oneapi.com/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html). 
The number of threads that may simultaneously execute tasks in a `task_arena` is limited by its concurrency limit. Each user 
thread that invokes any parallel construct outside an explicit task_arena uses an implicit task arena representation object
associated with the calling thread. The tasks spawned or enqueued into one arena cannot be executed in another arena.
The interfaces needed for the exercises in this module are shown below.

```cpp
// Defined in header <tbb/task_arena.h>

namespace tbb {
    class task_arena {
    public:
        task_arena(int max_concurrency = automatic, unsigned reserved_for_masters = 1);
        int max_concurrency() const;
        template<typename F> auto execute(F&& f) -> decltype(f());
        template<typename F> void enqueue(F&& f);
    };
} // namespace tbb
```

The classes `global_control` and `task_arena` work together. The number of TBB worker threads participating in 
executing tasks in a specific task arena will not exceed that task arena's concurrency limit and the number of 
threads participating in work across all task arenas will not exceed `max_allowed_parallelism`.

### Use tbb::global_control to do a scalability study

To demonstrate the effects of `global_control`, let's change our earlier example to execute the test function
after setting `max_allowed_parallelism` to values 1 to `2*std::thread::hardware_concurrency()`. We will *not*
introduce an explicit `task_arena` in this exercise.  As a result our test will use the default task arena, which
has a concurrency limit of `std::thread::hardware_concurrency()`.  We should therefore expect that our test will
be restricted when we set `max_allowed_parallelism` to values below `std::thread::hardware_concurrency()`. 
However, when we set `max_allowed_parallelism` to values larger than  `std::thread::hardware_concurrency()`,
our test will still be limited to at most `std::thread::hardware_concurrency()` due to the limits on the task
arena.

For this exercise, complete the following steps:
1. Inspect the code cell below and make the following modifications.
  1. Change the value passed to the global control object from INCORRECT_VALUE to `i`.
2. When the modifications are complete, click run ▶ to save the code to a file.
3. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/scalability-gc.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  for (int i = 1; i <= 2*hw_threads; ++i) {
    // STEP A: Change the global_control object to allow at most i concurrent threads
    tbb::global_control gc(tbb::global_control::max_allowed_parallelism, INCORRECT_VALUE);
    run_test(); // warm-up run
    auto t0 = std::chrono::high_resolution_clock::now();
    auto num_participating_threads = run_test(); // test run
    auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
    std::cout << "Ran test with on hw with " << hw_threads << " threads using "
              << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
              << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
  }
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-gc.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-gc.sh; else ./scripts/run_scalability-gc.sh; fi

### global_control scalability solution (Don't peak unless you have to)

In [None]:
%%writefile solutions/scalability-gc-solved.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  for (int i = 1; i <= 2*hw_threads; ++i) {
    // STEP A: Use i instead of INCORRECT_VALUE to vary the value from 1 to 2*hw_threads
    tbb::global_control gc(tbb::global_control::max_allowed_parallelism, i);
    run_test(); // warm-up run
    auto t0 = std::chrono::high_resolution_clock::now();
    auto num_participating_threads = run_test(); // test run
    auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
    std::cout << "Ran test with on hw with " << hw_threads << " threads using "
              << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
              << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
  }
}

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-gc-solved.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-gc-solved.sh; else ./scripts/run_scalability-gc-solved.sh; fi

### Use an explicit tbb::task_arena to do a scalability study

To demonstrate the effects of `task_arena`, let's change our example to now execute the test function
using explicit task_arena objects, varying the concurrency limit for the task arenas from
1 to `2*std::thread::hardware_concurrency()`. We will *not* use `global_control` to change
the default for `max_allowed_parallelism`.  The default is `std::thread::hardware_concurrency()`.
Just like in our previous exercise, we should therefore expect that our test will
be restricted when we set use a `task_arena` with a concurrency limit below 
`std::thread::hardware_concurrency()`. However, when using values larger than 
`std::thread::hardware_concurrency()`, our test will still be limited to at most 
`std::thread::hardware_concurrency()` due to the default `max_allowed_concurrency` value.

For this exercise, complete the following steps:
1. Inspect the code cell below and make the following modifications.
  1. Change the value passed to the task_arena object from INCORRECT_VALUE to `i`.
2. When the modifications are complete, click run ▶ to save the code to a file.
3. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/scalability-ta.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  for (int i = 1; i <= 2*hw_threads; ++i) {
    // STEP A: Change the task_arena to use at most i threads
    tbb::task_arena ta(INCORRECT_VALUE);
    ta.execute([]() {
      run_test(); // warm-up run
    });
    ta.execute([hw_threads]() {
      auto t0 = std::chrono::high_resolution_clock::now();
      auto num_participating_threads = run_test(); // test run
      auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      std::cout << "Ran test with on hw with " << hw_threads << " threads using "
                << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
                << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
    });
  }
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-ta.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-ta.sh; else ./scripts/run_scalability-ta.sh; fi

### task_arena scalability solution (Don't peak unless you have to)

In [None]:
%%writefile solutions/scalability-ta-solved.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  for (int i = 1; i <= 2*hw_threads; ++i) {
    // STEP A: Change the task_arena to use at most i threads
    tbb::task_arena ta(i);
    ta.execute([]() {
      run_test(); // warm-up run
    });
    ta.execute([hw_threads]() {
      auto t0 = std::chrono::high_resolution_clock::now();
      auto num_participating_threads = run_test(); // test run
      auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      std::cout << "Ran test with on hw with " << hw_threads << " threads using "
                << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
                << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
    });
  }
}

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-ta-solved.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-ta-solved.sh; else ./scripts/run_scalability-ta-solved.sh; fi

### Combining tbb::global_control and tbb::task_arena to do a scalability study

From the previous two exercises, we can see that the oneTBB library is biased towards
preventing oversubscription of cores. If we just set the concurrency limit on an 
explicit task_arena or just use a global_contol object to set `max_allowed_parallelism`,
the oneTBB library restricts our parallel computation to at most 
`std::thread::hardware_concurrency()` threads.

If we really want to use more than `std::thread::hardware_concurrency()` threads, we
need to use both `tbb::global_control` and `tbb::task_arena`. First we must increase
the total number of available threads by setting `max_allowed_parallelism` to a value
larger than `std::thread::hardware_concurrency()` and then execute our computation in a 
`task_arena` with a concurrency limit greater than `std::thread::hardware_concurrency()`.

For this exercise, complete the following steps:
1. Inspect the code cell below and make the following modifications.
  1. Change the value passed to the global control object from INCORRECT_VALUE to `2*std::thread::hardware_concurrency()`.
  2. Change the value passed to the task_arena object from INCORRECT_VALUE to `i`.
2. When the modifications are complete, click run ▶ to save the code to a file.
3. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/scalability-gc-ta.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  // STEP A: Set the max_allowed_parallelism to 2*hw_threads
  tbb::global_control gc(tbb::global_control::max_allowed_parallelism, INCORRECT_VALUE);

  for (int i = 1; i <= 2*hw_threads; ++i) {
    // STEP B: Set the limit on the concurrency for the task_arena to i
    tbb::task_arena ta(INCORRECT_VALUE);
    ta.execute([]() {
      run_test(); // warm-up run
    });
    ta.execute([hw_threads]() {
      auto t0 = std::chrono::high_resolution_clock::now();
      auto num_participating_threads = run_test(); // test run
      auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      std::cout << "Ran test with on hw with " << hw_threads << " threads using "
                << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
                << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
    });
  }
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-gc-ta.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-gc-ta.sh; else ./scripts/run_scalability-gc-ta.sh; fi

### global_control plus task_arena scalability solution (Don't peak unless you have to)

In [None]:
%%writefile solutions/scalability-gc-ta-solved.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  // STEP A: Set the max_allowed_parallelism to 2*hw_threads
  tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 2*hw_threads);

  for (int i = 1; i <= 2*hw_threads; ++i) {
    // STEP B: Set the limit on the concurrency for the task_arena to i
    tbb::task_arena ta(i);
    ta.execute([]() {
      run_test(); // warm-up run
    });
    ta.execute([hw_threads]() {
      auto t0 = std::chrono::high_resolution_clock::now();
      auto num_participating_threads = run_test(); // test run
      auto sec = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      std::cout << "Ran test with on hw with " << hw_threads << " threads using "
                << num_participating_threads << " threads. Time == " << sec << " seconds." << std::endl
                << "1/" << hw_threads << " == " << 1.0/hw_threads << std::endl;
    });
  }
}

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_scalability-gc-ta-solved.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_scalability-gc-ta-solved.sh; else ./scripts/run_scalability-gc-ta-solved.sh; fi

## Dividing up resources with more than one task_arena

A `tbb::task_arena` represents a place where threads may share and execute tasks and we can have more than one
`task_arena` active at a time. When there is more than one `task_arena`, including implicit and explicit arenas,
the available threads are divided up among them, generally in proportion to the concurrency limits
of the `task_arena` objects.  Using this behavior, we can divide up our threads as we see fit. For example, we
could create one `task_arena` object that has a concurrency limit of two threads and another one that has a limit
of P-2 threads.

Let's start by executing code that uses two explicit task_arena objects, but uses the default concurrency limit 
for each (which is `std::thread::hardware_concurrency()`).  We will invoke `run_test` in one arena and `run_test_2`
in the other. Both of these functions do the same thing but use different static variables for tracking the number
of threads that participate. We should expect that `run_test` and `run_test_1` complete in roughly the same amount 
of time.

Inspect the code below - there are no modifications necessary. Run the first cell to create the file, then run the cell below it to compile and execute the code.
1. Inspect the code cell below, then click run ▶ to save the code to a file
2. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/split-default.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  tbb::task_arena ta1;
  tbb::task_arena ta2;

  // warm-up
  tbb::parallel_invoke(
    [&]() { ta1.execute([]() { run_test(); }); },
    [&]() { ta2.execute([]() { run_test_2(); }); }
  );
    
  // run test
  int num_threads_1, num_threads_2;
  double time_of_1, time_of_2;
  tbb::parallel_invoke(
    [&]() { 
      ta1.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_1 = run_test();
        time_of_1 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      }); 
    },
    [&]() { 
      ta2.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_2 = run_test_2(); 
        time_of_2 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();          
      }); 
    }
  );
    
  std::cout << "Ran test with on hw with " << hw_threads << " threads\n"
            << " test_1 ran using " << num_threads_1 << " threads in " << time_of_1 << " seconds.\n"
            << " test_2 ran using " << num_threads_2 << " threads in " << time_of_2 << " seconds.\n";
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_split-default.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_split-default.sh; else ./scripts/run_split-default.sh; fi

Surprisingly, you might see that more threads participated in one arena than the other, but 
that the total executions times are similar. If so, this is because we did not set any 
concurrency limit for either `task_arena` object.  If the work was completed in one arena 
before the other, the threads that were in the arena that finished first can migrate to 
the other arena to help finish the remaining work. This is typically the behavior we want.
Why not let idle threads help out with remaining work?

### Splitting the threads up evenly

But, let's say we want to prevent the idle threads from helping out when they finish the work in their arena.
We could set the concurrency limit in each arena to P/2.

For this exercise, complete the following steps:
1. Inspect the code cell below and make the following modifications.
  1. Change the value passed to each task_arena object from INCORRECT_VALUE to `hw_threads/2`.
2. When the modifications are complete, click run ▶ to save the code to a file.
3. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/split-evenly.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  // STEP A: pass hw_threads/2 to the constructor of each arena
  tbb::task_arena ta1(INCORRECT_VALUE);
  tbb::task_arena ta2(INCORRECT_VALUE);

  // warm-up
  tbb::parallel_invoke(
    [&]() { ta1.execute([]() { run_test(); }); },
    [&]() { ta2.execute([]() { run_test_2(); }); }
  );
    
  // run test
  int num_threads_1, num_threads_2;
  double time_of_1, time_of_2;
  tbb::parallel_invoke(
    [&]() { 
      ta1.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_1 = run_test();
        time_of_1 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      }); 
    },
    [&]() { 
      ta2.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_2 = run_test_2(); 
        time_of_2 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();          
      }); 
    }
  );
    
  std::cout << "Ran test with on hw with " << hw_threads << " threads\n"
            << " test_1 ran using " << num_threads_1 << " threads in " << time_of_1 << " seconds.\n"
            << " test_2 ran using " << num_threads_2 << " threads in " << time_of_2 << " seconds.\n";
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_split-evenly.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_split-evenly.sh; else ./scripts/run_split-evenly.sh; fi

### split-evenly solution (Don't peak unless you have to)

In [None]:
%%writefile solutions/split-evenly-solved.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  // STEP A: pass hw_threads/2 to the constructor of each arena
  tbb::task_arena ta1(hw_threads/2);
  tbb::task_arena ta2(hw_threads/2);

  // warm-up
  tbb::parallel_invoke(
    [&]() { ta1.execute([]() { run_test(); }); },
    [&]() { ta2.execute([]() { run_test_2(); }); }
  );
    
  // run test
  int num_threads_1, num_threads_2;
  double time_of_1, time_of_2;
  tbb::parallel_invoke(
    [&]() { 
      ta1.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_1 = run_test();
        time_of_1 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      }); 
    },
    [&]() { 
      ta2.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_2 = run_test_2(); 
        time_of_2 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();          
      }); 
    }
  );
    
  std::cout << "Ran test with on hw with " << hw_threads << " threads\n"
            << " test_1 ran using " << num_threads_1 << " threads in " << time_of_1 << " seconds.\n"
            << " test_2 ran using " << num_threads_2 << " threads in " << time_of_2 << " seconds.\n";
}

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_split-evenly-solved.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_split-evenly-solved.sh; else ./scripts/run_split-evenly-solved.sh; fi

### Splitting the threads up unevenly

And of course, once we explicitly set the concurrency limit, we don't have to split the threads up
evenly.

For this exercise, complete the following steps:
1. Inspect the code cell below and make the following modifications.
  1. Change the value passed to one of the task_arena objects from INCORRECT_VALUE to 2.
  2. Change the value passed to the other task_arenas object from INCORRECT_VALUE to `hw_threads-2`.
2. When the modifications are complete, click run ▶ to save the code to a file.
3. Run ▶ the cell in the __Build and Run__ section below the code snippet to compile and execute the code in the saved file

In [None]:
%%writefile lab/split-unevenly.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  // STEP A: pass 2 to the constructor of one arena
  tbb::task_arena ta1(INCORRECT_VALUE);
  // STEP B: pass hw_threads-2 to the constructor of the other
  tbb::task_arena ta2(INCORRECT_VALUE);

  // warm-up
  tbb::parallel_invoke(
    [&]() { ta1.execute([]() { run_test(); }); },
    [&]() { ta2.execute([]() { run_test_2(); }); }
  );
    
  // run test
  int num_threads_1, num_threads_2;
  double time_of_1, time_of_2;
  tbb::parallel_invoke(
    [&]() { 
      ta1.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_1 = run_test();
        time_of_1 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      }); 
    },
    [&]() { 
      ta2.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_2 = run_test_2(); 
        time_of_2 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();          
      }); 
    }
  );
    
  std::cout << "Ran test with on hw with " << hw_threads << " threads\n"
            << " test_1 ran using " << num_threads_1 << " threads in " << time_of_1 << " seconds.\n"
            << " test_2 ran using " << num_threads_2 << " threads in " << time_of_2 << " seconds.\n";
}

### Build and Run
Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_split-unevenly.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_split-unevenly.sh; else ./scripts/run_split-unevenly.sh; fi

### split-unevenly solution (Don't peak unless you have to)

In [None]:
%%writefile solutions/split-unevenly-solved.cpp
//==============================================================
// Copyright (c) 2020 Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================

#include <chrono>
#include <iostream>
#include <tbb/tbb.h>

#include "../common/test_function.h"

#define INCORRECT_VALUE hw_threads

int main() {
  const int hw_threads = std::thread::hardware_concurrency();
  // STEP A: pass 2 to the constructor of one arena
  tbb::task_arena ta1(2);
  // STEP B: pass hw_threads-2 to the constructor of the other
  tbb::task_arena ta2(hw_threads-2);

  // warm-up
  tbb::parallel_invoke(
    [&]() { ta1.execute([]() { run_test(); }); },
    [&]() { ta2.execute([]() { run_test_2(); }); }
  );
    
  // run test
  int num_threads_1, num_threads_2;
  double time_of_1, time_of_2;
  tbb::parallel_invoke(
    [&]() { 
      ta1.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_1 = run_test();
        time_of_1 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();
      }); 
    },
    [&]() { 
      ta2.execute([&]() {
        auto t0 = std::chrono::high_resolution_clock::now();
        num_threads_2 = run_test_2(); 
        time_of_2 = 1e-9*(std::chrono::high_resolution_clock::now() - t0).count();          
      }); 
    }
  );
    
  std::cout << "Ran test with on hw with " << hw_threads << " threads\n"
            << " test_1 ran using " << num_threads_1 << " threads in " << time_of_1 << " seconds.\n"
            << " test_2 ran using " << num_threads_2 << " threads in " << time_of_2 << " seconds.\n";
}

In [None]:
! chmod 755 q; chmod 755 ./scripts/run_split-unevenly-solved.sh; if [ -x "$(command -v qsub)" ]; then ./q scripts/run_split-unevenly-solved.sh; else ./scripts/run_split-unevenly-solved.sh; fi