Skip to content

Benchmarks

Giora Kosoi edited this page May 1, 2023 · 9 revisions

Whenever we attempt to solve a problem by parallel computation, we split it into the sequential and concurrent parts. Then we distribute the concurrent part for computation, in our case using threads.

Our expectation is that the performance of the concurrent part will improve as we add processors, that is, we expect our system to scale when we increase computational power.

Another expectation is that the distribution method itself will not create excessive overhead. The goal of this benchmark suite is to verify that the use of the command executor thread pool is scalable and to find workload size where the distribution overhead cancels the benefits of the parallel computation.

Design

The benchmark is designed around and implemented with benchmark-rs, which creates a series of measurements for increasing workloads. The workload unit is all tests is a generation of a vector of N random strings and sorting it. Each benchmark runs the same computation for the same range of workloads with number of threads doubled in each consecutive benchmark. The baseline benchmark performs inline computation in the calling thread without any synchronization overhead.

To find the break even point, that is the point where there is no improvement of performance when using the command executor, the suite is to be run for workload where N is of different order of magnitude.

Analysis

The benchmark was run on a MacBook Pro, without reaching memory or CPU limits. See system details below.

Software

System Software Overview:

  System Version: macOS 12.2.1 (21D62)
  Kernel Version: Darwin 21.3.0
  Boot Volume: Macintosh HD
  Boot Mode: Normal
  Computer Name: *****
  User Name: *****
  Secure Virtual Memory: Enabled
  System Integrity Protection: Enabled
  Time since boot: 55 days 21:54

Hardware

Hardware Overview:

  Model Name: MacBook Pro
  Model Identifier: MacBookPro18,2
  Chip: Apple M1 Max
  Total Number of Cores: 10 (8 performance and 2 efficiency)
  Memory: 32 GB      

Scalability

Work unit 10000

We start our tests with a work unit of 10000, that is, we generate 10000 random strings, 100 bytes each and sort them in each command. Each test consists of creating a thread pool and dispatching 200 commands to it. The approximate duration for computation of each command is 0.008 sec. We then increase the work unit size in steps of 10000 leading to duration of 0.2 seconds per command.

We can see from the charts below that our computation scales to number of threads, although the degradation in the upper end needs more investigation.

Time to complete 200 commands as function of work unit size

Blocking Queue - 200 commands - work unit 10000

Scale factor for 1, 2 and 4 threads respectively as function of work unit size

Scalability - Blocking Queue - 200 cmd - work unit 10000

Scale factor for 1, 2 and 4 threads respectively as function of work unit execution duration

Scale - duration - Blocking Queue - 200 cmd - work unit 10000

Work unit 1000

Test with work unit of 1000. Time range is from 0.8 to 17 milliseconds per unit of workload execution

Time to complete 200 commands as function of work unit size

Blocking Queue - 200 commands - work unit 1000

Scale factor for 1, 2 and 4 threads respectively as function of work unit size

Scalability - Blocking Queue - 200 cmd - work unit 1000

Scale factor for 1, 2 and 4 threads respectively as function of work unit execution duration

Scale - duration - Blocking Queue - 200 cmd - work unit 1000

Work unit 100

Test with work unit of 100. Time range is from 0.07 to 1.6 milliseconds per unit of workload execution

Time to complete 200 commands as function of work unit size

Blocking Queue - 200 commands - work unit 100

Scale factor for 1, 2 and 4 threads respectively as function of work unit size

Scalability - Blocking Queue - 200 cmd - work unit 100

Scale factor for 1, 2 and 4 threads respectively as function of work unit execution duration

Scale - duration - Blocking Queue - 200 cmd - work unit 100

Work unit 10

Test with work unit of 10. Time range is from 19 to 159 microseconds per unit of workload execution

Time to complete 200 commands as function of work unit size

Blocking Queue - 200 commands - work unit 10

Scale factor for 1, 2 and 4 threads respectively as function of work unit size

Scalability - Blocking Queue - 200 cmd - work unit 10

Scale factor for 1, 2 and 4 threads respectively as function of work unit execution duration

Scale - duration - Blocking Queue - 200 cmd - work unit 10

Below the Break Even Point

Work unit 1

Test with work unit of 1. Time range is from 7 to 15 microseconds per unit of workload execution. In this test we can see clearly that when the execution of the workload unit is smaller than 10 microseconds the distribution overhead, especially lock contention, cancels the benefits of parallel computation.

Time to complete 200 commands as function of work unit size

Blocking Queue - 200 commands - work unit 1

Scale factor for 1, 2 and 4 threads respectively as function of work unit size

Scalability - Blocking Queue - 200 cmd - work unit 1

Scale factor for 1, 2 and 4 threads respectively as function of work unit execution duration

Scale - duration - Blocking Queue - 200 cmd - work unit 1

Next steps

  • Check how to bring the break even point down to smaller workloads
  • Experiment with non blocking queue