-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmarks
Whenever we attempt to solve a problem by parallel computation, we split it into the sequential and concurrent parts. Then we distribute the concurrent part for computation, in our case using threads.
Our expectation is that the performance of the concurrent part will improve as we add processors, that is, we expect our system to scale when we increase computational power.
Another expectation is that the distribution method itself will not create excessive overhead. The goal of this benchmark suite is to verify that the use of the command executor thread pool is scalable and to find workload size where the distribution overhead cancels the benefits of the parallel computation.
The benchmark is designed around and implemented with benchmark-rs, which creates a series of measurements for increasing workloads. The workload unit is all tests is a generation of a vector of N
random strings and sorting it.
Each benchmark runs the same computation for the same range of workloads with number of threads doubled in each consecutive benchmark. The baseline benchmark performs inline computation in the calling thread without any synchronization overhead.
To find the break even point, that is the point where there is no improvement of performance when using the command executor, the suite is to be run for workload where N
is of different order of magnitude.
The benchmark was run on a MacBook Pro, without reaching memory or CPU limits. See system details below.
System Software Overview:
System Version: macOS 12.2.1 (21D62)
Kernel Version: Darwin 21.3.0
Boot Volume: Macintosh HD
Boot Mode: Normal
Computer Name: *****
User Name: *****
Secure Virtual Memory: Enabled
System Integrity Protection: Enabled
Time since boot: 55 days 21:54
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro18,2
Chip: Apple M1 Max
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB
We start our tests with a work unit of 10000, that is, we generate 10000 random strings, 100 bytes each and sort them in each command. Each test consists of creating a thread pool and dispatching 200 commands to it. The approximate duration for computation of each command is 0.008 sec. We then increase the work unit size in steps of 10000 leading to duration of 0.2 seconds per command.
We can see from the charts below that our computation scales to number of threads, although the degradation in the upper end needs more investigation.
Test with work unit of 1000. Time range is from 0.8 to 17 milliseconds per unit of workload execution
Test with work unit of 100. Time range is from 0.07 to 1.6 milliseconds per unit of workload execution
Test with work unit of 10. Time range is from 19 to 159 microseconds per unit of workload execution
Test with work unit of 1. Time range is from 7 to 15 microseconds per unit of workload execution. In this test we can see clearly that when the execution of the workload unit is smaller than 10 microseconds the distribution overhead, especially lock contention, cancels the benefits of parallel computation.
- Check how to bring the break even point down to smaller workloads
- Experiment with non blocking queue