Make our handshake results more intuitive #29

nhorman · 2025-06-26T16:18:35Z

In doing some testing with the handshake test, I was getting results that made very little sense to me, showing a significant decrease in usecs/handshake, but maintaining consistent handshake/second counts, with the overall time of the test (as measured by time) dropping.

I think what we're currently doing to measure the handshakes per second, while not wrong per se, is a bit counter-intuitive. We are currently computing this value by summing the time each thread runs and dividing that by the fixed number of handshakes we perform. While thats sensible, it means that two cpus running in parallel for a total process run time of X seconds produces a cumulative run time of 2X seconds, which, while true, doesn't really accurately reflect the the number of handshakes a multi-threaded process can complete in a given amount of wall clock time.

I'm proposing that we alter the computation so that handshakes per second are computed as:
sum(i=1..N)(handshakes thread[i] completes / time thread[i] exectues) and usecs per handshake be computed as:
sum(i=1..N)(time thread[i] executes / handshakes thread[i] completes)/N

In doing some testing with the handshake test, I was getting results that made very little sense to me, showing a significant decrease in usecs/handshake, but maintaining consistent handshake/second counts, with the overall time of the test (as measured by time) dropping. I think what we're currently doing to measure the handshakes per second, while not wrong per se, is a bit counter-intuitive. We are currently computing this value by summing the time each thread runs and dividing that by the fixed number of handshakes we perform. While thats sensible, it means that two cpus running in parallel for a total process run time of X seconds produces a cumulative run time of 2X seconds, which, while true, doesn't really accurately reflect the the number of handshakes a multi-threaded process can complete in a given amount of wall clock time. I'm proposing that we alter the computation so that handshakes per second are computed as: sum(i=1..N)(handshakes thread[i] completes / time thread[i] exectues) and usecs per handshake be computed as: sum(i=1..N)(time thread[i] executes / handshakes thread[i] completes)/N

t8m · 2025-06-26T16:52:15Z

source/handshake.c

-    ttime = times[0];
-    for (i = 1; i < threadcount; i++)
-        ttime = ossl_time_add(ttime, times[i]);
+    avcalltime = (double)0;
+    persec += (double)0;
+    for (i = 0; i < threadcount; i++) {
+        avcalltime += ((double)ossl_time2ticks(times[i]) / (num_calls / threadcount)) / (double)OSSL_TIME_US;
+        persec += (((num_calls * OSSL_TIME_SECOND) / (double)threadcount) / (double)ossl_time2ticks(times[i]));


I do not think this is a correct change. Either we want to compute the wall clock time of all the threads spent in the computation or we want to compute how much time it took to perform the num_calls operations regardless of the number of threads involved. This is something in the middle that has no real meaning IMO.

Furthermore we will loose consistency in the graphs tracking the library changes over time.

I instead propose to use the existing persec number (which measures the number of executions per second for the whole process) for comparison across number of threads.

I can accept a different change, in that it makes more intuitive sense to me to compute stats using wall clock time rather than per-thread time, as that really makes more intuitive sense to me, as that is going to be representative of what an application sees overall.

I disagree that the current persec (handshakes per seconds number) accurately measures the number of executions for the process, as we are assuming the process time is the cumulative time spent in the process is the sum of each threads execution time (i.e. 2 threads running for 2 second of wall clock time on 2 cpus is calculated as 4 total seconds, which stands in stark contrast to the wall clock run time of a process, which would be 2 seconds. This becomes quite attenuated at higher thread counts.

Insofar as consistency is concerned, I get that, but I feel like the implication there is that, even if the stats are not useful, we can't change them, which seems to make it impossible to fix/change anything.

OK, I can accept that we should change all tests to use wall clock time of the process instead of this sum of wall clock times of all threads. But we need to at least do it consistently across all the test apps. And if we are changing it we should IMO also change the tested number of threads because the current 1, 10, 100, 500 does not give enough granularity where the difference between 100 and 500 does not give much more information.

Something like 1, 4, 16, 64, 256 would be IMO much better for logarithmic scale graphs.

agreed, makes sense. I'll close this in favor of a new pr that addresses all tests shortly.

I agree that we need more steps than 1, 10, 100 and 500. The proposal to increase by a factor of 4 works for me, but I think I prefer a factor of 2. I don't see the point of logarithmic scale graphs, so anything with more points in between works for me. I also wondering if there is a point in testing this with more threads than the available number of cores.

kroeckx · 2025-06-26T21:46:52Z

It's my understanding that not all threads run for the same amount of time, and getting good stats for that is going to be complicated. Maybe we should instead run for a fixed amount of time.

t8m · 2025-06-27T07:52:24Z

It's my understanding that not all threads run for the same amount of time, and getting good stats for that is going to be complicated. Maybe we should instead run for a fixed amount of time.

We've agreed with @nhorman above that we will stop measuring per-thread time as that is just confusing and start measuring the time of the whole process execution as is currently done in the handshake test for the persec measurement.

kroeckx · 2025-06-27T10:52:51Z

If all threads don't run the same amount of time it means that the number of threads running is not constant, and so as more threads finish their work, the faster the other threads become in case of contention.

nhorman · 2025-06-27T13:08:31Z

Just to give some data behind the thread run time question, when running with 10 threads, this is what I get for the run time of each thread in usecs

LD_LIBRARY_PATH=/home/nhorman/git/openssl ./handshake /home/nhorman/git/openssl/test/certs/ 10
688606 694903 696000 697386 698750 701242 701172 716509 1091164 1094711

Thats an average of 778044.3 with a standard deviation of 157592.062 usecs, so I read that as most of our samples being within a single standard deviation of the average time, with two threads getting blocked for a significantly greater time (.3 seconds it looks like to me)

And with 20 threads:

350791 355443 354310 354701 356165 355986 366422 442179 513128 549581 558936 559382 565822 569341 571090 629740 674608 678816 680041 681293

Thats an average of 508388.75 with a standard deviation of 125520.353 usecs. I interpret that as our run time becoming slightly more consistent, but more threads still becoming blocked on a lock for a long period of time (note that several threads run in about 350k usecs, while others take 550k usecs, about .2 seconds longer)

So yeah, the conclusion I draw is that our thread timing is fairly inconsistent.

t8m · 2025-06-27T14:05:18Z

The more I think about this, the more I believe that measuring the wall clock time of each thread is wrong for the summary numbers. IMO what we are interested in are the run time of the process.

Alternative could be to run all threads until the total expected number of iterations is reached and terminating them only then. However then you would need to use atomics on the total count and that might slightly skew the numbers for fast tests such as the EVP fetch test. Which we might care about or maybe not.

nhorman · 2025-06-27T14:26:10Z

But the run time of the process is the wall clock time, for all intents and purposes. If a process has two threads that run on two cpus, and thread A runs for 2 seconds, while thread b runs for 4 seconds, the process run time, from the standpoint of the user is that the process executes for a total of 4 seconds. The way we gather our stats currently, we would report that the process run time was 6 seconds, which is actually the cumulative cpu time spent on the process, which is very counter intuitive.

t8m · 2025-06-27T14:51:28Z

You're probably misunderstanding me. I the previous comment I agree that measuring the run time of each thread is counterintuitive and we should change the measurement to measure just the process run time.

What I do NOT want us to do is to measure the average run time of a thread and call that a run time of the process as that would not be a representative number of anything real. For example if all the threads were run sequentially (which is of course completely unlikely but...) then the average run time with 10 threads could be for example 10 seconds but the total run time would be 100 seconds. Of course that is an extreme and in reality it would be smaller.

nhorman · 2025-06-27T14:55:59Z

ok, yes, I misunderstood you, apologies. I agree, computing the mean of the cumulative thread run times is a meaningless statistic. What I would like to see is the the run time of the whole process (i.e. what the user experiences). This should be represented by the duration variable in all of our tests, as it measures the time from when the first thread is created, until the last thread exits (i.e the last pthread_join returns). I think we're on the same page.

I'm working on some changes now that I think we can all agree with (I hope). I'm add a -w option to each test, to report measurements vs wall clock time vs cumulative cpu time (i.e. what we have now). My intent is to default this option to off, so that our current stats dashboard remains consistent, but gives us the option to measure vs wall clock time for the purposes of various pr tests, so that results are a bit more intuitive.

nhorman self-assigned this Jun 26, 2025

nhorman added this to Development Board Jun 26, 2025

nhorman moved this to Waiting Review in Development Board Jun 26, 2025

t8m reviewed Jun 26, 2025

View reviewed changes

nhorman closed this Jun 27, 2025

github-project-automation bot moved this from Waiting Review to Done in Development Board Jun 27, 2025

This was referenced Jun 29, 2025

Let all threads run for the same amount of time #30

Closed

More steps in the number of threads for performance tests openssl/project#1263

Open

npajkovsky mentioned this pull request Oct 22, 2025

Add evp_hash perftool #58

Closed

Make our handshake results more intuitive #29

Make our handshake results more intuitive #29

Uh oh!

Conversation

nhorman commented Jun 26, 2025

Uh oh!

t8m Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhorman Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

t8m Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

nhorman Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

kroeckx Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

kroeckx commented Jun 26, 2025

Uh oh!

t8m commented Jun 27, 2025

Uh oh!

kroeckx commented Jun 27, 2025

Uh oh!

nhorman commented Jun 27, 2025

Uh oh!

t8m commented Jun 27, 2025

Uh oh!

nhorman commented Jun 27, 2025

Uh oh!

t8m commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nhorman commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

t8m Jun 26, 2025 •

edited

Loading

t8m commented Jun 27, 2025 •

edited

Loading