Skip to content

Conversation

@nhorman
Copy link
Contributor

@nhorman nhorman commented Jun 26, 2025

In doing some testing with the handshake test, I was getting results that made very little sense to me, showing a significant decrease in usecs/handshake, but maintaining consistent handshake/second counts, with the overall time of the test (as measured by time) dropping.

I think what we're currently doing to measure the handshakes per second, while not wrong per se, is a bit counter-intuitive. We are currently computing this value by summing the time each thread runs and dividing that by the fixed number of handshakes we perform. While thats sensible, it means that two cpus running in parallel for a total process run time of X seconds produces a cumulative run time of 2X seconds, which, while true, doesn't really accurately reflect the the number of handshakes a multi-threaded process can complete in a given amount of wall clock time.

I'm proposing that we alter the computation so that handshakes per second are computed as:
sum(i=1..N)(handshakes thread[i] completes / time thread[i] exectues) and usecs per handshake be computed as:
sum(i=1..N)(time thread[i] executes / handshakes thread[i] completes)/N

In doing some testing with the handshake test, I was getting results
that made very little sense to me, showing a significant decrease in
usecs/handshake, but maintaining consistent handshake/second counts,
with the overall time of the test (as measured by time) dropping.

I think what we're currently doing to measure the handshakes per second,
while not wrong per se, is a bit counter-intuitive.  We are currently
computing this value by summing the time each thread runs and dividing
that by the fixed number of handshakes we perform.  While thats
sensible, it means that two cpus running in parallel for a total process
run time of X seconds produces a cumulative run time of 2X seconds,
which, while true, doesn't really accurately reflect the the number of
handshakes a multi-threaded process can complete in a given amount of
wall clock time.

I'm proposing that we alter the computation so that handshakes per
second are computed as:
sum(i=1..N)(handshakes thread[i] completes / time thread[i] exectues)
and usecs per handshake be computed as:
sum(i=1..N)(time thread[i] executes / handshakes thread[i] completes)/N
@nhorman nhorman self-assigned this Jun 26, 2025
@nhorman nhorman moved this to Waiting Review in Development Board Jun 26, 2025
Comment on lines -399 to +403
ttime = times[0];
for (i = 1; i < threadcount; i++)
ttime = ossl_time_add(ttime, times[i]);
avcalltime = (double)0;
persec += (double)0;
for (i = 0; i < threadcount; i++) {
avcalltime += ((double)ossl_time2ticks(times[i]) / (num_calls / threadcount)) / (double)OSSL_TIME_US;
persec += (((num_calls * OSSL_TIME_SECOND) / (double)threadcount) / (double)ossl_time2ticks(times[i]));
Copy link
Member

@t8m t8m Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is a correct change. Either we want to compute the wall clock time of all the threads spent in the computation or we want to compute how much time it took to perform the num_calls operations regardless of the number of threads involved. This is something in the middle that has no real meaning IMO.

Furthermore we will loose consistency in the graphs tracking the library changes over time.

I instead propose to use the existing persec number (which measures the number of executions per second for the whole process) for comparison across number of threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can accept a different change, in that it makes more intuitive sense to me to compute stats using wall clock time rather than per-thread time, as that really makes more intuitive sense to me, as that is going to be representative of what an application sees overall.

I disagree that the current persec (handshakes per seconds number) accurately measures the number of executions for the process, as we are assuming the process time is the cumulative time spent in the process is the sum of each threads execution time (i.e. 2 threads running for 2 second of wall clock time on 2 cpus is calculated as 4 total seconds, which stands in stark contrast to the wall clock run time of a process, which would be 2 seconds. This becomes quite attenuated at higher thread counts.

Insofar as consistency is concerned, I get that, but I feel like the implication there is that, even if the stats are not useful, we can't change them, which seems to make it impossible to fix/change anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can accept that we should change all tests to use wall clock time of the process instead of this sum of wall clock times of all threads. But we need to at least do it consistently across all the test apps. And if we are changing it we should IMO also change the tested number of threads because the current 1, 10, 100, 500 does not give enough granularity where the difference between 100 and 500 does not give much more information.

Something like 1, 4, 16, 64, 256 would be IMO much better for logarithmic scale graphs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, makes sense. I'll close this in favor of a new pr that addresses all tests shortly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need more steps than 1, 10, 100 and 500. The proposal to increase by a factor of 4 works for me, but I think I prefer a factor of 2. I don't see the point of logarithmic scale graphs, so anything with more points in between works for me. I also wondering if there is a point in testing this with more threads than the available number of cores.

@kroeckx
Copy link
Member

kroeckx commented Jun 26, 2025

It's my understanding that not all threads run for the same amount of time, and getting good stats for that is going to be complicated. Maybe we should instead run for a fixed amount of time.

@t8m
Copy link
Member

t8m commented Jun 27, 2025

It's my understanding that not all threads run for the same amount of time, and getting good stats for that is going to be complicated. Maybe we should instead run for a fixed amount of time.

We've agreed with @nhorman above that we will stop measuring per-thread time as that is just confusing and start measuring the time of the whole process execution as is currently done in the handshake test for the persec measurement.

@kroeckx
Copy link
Member

kroeckx commented Jun 27, 2025

If all threads don't run the same amount of time it means that the number of threads running is not constant, and so as more threads finish their work, the faster the other threads become in case of contention.

@nhorman
Copy link
Contributor Author

nhorman commented Jun 27, 2025

Just to give some data behind the thread run time question, when running with 10 threads, this is what I get for the run time of each thread in usecs

LD_LIBRARY_PATH=/home/nhorman/git/openssl ./handshake /home/nhorman/git/openssl/test/certs/ 10
688606 694903 696000 697386 698750 701242 701172 716509 1091164 1094711

Thats an average of 778044.3 with a standard deviation of 157592.062 usecs, so I read that as most of our samples being within a single standard deviation of the average time, with two threads getting blocked for a significantly greater time (.3 seconds it looks like to me)

And with 20 threads:

350791 355443 354310 354701 356165 355986 366422 442179 513128 549581 558936 559382 565822 569341 571090 629740 674608 678816 680041 681293

Thats an average of 508388.75 with a standard deviation of 125520.353 usecs. I interpret that as our run time becoming slightly more consistent, but more threads still becoming blocked on a lock for a long period of time (note that several threads run in about 350k usecs, while others take 550k usecs, about .2 seconds longer)

So yeah, the conclusion I draw is that our thread timing is fairly inconsistent.

@t8m
Copy link
Member

t8m commented Jun 27, 2025

The more I think about this, the more I believe that measuring the wall clock time of each thread is wrong for the summary numbers. IMO what we are interested in are the run time of the process.

Alternative could be to run all threads until the total expected number of iterations is reached and terminating them only then. However then you would need to use atomics on the total count and that might slightly skew the numbers for fast tests such as the EVP fetch test. Which we might care about or maybe not.

@nhorman
Copy link
Contributor Author

nhorman commented Jun 27, 2025

But the run time of the process is the wall clock time, for all intents and purposes. If a process has two threads that run on two cpus, and thread A runs for 2 seconds, while thread b runs for 4 seconds, the process run time, from the standpoint of the user is that the process executes for a total of 4 seconds. The way we gather our stats currently, we would report that the process run time was 6 seconds, which is actually the cumulative cpu time spent on the process, which is very counter intuitive.

@t8m
Copy link
Member

t8m commented Jun 27, 2025

You're probably misunderstanding me. I the previous comment I agree that measuring the run time of each thread is counterintuitive and we should change the measurement to measure just the process run time.

What I do NOT want us to do is to measure the average run time of a thread and call that a run time of the process as that would not be a representative number of anything real. For example if all the threads were run sequentially (which is of course completely unlikely but...) then the average run time with 10 threads could be for example 10 seconds but the total run time would be 100 seconds. Of course that is an extreme and in reality it would be smaller.

@nhorman
Copy link
Contributor Author

nhorman commented Jun 27, 2025

ok, yes, I misunderstood you, apologies. I agree, computing the mean of the cumulative thread run times is a meaningless statistic. What I would like to see is the the run time of the whole process (i.e. what the user experiences). This should be represented by the duration variable in all of our tests, as it measures the time from when the first thread is created, until the last thread exits (i.e the last pthread_join returns). I think we're on the same page.

I'm working on some changes now that I think we can all agree with (I hope). I'm add a -w option to each test, to report measurements vs wall clock time vs cumulative cpu time (i.e. what we have now). My intent is to default this option to off, so that our current stats dashboard remains consistent, but gives us the option to measure vs wall clock time for the purposes of various pr tests, so that results are a bit more intuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants