-
Notifications
You must be signed in to change notification settings - Fork 11
Make our handshake results more intuitive #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In doing some testing with the handshake test, I was getting results that made very little sense to me, showing a significant decrease in usecs/handshake, but maintaining consistent handshake/second counts, with the overall time of the test (as measured by time) dropping. I think what we're currently doing to measure the handshakes per second, while not wrong per se, is a bit counter-intuitive. We are currently computing this value by summing the time each thread runs and dividing that by the fixed number of handshakes we perform. While thats sensible, it means that two cpus running in parallel for a total process run time of X seconds produces a cumulative run time of 2X seconds, which, while true, doesn't really accurately reflect the the number of handshakes a multi-threaded process can complete in a given amount of wall clock time. I'm proposing that we alter the computation so that handshakes per second are computed as: sum(i=1..N)(handshakes thread[i] completes / time thread[i] exectues) and usecs per handshake be computed as: sum(i=1..N)(time thread[i] executes / handshakes thread[i] completes)/N
| ttime = times[0]; | ||
| for (i = 1; i < threadcount; i++) | ||
| ttime = ossl_time_add(ttime, times[i]); | ||
| avcalltime = (double)0; | ||
| persec += (double)0; | ||
| for (i = 0; i < threadcount; i++) { | ||
| avcalltime += ((double)ossl_time2ticks(times[i]) / (num_calls / threadcount)) / (double)OSSL_TIME_US; | ||
| persec += (((num_calls * OSSL_TIME_SECOND) / (double)threadcount) / (double)ossl_time2ticks(times[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is a correct change. Either we want to compute the wall clock time of all the threads spent in the computation or we want to compute how much time it took to perform the num_calls operations regardless of the number of threads involved. This is something in the middle that has no real meaning IMO.
Furthermore we will loose consistency in the graphs tracking the library changes over time.
I instead propose to use the existing persec number (which measures the number of executions per second for the whole process) for comparison across number of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can accept a different change, in that it makes more intuitive sense to me to compute stats using wall clock time rather than per-thread time, as that really makes more intuitive sense to me, as that is going to be representative of what an application sees overall.
I disagree that the current persec (handshakes per seconds number) accurately measures the number of executions for the process, as we are assuming the process time is the cumulative time spent in the process is the sum of each threads execution time (i.e. 2 threads running for 2 second of wall clock time on 2 cpus is calculated as 4 total seconds, which stands in stark contrast to the wall clock run time of a process, which would be 2 seconds. This becomes quite attenuated at higher thread counts.
Insofar as consistency is concerned, I get that, but I feel like the implication there is that, even if the stats are not useful, we can't change them, which seems to make it impossible to fix/change anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can accept that we should change all tests to use wall clock time of the process instead of this sum of wall clock times of all threads. But we need to at least do it consistently across all the test apps. And if we are changing it we should IMO also change the tested number of threads because the current 1, 10, 100, 500 does not give enough granularity where the difference between 100 and 500 does not give much more information.
Something like 1, 4, 16, 64, 256 would be IMO much better for logarithmic scale graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, makes sense. I'll close this in favor of a new pr that addresses all tests shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we need more steps than 1, 10, 100 and 500. The proposal to increase by a factor of 4 works for me, but I think I prefer a factor of 2. I don't see the point of logarithmic scale graphs, so anything with more points in between works for me. I also wondering if there is a point in testing this with more threads than the available number of cores.
|
It's my understanding that not all threads run for the same amount of time, and getting good stats for that is going to be complicated. Maybe we should instead run for a fixed amount of time. |
We've agreed with @nhorman above that we will stop measuring per-thread time as that is just confusing and start measuring the time of the whole process execution as is currently done in the handshake test for the persec measurement. |
|
If all threads don't run the same amount of time it means that the number of threads running is not constant, and so as more threads finish their work, the faster the other threads become in case of contention. |
|
Just to give some data behind the thread run time question, when running with 10 threads, this is what I get for the run time of each thread in usecs Thats an average of 778044.3 with a standard deviation of 157592.062 usecs, so I read that as most of our samples being within a single standard deviation of the average time, with two threads getting blocked for a significantly greater time (.3 seconds it looks like to me) And with 20 threads: Thats an average of 508388.75 with a standard deviation of 125520.353 usecs. I interpret that as our run time becoming slightly more consistent, but more threads still becoming blocked on a lock for a long period of time (note that several threads run in about 350k usecs, while others take 550k usecs, about .2 seconds longer) So yeah, the conclusion I draw is that our thread timing is fairly inconsistent. |
|
The more I think about this, the more I believe that measuring the wall clock time of each thread is wrong for the summary numbers. IMO what we are interested in are the run time of the process. Alternative could be to run all threads until the total expected number of iterations is reached and terminating them only then. However then you would need to use atomics on the total count and that might slightly skew the numbers for fast tests such as the EVP fetch test. Which we might care about or maybe not. |
|
But the run time of the process is the wall clock time, for all intents and purposes. If a process has two threads that run on two cpus, and thread A runs for 2 seconds, while thread b runs for 4 seconds, the process run time, from the standpoint of the user is that the process executes for a total of 4 seconds. The way we gather our stats currently, we would report that the process run time was 6 seconds, which is actually the cumulative cpu time spent on the process, which is very counter intuitive. |
|
You're probably misunderstanding me. I the previous comment I agree that measuring the run time of each thread is counterintuitive and we should change the measurement to measure just the process run time. What I do NOT want us to do is to measure the average run time of a thread and call that a run time of the process as that would not be a representative number of anything real. For example if all the threads were run sequentially (which is of course completely unlikely but...) then the average run time with 10 threads could be for example 10 seconds but the total run time would be 100 seconds. Of course that is an extreme and in reality it would be smaller. |
|
ok, yes, I misunderstood you, apologies. I agree, computing the mean of the cumulative thread run times is a meaningless statistic. What I would like to see is the the run time of the whole process (i.e. what the user experiences). This should be represented by the duration variable in all of our tests, as it measures the time from when the first thread is created, until the last thread exits (i.e the last pthread_join returns). I think we're on the same page. I'm working on some changes now that I think we can all agree with (I hope). I'm add a -w option to each test, to report measurements vs wall clock time vs cumulative cpu time (i.e. what we have now). My intent is to default this option to off, so that our current stats dashboard remains consistent, but gives us the option to measure vs wall clock time for the purposes of various pr tests, so that results are a bit more intuitive. |
In doing some testing with the handshake test, I was getting results that made very little sense to me, showing a significant decrease in usecs/handshake, but maintaining consistent handshake/second counts, with the overall time of the test (as measured by time) dropping.
I think what we're currently doing to measure the handshakes per second, while not wrong per se, is a bit counter-intuitive. We are currently computing this value by summing the time each thread runs and dividing that by the fixed number of handshakes we perform. While thats sensible, it means that two cpus running in parallel for a total process run time of X seconds produces a cumulative run time of 2X seconds, which, while true, doesn't really accurately reflect the the number of handshakes a multi-threaded process can complete in a given amount of wall clock time.
I'm proposing that we alter the computation so that handshakes per second are computed as:
sum(i=1..N)(handshakes thread[i] completes / time thread[i] exectues) and usecs per handshake be computed as:
sum(i=1..N)(time thread[i] executes / handshakes thread[i] completes)/N