-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: testing: add (*PB).NextIndex() (int, bool) method #43030
Comments
Can you show sample documentation for the Could a particular benchmark solves this problem by using its own local variable with |
The documentation for
There is definitely a performance advantage to adding the Furthermore, there is a programmatic advantage to keeping it inside the testing package; making each parallel benchmark repeatedly manage its own |
Why is this part of Next? Why not just have a pb.Index that returns the current goroutine's index? |
@rsc is there a provable superiority (programmatic or performance) of having |
b.RunParallel(func(pb *testing.PB) {
for {
i, ok := pb.NextIndex()
if !ok {
break
}
// ...
}
}) instead of b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
i := pb.Index()
// ...
}
}) makes me ask the inverse question: is there a provable superiority (programmatic or performance) of having I'm unconvinced the loss of readability is justified by having one less function call per iteration (hypothetical, considering inlining). |
@antichris,
Of course, it raises the question, how did I implement these functions. The below code snippets should give you an idea:
|
@chabbimilind, with your code unaltered, I get an average
over 50 runs, which amounts to at most a 2% performance loss on my machine. There must be something wrong with yours. If func (pb *PB) Index() uint64 { return pb.where - 1 } there is no performance loss at all, there's actually even a 0.42% gain, according to
That might be due to elimination of an This, I believe, is the way // ...
for pb.Next() {
i := pb.Index()
// use i
}
// ... The removal of those checks (and overhead) it had in your version might make it less safe to use improperly: it would return -1, before the first call to Even you said so yourself:
Looking at the code sample I suppose you were referring to (which is partially reproduced just above here), I can't really understand what you mean by "unlike how I use", though.
Oh, it did, my friend. Indeed it did. |
@antichris how many threads (CPU cores) on your machine? Are you prosing returning |
4 cores, 8 threads, according to the manufacturer's datasheet. I think I know where you're headed with this, though. That's why I rewrote your benchmark code to "proper" Go benchmarks, so that they could be executed with all the wonderful features that test flags provide, including (but not limited to) ergonomic control over the run count and (Expand for source)package main_test
import "testing"
func BenchmarkNextIndex(b *testing.B) {
for i := 0; i < 100000; i++ {
b.RunParallel(func(pb *testing.PB) {
sum := uint64(0)
for {
v, ok := pb.NextIndex()
if !ok {
break
}
sum += v
}
})
}
}
func BenchmarkIndex(b *testing.B) {
for i := 0; i < 100000; i++ {
b.RunParallel(func(pb *testing.PB) {
sum := uint64(0)
for pb.Next() {
v, _ := pb.Index()
// v := pb.Index() // Use this line instead of the above
// when a single value is returned.
sum += v
}
})
}
} The invocation I used for running the benchmarks: go test -bench=. -cpu=1,2,4,8 -count=10|tee results I did not want to babysit my machine (it has a tendency for a sharp performance drop at putting the screen to/waking it up from sleep) for more than half an hour, waiting for 400 runs to complete, therefore the The benchmark results were renamed to matching ones when comparing/summarizing with
Those "noticeable performance losses" seem to be very slight in general, and practically non-existent in the case of the simplified, single value Could you please be so kind as to run some benchmarks yourself (preferably with Otherwise, based on my experience with benchmarking this issue, I believe it is only a fair assumption to be made, that you cherry-picked your 32-percent-performance-loss result in bad faith, and, having grown so attached to the design you originally proposed, you are unwilling to concede it for a potentially more sensible and versatile one. I have a question for @rsc, though: what about the fact that the
No, not really. I believe, failure, in this case, is something one would have checked for via a
Da-yumn, @chabbimilind! If that ain't what they call "butthurt", I sincerely have no clue what was the impression you were trying to make with that. 😜 Just breathe, friend, relax and focus: having your proposal revised and improved by the gopher community is not the worst thing that could have happened to you today, nor is it a damning statement about your qualities as an individual and a member of the society. |
There's already a Next, so adding just Index avoids having two different Next operations. |
It is also obviously possible to implement these in such a way that there is no performance difference:
So I'm not placing much weight in the benchmarks showing a difference. |
What I don't see here is much indication that this use of the extra index is something that is generally advisable in parallel benchmarks. My understanding of the parallel benchmark functionality was to be able to measure how fast it would be to run b.N iterations in parallel goroutines, keeping all the goroutines busy even when there is obviously some variation in how quickly each executes, so that you can't just hand out b.N/#goroutines to each one. The more I think about this, the more I think we've gotten derailed by performance and API and have failed to recognize that adding either Index or NextIndex would be a mistake. If the work done for iteration i is not interchangeable with the work done for iteration j, that's not a valid benchmark at all. My first objection to the stencilParallel example is that it probably matters which indexes a goroutine is writing. If pb.Next is doing some kind of fine-grained iteration where it splits the work across G goroutines and hands out k, k+G, k+2G, k+3G in a single goroutine until we get close to the end, then all the different goroutines are going to have cache collisions writing to nearby entries in "out". On the other hand, if pb.Next works on blocks and hands out mostly contiguous indexes to particular goroutines, then there will be few cache collisions. The reported speed of a benchmarked function should clearly not depend on the exact iteration order returned by package testing. And of course the way to make that happen is to hide the iteration order completely, as we do now. But my second, more serious objection is that the stencilSerial example is not even a correct single-threaded benchmark. It's really important that b.N only change the number of times the code runs, not what code runs. In particular, it is a mistake to change the size of the data being processed based on b.N, because cache behavior changes non-linearly with data size, which then invalidates "divide total time by b.N to find per-operation time". When the data size is changing too, per-operation time is not actually a meaningful value - it's not constant. I think if we were going to adopt Index as an API, we would need a compelling, correct example. We don't have one. |
/cc @dvyukov |
There is something I don't fully understand in the problem statement: for parallel algorithms don't we want to measure the production parallelization rather than some random scheme provided by the testing package? If we get some parallel speed up, it does not seem to be useful because the actual production code does not have any parallelization. Perhaps you want some parallel algorithm support package that can be used in production code, and then measured in benchmarks. RunParallel is not intended for parallel algorithms, it's intended for measuring N logically independent copies of an operation to check how well that scales. The proposed NextIndex scheme also seems to imply dependence of data size on the number of iterations. That's generally a wrong thing because of the associated non-linear effects. E.g. a faster version may trigger 10x larger data size and end up being slower... A better scheme is to do whole large operation b.N times on data set of target size (or sizes). But per-se unique index assignment can be done with pb.Next as follows:
|
Based on the discussion, this seems like a likely decline. |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
testing.PB.Next()
only returns a boolean whether there exits another iteration when running parallel benchmarks.This is insufficient in many situations where one needs to know the current iteration index.
See the Go code in the attached stencil.go.zip file.
Consider the serial elision of the 3-point stencil example, which reads from an input slice
in
and writes to output sliceout
while computingin[i-1] + in[i] + in[i+1]
for iteration indexi
and writing out toout[i]
.Try to convert it to a parallel benchmark; here is a first racy version, which is definitely incorrect.
The
for i := 0; pb.Next(); i++
loop is local to each go routine and multiple go routines will overwrite the same indexout[i]
and not all indices will be read/updated and hence it is functionally incorrect.Furthermore, the cacheline pingpoining will deteriorate the performance of this parallel code (shown at the end of this issue).
A somewhat sane parallel version is to create
in
andout
slices inside each Go routine as shown below.However, this benchmark has several problems.
We need a
func (pb *PB) NextIndex() (int, bool)
API that can return the iteration index so that each iteration inside parallel region can take the appropriate action. See the code below, with the proposedNextIndex
API in action.It is worth observing the running time of each of these examples (I have implemented
NextIndex
locally) when scaling GOMAXPROCS from 1-8.stencilSerial
as expected stays put.stencilParallelRacy
has a poor scalability due to a heavy cacheline conflict (and of course incorrect result).stencilParallel
has pathetic parallel scaling due to bloated memory needs and its associated initialization in each Goroutine.stencilParallelNextIndex
shows higher (although not perfectly linear) throughput with more GOMAXPROCS and it is much desirable when writing parallel benchmarks.This is a small enough change, which introduces a new API.
What did you expect to see?
NA
What did you see instead?
NA
The text was updated successfully, but these errors were encountered: