New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrival rate fixes #2038
Arrival rate fixes #2038
Conversation
originally discussed in this comment #1957 (comment)
…tion Reusing the work done in #1957
Codecov Report
@@ Coverage Diff @@
## master #2038 +/- ##
==========================================
- Coverage 71.75% 71.73% -0.03%
==========================================
Files 179 179
Lines 14104 14100 -4
==========================================
- Hits 10121 10115 -6
- Misses 3345 3347 +2
Partials 638 638
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This helps with the fact that there is no gurantee currently that just because we have added 100 VUs any of them had actually started wait for an iteration. This likely has some downside in creating more channels when adding VUs back and forth but it should be minimal compared to the amount of work needed to call a JS function. This additionally could've meant that for high rate runs it was possible for not all VUs to be available leading to k6 not actually being able to run iterations at the given rate even though it should've been able if all the VUs had actually started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nicely done!
for range p.iterations { | ||
atomic.AddUint64(&p.running, uint64(1)) | ||
runfn(ctx, avu) | ||
atomic.AddUint64(&p.running, ^uint64(0)) | ||
} | ||
}() | ||
<-ch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this 😕 I read your explanation in 5df0e6b, but if this extra bit of waiting helps all that much, then we probably have a bigger conceptual problem, since this is not very different from sleep(time.Millisecond)
. Can you point me to what exactly failed when this wasn't present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically waiting for the goroutine to have started and then it tells you it has Added a VU.
What fails are all the usual TestConstantArrivalRateRunCorrectTiming as shown in https://github.com/k6io/k6/runs/2640090549?check_suite_focus=true, which at least in my testing happens as we :
- add all the VUs
- try to start an iteration
- none of the VUs has already gotten to the line
for range p.iterations
as it probably didn't even initialize the goroutine yet.
This is particularly easy to reproduce with db89d213
and go test -count 1 -cpu 1 -race -run TestConstantArrivalRateRunCorrectTiming
, specifically the -cpu 1
as with -cpu 8
I in practice can't reproduce it - probably as there is a free slot for that goroutine to always start immediately or at least fast enough.
This as we now no longer have a "semapthore" (using the channel activeVUs
) but instead have workers. The semaphore was initialized synchronously while the workers are asynchronous for the most part - we just say "start this goroutine" and with this change at least wait for the "start the goroutine" part instead of hoping the go will schedule it fast enough. Which is totally different from what time.Sleep(time....)
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that if the CPU contention was so heavy, it's possible that a goroutine wouldn't be allowed to run for a long while, even after it was started, so it might not get to run the for
loop after it closes the channel 🤷♂️
I get how this is different from a Sleep
, but as far as I understand, it's not actually locking anything that raced before... So if it failed before, this doesn't guarantee it will work afterwards? I guess it would just greatly reduce the changes of a test failing, which might be good enough, though a better solution would be to improve the test, with #1986 or something else...
So yeah, I don't mind this code, I just didn't understand what it solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The go statements explicitly don't wait for the goroutine to start. So with -cpu 1
(what the windows tests effectively do) there is 1 CPU that can execute 1 goroutine and it already is executing the code that initializes the VUs and calls addVU
(which starts a goroutine with the go statement) and then it goes and says "begin an iteration" at which point there is nothing guaranteeing that the executing goroutine would've yielded between calling addVU
and this moment so that any of the addVU's goroutine would've had the CPU to run on at all at which point the select with default
just goes to the default
case. This is also not guaranteed if we sleep for an hour or something like that (although I would expect it will happen ;) )
The same way that currently nothing guarantees that after calling close(ch)
it will actually get to the range p.iterations
without yielding it just makes a lot more likely. (to be honest as far as I know close(ch)
should not yield so we should always get to range p.iterations
although I have no proof so 🤷, but IMO it makes more sense to execute until you need to block instead of yield whenever ) So much more that I haven't had a failure of this kind at all after these changes and even parallelizing the tests has been more stable (having run them 5 times).
On the other hand, the change to not start a goroutine on each iteration makes it less likely each of those starting of goroutines will be a bit slow (because of GC or ... just CPU contention) and will not manage to execute an iteration within 15ms of the time it needs to, which was the original issue in #1715 instead the currently fixing one where we just miss the first iteration.
While I think #1986 will make this even less possible to fail in the test it will not fix the underlying problem that this patch tries to fix in a production call. As it will still there be nothing to guarantee the goroutine will be waiting on the channel to receive the "start an iteration" event. But it will mask it ;). Also to be honest, given the current congestion on time.
calls that I see on each profile I make I expect that adding an interface call in the middle will have .... detrimental outcomes for the performance, but that will need to be checked whenever someone finds the time to rewrite the code to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🤞
No description provided.