Arrival rate fixes #2038

mstoykov · 2021-05-21T13:16:17Z

No description provided.

originally discussed in this comment #1957 (comment)

…tion Reusing the work done in #1957

codecov-commenter · 2021-05-21T13:24:27Z

Codecov Report

Merging #2038 (f5dee9e) into master (5fe5e5a) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head f5dee9e differs from pull request most recent head 5df0e6b. Consider uploading reports for the commit 5df0e6b to get more accurate results

@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
- Coverage   71.75%   71.73%   -0.03%     
==========================================
  Files         179      179              
  Lines       14104    14100       -4     
==========================================
- Hits        10121    10115       -6     
- Misses       3345     3347       +2     
  Partials      638      638

Flag	Coverage Δ
ubuntu	`71.66% <100.00%> (-0.04%)`	⬇️
windows	`71.37% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
lib/executor/constant_arrival_rate.go	`96.25% <100.00%> (-0.16%)`	⬇️
lib/executor/ramping_arrival_rate.go	`96.15% <100.00%> (+0.05%)`	⬆️
output/cloud/output.go	`81.15% <0.00%> (-0.58%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fe5e5a...5df0e6b. Read the comment docs.

This helps with the fact that there is no gurantee currently that just because we have added 100 VUs any of them had actually started wait for an iteration. This likely has some downside in creating more channels when adding VUs back and forth but it should be minimal compared to the amount of work needed to call a JS function. This additionally could've meant that for high rate runs it was possible for not all VUs to be available leading to k6 not actually being able to run iterations at the given rate even though it should've been able if all the VUs had actually started.

imiric

LGTM, nicely done!

na-- · 2021-05-25T09:08:59Z

lib/executor/ramping_arrival_rate.go

 		for range p.iterations {
 			atomic.AddUint64(&p.running, uint64(1))
 			runfn(ctx, avu)
 			atomic.AddUint64(&p.running, ^uint64(0))
 		}
 	}()
+	<-ch


I am confused by this 😕 I read your explanation in 5df0e6b, but if this extra bit of waiting helps all that much, then we probably have a bigger conceptual problem, since this is not very different from sleep(time.Millisecond). Can you point me to what exactly failed when this wasn't present?

This is basically waiting for the goroutine to have started and then it tells you it has Added a VU.

What fails are all the usual TestConstantArrivalRateRunCorrectTiming as shown in https://github.com/k6io/k6/runs/2640090549?check_suite_focus=true, which at least in my testing happens as we :

add all the VUs

try to start an iteration

none of the VUs has already gotten to the line for range p.iterations as it probably didn't even initialize the goroutine yet.

This is particularly easy to reproduce with db89d213 and go test -count 1 -cpu 1 -race -run TestConstantArrivalRateRunCorrectTiming, specifically the -cpu 1 as with -cpu 8 I in practice can't reproduce it - probably as there is a free slot for that goroutine to always start immediately or at least fast enough.

This as we now no longer have a "semapthore" (using the channel activeVUs) but instead have workers. The semaphore was initialized synchronously while the workers are asynchronous for the most part - we just say "start this goroutine" and with this change at least wait for the "start the goroutine" part instead of hoping the go will schedule it fast enough. Which is totally different from what time.Sleep(time....) will do

My point was that if the CPU contention was so heavy, it's possible that a goroutine wouldn't be allowed to run for a long while, even after it was started, so it might not get to run the for loop after it closes the channel 🤷‍♂️

I get how this is different from a Sleep, but as far as I understand, it's not actually locking anything that raced before... So if it failed before, this doesn't guarantee it will work afterwards? I guess it would just greatly reduce the changes of a test failing, which might be good enough, though a better solution would be to improve the test, with #1986 or something else...

So yeah, I don't mind this code, I just didn't understand what it solved.

The go statements explicitly don't wait for the goroutine to start. So with -cpu 1 (what the windows tests effectively do) there is 1 CPU that can execute 1 goroutine and it already is executing the code that initializes the VUs and calls addVU (which starts a goroutine with the go statement) and then it goes and says "begin an iteration" at which point there is nothing guaranteeing that the executing goroutine would've yielded between calling addVU and this moment so that any of the addVU's goroutine would've had the CPU to run on at all at which point the select with default just goes to the default case. This is also not guaranteed if we sleep for an hour or something like that (although I would expect it will happen ;) )

The same way that currently nothing guarantees that after calling close(ch) it will actually get to the range p.iterations without yielding it just makes a lot more likely. (to be honest as far as I know close(ch) should not yield so we should always get to range p.iterations although I have no proof so 🤷, but IMO it makes more sense to execute until you need to block instead of yield whenever ) So much more that I haven't had a failure of this kind at all after these changes and even parallelizing the tests has been more stable (having run them 5 times).

On the other hand, the change to not start a goroutine on each iteration makes it less likely each of those starting of goroutines will be a bit slow (because of GC or ... just CPU contention) and will not manage to execute an iteration within 15ms of the time it needs to, which was the original issue in #1715 instead the currently fixing one where we just miss the first iteration.

While I think #1986 will make this even less possible to fail in the test it will not fix the underlying problem that this patch tries to fix in a production call. As it will still there be nothing to guarantee the goroutine will be waiting on the channel to receive the "start an iteration" event. But it will mask it ;). Also to be honest, given the current congestion on time. calls that I see on each profile I make I expect that adding an interface call in the middle will have .... detrimental outcomes for the performance, but that will need to be checked whenever someone finds the time to rewrite the code to use it.

na--

LGTM 🤞

mstoykov added 2 commits May 21, 2021 16:12

Fix race in ramping arrival rate

0edccb0

originally discussed in this comment #1957 (comment)

constant-arrival-rate: Avoid to spawn a new goroutine for every itera…

db89d21

…tion Reusing the work done in #1957

mstoykov added this to the v0.33.0 milestone May 21, 2021

mstoykov requested review from imiric and na-- May 21, 2021 13:16

mstoykov mentioned this pull request May 25, 2021

Don't panic on binaryResponses where the body wasn't set #2041

Merged

imiric approved these changes May 25, 2021

View reviewed changes

na-- reviewed May 25, 2021

View reviewed changes

na-- approved these changes May 25, 2021

View reviewed changes

mstoykov merged commit 632774a into master May 25, 2021

mstoykov deleted the arrivalRateFixes branch May 25, 2021 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrival rate fixes #2038

Arrival rate fixes #2038

mstoykov commented May 21, 2021

codecov-commenter commented May 21, 2021 •

edited

imiric left a comment

na-- May 25, 2021

mstoykov May 25, 2021

na-- May 25, 2021

mstoykov May 25, 2021

na-- left a comment

Arrival rate fixes #2038

Arrival rate fixes #2038

Conversation

mstoykov commented May 21, 2021

codecov-commenter commented May 21, 2021 • edited

Codecov Report

imiric left a comment

Choose a reason for hiding this comment

na-- May 25, 2021

Choose a reason for hiding this comment

mstoykov May 25, 2021

Choose a reason for hiding this comment

na-- May 25, 2021

Choose a reason for hiding this comment

mstoykov May 25, 2021

Choose a reason for hiding this comment

na-- left a comment

Choose a reason for hiding this comment

codecov-commenter commented May 21, 2021 •

edited