Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrival rate fixes #2038

Merged
merged 3 commits into from May 25, 2021
Merged

Arrival rate fixes #2038

merged 3 commits into from May 25, 2021

Conversation

mstoykov
Copy link
Collaborator

No description provided.

@mstoykov mstoykov added this to the v0.33.0 milestone May 21, 2021
@mstoykov mstoykov requested review from imiric and na-- May 21, 2021 13:16
@codecov-commenter
Copy link

codecov-commenter commented May 21, 2021

Codecov Report

Merging #2038 (f5dee9e) into master (5fe5e5a) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head f5dee9e differs from pull request most recent head 5df0e6b. Consider uploading reports for the commit 5df0e6b to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2038      +/-   ##
==========================================
- Coverage   71.75%   71.73%   -0.03%     
==========================================
  Files         179      179              
  Lines       14104    14100       -4     
==========================================
- Hits        10121    10115       -6     
- Misses       3345     3347       +2     
  Partials      638      638              
Flag Coverage Δ
ubuntu 71.66% <100.00%> (-0.04%) ⬇️
windows 71.37% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
lib/executor/constant_arrival_rate.go 96.25% <100.00%> (-0.16%) ⬇️
lib/executor/ramping_arrival_rate.go 96.15% <100.00%> (+0.05%) ⬆️
output/cloud/output.go 81.15% <0.00%> (-0.58%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5fe5e5a...5df0e6b. Read the comment docs.

This helps with the fact that there is no gurantee currently that just
because we have added 100 VUs any of them had actually started wait for
an iteration.

This likely has some downside in creating more channels when adding VUs
back and forth but it should be minimal compared to the amount of work
needed to call a JS function.

This additionally could've meant that for high rate runs it was possible
for not all VUs to be available leading to k6 not actually being able to
run iterations at the given rate even though it should've been able if
all the VUs had actually started.
Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nicely done!

for range p.iterations {
atomic.AddUint64(&p.running, uint64(1))
runfn(ctx, avu)
atomic.AddUint64(&p.running, ^uint64(0))
}
}()
<-ch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this 😕 I read your explanation in 5df0e6b, but if this extra bit of waiting helps all that much, then we probably have a bigger conceptual problem, since this is not very different from sleep(time.Millisecond). Can you point me to what exactly failed when this wasn't present?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically waiting for the goroutine to have started and then it tells you it has Added a VU.

What fails are all the usual TestConstantArrivalRateRunCorrectTiming as shown in https://github.com/k6io/k6/runs/2640090549?check_suite_focus=true, which at least in my testing happens as we :

  1. add all the VUs
  2. try to start an iteration
  3. none of the VUs has already gotten to the line for range p.iterations as it probably didn't even initialize the goroutine yet.

This is particularly easy to reproduce with db89d213 and go test -count 1 -cpu 1 -race -run TestConstantArrivalRateRunCorrectTiming, specifically the -cpu 1 as with -cpu 8 I in practice can't reproduce it - probably as there is a free slot for that goroutine to always start immediately or at least fast enough.

This as we now no longer have a "semapthore" (using the channel activeVUs) but instead have workers. The semaphore was initialized synchronously while the workers are asynchronous for the most part - we just say "start this goroutine" and with this change at least wait for the "start the goroutine" part instead of hoping the go will schedule it fast enough. Which is totally different from what time.Sleep(time....) will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that if the CPU contention was so heavy, it's possible that a goroutine wouldn't be allowed to run for a long while, even after it was started, so it might not get to run the for loop after it closes the channel 🤷‍♂️

I get how this is different from a Sleep, but as far as I understand, it's not actually locking anything that raced before... So if it failed before, this doesn't guarantee it will work afterwards? I guess it would just greatly reduce the changes of a test failing, which might be good enough, though a better solution would be to improve the test, with #1986 or something else...

So yeah, I don't mind this code, I just didn't understand what it solved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go statements explicitly don't wait for the goroutine to start. So with -cpu 1 (what the windows tests effectively do) there is 1 CPU that can execute 1 goroutine and it already is executing the code that initializes the VUs and calls addVU (which starts a goroutine with the go statement) and then it goes and says "begin an iteration" at which point there is nothing guaranteeing that the executing goroutine would've yielded between calling addVU and this moment so that any of the addVU's goroutine would've had the CPU to run on at all at which point the select with default just goes to the default case. This is also not guaranteed if we sleep for an hour or something like that (although I would expect it will happen ;) )

The same way that currently nothing guarantees that after calling close(ch) it will actually get to the range p.iterations without yielding it just makes a lot more likely. (to be honest as far as I know close(ch) should not yield so we should always get to range p.iterations although I have no proof so 🤷, but IMO it makes more sense to execute until you need to block instead of yield whenever ) So much more that I haven't had a failure of this kind at all after these changes and even parallelizing the tests has been more stable (having run them 5 times).

On the other hand, the change to not start a goroutine on each iteration makes it less likely each of those starting of goroutines will be a bit slow (because of GC or ... just CPU contention) and will not manage to execute an iteration within 15ms of the time it needs to, which was the original issue in #1715 instead the currently fixing one where we just miss the first iteration.

While I think #1986 will make this even less possible to fail in the test it will not fix the underlying problem that this patch tries to fix in a production call. As it will still there be nothing to guarantee the goroutine will be waiting on the channel to receive the "start an iteration" event. But it will mask it ;). Also to be honest, given the current congestion on time. calls that I see on each profile I make I expect that adding an interface call in the middle will have .... detrimental outcomes for the performance, but that will need to be checked whenever someone finds the time to rewrite the code to use it.

Copy link
Member

@na-- na-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🤞

@mstoykov mstoykov merged commit 632774a into master May 25, 2021
@mstoykov mstoykov deleted the arrivalRateFixes branch May 25, 2021 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants