Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move User selection responsibility from worker to master in order to fix unbalanced distribution of users and uneven ramp-up #1621

Merged
merged 149 commits into from Jul 5, 2021

Conversation

mboutet
Copy link
Contributor

@mboutet mboutet commented Nov 8, 2020

Fixes #1618
Fixes #896
Fixes #1635

There's quite a lot of changes in this PR, but it was required to solve these two issues.

Changes

  • New logic for computing the distribution of users (locust/distribution.py).
  • New logic for dispatching the users (locust/dispatch.py). The master is now responsible to compute the distribution of users and then dispatching a portion of this distribution to each worker. The spawn rate is also respected in distributed mode.
  • Possible breaking change: signature of spawning_complete event now takes user_class_occurrences instead of user_count.
  • Behaviour change: users are no longer stopped at a rate of spawn_rate. See the docstring of locust.dispatch.dispatch_users for the rationale behind this decision. The correct way to have a ramp down is to use the load test shape feature.

Demo

Example of a real distributed load test shape with 5 workers and a stop timeout of 30s:
image
image

Load test shape profile for this run:

duration users spawn rate
2m 5 1
2m 10 1
2m 15 1
2m 20 1
5m 25 1
2m 20 1
2m 15 1
2m 10 1
2m 1 1
5m 25 5
2m 1 1
5m 25 5
2m 1 1

The ramp-up of each stage takes around 4-5s which correlates with the spawn rate of 1. Also, we can see that the users are gracefully stopped (or killed after the stop timeout of 30s). That's why the ramp-down have the small staircases instead of one single drop.

Other

I hope I did not break anything. All the tests are passing, but while running some real load tests (such as the one above), it seems like the users are not always properly stopped. I added an integration test in test_runners.TestMasterWorkerRunners.test_distributed_shape_with_stop_timeout and I can't seem to reproduce the issue, so I don't think this problem is caused by my changes. Let me know if you think it might be because of my changes. Otherwise, I will open a new issue with some idea I have that might help with this.

@mboutet mboutet changed the title Simplify and fix distribution of spawned users Better distribution of users and fix distributed hatch rate Nov 13, 2020
@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #1621 (a2d74e9) into master (487098e) will decrease coverage by 0.11%.
The diff coverage is 90.93%.

❗ Current head a2d74e9 differs from pull request most recent head 7d67e23. Consider uploading reports for the commit 7d67e23 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1621      +/-   ##
==========================================
- Coverage   80.58%   80.46%   -0.12%     
==========================================
  Files          30       32       +2     
  Lines        2683     2851     +168     
  Branches      412      462      +50     
==========================================
+ Hits         2162     2294     +132     
- Misses        427      455      +28     
- Partials       94      102       +8     
Impacted Files Coverage Δ
locust/event.py 95.12% <ø> (+2.43%) ⬆️
locust/main.py 20.25% <ø> (ø)
locust/runners.py 81.43% <87.87%> (-3.15%) ⬇️
locust/dispatch.py 92.30% <92.30%> (ø)
locust/distribution.py 100.00% <100.00%> (ø)
locust/env.py 96.92% <100.00%> (+0.20%) ⬆️
locust/input_events.py 26.66% <100.00%> (+0.99%) ⬆️
locust/stats.py 88.82% <100.00%> (ø)
locust/user/users.py 97.40% <100.00%> (+1.62%) ⬆️
locust/user/task.py 94.14% <0.00%> (-2.13%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55525a5...7d67e23. Read the comment docs.

locust/env.py Outdated Show resolved Hide resolved
locust/input_events.py Outdated Show resolved Hide resolved
@mboutet mboutet marked this pull request as ready for review November 17, 2020 23:45
@mboutet mboutet requested a review from cyberw November 17, 2020 23:45
@mboutet
Copy link
Contributor Author

mboutet commented Nov 18, 2020

One of the tests is flaky, I'll look into it tomorrow

@cyberw
Copy link
Collaborator

cyberw commented Nov 18, 2020

Phew. That's a lot of changes and a lot of new code. Is there any way you could simplify it a little bit? I dont feel I can review it in its current state.

Does it really need to be recursive?

Is there really any benefit to using the "postfix" style for-loops? Imho, they make things harder to read.

I dont like the changes to input handling. Having two ways of handling input makes no sense. Try to stick as closely as possible to the current solution to avoid changing something that really isnt related to the distribution of users.

locust/dispatch.py Outdated Show resolved Hide resolved
@mboutet mboutet marked this pull request as draft November 18, 2020 21:59
@mboutet
Copy link
Contributor Author

mboutet commented Nov 18, 2020

  • I reworked the distribution code. It is now simpler (in my opinion) and no longer uses the recursive pattern.
  • I also restored the original input_listener and only modified the anonymous functions given in the key_to_func dictionary.

@mboutet mboutet marked this pull request as ready for review November 18, 2020 22:59
@mboutet mboutet requested a review from cyberw November 18, 2020 22:59
@mboutet mboutet marked this pull request as draft November 18, 2020 23:14
@mboutet
Copy link
Contributor Author

mboutet commented Nov 18, 2020

I need to perform more testing on my side. When I run some real scenarios with the web ui, the chart for the spawned users shows weird patterns (e.g. users decreasing before going up again).

Fixed.

@mboutet mboutet marked this pull request as ready for review November 18, 2020 23:46
@cyberw
Copy link
Collaborator

cyberw commented Nov 19, 2020

build is still failing (intermittently?)

@cyberw
Copy link
Collaborator

cyberw commented Nov 19, 2020

To be frank, I'm still a little skeptical to such a big change and I dont think I'll have time to review it any time soon. Maybe @heyman has some input?

@mboutet
Copy link
Contributor Author

mboutet commented Nov 19, 2020

build is still failing (intermittently?)

The new integration test TestMasterWorkerRunners.test_distributed_shape_with_stop_timeout is still flaky. I'll make it robust so that it passes every time. It's a bit tricky with the timing and async nature of the code.

@mboutet
Copy link
Contributor Author

mboutet commented Nov 19, 2020

To be frank, I'm still a little skeptical to such a big change and I dont think I'll have time to review it any time soon.

I don't know what to tell you, the previous implementation had flaws when running in distributed mode as well as with the load test shape, the latter misbehaving in both modes. The problem are well explained in the two issues this PR is trying to solve.

The two years old Issue #896 is labelled as low prio, but I personally think this is a bug. Say you have 100 workers and want to hatch 1000 users at a rate of 10, with the previous implementation, you'd have an effective rate of 1000/s ten times in a row which does not make any sense and can screw up your load test. Now, imagine with even more workers...it's clear the implementation was not scalable and flawed.

In my opinion and after a considerable amount of thinking and studying the runners codebase, the most reliable, and probably most precise way to handle this is to have the master manage the state of users and their dispatch as this PR does.

Regarding the alternative of having the master send the user count along with a delay so that the worker waits before spawning; I don't think this would scale nor work very well because of the unpredictable nature of the network having variable latencies and response times. Furthermore, each worker might take an unpredictable amount of time to process each received spawn messages (because it is busy doing another task or else). Given these aspects, it would be very difficult for the master to compute this delay accurately.

I think a good place to start is by looking at the tests in test_dispatch.py and test_distribution.py so it is easier to understand the new behaviours.

I think the most challenging changes to make sense of are in the runners module. So, I'll try to summarize the changes here:

  • Before, Runner.spawn_users() and Runner.stop_users() would take a relative amount of users to spawn and stop. These two functions were responsible to compute the number of users in each class to start/stop (using Runner.weight_users()). Now, Runner.spawn_users() and Runner.stop_users() are given a dictionary with exactly the number of each user class to start or stop.
  • Before, the Runner.start() function would simply check if stopping or spawning users is necessary and then call either Runner.spawn_users() or Runner.stop_users(). Now, Runner.start() has become the "brain" and it is responsible to compute the user distribution and dispatch (with a delay in between) the users to start and stop. With the new code, a single call to Runner.start() will start and stop users depending on the user class. This explains the added complexity needed to handle starting and stopping users at the same time.
  • Before, the worker runner was directly using Runner.start(). However, this is no longer appropriate since the worker is now "brainless" because it receives the pre-computed occurrences for each user class without any spawn rate. Its job is simply to altered its state to match the desired state as quickly as possible. For that matter, a new WorkerRunner.start_worker() has been added to clearly differentiate with the Runner.start() method which is never used by the worker class.
  • The Runner.weight() has been replaced by the weight() function in the distribution.py module. The main difference with respect to the previous function is that when user count is less than or equal to the number of user classes, it is guaranteed that one user of each class will exist. Also, the new function is more performant and probably needs less memory because it doesn't create the user objects, it only computes the desired number. You can see in test_distribution_large_number_of_users that even with a ridiculously large number, it finishes instantly whereas the previous one runs endlessly. I chose to factor out this function in its own module in order to not add more code in the runners.py module which is already quite large. I also find it's easier to test a standalone weight() function because it does not require to setup the Environment and WorkerRunner objects which are unnecessary to test the behaviour.

Another note is that Runner.spawn_users() and Runner.stop_users() are not intended to be used outside the runner classes. The proper way to control the number of running users and spawn rate is to use Runner.start() since it will take into account the current state of the running users and issue the proper calls to Runner.spawn_users() and Runner.stop_users(). That is why the input_listener needed to be updated to no longer use Runner.spawn_users() and Runner.stop_users().

@cyberw
Copy link
Collaborator

cyberw commented Nov 19, 2020

I agree that this is an area that needs fixes, it is just that I dont have time to review it. I have spent way too much time on the project in the last six months anyway :)

Low prio is just my way to mark tickets as non-critical (and while this is definitely something that we want to fix, it is not critical) (edit: I renamed "low prio" to "non-critical" to make that more clear)

If @heyman, or maybe even someone of the other main contributors (doesnt have to be someone with merge permissions) does a first sweep of feedback I can give it a last look and help with the actual clicking of the merge-button :)

Maybe @max-rocket-internet is interested in this kind of feature?

@max-rocket-internet
Copy link
Contributor

Maybe @max-rocket-internet is interested in this kind of feature?

Not really. As I understand this PR is about the distribution of user classes, not locust users, right? In all our load tests I think only 1 or 2 a using more than a single HttpUser. We just use task decorator to split functionality by ratio of requests.

The correct way to have a ramp down is to use the load test shape feature.

By this logic you could say the same about ramping up.

@mboutet
Copy link
Contributor Author

mboutet commented Nov 25, 2020

Not really.

I see that you showed some interests in solving the distributed hatch rate issue #896 which is addressed by this PR.

As I understand this PR is about the distribution of user classes, not locust users, right?

It also addresses the issues of fair dispatch of the users in distributed mode.

We just use task decorator to split functionality by ratio of requests.

I think this works well for most of the use cases. Especially when each task is more or less "equal" in terms of runtime. However, I found that when dealing with unbalanced tasks, i.e. some tasks take only few seconds or less whereas others can take several minutes, this way of doing was not working very well. The problem is that, eventually, a large number of users get "stuck" on these long-running tasks even if they have a lower weight compared to the short-running tasks. By separating in different user classes, it is possible to get much more control over how frequent each task is executed.

By this logic you could say the same about ramping up.

Absolutely, and I think people should do so. As I see it, the spawn rate is mainly there as a convenience. For instance, it's useful when one only wishes to ramp up linearly to a certain steady-state number of users.

@cyberw
Copy link
Collaborator

cyberw commented Jul 1, 2021

I'm liking the progress for this, and I have run a few basic tests myself again, and so far everything looks stable. A few more thoughts:

  1. We still need to reduce the amount of logging somehow. Running headless is too noisy now. Moving the "Updating running test with x users" to debug level might be a start, but I'm not sure what would be the best approach. Maybe moving both to debug makes sense.
[2021-07-01 16:17:42,502] lars-mbp.local/INFO/locust.main: No run time limit set, use CTRL+C to interrupt.
[2021-07-01 16:17:42,503] lars-mbp.local/INFO/locust.main: Starting Locust 1.6.0.2
Name                                                          # reqs      # fails  |     Avg     Min     Max  Median  |   req/s failures/s
--------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------
Aggregated                                                         0     0(0.00%)  |       0       0       0       0  |    0.00    0.00

[2021-07-01 16:17:42,505] lars-mbp.local/INFO/locust.runners: Updating running test with 1 users
[2021-07-01 16:17:42,505] lars-mbp.local/INFO/locust.runners: Spawning additional 1 (0 already running)...
[2021-07-01 16:17:43,506] lars-mbp.local/INFO/locust.runners: Updating running test with 2 users
[2021-07-01 16:17:43,506] lars-mbp.local/INFO/locust.runners: Spawning additional 1 (1 already running)...
Name                                                          # reqs      # fails  |     Avg     Min     Max  Median  |   req/s failures/s
--------------------------------------------------------------------------------------------------------------------------------------------
GET 1                                                              1     0(0.00%)  |     227     227     227     227  |    0.00    0.00
GET 2                                                              1     0(0.00%)  |     194     194     194     194  |    0.00    0.00
--------------------------------------------------------------------------------------------------------------------------------------------
Aggregated                                                         2     0(0.00%)  |     210     194     227     194  |    0.00    0.00

[2021-07-01 16:17:44,506] lars-mbp.local/INFO/locust.runners: Updating running test with 3 users
[2021-07-01 16:17:44,506] lars-mbp.local/INFO/locust.runners: Spawning additional 1 (2 already running)...
[2021-07-01 16:17:45,508] lars-mbp.local/INFO/locust.runners: Updating running test with 4 users
[2021-07-01 16:17:45,508] lars-mbp.local/INFO/locust.runners: Spawning additional 1 (3 already running)...
Name                                                          # reqs      # fails  |     Avg     Min     Max  Median  |   req/s failures/s
--------------------------------------------------------------------------------------------------------------------------------------------
GET 1                                                              3     0(0.00%)  |     295     227     335     320  |    1.00    0.00
GET 2                                                              1     0(0.00%)  |     194     194     194     194  |    0.00    0.00
--------------------------------------------------------------------------------------------------------------------------------------------
Aggregated                                                         4     0(0.00%)  |     270     194     335     230  |    1.00    0.00

[2021-07-01 16:17:46,508] lars-mbp.local/INFO/locust.runners: Updating running test with 5 users
[2021-07-01 16:17:46,508] lars-mbp.local/INFO/locust.runners: Spawning additional 1 (4 already running)...
[2021-07-01 16:17:47,509] lars-mbp.local/INFO/locust.runners: Updating running test with 6 users
[2021-07-01 16:17:47,509] lars-mbp.local/INFO/locust.runners: Spawning additional 1 (5 already running)...
...
  1. Should we reduce the default weight of users from 10 to 1? I think the old value (which doesnt make much sense) was set because of rounding issues or something (which the updated code seems to have no problems with). We should do this at the same time so we do all the breaking changes at once (I could do it in a separate PR though, if that is better for you)

When master sends a non-null `host` to the workers, set this `host` for the users.
Other logging related code was also refactored to be cleaner and/or be more exact.
@mboutet
Copy link
Contributor Author

mboutet commented Jul 2, 2021

We still need to reduce the amount of logging somehow

I published locust-mboutet-1.6.0.4 in which I made a small refactor for the logging. I ran headless in both local and distributed mode and I think that the INFO logging is now more manageable and less verbose. If one wishes to see more details, DEBUG provides all the low-level logs. Let me know what you think.

Should we reduce the default weight of users from 10 to 1?

I was not even aware the default weights were set to 10 😛 I went back the history when weightwas added in 2011 (I haven't realised Locust was that "old" 😮) and it seems it's been set to 10 from the start. Anyway, we could probably change this in a separate PR. Perhaps a v2 branch could be created in which all the PRs for v2 are merged in, what do you think? Thus, if there are fixes to be made to the master branch in the meantime, they can be done and you can publish other 1.x.y releases.

@cyberw
Copy link
Collaborator

cyberw commented Jul 2, 2021

We still need to reduce the amount of logging somehow

I published locust-mboutet-1.6.0.4 in which I made a small refactor for the logging. I ran headless in both local and distributed mode and I think that the INFO logging is now more manageable and less verbose. If one wishes to see more details, DEBUG provides all the low-level logs. Let me know what you think.

Looks good! Is there a better word than "Updating test with X users" though? Maybe "Target set to X users" or something. "Updating" (to me at least) implies that there was already some relevant state even before (which isnt really the case when running headless)

Should we reduce the default weight of users from 10 to 1?

I was not even aware the default weights were set to 10 😛 I went back the history when weightwas added in 2011 (I haven't realised Locust was that "old" 😮) and it seems it's been set to 10 from the start. Anyway, we could probably change this in a separate PR. Perhaps a v2 branch could be created in which all the PRs for v2 are merged in, what do you think? Thus, if there are fixes to be made to the master branch in the meantime, they can be done and you can publish other 1.x.y releases.

Hmm. @heyman , do you have the reason for the 10 default weight? I can worry about changing this (worst case it might be something for 3.x :) )

I think we're good to go soon, and having a "flat" structure of only PR:s to master is good for changelog etc, so I'm leaning towards no "v2" branch (possibly one or more 2.0bX prereleases, but that's all). I will also make a 1.6.x branch just before we merge this PR, in case we need to build a release on the old code.

@mboutet
Copy link
Contributor Author

mboutet commented Jul 3, 2021

Is there a better word than "Updating test with X users" though? Maybe "Target set to X users" or something.

I think that Updating test to %d users using a %.2f spawn rate or Updating user count to %d using a %.2f spawn rate could be better. The transitive verb Updating implies that the test will soon be updated to reach the indicated target user count whereas Target set to X users could be interpreted as the test has already been updated to run X users. I'm not a native-speaker, but that's how I interpret it.

"Updating" (to me at least) implies that there was already some relevant state even before (which isnt really the case when running headless)

Perhaps our interpretation is different, but there is always a prior state. When this is the first spawning when the test just started, the prior state is "0 users".

do you have the reason for the 10 default weight?

Just my 2 cents, but for the new code, the weight doesn't matter. The weights could be small 0.00005, 0.007, etc. and it won't cause any issues (unless you go so small as to mess with the float precision). The only breakage it will cause is for people having user classes with and without explicit weights in which case they will need to explicitly set weight = 10 to keep the previous behaviour.

# can easily do 200/s. However, 200/s with 50 workers and 20 user classes will likely make the dispatch very
# slow because of the required computations. I (@mboutet) doubt that many Locust's users are spawning
# that rapidly. If so, then they'll likely open issues on GitHub in which case I'll (@mboutet) take a look.
if spawn_rate > 100:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like you are worried about a completely different issue (high load on master) than what this code was trying to warn about (high load on workers due to lots of Users/HTTP sessions being started at the same time)?

I added this warning because people (actually mostly people who had misunderstood the load model) kept filing bugs/SO issues about poor performance on workers during fast ramp-up, so I'd like to keep that. If you want to add a separate warning about high total spawn rate that is fine.

@cyberw
Copy link
Collaborator

cyberw commented Jul 3, 2021

"Updating" (to me at least) implies that there was already some relevant state even before (which isnt really the case when running headless)

Perhaps our interpretation is different, but there is always a prior state. When this is the first spawning when the test just started, the prior state is "0 users".

Probably just a language thing. But as a person who is just starting a run (at least in my mind), the previous state is not that there are "zero users running", it is that there is not even a test running at all (so the user count is undefined/impossible, not zero). Maybe "Ramping to X users..."?

do you have the reason for the 10 default weight?

Just my 2 cents, but for the new code, the weight doesn't matter. The weights could be small 0.00005, 0.007, etc. and it won't cause any issues (unless you go so small as to mess with the float precision). The only breakage it will cause is for people having user classes with and without explicit weights in which case they will need to explicitly set weight = 10 to keep the previous behaviour.

👍 I've added a PR and will merge it after this.

@mboutet
Copy link
Contributor Author

mboutet commented Jul 3, 2021

Good for Ramping to %d users using a %.2f spawn rate. I'll push this tomorrow along with the warning on the spawn rate.

@mboutet
Copy link
Contributor Author

mboutet commented Jul 4, 2021

I published locust-mboutet-1.6.0.5 with the modified log message and the reinstated worker spawn rate warning as discussed.

locust/runners.py Outdated Show resolved Hide resolved
@cyberw
Copy link
Collaborator

cyberw commented Jul 4, 2021

👍 looks really good now. I’ll try to get my colleague to try this out (check the slack thread @DennisKrone ) next week and if he finds no issues I’ll merge and make a first prerelease build (probably 2.0b0). I’ll merge the other breaking changes after that and make new prerelease builds that include those.

Also, I have looked into how to make prerelease builds for each commit, so once 2.0 (release version) is done I’ll add that.

@mboutet
Copy link
Contributor Author

mboutet commented Jul 4, 2021

Great! thanks for your valuable feedback these past weeks. And also thank you @domik82 and @dannyfreak for your extensive tests.

I've uploaded locust-mboutet-1.6.0.6 containing the latest changes.

@cyberw
Copy link
Collaborator

cyberw commented Jul 5, 2021

I'm ready to merge this now. There's nothing more that needs to happen first, right @mboutet ?

I have also prepared a branch that adds version checking between master/worker, which I intend to merge before 2.0 release.

@cyberw cyberw changed the title Better distribution of users and fix distributed hatch rate Move User selection responsibility from worker to master in order to fix unbalanced distribution of users and uneven ramp-up Jul 5, 2021
@cyberw
Copy link
Collaborator

cyberw commented Jul 5, 2021

Bombs away!

@cyberw cyberw merged commit 04271c4 into locustio:master Jul 5, 2021
@mboutet
Copy link
Contributor Author

mboutet commented Jul 5, 2021

@domik82 & @dannyfreak, I discussed by DM with @cyberw regarding the performance issue with the master for the kind of scale you have. I think this could be addressed in a separate PR as I don't have a quick fix for you at the moment. 160-300 workers with 5-10 users classes and 25 000 total users is, in my opinion, next-level in terms of difficulty. It's not easy to ensure all the constraints (distribution, balanced users, round-robin, etc.) are respected while having a low computational footprint. I've profiled the code using line_profiler to find the hotspots and optimized a few of them, but at the end of the day, it's Python.

So, let's find a way to handle such large-scale in a separate PR. Hopefully, you can get familiar with the code and contribute to improving this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants