Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clustering and distributed execution #140
This is something that isn't top priority at the moment, but it's going to take a lot of design work, so I'd like to get the ball rolling on the actual planning.
My current idea is to have a
Each instance registers itself, along with how many VUs it can handle, exposes an API to talk to the cluster, and triggers a leader election. The information registered in etcd might have a structure like:
Running a test on the cluster is a matter of calling
When test data is loaded, each VU instantiates an engine and its maximum number of VUs right away, and watches its own registry information for changes. The elected leader then takes care of patching other nodes to distribute VUs evenly across the cluster.
WRT #202 / request for comments:
I don't have direct experience with etcd so i don't have an opinion there but the notion of using existing, proven software is clearly sound so i'd welcome that.
My initial thought is, how would someone know how many VU's a machine can handle? If there's an automated estimation based on system resources (though obv. e.g. a CPU core on AWS is definitely not === a CPU core on dedicated hardware) then that'd help to at least provide consistency which would be good. I'm aware though that it'd quite easy to overload a load generator with work and thus skew its output as it would lack the system resources to measure accurately. However it's implemented, i would imagine it's going to be necessary to allow users to set/amend the VU capability and a good user guide would help a lot - i.e. defining a way in which users can calculate/estimate - that might be the best way to start actually, keeping it simple and iterating/adding from there.
Honestly, your best shot is probably trial and error. The limiting factor is not typically CPU power, but rather local socket usage, and to some lesser extent, RAM usage, both of which vary slightly between scripts. A good start would be just split your desired number of VUs across as many hosts as you want and seeing if it flies or not.
Linux has a maximum of 64K ephemeral ports. That's your connection limit. I'd also kernel tune TIME_WAIT etc
This was referenced
Jun 26, 2017
Thanks for this, I am very interested!
I was wondering if there would be a way to avoid adding a new service to the pool.
The need of extra shared meta date is going to be there, because you would probably want to direct the load test output to a single InfluxDB instance, so maybe we can save this kind of metadata directly there?
I know that this would force folks to stick with InfluxDB, but if we'd use etcd people would in any case need to custom tailor something to collect and aggregate results.
@arichiardi I think it's more important that we look at how we can best implement this, using all available tools, rather than looking at how to minimise dependencies right out of the bat. I'm not saying we should introduce dependencies for the sake of it, but we want this done right.
The current requirements for this to be implemented is as follows:
Prerequisite: Leader assignment
Most of the below requirements have one prerequisite: we need a central point to make all decisions from. The leader doesn't need a lot of processing power, it just needs to keep an eye on things, so to speak.
Spreading VUs across instances.
The algorithm for this could simply be to spread VUs evenly across all available instances, respecting their caps. We could possibly do some weighing, eg. between an instance with max 1000 VUs and one with max 2000 VUs, the latter could get 2x as many VUs allocated to it.
Possible implementations I can see:
Central execution of thresholds, from a data source.
This would be fairly simple using something like InfluxDB; we can parse threshold snippets for the variables they refer to (there's some code for that already), then query them out of the database, using the starting timestamp of the test as delimiter.
We could do something with shipping samples back to the master, but that feels... a little silly.
Distributed rate limiting.
Distributed data storage.
We need to be able to store two kinds of things:
This can be anything that can store keys and values of arbitrary size.
Have you considered implementing something similar to what was done for Locust?
They have a master/slave architecture where the synchronization happens via ZMQ (TCP), which is lightweight enough.
One advantage, in this case, is that there is no need for introducing a hard dependency. The synchronization master/slave can be implemented via ZMQ, HTTP or whatever network protocol you might consider.
IMHO, the only disadvantage that Locust implementation has is the fact that it is a stateful system, where the master must always be started first and can't recover if a slave disappears and then comes back.
I would rather see a stateless system that can handle connectivity issues gracefully.
@coderlifter, thanks, we still haven't finalized the k6 distributed execution design yet, so we'll definitely consider this approach when we get to it. We'll post a final design/RFC here when we start implementing this, so it can be discussed by any interested people.