general remarks about Loads architecture / next steps #262

tarekziade · 2014-04-30T13:29:39Z

brain dump -- would love some feedback @ametaireau @Natim @jbonacci @rfk @bbangert

So, after a few months of work - I realize it's a lot of work to maintain a consistent cluster where we have agents spread on several boxes and a broker linked to them.

The main issue is that once some load tests are running, all the results are sent in real time to the broker via a chain of zeromq publisher sockets.

That leads to 2 problems:

the broker become a bottleneck when it's bombarded with results. Even though I slimed down the size of those results, it's still extra load
when the network partitions, we're kinda losing the ability to know what's going on, and it's very hard to have a system where we're getting back to normal 100% of the time. We're kinda there (thanks to 0mq queues), but there are so many edge cases of possible breakages depending on when the network partition happens.

I think a much more robust system would be to drop the PUB/SUB system for results, and use a shared database. We'd let the database system deal with all the network partionning issue and the broker would simply drive the agents to run the tests.

In case the broker can't reach an agent, well, the agent is on its own - working on the test and reporting back to the DB. Our web dashboard can then just do db queries to display results - like it does now but not by asking the broker anymore (right now the broker provides APIs to query/fill the DB and to run tests)

That would separate the concerns:

the broker is just there to reach out agents and send some work and not worry about the output anymore
the agents would work autonomously once they get a job to do, and just report to the db. Once it's over they can tell the broker they are available again
the web dashboard can interact with the DB to display live results

I am not sure what database system we want yet. Step 1 could be to extract everything related to the DB from the broker, and have it under its own process - then change the agents so they interact with that one when some results are to be published.

rfk · 2014-05-01T01:17:36Z

This sounds like a good architecture to me - as long as the db can keep up! I guess you already have a UUID for each test run so conceptually, it's just having the agents insert individual results under this key. Would you still use zmq to push results into the db process?

tarekziade · 2014-05-01T11:33:31Z

Yes each result is unique so we won't have any conflict. I guess DynamoDB could work there.

Would you still use zmq to push results into the db process?

I would keep zmq for all the client/broker/agents communication, but would use a pure tcp client to send the data to the DB - see #263 for the new results publication flow

almet · 2014-05-12T11:45:25Z

The reasoning sounds good to me. Especially, I think one key thinking here is the fact we don't really need to store a lot of duplicated data.

For instance, we could just store an incremental value of successes, plus the different errors, maybe storing when the first error occurred and when the last one did.

In a discussion we had with Tarek, I think I understood the goal was to have the data aggregated by the "test" program itself before sending it to stdout. Now that I think of it, I would do this aggregation in the agent code rather than in the test program, in order to stay with a really simple test program protocol.

Otherwise, looks good to me!

tarekziade · 2014-05-12T11:51:20Z

Now that I think of it, I would do this aggregation in the agent code rather than in the test program, in order to stay with a really simple test program protocol.

The problem here is that for very intensive load testing you will probably bust the stdin pipe, because the size of the buffer is limited and it gets emptied as fast as the agent dequeues data to send them to the database. Once you've reached the max size everything gets blocked and we're in trouble.

If it's well documented I don't think it's that hard imho

almet · 2014-05-12T11:54:28Z

Oh, that's right, it's not ultra complicated, it's just some complexity that I think should be avoided if possible: less to do for the implementers means more implementations!

Isn't there any way to tweak this max size of the pipe (can't we use all the free RAM)?

tarekziade · 2014-05-12T12:01:37Z

tldr; if the agent can't keep up the pace we're asking for trouble because we may run day-long tests.

Let's imagine a Go program that sends several thousands of results per seconds for 24hours. Even if we use the RAM, if the python agent can't keep up the pace, the queue will grow and eventually eat all the RAM. We will also be unable to provide a almost-live feedback on the test and every report will start to lag like hell.

And the other problem is that we will end up using the CPU for the agent queue work instead of letting as much CPU as possible for the load program to use.

Asking the program to aggregate per second is "free".

almet · 2014-05-12T15:47:06Z

Gotcha. That works for me.

tarekziade mentioned this issue Apr 30, 2014

modifying agent / worker interactions #263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

general remarks about Loads architecture / next steps #262

general remarks about Loads architecture / next steps #262

tarekziade commented Apr 30, 2014

rfk commented May 1, 2014

tarekziade commented May 1, 2014

almet commented May 12, 2014

tarekziade commented May 12, 2014

almet commented May 12, 2014

tarekziade commented May 12, 2014

almet commented May 12, 2014

general remarks about Loads architecture / next steps #262

general remarks about Loads architecture / next steps #262

Comments

tarekziade commented Apr 30, 2014

rfk commented May 1, 2014

tarekziade commented May 1, 2014

almet commented May 12, 2014

tarekziade commented May 12, 2014

almet commented May 12, 2014

tarekziade commented May 12, 2014

almet commented May 12, 2014