Skip to content
This repository has been archived by the owner on Apr 2, 2024. It is now read-only.

general remarks about Loads architecture / next steps #262

Open
tarekziade opened this issue Apr 30, 2014 · 8 comments
Open

general remarks about Loads architecture / next steps #262

tarekziade opened this issue Apr 30, 2014 · 8 comments

Comments

@tarekziade
Copy link
Contributor

brain dump -- would love some feedback @ametaireau @Natim @jbonacci @rfk @bbangert

So, after a few months of work - I realize it's a lot of work to maintain a consistent cluster where we have agents spread on several boxes and a broker linked to them.

The main issue is that once some load tests are running, all the results are sent in real time to the broker via a chain of zeromq publisher sockets.

That leads to 2 problems:

  1. the broker become a bottleneck when it's bombarded with results. Even though I slimed down the size of those results, it's still extra load
  2. when the network partitions, we're kinda losing the ability to know what's going on, and it's very hard to have a system where we're getting back to normal 100% of the time. We're kinda there (thanks to 0mq queues), but there are so many edge cases of possible breakages depending on when the network partition happens.

I think a much more robust system would be to drop the PUB/SUB system for results, and use a shared database. We'd let the database system deal with all the network partionning issue and the broker would simply drive the agents to run the tests.

In case the broker can't reach an agent, well, the agent is on its own - working on the test and reporting back to the DB. Our web dashboard can then just do db queries to display results - like it does now but not by asking the broker anymore (right now the broker provides APIs to query/fill the DB and to run tests)

That would separate the concerns:

  • the broker is just there to reach out agents and send some work and not worry about the output anymore
  • the agents would work autonomously once they get a job to do, and just report to the db. Once it's over they can tell the broker they are available again
  • the web dashboard can interact with the DB to display live results

I am not sure what database system we want yet. Step 1 could be to extract everything related to the DB from the broker, and have it under its own process - then change the agents so they interact with that one when some results are to be published.

@rfk
Copy link
Contributor

rfk commented May 1, 2014

This sounds like a good architecture to me - as long as the db can keep up! I guess you already have a UUID for each test run so conceptually, it's just having the agents insert individual results under this key. Would you still use zmq to push results into the db process?

@tarekziade
Copy link
Contributor Author

Yes each result is unique so we won't have any conflict. I guess DynamoDB could work there.

Would you still use zmq to push results into the db process?

I would keep zmq for all the client/broker/agents communication, but would use a pure tcp client to send the data to the DB - see #263 for the new results publication flow

@almet
Copy link
Contributor

almet commented May 12, 2014

The reasoning sounds good to me. Especially, I think one key thinking here is the fact we don't really need to store a lot of duplicated data.

For instance, we could just store an incremental value of successes, plus the different errors, maybe storing when the first error occurred and when the last one did.

In a discussion we had with Tarek, I think I understood the goal was to have the data aggregated by the "test" program itself before sending it to stdout. Now that I think of it, I would do this aggregation in the agent code rather than in the test program, in order to stay with a really simple test program protocol.

Otherwise, looks good to me!

@tarekziade
Copy link
Contributor Author

Now that I think of it, I would do this aggregation in the agent code rather than in the test program, in order to stay with a really simple test program protocol.

The problem here is that for very intensive load testing you will probably bust the stdin pipe, because the size of the buffer is limited and it gets emptied as fast as the agent dequeues data to send them to the database. Once you've reached the max size everything gets blocked and we're in trouble.

If it's well documented I don't think it's that hard imho

@almet
Copy link
Contributor

almet commented May 12, 2014

Oh, that's right, it's not ultra complicated, it's just some complexity that I think should be avoided if possible: less to do for the implementers means more implementations!

Isn't there any way to tweak this max size of the pipe (can't we use all the free RAM)?

@tarekziade
Copy link
Contributor Author

tldr; if the agent can't keep up the pace we're asking for trouble because we may run day-long tests.

Let's imagine a Go program that sends several thousands of results per seconds for 24hours. Even if we use the RAM, if the python agent can't keep up the pace, the queue will grow and eventually eat all the RAM. We will also be unable to provide a almost-live feedback on the test and every report will start to lag like hell.

And the other problem is that we will end up using the CPU for the agent queue work instead of letting as much CPU as possible for the load program to use.

Asking the program to aggregate per second is "free".

@almet
Copy link
Contributor

almet commented May 12, 2014

Gotcha. That works for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@rfk @almet @tarekziade and others