Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tsung fails to use all system resources #297

Open
manukoshe opened this issue Mar 8, 2018 · 10 comments
Open

Tsung fails to use all system resources #297

manukoshe opened this issue Mar 8, 2018 · 10 comments

Comments

@manukoshe
Copy link

Hi,

I have the following setup:
6 nodes with 8 CPU/8GB RAM
On the main node where scripts are run I define weight of 25% to equalize the load with other nodes.
Tsung is generating load perfectly until about 50% of CPU is used, then if fails to generate more load.
All machines are tuned with following settings:
"profiles::sysctl::_config": {
"net.ipv4.tcp_tw_reuse": {
"value": "1",
"permanent": true
},
"net.ipv4.tcp_tw_recycle": {
"value": "1",
"permanent": true
},
"net.ipv4.ip_local_port_range": {
"value": "1024 65535",
"permanent": true
},
"fs.file-max": {
"value": "500000",
"permanent": true
},
"net.ipv4.tcp_fin_timeout": {
"value": "10",
"permanent": true
}

Network capacity is much greater than required. Other monitoring tools also confirm that CPU is only at 50%. There about 280 unknown errors which I assume is a good number for such setup? Loglevel is set to error only.
Attaching screenshots of generated load and resources used.
2018-03-08 14_10_41-tsung - graphs report
2018-03-08 14_11_24-tsung - graphs report
2018-03-08 14_11_35-tsung - graphs report

Any ideas what might be wrong? Tsung version is 1.6, Erlang 16

@manukoshe
Copy link
Author

Could it be due to CPU load reaching 8 (i.e. 1 for each core)?

@manukoshe
Copy link
Author

manukoshe commented Mar 9, 2018

Same test and configuration, except for single machine:
image
image
image

It's 3 times more effective! Am I doing something wrong or Tsung is very ineffective in distributed configuration?

@manukoshe
Copy link
Author

manukoshe commented Mar 28, 2018

Did some benchmarking and the result is that load generation capacity drops significantly for each machine added to distributed configuration. Network latency between the Tsung nodes also has very big impact. In the end I get the result that a single machine can generate about 60% of load vs 6 machines of the same CPU/RAM in distributed setup. This makes Tsung distributed load generation a bit useless (at least for some scenarios).

I suspect this might be due to batch size of users sent from controller to slave nodes (observed some cases where batch size is only 1k users which causes very frequent communication between nodes and thus decreased performance?). I understand that increased batch size would probably have the impact of less even/accurate load distribution.

Maybe development team could consider a parameter to control controller instructions batch size? Or some other improvements to reduce the amount of communication between controller and slave nodes and increase distributed load performance?

@tisba
Copy link
Collaborator

tisba commented Mar 28, 2018

Hey @manukoshe.

Can you share your tsung configuration?

AFAIK there are some "well-known" issues, where there is more coordination required between controllers and generators than probably necessary. #237 is one step to solve one of those issues (decentralised file_server access).

Also AFAIK user arrivals are also in part coordinated by the controller.

Depending on how you model your test case, we've had quite good experience so far in running really large tests (~1M connected users, >300k req/s, using 100+ generator nodes).

@manukoshe
Copy link
Author

manukoshe commented Mar 28, 2018

Thanks for comment, will send you setup privately.

I observed about 5-10% drop in generated max requests 1 machine VS 1 controller/1 slave setup in the same DataCenter.

I observed about 20% drop in generated max requests 1 machine VS 1 controller/1 slave setup in different geographical locations.

Would be nice if anyone else could benchmark and share results for their scenarios.

@manukoshe
Copy link
Author

Additional findings:
No matter how many machines I was using, I hit ~120k req/s limit per cluster configuration. I got same 120k/s requests max either with 1 master/3 slaves OR 1 master/11 slaves. There seems to be some kind of limitation in tsung controller or metrics reporting or somewhere else.

I was able to generate much more load by separately starting 3 x tsung clusters consisting 1 master/3 slaves (total 12 VMs).
However this is not very convenient because:

  • replication of test scripts to 3 machines
  • manual login ant test run in 3 machines instead of 1
  • test reports are split (however not a big issue in my case since I use Grafana for monitoring)

Hopefully these issues will be addressed in later Tsung versions.

@tisba
Copy link
Collaborator

tisba commented May 3, 2018

Hey @manukoshe. Again, it would be very helpful if you could share your test configuration. There are issues regarding scalability, but many of them can currently be mitigated. It would be nice to know what the potential problem is, that you are currently running into.

@manukoshe
Copy link
Author

Hi, can't share test script in public. Please respond to my messages in Linkedin or Twitter:)

@tisba
Copy link
Collaborator

tisba commented May 30, 2018

I took a look at what @manukoshe send me. I'm pretty sure that the problem is the currently centralized nature of file_server. An optimised and distributed local_file_server is WIP and there is a PR for this: #237.

If you have the possibility, @manukoshe, you're very welcome to checkout #237, compile tsung and give this a try. The change to your configuration should be rather simple.

@MoorthiRaj
Copy link

Hi @tisba,

I have updated the patches in my local machines tsung directories successfully. When I start tsung I am getting the following errors,

Starting Tsung
Log directory is: /root/.tsung/log/20200925-2042
Config Error, aborting ! {{case_clause,"local_file"},
[{ts_config,parse,2,
[{file,"src/tsung_controller/ts_config.erl"},
{line,1051}]},
{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},
{ts_config,handle_read,3,
[{file,"src/tsung_controller/ts_config.erl"},
{line,85}]},
{ts_config,read,2,
[{file,"src/tsung_controller/ts_config.erl"},
{line,70}]},
{ts_config_server,handle_call,3,
[{file,
"src/tsung_controller/ts_config_server.erl"},
{line,209}]},
{gen_server,try_handle_call,4,
[{file,"gen_server.erl"},{line,661}]},
{gen_server,handle_msg,6,
[{file,"gen_server.erl"},{line,690}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,249}]}]}

Can you please tell me where I should update the patches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants