Load-testing Erlang servers for fun
and profit
With this project you'll be able to set up, with minor effort, an Erlang TCP server, any number of clients connecting to it, and monitor everything as it happens. This way you can introduce bottlenecks, bugs, or any other issue you want into the server and teach how to analyze the metrics/logs, debug, and resolve the issue.
- Handles client subscribe/unsubscribe.
- Broadcasts messages received from one client to the rest.
- Sends a message to the server every
N
seconds. - Receives messages from the server (originally sent by another client).
- Monitoring of clients
- Monitoring of server
- Serve dashboard to visualize metrics
Establishing a connection:
- Client connects to server
- Server responds with a banner
MRPS server 0.1
Messages a client can send:
SEND<msg>
: The server sends<msg>
to all the other clients connected. They receive the following:Received message: <msg>
COUNT
: the server responds with the amount of connected clientsEXIT
: Closes the connection to the server
- Terraform 0.11.11
- Ansible 2.7.5
- Scaleway account (this can be changed to any other provider in Terraform configuration)
Set the following environment variables:
TF_VAR_organization
: Your Scaleway organization IDTF_VAR_token
: Your Scaleway API access token, generated by youTF_VAR_ssh_key
: Your SSH key to access provisioned serversTF_VAR_region
: (OPTIONAL) The Scaleway region, by defaultpar1
TF_VAR_count
: (OPTIONAL) This is the amount of client servers to provision, by default 1
If you are using (or want) a different provider please check their corresponding documentation on how to setup for it at Terraform Providers.
$> make infra
This will provision your servers using the Terraform configuration. You will be asked to confirm if you want to perform the actions, type yes
to confirm.
Plan: 8 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
Terraform will output the IPs of the provisioned servers:
Outputs:
client_ips = [
192.168.1.208
]
server_ip = 192.168.1.100
monitor_ip = 192.168.1.133
Install everything you need and setup your clients, server, and monitoring by doing:
$> make ops
$> make run
This will run Tsung using its configuration.
One of the core strengths of the Erlang programming language, more specifically of the Erlang/OTP runtime system, is its support for concurrency and distribution, making it ideally suited for soft real-time, high-availablity applications.
Inspired by this post from the Phoenix team, we decided to build a TCP server to try our hand at the C10k problem, namely to serve and maintain as many connections as possible on commodity hardware. This is our journal of the project.
To build our server we could have rolled our own socket pool using OTP's gen_tcp
behaviour, but we chose to use Ranch, a minimal, low latency library from the creator of Cowboy.
Once the Ranch
application is started, we can start listening for connections:
start(_StartType, _StartArgs) ->
ranch:start_listener(mrps, 10, ranch_tcp,
[{port, 6969}, {max_connections, infinity}],
mrps_protocol, []),
mrps_sup:start_link().
Here we start 10
acceptor processes, that listen on TCP port 6969
for an unlimited number of connections. The mrps_protocol
is our protocol handler, which defines the logic executed by each connection process.
Our server supports some basic commands:
- Accept incoming connections
- Return count of connected clients
- Broadcast messages received from one client to the rest
- Close connection
The core of the protocol implementation is a loop
function that handles communication between client and server:
loop(Socket, Transport, Register) ->
ok = Transport:setopts(Socket, [{active, once}]),
receive
{msg, Message} ->
Transport:send(Socket, ["Received message:\n", Message]),
loop(Socket, Transport, Register);
{tcp, Socket, <<"SEND", Message/binary>>} ->
Register:for_each(send_msg(Message, self())),
Transport:send(Socket, <<"Message sent\n">>),
loop(Socket, Transport, Register);
{tcp, Socket, <<"COUNT\n">>} ->
Count = integer_to_binary(Register:count()),
Transport:send(Socket, [Count, <<"\n">>]),
loop(Socket, Transport, Register);
{tcp, Socket, <<"EXIT\n">>} ->
close(Socket, Transport, Register);
{tcp, Socket, _Data} ->
loop(Socket, Transport, Register);
{tcp_closed, Socket} ->
close(Socket, Transport, Register);
{tcp_closed, Socket, _Reason} ->
close(Socket, Transport, Register)
end.
The Register
we use here is an abstraction (a behaviour) that maintains a list of connected clients and provides a way to iterate through them. Later, we discuss the different implementations we tried out.
Our server listens for the following commands:
SEND<msg>
=> Server sends<msg>
to all the other clients connected. They receive the following:Received message: <msg>
COUNT
=> Server responds with the amount of connected clientsEXIT
=> Closes the connection to the server
To broadcast the <msg>
to the rest of the clients, we send a message to each connection process:
Client ! {msg, Message}
The message is matched by the loop
function, which then sends it to the client:
{msg, Message} ->
Transport:send(Socket, ["Received message:\n", Message]),
loop(Socket, Transport, Register);
In order to broadcast messages our application needed a way to store all connected clients. We first created a behaviour that defined the following callbacks:
behaviour_info(callbacks) ->
[{init, 1},
{store_client, 1},
{remove_client, 1},
{for_each, 1},
{count, 0}];
We tested out serveral alternatives.
A simple yet effective way to store global state comes built in the Erlang system. Erlang Term Storage
, or ets, is a module that provides dynamic tables that store tuples indexed by a key.
Nothing beats ets
tables for simplicity:
init(_Args) ->
ets:new(clients, [set, public, named_table]).
store_client(Client) ->
ets:insert(clients, {Client}).
remove_client(Client) ->
ets:delete(clients, Client).
for_each(Func) ->
ets:foldl(
fun({Client}, _Acc) -> Func(Client) end,
[],
clients
).
count() ->
ets:info(clients, size).
Here the init
callback returns a table named clients
with public
access, so that connection processes can register each client.
We tested out the different types of ets
tables:
set
: each key must be uniqueordered_set
: keys are unique and are stored/accessed in orderbag
: keys can repeat, but the tuples to store must be uniqueduplicate_bag
: keys can repeat and identical tuples can be stored
We found no meaningful difference in performance between the tables, although its been reported that duplicate_bag
(since it enforces less constraints) might offer some improvement.
The rest of the api is self-explanatory.
Erlang provides another solution in the pg2 module, which implements process groups. It handles distributed named process groups, where messages can be sent to one, some or all group members.
Clients are added to a global group, and then removed automatically when their connection process ends.
These are two third-party libraries that provide a global process registry, with strong support for distribution and concurrent reads. They are well tested and maintained; perhaps a bit excessive for such a simple project. An extensive comparison of the different process registries can be found here.
The corresponding behaviour implementations are relatively straight forward.
One of our goals was to have a one-click solution for creating infrastructure, installing needed software and deploying code to run our server.
To this end, we used Terraform to create and manage the needed cloud resources, and Ansible as a provisioning and configuration management tool.
From the official documentation:
Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.
As stated, it is a tool to create, manage and tear down cloud resources. It features a declarative syntax with which we describe the end state of our infrastructure, and the needed operations (creation, destruction, scaling) are handled automatically.
For our project, we used Scaleway, a cloud provider that offers ARMv7 machines.
Here's how we create a server:
resource "scaleway_server" "server" {
name = "load-server"
image = "31dfef82-9b45-4b01-9656-031617f36599"
type = "C1"
depends_on = ["scaleway_ssh_key.ssh_key"]
}
Since we wanted the ability to create clients dynamically, we can pass the number of clients as a parameter to Terraform:
$> terraform apply -var "count=5"
And here's how we create the client servers:
resource "scaleway_server" "clients" {
count = "${var.count}"
name = "load-client-${count.index}"
image = "31dfef82-9b45-4b01-9656-031617f36599"
type = "C1"
depends_on = ["scaleway_server.server"]
}
Terraform provides a way to output generated values (ips in our case) so we can use them with our provisioning tool. Here's how we populate Ansible's inventory:
data "template_file" "ansible_inventory_tpl" {
template = "${file("devops/inventory.tpl")}"
vars {
server_ip = "${scaleway_ip.server_ip.ip}"
client_ips = "${join("\n", scaleway_ip.client_ips.*.ip)}"
monitor_ip = "${scaleway_ip.monitor_ip.ip}"
}
}
Among the many alternatives in the field (Chef, Puppet, Salt) we chose Ansible, an agentless provisioning and configuration management tool that leverages Python and ssh.
The syntax is quite easy to grasp, consisting of yaml
files (playbooks) and an inventory file that defines the hosts to provision/configure.
Playbooks
are run against said hosts and take care of installing packages, creating users, configuring the system, etc.
Here's a task that installs Tsung on clients:
---
- name: Install Tsung on clients
hosts: clients
remote_user: root
tasks:
- name: install tsung
apt:
name: tsung
state: latest
It is a comprehensive tool, allowing for a variety of system administration tasks that go above and beyond what we needed for this project.
In order to gain insights from our tests, we also set up a monitoring system. In particular, we used Prometheus, an open-source systems monitoring and alerting toolkit.
It consists of multiple components:
- the main Prometheus server which scrapes and stores time series data
- special-purpose exporters
- an alertmanager
- client libraries for instrumenting application code
Each machine in our project collects metrics through the official node_exporter and the Erlang client library.
The data is then fed to the visualization and analytics platform Grafana.
With our system already in place, we begun testing our server with Tsung, a distributed load testing tool written in Erlang.
Tsung requires testing specifications written in XML
that describe the number of users to create, and how those users communicate with the server.