Mconf Scalability

Leonardo Crauss Daronco edited this page Sep 18, 2015 · 13 revisions

This document contains a documentation of the infrastructure used for scaling Mconf. It is also a documentation of how the Mconf Network (see more about it at http://mconf.org/about/mconf-network) works today.

Methodology

The scaling of Mconf has been divided in two phases with two distinct objectives:

  • Objective 1: increase the limit of simultaneous users in a web conference infrastructure.

    In other words: use several web conference servers in a single infrastructure to increase the limit of users imposed when using a single server. The difference between this infrastructure and a single web conference server shouldn't be noticeable neither for the users nor for the applications that use the system (e.g. Mconf-Web and !BigBlueButton's integration for Moodle). The infrastructure should be as dynamic as possible (e.g. servers can be added and removed) and as automatic as possible (e.g. new servers can register themselves in the "cloud" of servers).

  • Objective 2: increase the limit of users in a single web conference.

    This will imply in increasing the limit of users that a single Mconf-Live server supports and/or enable a single meeting to be held using more that one server.

The objective 1 has been started with start of the second phase of Mconf (Oct. 2011), and the objective 2 should start when the first one is completed.

Architecture

The image below shows an overview of the current infrastructure. This architecture solves the problem of our objective 1: use several web conference servers in a single infrastructure.

Scalable infrastructure overview

1. Front-end

The front-end is no different than the current front-ends. Users (with web and mobile devices) will join the conferences through any application that's integrated with Mconf-Live. Examples are the Mconf web portal at http://mconf.org and any of the applications listed at http://www.bigbluebutton.org/open-source-integrations/.

Instead of communicating directly with a web conference server, these front-end applications will use the Mconf Load Balancer, that implements the Mconf-Live/BigBlueButton API (read more about it below). This means that the front-end applications will not have to be modified to use the Mconf infrastructure.

However, the Load Balancer's API currently has some very small differences from the Mconf-Live/BigBlueButton API and more could be added in the future, but it is extremely important that, even with these small differences, the existent front-end applications should not need to be adapted to use the Mconf infrastructure. Moreover, nothing should be removed from the API. You can see the documentation of the API in this page.

2. Web conference back-end

This is where the most of our work is directed to currently. The back-end consists of a "cloud" of web conference servers, and two new components: the Mconf Monitoring Server (Nagios) and the Mconf Load Balancer. These components are being developed in Mconf and are explained below.

Mconf Monitoring Server (Nagios)

The Mconf Monitoring Server uses Nagios, an open source monitoring system that's very flexible and has a large number of users. This server monitors the cloud of web conference servers and provides access to all the monitoring data to users (i.e. using a graphical administration interface) or to applications (i.e. through an API).

Checks/reports

Nagios provides plugins for several things that should be checked ("checks" or "reports") in a server: HTTP, PING, FTP, etc. These checks can be passive or active: passive checks are sent from a web conference server to the Monitoring Server, while active checks are generated by the Monitoring Server to a web conference server. For Mconf the following metrics are being monitored:

  • Memory Report: Current usage and total of RAM.
  • Processor Report: Current CPU load.
  • Network Report: Input and output bandwidth usage.
  • Disk Report: Hard disk usage and total space available.
  • BigBlueButton Info: Informs the number of users, meetings, users with audio and users with video.
  • BigBlueButton API: Checks if the web conference API is responding correctly.
  • BigBlueButton Demo Installed: Checks if the demos are installed (they shouldn't).
  • BigBlueButton Version: Checks if the version of the web conference server is correct.
  • Live Notes Server: Checks if the live notes server (the server that handles the notes module) is running.
  • Port check - Desktop Sharing: Check if the port usage by the desktop sharing is open.
  • Port check - RTMP: Check if the port usage by RTMP is open.
  • Port check - SIP: Check if the port usage by SIP is open.
  • (There's also a default service that checks if the server is online)

The intervals in which the Monitoring Server receives these information is configurable. For example, we could use 10 seconds for memory, processor and network reports, and 30 seconds for web conference reports. Also, some of the reports listed above are passive: there's a very simple application installed in the web conference servers that periodically sends information to the configured Monitoring Server. Other reports, such as "BigBlueButton Info" are active: the Monitoring Server periodically consults each web conference server to get the data.

The interactions between the web conference servers and the Monitoring Server are shown in the item 2.a in the architecture image.

Database and graphics

The Monitoring Server was integrated with a RRDtool database using a plugin called nagiosgraph. This plugin stores all the information received from the monitored servers and also display all information graphically with charts such as the examples in the images below.

Example of an RRD chart in Nagios (1)

Example of an RRD chart in Nagios (2)

This provides us historical data for every single metric being monitored and for every server. We can, for instance, see the number of meetings and users in the past year, see the CPU usage of a server in the past month, or the bandwidth usage of all servers in the past week.

Source Code

As already commented, the Monitoring Server uses Nagios, an open source monitoring tool that can be found at: http://www.nagios.org/

Also, all the plugins and tools used with Nagios that were implemented by Mconf are open source and available at: https://github.com/mconf/chef-recipes. We use and maintain recipes to

Mconf Load Balancer

Another important element in the infrastructure is the Mconf Load Balancer. It is responsible for deciding in which server a meeting should be created in order to balance the load among all servers in the cloud. It uses the metrics collected by the monitoring system (item 2.b in the architecture image above) to constantly avail the servers in the infrastructure, so when a user requests a new web conference room, the load balancer uses all the monitoring information to select the most appropriate server to run the conference (item 2.c in the architecture image above).

See below more details about how the load balancer works.

API

The API is the entry point for other applications to communicate with the load balancer and the web conference back-end. It can be used to create meetings, see the meetings that are running and get more information about these meetings.

The load balancer implements the same API available in the web conference servers, but with a few changes. Mconf-Live has the same API implemented by BigBlueButton and includes a few changes to it. You can find the documentation about the API of BigBlueButton at this page, and the changes made on it by Mconf-Live at this page.

Since the load balancer implements the same API, all applications that are already integrated with BigBlueButton or Mconf-Live don't need to be changed to work with the load balancer. This is one of the most important things in the load balancer, since it enables all the integrations that BigBlueButton already has (see this page) to be used in this infrastructure.

The few changes that were made in the Load Balancer's API should not break any integration that already works with Mconf-Live/BigBlueButton. These changes are documented in this page.

Moreover, the load balancer implements the mobile API, so Mconf-Mobile and BBB-Android can also connect to the load balancer. See more about this API in the Mconf-Mobile page or in the source code at GitHub.

Multiple clients (salts or institutions)

To access the API, the client applications need to known the salt of the load balancer, just as they would need to access Mconf-Live/BigBlueButton (see more about it here).

To allow multiple applications (also called "clients") to access the same infrastructure, the load balancer is able to create multiple salts. Each application/client/institution has its own salt, and with this the load balancer is able to distinguish the client that is accessing it.

The mobile applications, however, use a different salt to access a web conference server (see more at this page). This salt has only 5 characters to make it easier to type in a mobile device. In the load balancer, the first 5 characters of the standard salt are used as the key to mobile applications. So if the standard API salt is "abcdefghijklmnopqrstuvwxyz123456", the mobile salt will be "abcde".

Algorithms

The algorithms are the logic behind the load balancing. There are several algorithms implemented in the load balancer, but only one in use. Some of these algorithms are:

  • Select the server with less CPU load;
  • Select the server with less users;
  • Select the server that is geographically nearest to the client.

The algorithm current in use is:

  • Order the servers by proximity to the client. If there is more than one server in a ~300km radius, select the one that has less CPU load in the moment. Won't consider servers that are not UP, that are not responding properly to the monitoring server, or that have a CPU load higher than ~70%.

Dashboard

The Load Balancer has also a Dashboard, a web page that shows in real-time the state of all servers in the network. You can see the dashboard of the Mconf Network at: http://lb.mconf.org.

The main idea of the dashboard is to provide an easy way to the administrators to see the state of the network and easily detect possible problems. The dashboard won't show all the information available in the monitoring server, but a reduced set of information including only what's really important and show it in a compact and easy-to-use interface.

Chef: software updates

Before the distributed network, the steps we took to update the version a web conference server were all manual: ssh to the server machine and run a bunch of commands that will update the server. If anything went wrong we had to figure it out and fix it while updating. This approach would be terrible in a distributed network where we have several web conference servers and not a single one. And even worse in the Mconf Network, because since we don't have ssh access to all servers (each institution is responsible for their own servers) it would demand more from the administrators of each institution.

To simplify and automate the process, we started using a software called Chef. With Chef, a set of ruby scripts (called recipes) are written to automate the installation of everything that runs in a web conference server. So from a computer with a basic O.S. installed, we can install the Chef client, run the recipes and it will become a web conference server.

You can see below an image showing how the process of developing new recipes and updating the web conference servers work:

Updating the cloud with Chef

It starts in a development environment, where new recipes are written to install the new features developed in the web conference software. These recipes are sent to a git repository where they will be available for the Chef server. The Chef server is then updated: this can be done in more than one way, currently a developer logs into it and downloads the recipes from the git repository.

Meanwhile, the nodes (the web conference servers) are periodically consulting the Chef server to see if there's anything new for them. They will always fetch the recipes and execute them, so if there's anything new it will be installed. If nothing changed in the recipes, they should detect it and shouldn't run anything in the node.

This was just a brief description of how it currently works in the Mconf Network, but Chef is very flexible and can be configured in different ways. In short, the main advantage of using Chef is that the updates can be triggered remotely to all nodes in the network, and doesn't require manual configuration in the nodes. Also, the scripts are written in ruby, a dynamic, easy-to-use programming language.

For the Mconf Network, this reduces the work needed from the administrators of each institution, since the updates will be automatic and less error-prone (since all recipes are tested before an update).

Resources

All recipes written for Mconf are open source and available at: https://github.com/mconf/chef-recipes

Recording architecture

The Mconf Network supports recordings in a distributed way. Mconf-Live, the base webconference system, supports recordings natively, but since it's part of a distributed environment there are many issues involved.

The first issue is that the Mconf Network is a dynamic environment, nodes can be available now and unavailable later, so nobody can assume that a given server will be always available to serve recordings. The second one is that institutions must have control over their recordings, and it means that the institution should apply a proper backup policy and the institution must be responsible for the availability of the server where the recordings are hosted.

Therefore we design a novel architecture to support the recordings in a distributed way, making it transparent for the users and respecting the same API. You can see in a general way how it works in the figure below.

Click to enlarge the image Overview of the recording architecture

The recording server is a Mconf-Live server with two differences: the recording server has the capability to process and host recordings, and each recording server has a RSA key pair. The key pair is used to encrypt and decrypt the recording files in their way from the Mconf-Live server to the recording server, as explained below. Each partner of the Mconf Network that wants to provide recording capabilities for its users must make available a recording server (cannot be the same server as the Mconf-Live!). The recording server can be used to host recordings created by more than one web portal - for instance, if the partner has Moodle and Mconf-Web integrated to the network, he can use the same recording server for both.

When a user opens a room in a web portal, he's able to choose to record the session. The record option is sent to the load balancer as part of the create API call. The load balancer identifies which web portal is requesting such a room, and check if the owner institution has a recording server connected to the network. If the institution doesn't have a recording server, the load balancer removes the record flag and the session won't be recorded at all. If the institution has a recording server, then the create call is forwarded to the most suitable Mconf-Live server, with additional data. The additional data includes owner institution metadata (such as name and ID) and the public key of the recording server.

The Mconf-Live server runs the meeting and, during the meeting, the recording files are stored temporarily in the server's disk. When the conference finishes, the Mconf-Live server then pack into a .zip file all the files related to the recording and encrypt it with the public key received by the load balancer. The .zip file is then made public through the API, and all temporary files are removed from the server.

The recording server periodically pulls the load balancer for new recordings. When the Mconf-Live expose a new encrypted recording, it is visible for the load balancer, and then consequently for the recording server. If the recording server notices a recording that was not yet processed, the recording server downloads the encrypted recording, decrypts it and process it. In this step, the cycle is completed - the recording created in a Mconf-Live server is now available as a processed recording in the recording server of the owner institution.

When the user wants to watch the recording, he is going to be redirected to the recording playback hosted in the recording server. The recording playback is implemented in HTML5, and looks like this:

Testing

To test the distributed infrastructure, at Mconf we developed a command line web conference client called BBBot. This application was built in Java, using the libraries also used in Mconf-Mobile and BBB-Android. The bot is able to create and join web conferences, just like a standard browser client, and also to send and receive audio and video.

This application can also be used by anyone that wants to test Mconf-Live/BigBlueButton servers, be it in a distributed environment or in a single server.

BBBot is open source and is available at: https://github.com/mconf/bbbot

Recommendations

A list of technical tips for people using or about to use Mconf's solution for scalability. Specially useful for institutions that want to join the Mconf Network.

  • Learn how the API works by reading:
  • Set the HTTP header x-forwarded-for in your requests to the load balancer. The value of it should be the client's IP, so that the load balancer can properly select an adequate server for him. The load balancer needs this IP to define the location of the client, otherwise it will only know the location of the server that sent the request (your Moodle server, for instance). If you're using Mconf-Web you don't need to do anything, since it already sets this header by default.
  • When creating a meeting, use meetingID as a random globally unique identifier (GUID), and name as the proper human readable name of the meeting. Consider that the create can fail if a meetingID is duplicated. You can read more about GUIDs here, here and here.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.