Scaling to 1M

Lloyd Hilaiel edited this page Nov 15, 2011 · 3 revisions

This page presents a methodology for estimating BrowserID capacity, and presents a plan for how to scale out capacity to handle 1M active users.

Overview

An "active user" for the purposes of these estimates is a user with two devices, who uses 4 different BrowserID enabled sites, and visits her sites about 40 times a day evenly spread out over site and device. Users are required to re-authenticate to BrowserID once a week per device. Finally, "typical sites" remember user authentication for about 2 hours before requiring re-authentication via browserid. Finally, we expect about a 20% growth rate per month.

For the purposes of this discussion, there are 6 interesting types of activities that occur on browserid:

  • signup - A user creates a new browserid "account" via the dialog
  • reset_pass - A user resets their password through the dialog
  • add_email - A user adds a new email to their browserid "account"
  • reauth - A user re-authenticates to browserid
  • signin - A user uses browserid to sign into a site.
  • certify - A user must generate a keypair and get it certified, occurs when no local key exists.
  • include_only - An RP includes include.js.

The discussion below concludes that 1M active users corresponds to:

  • 2.10 signup activites per second
  • 0.42 reset_pass activites per second
  • 0.84 add_email activites per second
  • 1.68 reauth activites per second
  • 93.93 signin activites per second

Caveats

First, This is guesswork and estimation. The goals here are to guess at conservative initial hardware requirements and to have a means of quantifying the performance of the system. Once we have significant usage of the system we can use live data as a basis of capacity planning rather than simulated load.

Basic Architecture

At present there are three services that constitue a BrowserID node capable of serving static resources and dynamic WSAPI calls:

  1. mysql - persistence
  2. nginx - reverse proxy for node.js api server which handles all static requests
  3. node.js server - which handles all api calls.

Additionally there is the verifier, which is itself a node.js server.

In the current production all four of these services are run on a single node, which is in turn behind a load balancer.

Projected Load by Area

Static Resources

Static resources served from browserid servers fall into three groups:

  1. include.js served to RPs at the time their page with a login button is loaded
  2. dialog resources served to RPs at the time that a user clicks on a login button on
  3. site resources served to visitors of browserid.org (using the manage page or viewing the site).

Generally all of these resources are to be completely static. They can be cached and served by nginx or even higher levels of middleware. If/when required we can offload them to an SSL CDN. The question is when to do this?

So how much bandwidth will be consumed by static resources:

  1. include.js is 4k minified and gzipped. It will be served to about 450 active browserid users per second by RPs that use browserid. Further, it will be served even more frequently by these same RPs as very likely only a subset of their users will authenticate with BrowserID. Assume that 10% of users of RPs that use browserid authenticate with browserid. That reasoning yields:

    4k * 450 * 10 = 18M/sec

  2. dialog resources are 100k minified and gzipped. these will be served to about 100 users per second.

    100k * 100 = 10M/sec

  3. the manage page is also about 100k of resources. 4 users per second will be directed these resources as part of the sign in flow, and additionally can conservatively estimate organic manage page usage to be about 5% of the traffic to dialog.

    100k * 4 + 10M * .05 = 900k/sec

Given these estimates, with a million active, we can expect about 30M/sec of traffic to be static resource requests. This estimate ignores browser caching and assumes we need to serve all static resources afresh.

The only work to do here is to ensure all static resources are properly cached in the nginx layer with correct cache headers. Further that resources are cached and gzipped properly.

Verification

Verification of an assertion takes on the order of 500ms on a 2.66Ghz Core 2 processor.

With 1M active we should expect about 100 verification actions per second, which will require 50 cores to satisfy. If we want to run the service at 20%, that's about 250 cores.

The other consideration to verification is that this is one part that will move off of our service and onto RPs, which is beneficial to them as it limits their reliance on external servers, probably decreases latency for users, and at smaller scale has a much better maintenence cost.

Another consideration here is that this is one area where a better native implementation can have a huge impact.

Open work for the verifier is to ensure that it can saturate all available cores. This might be as simple as running multiple instances on different ports and letting nginx or the load balancer round robin.

Certification

XXX: Write me

WSAPI calls

Not considering verifier calls, a simulated load of 1M active users

XXX: update me

Email

We've agreed to outsource email to a different vendor. Given the numbers above, we can expect to send about 4 emails per second per million active users.

Persistence

With 1M active users we run 33 QPS against MYSQL. This is a completely managable quantity. All we should do in this area is load up the database with sample data and figure out what our size requirments are. Also we should go through every distinct type of query and ensure that there are no table scans and figure out what a reasonable upper bound is on a typical node.

Outside of scaling to multiple nodes for redundancy, the mysql work to get to 1M is largely just making sure we haven't done anything stupid.

tl;dr;

With 1M active users we:

  • serve 30M/sec in static resources
  • run 33 QPS through mysql
  • saturate 250 cores with verification requests
  • saturate 2.1 cores with WSAPI requests w/ 12 round bcrypt

For the purposes of getting to 1M, I would suggest that we acquire at least 4 machines in at least 2 colos having at least 24 cores between them. 3 4-core boxes each in 2 colos or 2 8-core boxes in 2 colos both would work equally well.

I suggest for simplicities sake we continue to run multiple nodes which run all services on them, and later we can incrementally move distinct services (like database or verification) with unique requirements onto more optimized clusters.

To repeat myself, this plan should hold us over long enough that we can make future decisions based on emperical data rather than lloyd's hokey load generation.

Finally, I have been careful to avoid proposing optimizations as a required part of the plan to get to 1M, however I think we can invest time in several different areas and improve our cost model.