Economics of Amazon EMR

Paul Houle edited this page Sep 3, 2013 · 4 revisions
Clone this wiki locally


This article contains a simple analysis of the economics of EMR that give you some idea of how to estimate and control cost. The key concept is that you should understand costs and take steps to minimize them, and we present both "seat-of-the-pants" methods aimed for small groups and rapid growth and outline the methods that would be used if reducing cost was seen as more important than increasing revenue.

Risk and Politics

The flexibility of EC2 creates the risk that an EC2-authorized user can incur large costs quickly. If a person, for instance, creates an instance that costs $3 an hour to run and forgets about it, he could run up a $2160 bill in a month (more than his health insurance premiums!) -- and this situation could go on for months if the bill keeps being paid.

Be you a moonlighter or a venture capitalist, you will find Amazon Web Services expensive if you don't have financial controls.

If you are an individual or responsible for a small group you should monitor the status of your EC2 system at least once a day and account for every running machine. As your organization gets larger, a larger system of controls is necessary.

I am no fan of paperwork. However, if you control your EC2 costs proactively, you can maintain the maximum amount of autonomy, in your situation, for your AWS use.

Minimize the cost of failure

It is cheap to make mistakes against a 7.5 cent cluster, but it is costly to make mistakes with a $25 cluster. Large clusters must be protected against simple errors using technological, procedural and human development methods.

Careful investigation of the cause of failures in your organization and a systematic campaign to reduce them using checklists and other mechanisms is worthwhile for all your endeavors, including cloud cluster computing

Latency and Cost for Ephemeral Clusters

By the seat of your pants

The most remarkable feature of EMR billing is that AMZN bills by an hourly increment. Cluster initialization seems to take 5 minutes or so with the AMZN distribution, so if one of your job's input paths is missing, it will take about 5 minutes to fail, but you'll be charged for the whole hour.

If the job ran for 59 minutes, you'd get billed for just the first hour -- and you would have struck the perfect bargain. You're gambling, however, that the job doesn't go into overtime, runs for 61 minutes and then you get charged for the next hour, doubling the cost of the job!

You can definitely buy more machines and get a job done in less than an hour, but the cost-efficiency goes down. For instance, if your job completes in 30 minutes you will roughly double the cost. Pay for speed if you wish, but be aware of the cost. Once the run time gets below 10 minutes, you're mostly paying for overhead and it isn't worth going further.

Similar steps, with less severity, occur at every hour, but once the run time gets long, it's not so painful missing a step, so it's fair to say that you can turn a 12 hour job into a 3 hour job by quadrupling the number of compute nodes without a large increase in cost.

What a real analysis would like

A serious financial analysis would contain a probability distribution for the job completion time. If we try to complete the job in 55 minutes, there is a certain probability it will really go over, but if we aimed the probability for 50 minutes we'd buy more machines and have a lower probability of going over. Finding the right target (i.e. cluster size) is the basis for a simple problem in decision theory.

Semi-Permanent EMR Clusters

Note that an EMR cluster can also be configured to continue running after the job steps complete. By adding additional job steps, you could run several jobs inside a billing period; for instance, over a three hour period you could run a number of jobs of lengths running from 10-45 minutes and wind up with few wasted CPU minutes.

Reserved Instances, Spot Instances, ...

If you are racking up and unracking clusters very often, you might consider buying EC2 reserved instances. (These are a must if you run web or database servers in EC2)

The calculations are tricky, because multiple reserved instance types exist, but for some of you this may improve the competitivemess of EC2 vs. a departmental cluster.

You don't need to configure anything in Haruhi to use reserved instances. For instance, if you have 5 reserved instances for c1.xlarge, and two of them are in use outside of EMR, there are three reserved instances outstanding. If you buy a cluster with 6 instances of c1.xlarge1 you will be charged the reserve instance rate for the first 3, and the other 3 will be charged at the higher on-demand rate.

EMR users can experience dramatic savings by using Spot instances. These can be configured in your applicationContext.xml by adding instance groups to your cluster configuration. If your price is less than the spot price, you run the risk of never getting an instance, or having an instance terminated out from underneath you.

Hadoop can deal with either situation well, particularly if it just a matter of losing a TaskTracker. If you spend some time thinking about probability distributions and your utility function, you can certainly improve the "efficient frontier" of job completion time over cost compared to the methods above.