# Amazon Installations: Elastic Compute Cloud

## Usage Notes

Amazon Elastic Compute Cloud (EC2) is Amazon's service for requesting virtual machines that are created in Amazon's cloud.

This notebook will go over the different aspects of the virtual machine that you should take into consideration when you want to request a cluster of virtual machines, and the same considerations apply when requesting a single instance.

## Notebook Imports

In [None]:
%matplotlib inline

In [None]:
from __future__ import division, print_function
from aws_base import *
from aws_group import *
from aws_iam import *
from aws_request import *
from aws_spot import *
from aws_util import *
from aws_volumes import *
from IPython.display import display
from matplotlib import pyplot

## Identify the AMI

If you already have an AMI you want to use, set it here. The only requirement is that the AMI uses Ubuntu (otherwise, future attempts to install software won't work). If you do not set one, one will automatically be chosen for you later in the notebook.

https://console.aws.amazon.com/ec2/v2/home#Images

In [None]:
image_id = None

## Check Spot Instance Pricing

### Consider Instance Types

It's important to consider multiple instance types for our workload, because it's not always the case that the smallest instance type is the cheapest instance type. It is not uncommon for lower tier EC2 instances prices to spike due to lazy programmers who cause the price to inflate for the lower tier EC2 instance type used by their programs. A good reference for the instance types is put together by the community of AWS users.

* http://www.ec2instances.info/

For general purpose usage, many applications are CPU-bound while still needing a decent amount of memory, so you will want to consider the general-purpose ``m1``, ``m3`` and ``m4`` instances for those tasks. However, for a lot of Liferay related purposes, you can actually lean towards something more specific.

Below are a few examples of how you might decide on an instance type.

* A standard Liferay instance that you use for cluster testing is CPU-bound while needing a moderate amount of memory, and so something from the `c3` and `c4` families is appropriate.
* Liferay builds require a substantial amount of memory, but since they're running in the background, you can trade away some CPU capacity. For that reason, something from the `r4` family is appropriate, though you may switch to an `m3` or `m4` instance type if you find yourself needing a better trade-off balance.
* Liferay upgrades require about the same amount of CPU power as a standard Liferay instance, but they need a lot more memory to succeed. When taking pricing into consideration, you would consider a larger variant in the `r4` family.
* For purposes of deep learning, your only options are in the `p2` family.

In [None]:
instance_types = [
    'm1.small'    # Cheap SOCKS5 proxy
#    'c3.large'    # Liferay cluster node
#    'r4.large'    # Liferay bisect
#    'r4.xlarge'   # Liferay upgrade
#    'p2.xlarge'   # FastAI personal use
#    'p2.16xlarge' # FastAI workshop
]

### Retrieve Price History

When considering the instance type, one good metric is to look at the price of those instances types across several days. By default, the AWS GUI will show you the price history for 1 day and it allows you to look at the history for 7 days (one week) or 30 days (one month).

In [None]:
day_count = 7

Now we'll actually retrieve that price history. We'll keep things as a list so that names (such as instance types and availability zones) stay sorted during our analysis.

In [None]:
instances_price_history = [
    (instance_type, get_region_price_history(instance_type, day_count))
        for instance_type in instance_types
]

### Plot Price History

There are a lot of ways to plot the price history, Since it's essentially prices over time, you could also perform time series analysis on the spot instance prices.

For simplicity, though, we'll just plot the price trends over time relative to a target price, which is essentially how much you were intending to pay per node in your cluster per hour of runtime. This will be used when plotting the graph to make it easier for your to decide on the instance type.

In [None]:
target_price = None

If you didn't choose one in the previous step, this will use the averages of the per-instance target prices (described in the previous notebook on spot instances, and is essentially a percentage of the on-demand price based on the instance size) for the instance types that you've chosen.

In [None]:
target_prices = get_target_prices()

if target_price is None:
    target_price = round(
        numpy.mean([
            target_prices[instance_type] for instance_type in instance_types
        ]), 3)

target_price

Set the y-axis limit for the graph at where it would exceed the price for an on-demand instance for your largest instance type (because prices above that value are not meaningful), unless nothing in your price history comes close. Also add a dashed line to see for how long the prices stay below our desired target price.

In [None]:
on_demand_prices = get_on_demand_prices()

max_demand_price = max([
    on_demand_prices[instance_type] for instance_type in instance_types
])

instance_type_count = len(instance_types)
figure, subplots = pyplot.subplots(
    instance_type_count, figsize = (16, 3 * instance_type_count),
    sharex = True, sharey = True)

if not isinstance(subplots, numpy.ndarray):
    subplots = [subplots]

best_historic_date = None
best_historic_price = 0.0

# Create subplots for each of the instance types, and within each subplot, create a line
# graph representing the price history in each availability zone for that instance type

for i in range(instance_type_count):

    subplot = subplots[i]
    instance_type, instance_price_history = instances_price_history[i]

    subplot.set_title(instance_type)

    zone_names = []

    for zone_name, price_history in instance_price_history:
        zone_names.append(zone_name)

        min_historic_date = min(price_history['dates'])

        if best_historic_date is None:
            best_historic_date = min_historic_date
        else:
            best_historic_date = max(best_historic_date, min_historic_date)

        max_historic_price = max(price_history['prices'])
        best_historic_price = max(best_historic_price, max_historic_price)

        subplot.plot(price_history['dates'], price_history['prices'])

    box = subplot.get_position()
    subplot.set_position([box.x0, box.y0, box.width * 0.8, box.height])

    subplot.legend(
        zone_names, loc = 'center left', bbox_to_anchor = (1, 0.5),
        fancybox = True)

# Normalize the subplots so that you can meaningfully compare them relative to the target
# bid that you've set across instance types.

for subplot in subplots:

    subplot.axhline(y = target_price, color = 'black', ls = 'dashed')
    subplot.set_xlim(xmin = min_historic_date)

    best_ymax = max(max_historic_price, target_price) * 1.5
    best_ymax = min(best_ymax, max_demand_price)

    subplot.set_ylim(ymin = 0.0, ymax = best_ymax)

## Confirm Pricing Choices

### Select Instance Type

Now that we've looked at the price history, we can choose our desired instance type.

In [None]:
desired_instance_type = 'm1.small'

We'll make sure that we didn't make a typo.

In [None]:
assert desired_instance_type in instance_types

If you are setting it up as a one-time test, then you should create a spot instance request (set the following variable to `False`) in order to limit the cost of running the instance.

If you wish to reuse this machine again in the future or if this is a time-consuming test where you do not want to risk early termination of the EC2 instance, you will need to be able to start and stop the instance. If that's the case, you should create an on-demand instance (set the following variable to `True`).

In [None]:
is_on_demand = False

### Confirm Instance Type

Note that `t2` instances cannot be requested as spot instances, so we will force the value to `True` if you chose that instance type.

In [None]:
if not is_on_demand and desired_instance_type.find('t2.') == 0:
    is_on_demand = True

is_on_demand

If you haven't chosen an `image_id`, set your operating system. The code is able to figure out the proper `image_id` for Amazon Linux or Ubuntu, so specify one of the two.

In [None]:
linux_type = 'amazon'
#linux_type = 'ubuntu'

If you have not set an AMI, we will choose a default one based on your instance type. This choice will be made based on the virtualization options available for your instance type.

In [None]:
if image_id is None:
    virtualization_type = get_virtualization_type(desired_instance_type)
    image_id = get_default_image_id(virtualization_type, linux_type)

image_id

The following will confirm that your instance type can be created with the virtualization type required by your AMI.

In [None]:
ami_json = aws(
    'ec2', 'describe-images', '--image-id', image_id,
    '--region', region)

image = ami_json['Images'][0]
image_id = image['ImageId']

image_virtualization_type = image['VirtualizationType']
instance_virtualization_type = get_virtualization_type(
    desired_instance_type, image_virtualization_type)

assert image_virtualization_type == instance_virtualization_type

### Identify Availability Zone

If you are fixed to an availability zone due to an Amazon Elastic Block Store (EBS) volume, please set the availability zone containing this volume below.

In [None]:
desired_zone_name = None

If you do not have an availability zone, we will automatically select an availability zone based on the instance type that you selected.

In [None]:
if desired_zone_name is None:
    instance_price_history = None

    for candidate_type, candidate_price_history in instances_price_history:
        if candidate_type == desired_instance_type:
            instance_price_history = candidate_price_history
            break

    df, desired_zone_name = choose_availability_zone(
        desired_instance_type, instance_price_history, target_price)

    display(df)

desired_zone_name

### Confirm Bid Price

From our desired instance type, we will identify what our code has chosen for the bid price (essentially, it will always be the on-demand price). Note that the code below will throw an error if you made a typo in the name of your desired instance type.

In [None]:
bid_price = on_demand_prices[desired_instance_type]
bid_price

## Create a Placement Group

In order to ensure we have good networking in our cluster, we'll need to create a placement group.

* http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

For convenience, we'll need to specify the name we'll use. This name should be unique per region.

In [None]:
placement_group_name = None

If the placement group doesn't exist, we'll create it.

In [None]:
placement_group = None

if placement_group_name is not None:
    try:
        matching_placement_groups = aws(
            'ec2', 'describe-placement-groups', '--group-names', placement_group_name)
    except:
        matching_placement_groups = []

    if len(matching_placement_groups) == 0:
        aws(
            'ec2', 'create-placement-group', '--region', region,
            '--group-name', placement_group_name, '--strategy', 'cluster')

        matching_placement_groups = aws(
            'ec2', 'describe-placement-groups', '--group-names', placement_group_name)

    placement_group = matching_placement_groups['PlacementGroups'][0]

placement_group

We can only use a placement group for certain instance types, as described in the documentation. Let's confirm that we have a valid type.

In [None]:
# Only certain instance families support placement groups

unavailable_instance_families = set([
    'c1', 'm1', 'm2', 'm3', 't1', 't2'
])

desired_instance_family = desired_instance_type.split('.')[0]
is_placement_group_instance_family = desired_instance_family not in unavailable_instance_families

# Only large and higher instance types support placement groups

desired_instance_subtype = desired_instance_type.split('.')[1]
is_placement_group_instance_subtype = desired_instance_subtype.find('large') != -1

# Find out if it's supported

is_placement_group_allowed = is_placement_group_instance_family and is_placement_group_instance_subtype
is_placement_group_allowed

## Size the Cluster

### Specify Cluster Size

Now that you've identified what kind of node you want, the next step is to identify how many you want. A single spot request can generate more than one node, and you're probably using this notebook to understand how to configure a cluster that has more than one node.

In [None]:
cluster_size = 1

### Specify Volume Size

Cost of storage increases linearly with the number of nodes, so you may want to limit your volume size. However, chances are you have a large number of nodes because the amount of data you need to process is larger than what can fit on a standard node.

To avoid having things crash due to lacking disk space, increase this to whatever you need for your workload. Note that if you are using a `c3` or `m3` instance and you do not need the data to persist across server restarts, local storage on the machine may be able to offset some of your needs.

In [None]:
volume_size = 20

## Confirm the Request

We now have all the details we need in order to issue our spot request.

In [None]:
request_specification = {
    'ImageId': image_id,
    'KeyName': private_key_name,
    'SecurityGroups': security_group_names,
    'InstanceType': desired_instance_type,
    'Placement': {
        'AvailabilityZone': desired_zone_name
    },
    'BlockDeviceMappings': get_block_devices(image_id, desired_instance_type, volume_size)
}

if instance_profile_arn is not None:
    request_specification['IamInstanceProfile'] = {
        'Arn': instance_profile_arn
    }

if is_placement_group_allowed and placement_group is not None:
    request_specification['Placement']['GroupName'] = placement_group_name

if is_on_demand:
    request_specification['DisableApiTermination'] = True
    request_specification['InstanceInitiatedShutdownBehavior'] = 'stop'

## Issue the Request

We'll make a request for our application, and to distinguish it from other requests that are cached, we'll name it `app`.

In [None]:
if is_on_demand:
    app_request = OnDemandInstanceRequest('app')
else:
    app_request = SpotInstanceRequest('app')

app_request.request(bid_price, cluster_size, request_specification)

## Confirm Fulfillment

Here, we make sure that the request has been fulfilled and that the instances are accessible by installing `awscli` to all machines.

In [None]:
app_instances = app_request.get_fulfilled()
app_host_names = [instance['PublicDnsName'] for instance in app_instances]

if linux_type == 'ubuntu':
    install_awscli('ubuntu', app_host_names)
    extra_storage('ubuntu', app_host_names)
else:
    install_awscli('ec2-user', app_host_names)
    extra_storage('ec2-user', app_host_names)

app_host_names

### Internal Host Names

Many applications provide you with their private host name in Amazon's internal network rather than their external public host names. If you are not using SSH tunneling, this is problematic.

To alleviate this, update your hosts file to point the internal host names the external IP addresses, though you will need to clean out this file every time you create a new cluster.

Run the following lines to find out what you would need to add to your `/etc/hosts` file on Linux and Mac OS X or `/windows/system32/drivers/etc/hosts` on Windows.

In [None]:
if assume_ssh_tunnel:
    print('You will need to use SSH tunneling to reach the services on this instance')
else:
    hosts_entries = [
        instance['PublicIpAddress'] + '\t' + instance['PrivateDnsName']
            for instance in app_instances
    ]

    print('\n'.join(hosts_entries))