# `ec2` spot price history

## initialization

In [None]:
import datetime

import boto3
import pandas as pd

## preamble: what is spot pricing

when we created our `ec2` instances, we went with the "free tier" option, so we didn't care much about what the instance we built cost. That's not always an option -- for example, suppose you want to have a computer with more than one processor and more than 1 GiB of memory (not an unreasonable ask!). you might end up spending quite a bit for an *on-demand* instance -- one you can access any time you want -- with better specs.

check out the pricing information [here](https://aws.amazon.com/ec2/pricing/on-demand/) for more info about these costs.

there are other options, however:

1. reserved instances: you pay up front to have one machine with some properties *for as long as you want*. the up-front cose is high, but the day-to-day cost is much lower
2. spot prices: you agree to pay *up to a certain price*. you can use your instance up until the market demand for that resource passes your limit price, after which point your machine becomes unavailable to you until the market price goes back down.

spot pricing provides an interesting opportunity: suppose I want to have a very powerful machine, but I don't anticipate I will use it often, and I'm okay if it is unavailable for time to time (a good example: distributed computing with "worker nodes" in a hadoop environment). Maybe I can save a lot of money by picking a price much lower than the "on-demand" price?

let's download the timeseries of spot prices to get an idea about just what type of savings are possible.

## downloading spot price information

we know from [the `ec2 api` documentation](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Welcome.html) that we can get spot pricing via the `aws api` using the [`DescribeSpotPriceHistory`](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeSpotPriceHistory.html) endpoint.

let's use the `boto3 python` library to hit up that `api` and do a bit of analysis.

in the following, we will go through the steps required to download spot price history and load it into a dataframe. once we have an end-to-end process set up, we'll use the intermediate steps to create a single function that we can use to pull down spot pricing programatically and repeatedly.

### setup

first things first: let's create a `boto session` object named `session`

In [None]:
session = boto3.session.Session()

### investigate the `ec2 resource`

as we mentioned in lecture, it is always preferrable to use the `ec2` `resource` whenever possible. following that guideline, let's create a `session.resource` method of the `session` object we created above to build an `ec2` resource object, and explore the methods of that resource to see if there is anything `spot`-related

In [None]:
ec2 = session.resource('ec2')

In [None]:
assert isinstance(ec2, boto3.resources.factory.ServiceResource)

In [None]:
dir(ec2)

I could read through that list looking for the word "spot", but I'm *pretty* lazy. I'll let `python` do it for me. 

for example, if I wanted to find any element in the `dir` of `ec2` which involves `vpc`s, I could do:

In [None]:
[elem for elem in dir(ec2) if 'vpc' in elem.lower()]

In [None]:
assert len([elem for elem in dir(ec2) if 'vpc' in elem.lower()]) == 8

do the same, but for things related to `'spot'` pricing:

In [None]:
[elem for elem in dir(ec2) if 'spot' in elem.lower()]

huh... nothing.

at this point, I would crack open [the `boto3 ec2 resource`](http://boto3.readthedocs.io/en/latest/reference/services/ec2.html) documentation to see if there is any discussion of `spot` pricing. there is, but only associated with `client` objects -- not `resource` objects.

whenever we

1. can't make a `resource` object for a particular service, or
2. can't use a `resource` method or attribute to obtain information the information we desire

then and only then we will use an `ec2 client` and an exact endpoint instead

### `ec2 client` function

use the `session.client` method of our `session` object to create a *new* `ec2` item

In [None]:
ec2 = session.client('ec2')

In [None]:
import botocore
assert isinstance(ec2, botocore.client.BaseClient)

#### exploring `ec2`'s `describe_spot_price_history`

as with all `boto` clients, the member functions of the `ec2` client we just created are in a one-to-one mapping with the available `api` endpoints. We are looking to describe spot price history, and sure enough:

In [None]:
ec2.describe_spot_price_history?

it looks like it accepts many parameters, but there are a few that jump out at me:

##### `StartTime` and `EndTime`

we can pull all spot prices between these two `datetime` objects

##### `InstanceTypes`

we can select spot pricing for various instance types

##### `MaxResults` and `NextToken`

if you read the documentation, you'll see that the response you get from this request will not be *all* spot prices in your time window, but the first `MaxResults` items. The returned message will also give you a `NextToken`, which you can use on your *next* call to the `api` to effectively say "I've already received `MaxResults` records, so start there".

this arrangement -- where you receive information in chunks and have to keep track of which chunk you last received and which you need next -- is often called "pagination" (because you are receiving date one "page" at a time).

let's look at the spot prices for the month of September of this year, and for a particularly beefy instance type: [`m4.16xlarge`](https://aws.amazon.com/ec2/pricing/on-demand/). this instance type has 64 virtual `cpu`s, 256 `GiB` of memory, and usually costs about $3.2 per hour.

In [None]:
t0 = datetime.datetime(2018, 9, 1)
t1 = t0 + datetime.timedelta(days=30)
instancetypes = ['m4.16xlarge']

In [None]:
assert t0.year == datetime.datetime.now().year

In [None]:
resp = ec2.describe_spot_price_history(
    StartTime=t0,
    EndTime=t1,
    InstanceTypes=instancetypes
)
resp

this response contains 1000 prices:

In [None]:
prices = resp['SpotPriceHistory']
len(prices)

the `response` also contains a `NextToken` value, which is a way the responding server tells us that 1000 prices is only a chunk of the full set of prices, and we are not finished yet:

In [None]:
resp['NextToken']

we are supposed to use this token value in a "next" `request` to the `api` to "pick up where we left off" -- more on that just below.

returning to the prices, the first price in that list of `response` prices is

In [None]:
prices[0]

#### pagination and `paginator`s

at this point, we could create a loop which would take the `NextToken` from one request and use it in the next request, and that would be perfectly fine.

however, because this paradigm and process is so common, `boto3` has implemented a special way of handling that via `paginator` objects. a `paginator` is effectively a wrapper around a given endpoint (here, e.g., `describe_spot_price_history`) which handles this `NextToken` iteration logic for us -- pretty cool.

let's create a `paginator` for this `api` endpoint and look at the first two pages

In [None]:
ec2.get_paginator?

In [None]:
paginator = ec2.get_paginator('describe_spot_price_history')
help(paginator)

note that the only method available to `paginator` is `paginate`. the documentation for that `paginate` method is effectively just the documentation for the `describe_spot_price_history` api, except that the discussion about `NextToken` has dropped out.

the `paginate` method creates an iterator object:

In [None]:
pageiter = paginator.paginate(
    StartTime=t0,
    EndTime=t1,
    InstanceTypes=instancetypes
)
pageiter

#### iterating through spot price pages

paginator is *an `iterator`*, so whenever we use it we will be looping through it:

```python
for page in pageiter:
    # do something ...
```

it also means that it is *stateful*, so if we should re-create it each time we want to use it. in this sense it is very different from a list (which you could iterate through as many times as you want). you can't re-use an iterato  after it's been used once because *internally* it thinks it has done all of the work it can, and will no longer return anything.

It is possible to get the first item in any iterator by beginning to iterate through it (in a `for` loop) and immediately `break`ing.

In [None]:
for page in pageiter:
    break
page

so, the individual elements in the `paginator.paginate` iterator are effectively identical to the regular `api` endpoint response items -- cool.

let's practice using one of these iterators one time, just printing how many prices we get for each page

In [None]:
# we create a new iterator every time
pageiter = paginator.paginate(
    StartTime=t0,
    EndTime=t1,
    InstanceTypes=instancetypes
)

for page in pageiter:
    print(len(page['SpotPriceHistory']))

1,551 spot prices were registered in September and delivered to us in 1,000-price chunks.

we can use a list comprehension to load all of those items into a single list of dictionaries

In [None]:
# we have to create a new iterator every time
pageiter = paginator.paginate(
    StartTime=t0,
    EndTime=t1,
    InstanceTypes=instancetypes
)

pricehistory = [
    price
    for page in pageiter
    for price in page['SpotPriceHistory']
]

len(pricehistory)

In [None]:
pricehistory[:3]

### loading a list of `dict` items into a `pandas` dataframe

it just so happens that lists of dictionary items (such as the `pricehistory` list we just created) are one of the most basic input structures for a `pandas` dataframe:

In [None]:
df = pd.DataFrame(pricehistory)

In [None]:
assert df.shape == (1551, 5)

print out the first 10 rows:

In [None]:
df.head(10)

there's only one weird thing going on here, and it's not immediately obvious from the above. let's look at the `dtypes` (data types) attribute of our data frame:

In [None]:
df.dtypes

the `dtype` of the `SpotPrice` column is "object", which is the `dtype` of *strings*, not numbers, in `pandas`. for more details on `dtypes`, please refer to [the `pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes)

the reason these items appear as `object` and not `float` is because the `api` returns them as quoted strings.

let's fix this.

recall from the last homework that whenever we want to update a column in a dataframe we should use the `df.loc` indexer:

```python
df.loc[:, 'column_to_be_updated'] = newvalues
```

let's use the `astype` method of the `df.SpotPrice` column to replace the string values in column `SpotPrice` with `float` values:

In [None]:
df.loc[:, 'SpotPrice'] = df.SpotPrice.astype('float')

In [None]:
assert df.dtypes.SpotPrice == 'float64'

## putting it all together into a function

given all the various steps we took above, let's build a function to create a dataframe object with spot prices on an arbitrary set of instance types and between arbitrary start and end times.

In [None]:
PROD_DESC_DEFAULTS = ['SUSE Linux', 'Linux/UNIX', 'Windows']

def spot(t0, t1, instancetypes, productdescriptions=None):
    # the describe_spot_price_history endpoint has a ProductDescriptions
    # parameter which allows users to filter down the types of products they
    # would like returned. this line sets a default list of products in case
    # the user provides none (will be used in the pageiter call below)
    productdescriptions = productdescriptions or PROD_DESC_DEFAULTS
    
    # construct an s3 session
    session = boto3.session.Session()

    # construct an ec2 client object
    ec2 = boto3.client('ec2')
    
    # create a paginator object
    paginator = ec2.get_paginator('describe_spot_price_history')
    
    # use the pageiter object to create a list of 
    # spot price dictionary objects
    # combined with...
    # create a dataframe from the list of dicts above
    df = pd.DataFrame([
        px
        for page in paginator.paginate(
            StartTime=t0,
            EndTime=t1,
            InstanceTypes=instancetypes,
            ProductDescriptions=productdescriptions or prods
        )
        for px in page['SpotPriceHistory']
    ])

    # convert the elements of column `SpotPrice` to `float`s
    df.loc[:, 'SpotPrice'] = df.SpotPrice.astype(float)
    return df

above we saw that there were 1551 prices in the month of September -- let's confirm that our function creates a dataframe with that many items as well

In [None]:
df = spot(t0=t0, t1=t1, instancetypes=instancetypes)
assert df.shape[0] == 1551

## downloading and investigating data

use your `spot` function above to create a dataframe `df60` with the spot pricing for the 60 days leading up to October 31st for the same instance type (`m4.16xlarge`)

In [None]:
t1 = datetime.datetime(2018, 11, 1)
t0 = t1 - datetime.timedelta(days=60)
instancetypes = ['m4.16xlarge']

df60 = spot(t0, t1, instancetypes)

In [None]:
df60.head()

In [None]:
df60.ProductDescription.unique()

In [None]:
df60.shape

we can use the `df60.groupby` method to group records by `AvailabilityZone` and `InstanceType`, and then perform aggregation calculations on the `SpotPrice` values within those groups using the `agg` method:

In [None]:
help(df60.groupby)

In [None]:
df60.groupby('AvailabilityZone').SpotPrice.agg(['min', 'max', 'mean', 'std'])

use the `df60.groupby` method to group records based on both `ProductDescription` and `AvailabilityZone` (hint: look at the examples in the documentation above), and perform the same aggregation calculations on the `SpotPrice` columns within those groups

In [None]:
g = df60.groupby(['ProductDescription', 'AvailabilityZone']).SpotPrice.agg(['min', 'max', 'mean', 'std'])
g

In [None]:
assert g.shape == (18, 4)

## plotting data

If you were able to do the above, one thing you will have noticed is that the average spot price for different product descriptions was *hugely* different (Windows machines cost around `$`4, but Linux options cost around $1), and within a given `ProductDescription` (e.g. just Linux/UNIX machines) there was even a fair amount from availability zone to availability zone.

what if one availability zone is always cheaper than another? What if the pricing is cyclical, so some times in the day is always cheapest? that'd be good to know!

let's plot those different timeseries values and see if anything pops out.

first, let's limit ourselves to just Linux/UNIX machines. create a dataframe `dflinux` which has only the elements of `df60` above where the `ProductDescription` is `"Linux/UNIX"`

In [None]:
dflinux = df60[df60.ProductDescription == 'Linux/UNIX']

In [None]:
assert dflinux.ProductDescription.nunique() == 1

now, let's set up our plotly graphing

In [None]:
import plotly.graph_objs as go
import plotly.offline

plotly.offline.init_notebook_mode()

we'd like to create a separate line plot for each availabilty zone's timeseries.

To do this, we can *iterate* through `groupby` objects using the following syntax:

```python
for (grpIndexValues, groupRecordsDataframe) in df.groupby('myGroupbyColumn'):
    # do something with the common index
    # do something with the records for that common index
```

to get an idea of what the `grpIndexValues` and `groupRecordsDataframe` in the above look like, we can do the following:

In [None]:
for (idx, grp) in dflinux.groupby('AvailabilityZone'):
    break

idx

In [None]:
grp.head()

In [None]:
grp.shape

so in availability zone `us-east-1a` we have 180 price records. the `groupby` iteration gives us the index that defines this group (the availability zone, `us-east-1a`), and also the "chunk" of records (as a dataframe) that is all of the records in `dflinux` with that availability zone.

as an example where we do things with the index and group chunks:

In [None]:
for (idx, grp) in dflinux.groupby('AvailabilityZone'):
    print(idx)
    print(grp.shape)
    print(grp.SpotPrice.mean())
    print()

we can use this fact to create a different `plotly` [line object using the `Scatter` method](https://plot.ly/python/line-and-scatter/#line-and-scatter-plots) for each `AvailabilityZone` value.

In [None]:
data = [
    go.Scatter(
        x=grp.Timestamp,
        y=grp.SpotPrice,
        name=idx
    )
    for (idx, grp) in dflinux.groupby('AvailabilityZone')
]

In [None]:
assert len(data) == df.AvailabilityZone.nunique()
assert {_.name for _ in data} == set(df.AvailabilityZone.unique().tolist())

now, we render the plot!

In [None]:
layout = go.Layout(
    title="Spot pricing of Linux machines in us-east-1",
    xaxis={'title': 'time (UTC)'},
    yaxis={'title': 'price (USD)', 'range': [0.9, 1.3]}
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

it is also helpful to look at the distributions of prices as box plots. use the `go.Box` object in `plotly` to create `Box` data traces in the same way we did for the `Scatter` objects above:

In [None]:
data = [
    go.Box(
        x=grp.SpotPrice,
        name=idx,
    )
    for (idx, grp) in dflinux.groupby('AvailabilityZone')
]

In [None]:
layout = go.Layout(
    title="Spot pricing of Linux machines in us-east-1",
    xaxis={'title': 'price (USD)'},
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)

## choosing your favorite option

finally, based on what's above, just think about the following questions:

1. which availabilty zone would you pick *just based on this 90-day snapshot of price history*?
1. which would you definitely not?
1. roughly how much would you save on a machine you had running for one month if you used spot pricing in that availability zone vs. the quoted price online of $3.20?

# Exam stuff

In [None]:
t0 = datetime.datetime(2017, 9, 1)
t1 = datetime.datetime(2017, 10, 1)
instancetypes = ['m4.16xlarge']
prods = ['Linux/UNIX']

dfnix = spot(t0, t1, instancetypes, productdescriptions=prods)

In [None]:
dfnix.head()

In [None]:
dfnix.shape