Pangeo Billing Analysis
=======================

This is a simple notebook demonstrating how to access the billing logs for the Pangeo GCP account.
The analysis below investigates the per-cluster costs of the kubernetes clusters running on GCP.
The data is stored on Google Big Query and we access the tables directly using Pandas-GBQ.

In [None]:
%matplotlib inline

import pydata_google_auth
import pandas as pd
import pandas_gbq

import matplotlib.pyplot as plt

projectid = "pangeo-181919"
table = 'pangeo-181919.pangeo_kubernetes_logs.gcp_billing_export_v1_016C8D_761AEE_B0C379'

#### Authenticate to GCP
We explicitly authenticate via URL and assign to pandas_gbq context. You will likely need to copy a link into your browser and return with an authentication code. 

In [None]:
credentials = pydata_google_auth.get_user_credentials(
    ['https://www.googleapis.com/auth/cloud-platform'],
)
pandas_gbq.context.credentials = credentials

Pandas GBQ allows us to send commands directly to Big-Query and retun a Pandas Dataframe. Below we extract the full table but you could change this to just extract a subset of the records.

In [None]:
data_frame = pd.read_gbq(f'SELECT * FROM `{table}`',
                         project_id=projectid,
                         dialect='standard')
# some minor data cleaning
df = data_frame.set_index('usage_start_time').tz_convert(None)

In [None]:
display(df.head())

### Daily Costs

In the cell below, we first calculate the total daily cost of all GCP services, then we plot the results. As you can see, we started paying closer attention to our burn rate in January 2019 and made significant improvements over the next 4 months.

In [None]:
df.cost.resample('1D').sum().plot()
plt.title('Daily GCP Costs -- All Services')
plt.ylabel('Cost (USD)')

### Group costs by cluster

We have been running a number of kubernetes clusters, mostly hosting JupyterHubs but also our public BinderHub deployment. In March 2019, we gave each of these clusters a label so we could better track their relative and actual expenses.

In [None]:
def get_cluster(items):
    '''helper function to extract cluster label'''
    d = {i['key']: i['value'] for i in items}
    
    return d.get('cluster', 'none')

For this example, we want to determine how much we spent on each of our individual kubernetes clusters in April 2019. We use the `get_cluster` helper function to extract the cluster label and then a Pandas groupby to find the monthly total amounts. We drop the `'none'` label because that correpsonds to costs other than kubernetes clusters (e.g. cloud storage).

In [None]:
# time range -- you can change this if you want!
tslice = slice('2019-04-01', '2019-05-01')
# get the clusters
clusters = df.loc[tslice].labels.map(get_cluster)
# groupby cluster and sum over time
cluster_costs = df.loc[tslice]['cost'].groupby(clusters).sum().drop(index='none')
display(cluster_costs)

Finally, we simply plot this results using pandas/matplotlib. As you can see, in April 2019, we had 3 clusters that cost about 150 USD each to keep running, and 3 additional clusters that cost between 645 USD and 1084 USD.

In [None]:
cluster_costs.sort_values().plot.bar()
plt.ylabel('Cost (USD)')
plt.title('Pangeo Kubernetes Costs (%s - %s)' % (tslice.start, tslice.stop))