
# **Slot Machine**<br>
If you are using BigQuery with the on demand billing model where you are billed by bytes scanned, it may be a good idea to switch to the Editions model where you are billed for slots you used.

There are Two quite complicated querstions though:


1.   Will moving to Editions save me money ?
2.   How many slots should I reserve, what is the optimal max_xlots value ?

There is no way to be 100% accurate, but this set of queries aims to reduce the guesswork and guide you in finding better answers to the Two questions above.





**Usage**<br>
You should run this in the project you want to test.
The process creates several tables and views, if any of them already exists then it will stop with error, since we don't want to accidently delete user's important tables.

**Step 1**<br>
First we set some parameters such as region, timeframe and BigQuery edition. We also detect the current project id.

Here is a short description of the parameters:

- **region**: The region where your dataset is located. INFORMATION_SCHEMA is a regional data source.
- **dataset**: the dataset where you want the tables and views related to slot machine to be created. It's a good idea to create a dedicated dataset so we won't accidently overwrite a production table or view.
- **start and end timestamp**: We analyze the behavior in a limited timeframe. It should be long enough to contain all regular activity so it should contain at lease few days.
- **edition**: BigQuery compute has Three editions when using reservations which are priced differently. Choosing the edition affects total price and available features. You can read more [here](https://cloud.google.com/bigquery/docs/editions-intro).
- **verbose**: If true, Slot machine will print out the queries it runs plus additional interim data. If false it will operate quietly without printing out every query and interim result.

In [None]:
import google.auth
project_id = google.auth.default()[1]
region = "US" # @param {"type":"string", "placeholder": "Enter dataset region"}
dataset = "test2" # @param {"type":"string", "placeholder": "Enter the target dataset where objects will be created"}
start_timestamp = "2024-09-23" # @param {"type":"date"}
end_timestamp = "2024-09-30" # @param {"type":"date"}
edition = "Standard" # @param ["Standard", "Enterprise", "Enterprise plus"]
verbose = True # @param {"type":"boolean"}

**Step 2**<br>
Here we import some python packages and create the dataset if it does not already exist.

In [None]:
import pandas as pd
from google.cloud import bigquery
from google.cloud.exceptions import NotFound

client = bigquery.Client()
my_dataset = bigquery.Dataset(project_id+"."+dataset)
my_dataset.location = region


try:
    client.get_dataset(project_id+"."+dataset)
    print("Dataset {} already exists".format(dataset))
except NotFound:
    print("Dataset {} is not found".format(dataset), "creating it.")
    dataset_object = client.create_dataset(my_dataset, timeout=30)

**Step 3**<br>
Ceate a table with the on demand consumption for reference. The table name is bytes.<br>
We select from INFORMATION_SCHEMA.JOBS to find the total bytes billed during the timeframe and what was the cost in USD.<br>Write down the result.

In [None]:
print(dataset)
query = "create table "+dataset+".bytes as "+"""SELECT
  SUM(total_bytes_billed/1024/1024/1024/1024) AS total_tb_billed,
  SUM(total_bytes_billed/1024/1024/1024/1024)*6.25 as cost_usd
FROM """+project_id+".region-"+region+""".INFORMATION_SCHEMA.JOBS
WHERE
  creation_time BETWEEN CAST(\""""+start_timestamp+"""\" AS TIMESTAMP)
  AND """+"CAST(\""+end_timestamp+"\" AS TIMESTAMP)"

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

**Step 4**<br>
Then we query INFORMATION_SCHEMA.JOBS_TIMELINE on the same time range to see how many slots were used in every second of the time range. We send the result to a table called slots.

In [None]:
query = "create table "+dataset+".slots as "+"""SELECT
  period_start,
  SUM(period_slot_ms/1000) AS total_slot_ms,
FROM
  """+project_id+".region-"+region+""".INFORMATION_SCHEMA.JOBS_TIMELINE
WHERE
  period_start BETWEEN CAST(\""""+start_timestamp+"""\" AS TIMESTAMP)
  AND """+"CAST(\""+end_timestamp+"\" AS TIMESTAMP)"""

query = query + """
GROUP BY
  period_start
ORDER BY
  period_start DESC"""

if verbose:
  print(query)
df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

**Step 5**

Create jobs table that later will be used to chart the number of jobs that ran in each time period along with the foregraoud/background jobs breakdown.

In [None]:
query = "create table "+dataset+""".jobs as SELECT
  timeline.job_id,
  timestamp_trunc(timeline.period_start, MINUTE) period_start,
  jobs.user_email,
  jobs.Job_type,
  CASE
    WHEN CONTAINS_SUBSTR(jobs.user_email, 'gserviceaccount') THEN 'background'
    ELSE 'foreground'
END
  AS query_type
FROM
    `"""+project_id+".region-"+region+""".INFORMATION_SCHEMA.JOBS_TIMELINE` AS timeline
  JOIN
    `"""+project_id+".region-"+region+""".INFORMATION_SCHEMA.JOBS` AS jobs
  ON
    timeline.job_id = jobs.job_id
WHERE
  timeline.period_start BETWEEN CAST("""+"\""+start_timestamp+"\""+""" AS TIMESTAMP)
  AND CAST("""+"\""+end_timestamp+"\""+" AS TIMESTAMP)"

if verbose:
  print(query)
df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

**Step 5**<br>
We create a dynamic query that creates buckets incremented by 50 from 0 to the highest number of slots used during the selected time period.<br>
The original time periods are seconds, so we aggregate the data in Minute granularity and for each minute we take the high<br> watermark of slot usage (since BigQuery autoscaler's minimum is 1 minute).<br><br>
Then we assign each minute to a specific bucket according to the maximum slots that ut used.<br>
We create a view called bucketed to hold the result and enable further calculations.

In [None]:
query1 = "select max(total_slot_ms) as max_slots from "+dataset+".slots"
df = pd.io.gbq.read_gbq(query=query1, project_id=project_id, dialect='standard')
max_slots = int(df._get_value(0, 'max_slots'))
query2 = "create view "+dataset+".bucketed"+""" as SELECT
  timestamp_trunc(period_start, MINUTE) as period_start,
  max(case
when total_slot_ms = 0 then 0\n"""
for i in range(0, max_slots, 50):
  line = "when total_slot_ms between "+str(i)+" and "+str(i+50)+" then "+str(i+50)+'\n'
  query2 = query2 + line
query2 = query2 + """  end) as bucket
FROM """+dataset+""".slots
  group by timestamp_trunc(period_start, MINUTE)"""

if verbose:
  print(query2)

df = pd.io.gbq.read_gbq(query=query2, project_id=project_id, dialect='standard')

**Step 6**<br>
Here we create a view called buckets_count that shows hoe many time periods (minutes) fall into each bucket.<br>
Then we print the contents of the view.

In [None]:
query = "create view "+dataset+""".buckets_count as SELECT
  bucket,
  COUNT(*) as periods,
FROM `"""+project_id+"."+dataset+""".bucketed`
where bucket is not null
GROUP BY
  bucket"""

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')
query = 'select * from '+dataset+".buckets_count order by bucket"

if verbose:
  print(query)
df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')
print(df)

For our next calculations we need to know Two numbers:


1.   How many hours there are in our timeframe
2.   How many time periods we have

So steps 7 and 8 find those values.





**Step 7**<br>
Calculate how many hours we have in the chosen time frame.

In [None]:
from datetime import datetime
date_format = '%Y-%m-%d'
diff = datetime.strptime(end_timestamp,date_format) - datetime.strptime(start_timestamp, date_format)
hours = diff.days * 24 + diff.seconds / 3600
if verbose:
  print("hours: "+str(hours))

**Step 8**<br>
Calculate how many time slots we have.

In [None]:
query = "select count(*) as periods from "+dataset+".bucketed"
df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')
periods = df._get_value(0, 'periods')
if verbose:
  print("periods: "+str(periods))

**Step 9**<br>
This is the heart of our calculation.<br><br>
We know how many hours were in the selected timeframe and how many time periods we had during this timeframe.
So we can calculate what is the percentage of all the time that we used each bucket.<br>
If we know the percentage of time and we know how many wours we had in the timeframe then we can calculate how many hours each bucket was active.<br>
And we know how much each slot/hour costs for each BigQuery edition so we can calculate the cost each bucket incured.

So here we create a view called calculated that holds the bucket, the percentage of all time that this bucket was used, how many hours it was used, and how much we would be charged for it if we use the selected edition.

In [None]:
if edition == 'Enterprise':
  price = 0.06
elif edition == 'Enterprise plus':
  price = 0.1
else:
  price = 0.04

query = "create view "+dataset+""".calculated as SELECT
  bucket,
  round(periods/"""+str(periods)+"""*100, 3) as percentage,
  round(periods/"""+str(periods)+"*100/100*"+str(hours)+""", 3) as hours,
  round(periods/"""+str(periods)+"*100/100*"+str(hours)+"*bucket*"+str(price)+""", 3) as cost_usd
FROM
  `"""+project_id+"."+dataset+""".buckets_count`
  order by bucket"""

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

**Let's see how it looks like.**

In [None]:
query = "select * from "+dataset+".calculated"
calculated = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

In [None]:
print(calculated)

**Stage 10**<br>
Here is a visualization of the histogram that shows the distribution of the slot buckets and the time spent in each of them.<br>
There should be a "sweet spot" where most of the buckets under it are heavily used and above it there is only slight usage.<br>
This shhould give you the sense of where that sweet spot should be.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

calculated.plot(x="bucket", y="percentage", kind="line",figsize=(15,9))

In [None]:
query = "select period_start, total_queries from "+dataset+".slots order by period_start"
df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')
df.plot(x="period_start", y="total_queries", kind="line",figsize=(15,9))
#ts = pd.Series(df[total_queries], index=pd.period_start)

#ts = ts.cumsum()
#ts.plot()

**Step 11**<br>
Use the following query to identify where slot usage drops below 1% of the time.

In [None]:
query = "select max(bucket) as recommended from "+dataset+".calculated where percentage > 1"

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')
print(df)

**Step 12**<br>
The following steps try to find the optimal max_slots that best balances cost and performance.<br>
As a first step choose the max_slots value you want to check.

In [None]:
max_slots = 750 # @param {"type":"number"}

**Explanation**<br>
The trade off is cost vs. performance. If we choose the right max_slots than we can reduce the cost while only a small number of queries will decrease in performance. If these are not time critical queries (such as ETLs, background jobs etc.) than we may want to "sacrifice" them in return for  lower cost.

**Step 13**<br>
We try to calculate how much will we pay if we choose the above max_slots.<br>
This is a naive approach as it does not take into consideration all the buckers above the max_slots, as they all had 0 percent.

In [None]:
answer = calculated[calculated['bucket'] <= max_slots]
summed = answer['cost_usd'].sum()
print('Total cost estimation for the time frame using maximum slots '+str(max_slots)+' is '+str(summed)+' USD.' )

**Step 14**<br>
Finally, we want to see which queries will be the most affected by the slot decrease (the ones that has slot consumption above max_slots).
Many times you find out that those queries are not time sensitive (such as background processes) and you can sacrifice their performance to lower cost.

In [None]:
query = """SELECT
  *
FROM (
  SELECT
    timeline.job_id AS job_id,
    jobs.query AS query,
    jobs.job_type AS job_type,
    ROUND(MAX(timeline.period_slot_ms/1000)) AS total_slot_ms,
    COUNT(timeline.job_id) AS slices
  FROM
    `"""+project_id+"`.`region-"+region+"""`.INFORMATION_SCHEMA.JOBS_TIMELINE AS timeline
  JOIN
    `"""+project_id+"`.`region-"+region+"""`.INFORMATION_SCHEMA.JOBS AS jobs
  ON
    timeline.job_id = jobs.job_id
  WHERE
    period_start BETWEEN CAST(\""""+start_timestamp+"""\" AS TIMESTAMP)
  AND """+"CAST(\""+end_timestamp+"\" AS TIMESTAMP)"""+"""
  GROUP BY
    job_id,
    query,
    job_type
  ORDER BY
    slices DESC)
WHERE
  total_slot_ms>"""+str(max_slots)

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')


**Step 15**

Here we calculate what is the optimal baseline slots value, if any.

In [None]:
query = """SELECT
  bucket,
  cost_usd,
  MAX(percentage) AS percentage
FROM
  """+dataset+""".calculated
WHERE
  percentage>60
GROUP BY
  bucket, cost_usd"""

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')
if df.empty:
  print('It is recommended not to set a baseline.')
else:
  bucket = df['bucket'][0]
  cost = df['cost_usd'][0]
  percentage = df['percentage'][0]
  if percentage>60 :
    print("Setting a baseline of "+str(bucket)+" with 3 Year commitment will save you "+str(cost*0.4)+" dollars for the tested period.")
  if percentage>80 :
    print("Setting a baseline of "+str(bucket)+" with 1 Year commitment will save you "+str(cost*0.2)+" dollars for the tested period.")

**Step 16**

Here we check how many queries were run during our test period. We also break them down to background queries that were scheduled or triggered by automatic tools such as cloud scheduler, Composer etc (and hence are less time sensitive) and foreground queries that were run interactively by human users.

In [None]:
query = "CREATE OR REPLACE VIEW "+dataset+""".jobs_count AS
SELECT
  TIMESTAMP_TRUNC(period_start, HOUR) AS hour,
  SUM(foreground) AS fg_count,
  SUM(background) AS bg_count
FROM (
  SELECT
    period_start,
    CASE query_type
      WHEN 'foreground' THEN 1
      ELSE 0
  END
    AS foreground,
    CASE query_type
      WHEN 'background' THEN 1
      ELSE 0
  END
    AS background
  FROM
    """+dataset+""".jobs)
GROUP BY
  hour
ORDER BY
  hour"""

if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')


In [None]:
import matplotlib.pyplot as plt
import numpy as np

query = "SELECT hour, fg_count, bg_count FROM "+dataset+".jobs_count  ORDER BY hour"
if verbose:
  print(query)

df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

# Plotting the stacked bar chart
fig, ax = plt.subplots(figsize=(33,7))

ax.bar(df['hour'], df['bg_count'], label='Background', width = 0.03)
ax.bar(df['hour'], df["fg_count"], bottom=df['bg_count'], label='Foreground', width = 0.03)

# Adding labels and title
ax.set_xlabel('Hour')
ax.set_ylabel('Queries')
ax.set_title('Query count per hour')
ax.legend()


# Show the plot
plt.show()

cnt_bg = df["bg_count"].sum()
cnt_fg = df["fg_count"].sum()
total = cnt_fg + cnt_bg
print (str(round(cnt_bg/(total/100),2))+" percent of the queries were background queries and "+str(round(cnt_fg/(total/100),2))+ " were foreground queries")


**Step 17**<br>
Cleanup

In [None]:
queries = []
queries.append("drop view "+dataset+".calculated")
queries.append("drop view "+dataset+".buckets_count")
queries.append("drop view "+dataset+".bucketed")
queries.append("drop table "+dataset+".bytes")
queries.append("drop table "+dataset+".slots")

for query in queries:
  df = pd.io.gbq.read_gbq(query=query, project_id=project_id, dialect='standard')

print ("Delete complete.")