# BigQuery - Parameterized Queries

This notebook shows SQL and BigQuery support that goes beyond exploratory queries. It shows how to parameterize queries and organize them into modules for use elsewhere in the notebook and potentially for deployment to BigQuery from within a notebook. Query deployment is work in progress and modular, parameterized queries is a step in that direction.

# Setup

Import libraries and see the schema and data.

In [1]:
import gcp
import gcp.bigquery as bq

In [2]:
%%sql
SELECT * FROM [cloud-datalab:sampledata.requestlogs_20140615] LIMIT 5

In [3]:
%%sql
SELECT endpoint, count(status) statcount
FROM [cloud-datalab:sampledata.requestlogs_20140615]
WHERE status = 401
GROUP BY endpoint
ORDER BY endpoint

# Module and Parameters

The following cell uses the name StatusQueries to group a set of related queries in a module. Outside of the cell the module name needs to be used to refer to one of the queries.

The queries themselves are defined for later use with the DEFINE QUERY construct. These queries are not immediately executed. They contain named parameters like "endpt" that are given default values and referred to with '$' prefix inside the SQL statement. Notice that the queries are not strings and in order to replace parameters, you do not need to do awkward string processing.

In [4]:
%%sql --module StatusQueries
endpt = 'Admin'
stat = 401

DEFINE QUERY ErrorCountByEndpt
SELECT status, count(status) statcount
FROM [cloud-datalab:sampledata.requestlogs_20140615]
WHERE endpoint = $endpt
AND status > 399
GROUP BY status
ORDER BY status

DEFINE QUERY CountForStatus
SELECT endpoint, count(status) statcount
FROM [cloud-datalab:sampledata.requestlogs_20140615]
WHERE status = $stat
GROUP BY endpoint
ORDER BY endpoint

The queries above are just plain SQL queries. They are not bound to any particular SQL implementation. In order to use the SQL, we need to identify a SQL implementation as the executor. Let's use BigQuery as the executor since the data we are working with is in BigQuery.

As a first step, let's see the expanded query that would get executed - first with default and then with a non-default value. Note that single line syntax is %bigquery while multi-line cells use %%bigquery following the IPython syntax. When using non-default values we can use either JSON or YAML to specify the overridden values; in our example we are using JSON:

In [5]:
%bigquery pipeline --query StatusQueries.ErrorCountByEndpt

SELECT status, count(status) statcount
FROM [cloud-datalab:sampledata.requestlogs_20140615]
WHERE endpoint = "Admin"
AND status > 399
GROUP BY status
ORDER BY status


In [6]:
%%bigquery pipeline --query StatusQueries.ErrorCountByEndpt
{
  "endpt": "Other"
}

SELECT status, count(status) statcount
FROM [cloud-datalab:sampledata.requestlogs_20140615]
WHERE endpoint = "Other"
AND status > 399
GROUP BY status
ORDER BY status


The same can be accomplished with the more conventional APIs in Python code as well as shown below.

In [7]:
bq.Query(StatusQueries.ErrorCountByEndpt)

In [8]:
bq.Query(StatusQueries.ErrorCountByEndpt, endpt='Other')

Let's execute the query and put the results into a dataframe.

In [9]:
df = bq.Query(StatusQueries.ErrorCountByEndpt, endpt='Other').to_dataframe()
df[:10]

Unnamed: 0,status,statcount
0,400,1427
1,401,20
2,404,11860
3,405,27
4,500,325
5,503,36


# Exploratory Query in Module

Sometimes it is convenient to just try out the queries defined in a module right there in the same cell that defines the module. You can add a query at the end without a name. When the cell is executed, the last, unnamed query will be run in addition to the module definition being added to the notebook state.
Here we will reuse the same queries with a different module name and add one more query at the end without a name.

In [10]:
%%sql --module StatusQueries2
endpt = 'Admin'
stat = 401

DEFINE QUERY ErrorCountByEndpt
SELECT status, count(status) statcount
FROM [cloud-datalab:sampledata.requestlogs_20140615]
WHERE endpoint = $endpt
AND status > 399
GROUP BY status
ORDER BY status

SELECT *
FROM $ErrorCountByEndpt
WHERE status = 500

The last query without "DEFINE QUERY" and name becomes the default query of the module and can be executed as follows in another cell.

In [11]:
%bigquery execute --query StatusQueries2