# BigQuery - Data Transforms with SQL

This notebook demonstrates how you can run a SQL query within BigQuery and save the results into another table. This allows you to perform ETL-like tasks completely within BigQuery to prepare or clean data or to get it ready for further analysis and use.

This notebook uses a sample dataset of request logs data from a web server.

Related Links:

* [BigQuery](https://cloud.google.com/bigquery/)
* BigQuery [SQL reference](https://cloud.google.com/bigquery/query-reference)

----

NOTE:

* If you're new to notebooks, or want to check out additional samples, check out the full [list](..) of notebooks.

In [1]:
import gcp
import gcp.bigquery as bq

# Extract

The source of data is logs with the following schema. Separate "extract" step of ETL is not required since the data is already in BigQuery.

In [2]:
logs_table = bq.Table('cloud-datalab:sampledata.requestlogs_20140616')
logs_table.schema

# Transform

This data needs to be shaped for the purpose of tracking errors and associated endpoints over time. In this simple example, we will use a query that filters out successful requests and projects out columns relevant for error tracking. For more complex transformations, see query composition and UDF sample notebooks.

In [3]:
%%sql --module log_transform
SELECT endpoint, method, status, timestamp
FROM $logs_table
WHERE status >= 400

Let's test the query results before proceeding with loading of data

In [4]:
query = bq.Query(log_transform, logs_table=logs_table)
query.sample()

# Load

Let us create a dataset in the current project. This is an idempotent operation and returns the dataset if it already exists.

In [5]:
target_ds = bq.DataSet('output').create()

A new dataset is created in the current project for the output. You can cross-check that using BigQuery console. Next step is to execute the log_transform query and directly load the results into a BigQuery table in the newly created dataset. For idempotence, we specify that the table should be overwritten if it exists. We use the execute_async method which returns a job that we then wait on.

In [6]:
job = query.execute_async('output.transformedlogs_20140616', table_mode='overwrite')
job.wait()

Job job_mXbMEeuqcpST2_07_PWOKM7SGHM completed

Job completion time will depend on how much data is being processed. The following check is important to run to see if there were any errors when running the job:

In [7]:
job.errors

In [8]:
%%sql
SELECT *
FROM [output.transformedlogs_20140616]
LIMIT 5