# BigQuery APIs

Google Cloud Datalab provides an integrated environment for working with Google BigQuery for both adhoc, exploratory work as well as pipeline development. This notebook introduces some of the APIs that Cloud Datalab provides for working with BigQuery.

You've already seen the use of `%%sql` in the [Hello BigQuery](Hello BigQuery.ipynb) notebook, and various `%%bigquery` commands in the [BigQuery Commands](BigQuery Commands.ipynb) notebook. These BigQuery commands are built using the same BigQuery APIs that are available for your own use.

## Importing the API

The Cloud Datalab APIs are provided in the `datalab` Python library, and the BigQuery functionality is contained within the `datalab.bigquery` module.

In [1]:
import datalab.bigquery as bq

# Querying Data

The most important BigQuery-related API is the one that allows you to execute a SQL query. The `bq.Query` class provides that functionality.

In [2]:
# Create and run a SQL query
bq.Query('SELECT * FROM [cloud-datalab-samples:httplogs.logs_20140615] LIMIT 3').results()

timestamp,latency,status,method,endpoint
2014-06-15 07:00:00.003772,122,200,GET,Interact3
2014-06-15 07:00:00.428897,144,200,GET,Interact3
2014-06-15 07:00:00.536486,48,200,GET,Interact3


## SQL

In the example above, the SQL was written as a Python string literal. In Cloud Datalab, you can also specify your query as vanilla SQL. Here is the equivalent query:

In [3]:
%%sql --module logspreview
SELECT * FROM [cloud-datalab-samples:httplogs.logs_20140615] LIMIT 3

In [4]:
# Create a query using the SQL module defined above.
q = bq.Query(logspreview)

# Run the query, with caching turned off (for sample purposes only), so we're sure to be
# able to retrieve metadata, such as bytes processed from the resulting query job.
results = q.results(use_cache = False)
results

timestamp,latency,status,method,endpoint
2014-06-15 07:00:00.003772,122,200,GET,Interact3
2014-06-15 07:00:00.428897,144,200,GET,Interact3
2014-06-15 07:00:00.536486,48,200,GET,Interact3


The results object is a `QueryResultsTable` class, and can be enumerated in the same manner a regular Python list, in addition to retrieving metadata about the results.

In [5]:
# Inspecting the results, and the associated job
print results.sql
print str(results.length) + ' rows'
print str(results.job.bytes_processed) + ' bytes processed'

SELECT * FROM [cloud-datalab-samples:httplogs.logs_20140615] LIMIT 3
3 rows
24152138 bytes processed


In [6]:
# Inspect the programmatic representation.
# Converting the QueryResultsTable to a vanilla list enables viewing the literal data,
# as well as side-stepping the HTML rendering seen above.
list(results)

[{u'endpoint': u'Interact3',
  u'latency': 122,
  u'method': u'GET',
  u'status': 200,
  u'timestamp': datetime.datetime(2014, 6, 15, 7, 0, 0, 3772)},
 {u'endpoint': u'Interact3',
  u'latency': 144,
  u'method': u'GET',
  u'status': 200,
  u'timestamp': datetime.datetime(2014, 6, 15, 7, 0, 0, 428897)},
 {u'endpoint': u'Interact3',
  u'latency': 48,
  u'method': u'GET',
  u'status': 200,
  u'timestamp': datetime.datetime(2014, 6, 15, 7, 0, 0, 536486)}]

## Sampling with bq.Query

The `Query` class has a number of other methods, such as the ability to sample against it.

In [7]:
%%sql --module logs
SELECT * FROM [cloud-datalab-samples:httplogs.logs_20140615]

In [8]:
# Use a hash-based sampling strategy that hashes the timestamp and takes a 1% sample.
# By default, all fields are chosen, but a particular projection can be specified as well.
# Further, limit to 10, since, in this case, the only use of the sampled results is to display a table.
sampling = bq.Sampling.hashed('timestamp', percent=1, count=10, fields = ['timestamp', 'latency'])
sample = bq.Query(logs).sample(sampling = sampling)
sample

timestamp,latency
2014-06-15 07:00:05.449186,6
2014-06-15 07:00:05.908400,5
2014-06-15 07:00:09.078710,30
2014-06-15 07:00:18.609836,28
2014-06-15 07:00:18.861028,119
2014-06-15 07:00:25.316129,712
2014-06-15 07:00:28.423380,211
2014-06-15 07:00:46.074430,501
2014-06-15 07:00:51.734565,124
2014-06-15 07:00:53.029076,121


In [9]:
# Sampling is implemented using standard SQL constructs, and is performed in BigQuery,
# thereby limiting the results retrieved into the notebook.
print sample.sql

SELECT timestamp,latency FROM (SELECT * FROM [cloud-datalab-samples:httplogs.logs_20140615]) WHERE ABS(HASH(timestamp)) % 100 < 1 LIMIT 10


# Datasets and Tables

In addition to executing queries, BigQuery objects like Datasets, Tables and their Schemas can be accessed programmatically as well.

## Listing Resources

In [10]:
datasets = bq.Datasets(project_id = 'cloud-datalab-samples')
for ds in datasets:
  print ds.name

DatasetName(project_id=u'cloud-datalab-samples', dataset_id=u'appenginelogs')
DatasetName(project_id=u'cloud-datalab-samples', dataset_id=u'carprices')
DatasetName(project_id=u'cloud-datalab-samples', dataset_id=u'httplogs')


In [11]:
sample_dataset = list(datasets)[1]
tables = sample_dataset.tables()
for table in tables:
  print '%s (%d rows - %d bytes)' % (table.name.table_id, table.metadata.rows, table.metadata.size)

testing (100 rows - 4586 bytes)
training (417 rows - 19086 bytes)


In [12]:
table = bq.Table('cloud-datalab-samples:httplogs.logs_20140615')
fields = map(lambda tsf: tsf.name, table.schema)
fields

[u'timestamp', u'latency', u'status', u'method', u'endpoint']

## Creating Resources

In [13]:
# Create a new dataset (this will be deleted later in the notebook)
sample_dataset = bq.Dataset('sample')
sample_dataset.create(friendly_name = 'Sample Dataset', description = 'Created from Sample Notebook')
sample_dataset.exists()

True

In [14]:
# To create a table, we also need to create a schema.
# Its easiest to create a schema from some existing data, so this
# example demonstrates using an example object
sample_row = {
  'name': 'string value',
  'value': 0,
  'flag': True
}
sample_schema = bq.Schema.from_data([sample_row])

sample_table = bq.Table("sample.sample_table").create(schema = sample_schema, overwrite = True)

You can run the cell, below, to see the contents of the new dataset:

In [None]:
list(sample_dataset.tables())

## Deleting Resources

In [16]:
# Clear out sample resources
sample_dataset.delete(delete_contents = True)

# Looking Ahead

This notebook covered a small subset of the APIs. Subsequent notebooks cover additional capabilities, such as importing and exporting data into and from BigQuery tables.