# BigQuery - Extracting Data

This notebook demonstrates how you can run a SQL query to extract data from BigQuery and stream the results to a local file, or to a file in GCS, rather than into memory. This maybe useful for resultsets that are too large to directly fit into memory all at once.

This notebook uses a sample dataset of request logs data from a web server.

Related Links:

* [BigQuery](https://cloud.google.com/bigquery/)
* BigQuery [SQL reference](https://cloud.google.com/bigquery/query-reference)

----

NOTE:

* If you're new to notebooks, or want to check out additional samples, check out the full [list](..) of notebooks.

In [1]:
import gcp
import gcp.bigquery as bq
import gcp.storage as gcs

# Querying BigQuery

First we'll define the query we want to use to retrieve results.

In [2]:
%%sql --module data
SELECT * FROM [cloud-datalab:sampledata.requestlogs_20140615] LIMIT 100000

Lets sample it, to get a sense of the data that will be extracted when the full query is executed.

In [3]:
bq.Query(data).sample()

# Extracting Results to a Local File

You can use the `to_file` method on the query object to stream down results. Note that this can take some while for all the data to become available locally.

In [4]:
bq.Query(data).to_file('/tmp/data.csv')

'/tmp/data.csv'

To verify, lets open the file and read a few lines in. We expect to see the column headers in the first line, followed by a sample of individual data rows seen earlier.

In [5]:
with open('/tmp/data.csv') as datafile:
    head = [next(datafile) for x in xrange(6)]
print ''.join(head)

timestamp,latency,status,method,endpoint
2014-06-15 07:00:00.003772,122,200,GET,Interact3
2014-06-15 07:00:00.428897,144,200,GET,Interact3
2014-06-15 07:00:00.536486,48,200,GET,Interact3
2014-06-15 07:00:00.652760,28,405,GET,Interact2
2014-06-15 07:00:00.670100,103,200,GET,Interact3



In [6]:
%%bash
ls -l /tmp/data.csv

-rw-r--r-- 1 root root 4668222 Sep 29 22:50 /tmp/data.csv


# Extracting Results to a GCS Bucket

Instead of streaming results into the local VM, it might be more useful to stream the results into GCS, and then read from GCS as needed.

For the purposes of the sample, this code uses the current project id to create a reference to a GCS bucket that can be written to - specifically the same bucket that contains these sample notebooks. For the purposes of this job, we'll have BigQuery write out using 2 workers.

In [7]:
gcs_bucket_name = gcp.Context.default().project_id + '-datalab'
gcs_bucket = gcs.Bucket(gcs_bucket_name)
gcs_path_prefix = 'gs://%s/data/logs' % gcs_bucket_name

In [8]:
if not gcs_bucket.exists():
    gcs_bucket.create()
bq.Query(data).extract([gcs_path_prefix + '-1-*.csv', gcs_path_prefix + '-2-*.csv'])

Job job_HwYXcx7B-gXzYNkJCQXNvqFUKP4 completed

Once the job is complete, we can enumerate the GCS bucket to look for the written out file objects.

In [9]:
map(lambda item: item.key, list(gcs_bucket.items(prefix='data/')))

[u'data/logs-1-000000000000.csv',
 u'data/logs-1-000000000001.csv',
 u'data/logs-2-000000000000.csv']