# UDFs using Code in Cloud Storage

This notebook shows you how to use Javascript UDFs (user-defined functions) in Google BigQuery that reference Javascript code stored in Google Cloud Storage. Storing your UDF support code in Cloud Storage allows you to reuse well tested code and share it across multiple notebooks.

Before using this tutorial, you should go through the [UDFs in BigQuery](notebooks/datalab/tutorials/BigQuery/UDFs%20in%20BigQuery.ipynb) tutorial, which discusses how to use UDFs in notebooks without external code, and the [UDF Testing in the Notebook](notebooks/datalab/tutorials/BigQuery/UDF%20Testing%20in%20the%20Notebook.ipynb) tutorial, which shows you how to run and test your Javascript code in the notebook.

You can read more about UDFs [here](https://cloud.google.com/bigquery/user-defined-functions). 


## Scenario

This notebook repeats a scenario presented in other notebooks: looking at anonymized logs that originated in Google AppEngine. 

With BigQuery, it is possible to store your UDFs in Cloud Storage, and reference them from there. Cloud Datalab takes a  different approach: Cloud Datalab requires that UDFs be defined in a notebook, not in Cloud Storage, but does allow UDFs to make use of support code in Cloud Storage. Thus, you can factor out the bulk of your code into a Javascript library in Cloud Storage, and have your UDF in the notebook be a schema specification and thin wrapper function around that code.

## Refactoring to a Thin Wrapper

Let's revisit an earlier UDF and look at how we can move the code to Cloud Storage. As mentioned, we need the UDF function to be defined in the notebook, including the jsdoc comments that define the input and output schema, but the UDF code can be a thin wrapper function. So let's start by refactoring and testing the UDF to be a thin wrapper.

In [1]:
%%bigquery udf -m extract_params

/**
 * A helper function to split a set of URL query parameters into an object
 * as key/value properties.
 */
function getParameters(path) {
  var re = /[?&]([^=]*)=([^&]*)/g;
  var result = {};
  var match;
  while ((match = re.exec(path)) != null) {
    result[match[1]] = decodeURIComponent(match[2]);
  }
  return result;  
}

/**
 * The main part of the original UDF is now factored out into 
 * this function.
 */
function extractParams(r, emitFn) {
  var q = getParameters(r.path);
  var split = r.path.indexOf('?');
  r.event = r.path.substr(5, split - 5);
  r.project = q.project;
  r.instance = q.instance;
  r.user = q.user;
  r.page = q.page;
  r.path = q.path;
  r.version = q.version;
  r.release = q.release;
  emitFn(r);
}

/**
 * Our thin wrapper UDF function, which needs the jsdoc schema 
 * definition comments:
 *
 * @param {{timestamp: timestamp, method: string, status: integer, latency: float,
 *     path: string}} r
 * @param function({{timestamp: timestamp, method: string, status:integer, latency: float,
 *      path: string, event: string, project: string, instance: string, user: string,
 *      page: string, version: string, release: string}}) emitFn
 */
function(r, emitFn) {
  extractParams(r, emitFn);
}


In [2]:
%%sql 

SELECT * FROM extract_params([cloud-datalab-samples:appenginelogs.sample_logs_20151027])
LIMIT 5

timestamp,method,status,latency,path,event,project,instance,user,page,version,release
2015-10-27 01:03:10.959946,POST,204,0.003195,38,page,36,40,131,detail,0.1.1,alpha
2015-10-27 00:57:44.694484,POST,204,0.003418,3,start,143,215,2,master,0.1.1,alpha
2015-10-27 22:00:47.660171,POST,204,0.00337,3,signin,5,2,54,master,0.1.1,alpha
2015-10-27 20:10:19.547390,POST,204,0.00368,3,start,149,232,2,master,0.1.1,alpha
2015-10-27 01:24:18.065954,POST,204,0.003023,6,page,20,29,42,detail,0.1.1,alpha


## Moving the Code to GCS

Now that the testing is done, let's create a file in Cloud Storage to hold the bulk of the code. We can do that in the notebook. The name of the bucket will be project dependent, so you will need to complete and execute this code.

In [None]:
from datalab.context import Context
import datalab.storage as gs

code = """
/**
 * A helper function to split a set of URL query parameters into an object
 * as key/value properties.
 */
function getParameters(path) {
  var re = /[?&]([^=]*)=([^&]*)/g;
  var result = {};
  var match;
  while ((match = re.exec(path)) != null) {
    result[match[1]] = decodeURIComponent(match[2]);
  }
  return result;  
}

function extractParams(r, emitFn) {
  var q = getParameters(r.path);
  var split = r.path.indexOf('?');
  r.event = r.path.substr(5, split - 5);
  r.project = q.project;
  r.instance = q.instance;
  r.user = q.user;
  r.page = q.page;
  r.path = q.path;
  r.version = q.version;
  r.release = q.release;
  emitFn(r);
}
"""

# Get a bucket in the current project
project = Context.default().project_id
sample_bucket_name = project + '-datalab-udf-samples'

# Create the storage bucket and code library object
sample_bucket = gs.Bucket(sample_bucket_name)
sample_bucket.create()
sample_item = sample_bucket.item('udf_library.js')
sample_item.write_to(code, 'application/javascript')

# Print the URI of the library object to use in @import
print sample_item.uri

Once the code is copied to Cloud Storage, we can refer to it in the UDF jsdoc comment header using `@import`. You can have more than one `@import` if necessary. Note that in the cell below, you need to change the `@import` to refer to your project. You can use the URI output from the cell above as the argument to `@import`.

In [4]:
%%bigquery udf -m externalized_udf

/**
 * The next line imports the code from Cloud Storage. Replace the library
 * name with your project name.
 *
 * @import gs://YOUR-PROJECT-NAME-HERE-datalab-udf-samples/udf_library.js
 * @param {{timestamp: timestamp, method: string, status: integer, latency: float,
 *     path: string}} r
 * @param function({{timestamp: timestamp, method: string, status:integer, latency: float,
 *      path: string, event: string, project: string, instance: string, user: string,
 *      page: string, version: string, release: string}}) emitFn
 */
function(r, emitFn) {
  extractParams(r, emitFn);
}

And now we can test it:

In [5]:
%%sql 

SELECT * FROM externalized_udf([cloud-datalab-samples:appenginelogs.sample_logs_20151027])
LIMIT 5

timestamp,method,status,latency,path,event,project,instance,user,page,version,release
2015-10-27 01:03:10.959946,POST,204,0.003195,38,page,36,40,131,detail,0.1.1,alpha
2015-10-27 00:57:44.694484,POST,204,0.003418,3,start,143,215,2,master,0.1.1,alpha
2015-10-27 22:00:47.660171,POST,204,0.00337,3,signin,5,2,54,master,0.1.1,alpha
2015-10-27 20:10:19.547390,POST,204,0.00368,3,start,149,232,2,master,0.1.1,alpha
2015-10-27 01:24:18.065954,POST,204,0.003023,6,page,20,29,42,detail,0.1.1,alpha


## Cleaning Up

Since this is a tutorial, we should clean up the objects we created in Cloud Storage.

In [6]:
sample_item.delete()
sample_bucket.delete()