# UDFs in BigQuery

This notebook shows how to use Javascript UDFs (user-defined functions) in BigQuery. UDFs allow you to operate on a row of a table at a time, performing arbitrary transformations. A UDF is similar to the "Map" function in a MapReduce: it takes a single row as input and produces zero or more rows as output. The output can potentially have a different schema than the input.

Note that your UDFs should be stateless; i.e. you should not assume the availability of global state where the outputs of one call can affect the results of a later call. This is not because you can't create persistent state (you can; the function and any functions it calls are examples of that), but because your computation is distributed across multiple nodes and so you cannot guarantee consistency across the nodes.

As an aside, you may sometimes see the term TVF (Table Valued Function) in BigQuery error messages that are really concerning your UDF.

You can read more about UDFs [here](https://cloud.google.com/bigquery/user-defined-functions)


## Scenario

In this notebook we are going to look at some anonymized logs that originated in Google AppEngine. These logs include the path of the URI that was requested, and in this path are a number of query parameters we are interested in. We will create a UDF that can extract the query parameters and create a table with each parameter in a separate column, which will enable us to use the logs much more easily.


## Examining the Data

We are going to look at a week of logs from October 2015. These logs were imported from Google AppEngine and a few relevant fields extracted and anonymized. There is a separate table for each day. Let's look at an example, starting with the schema and then some sample rows:

In [1]:
%bigquery schema --table cloud-datalab-samples:appenginelogs.sample_logs_20151027

In [2]:
%bigquery sample --count 5 --table cloud-datalab-samples:appenginelogs.sample_logs_20151027

timestamp,method,status,latency,path
2015-10-27 01:03:10.959946,POST,204,0.003195,/log/page?project=36&instance=40&user=131&page=detail&path=38&version=0.1.1&release=alpha
2015-10-27 00:57:44.694484,POST,204,0.003418,/log/start?project=143&instance=215&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 22:00:47.660171,POST,204,0.00337,/log/signin?project=5&instance=2&user=54&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 20:10:19.547390,POST,204,0.00368,/log/start?project=149&instance=232&user=2&page=master&path=3&version=0.1.1&release=alpha
2015-10-27 01:24:18.065954,POST,204,0.003023,/log/page?project=20&instance=29&user=42&page=detail&path=6&version=0.1.1&release=alpha


You can see we have 5 columns, with the 'path' column being the one that needs the most processing. Each URL has the form 'log/event?params', where params can be one of project, instance, user, page, path, version, or release. We are going to extract each of these out into a separate column.

## Creating and Testing the UDF

UDFs are Javascript functions that take a row object and emitter function as input; they perform some computation and then call the emitter function to output a result row object. If you have read the normal BigQuery documentation on UDFs you will have read about the `defineFunction` call that is needed to define the UDF including input fields, output schema, and so on. Cloud Datalab is a bit simpler; we make use of jsdoc-style @param comments to achieve the same end. Another important thing to note is that the UDF function should not have a name or be assigned to a variable. It should ideally be stateless, but it can call support functions as demonstrated below.

In [3]:
%%bigquery udf --module extract_params

/**
 * A helper function to split a set of URL query parameters into an object
 * as key/value properties.
 */
function getParameters(path) {
  var re = /[?&]([^=]*)=([^&]*)/g;
  var result = {};
  var match;
  while ((match = re.exec(path)) != null) {
    result[match[1]] = decodeURIComponent(match[2]);
  }
  return result;  
}

/**
 * Our UDF function, which takes a row r and emitter function emitFn.
 * We assume each row r has the five columns from our input (timestamp, 
 * method, latency, status and path). We will parse path from the input
 * and add its constituent parts, then call the emitter.
 *
 * Note: we re-use r for the output as we are keeping its fields but we
 * could have created a new object if that was more appropriate.
 *
 * We define the two parameters below and specify the schema of the input row and
 * the output row.
 *
 * @param {{timestamp: timestamp, method: string, status: integer, latency: float,
 *     path: string}} r
 * @param function({{timestamp: timestamp, method: string, status:integer, latency: float,
 *      path: string, event: string, project: string, instance: string, user: string,
 *      page: string, version: string, release: string}}) emitFn
 */
function(r, emitFn) {
  var q = getParameters(r.path);
  var split = r.path.indexOf('?');
  r.event = r.path.substr(5, split - 5);
  r.project = q.project;
  r.instance = q.instance;
  r.user = q.user;
  r.page = q.page;
  r.path = q.path;
  r.version = q.version;
  r.release = q.release;
  emitFn(r);
}


Now we can try calling the UDF. We need to define a query to do this. Note that we just call our UDF much like any other function, although it must be applied to a tabular argument (either a table or table decorator, a query, or another UDF). 

In [4]:
%%sql 

SELECT * FROM extract_params([cloud-datalab-samples:appenginelogs.sample_logs_20151027])
LIMIT 5

timestamp,method,status,latency,path,event,project,instance,user,page,version,release
2015-10-27 01:03:10.959946,POST,204,0.003195,38,page,36,40,131,detail,0.1.1,alpha
2015-10-27 00:57:44.694484,POST,204,0.003418,3,start,143,215,2,master,0.1.1,alpha
2015-10-27 22:00:47.660171,POST,204,0.00337,3,signin,5,2,54,master,0.1.1,alpha
2015-10-27 20:10:19.547390,POST,204,0.00368,3,start,149,232,2,master,0.1.1,alpha
2015-10-27 01:24:18.065954,POST,204,0.003023,6,page,20,29,42,detail,0.1.1,alpha


## Next Steps

You can learn how to test your UDF in the notebook by following the tutorial [UDF Testing in the Notebook](notebooks/datalab/tutorials/BigQuery/UDF%20Testing%20in%20the%20Notebook.ipynb). If you have code that you use regularly in your UDFs you can factor that out and put it in Google Cloud Storage then import it; this is covered in the tutorial  [UDFs using Code in Cloud Storage](notebooks/datalab/tutorials/BigQuery/UDFs%20using%20Code%20in%20Cloud%20Storage.ipynb).