# Higher Order Functions and SQL UDF

Here it is the schema for the bookstore dataset used in this notebook:

![bookstore dataset schema](../Includes/images/image1.png)

In [0]:
-- Dataset containing three tables: customers, orders, and books
%run ../Includes/Copy-Datasets

In [0]:
SELECT * FROM orders

The `books` column is of complex data type. It is, actually, and array of struct type. To work directly with such a complex data type, **higher order functions** are needed.

**Higher order functions** allow to work directly with hierarchical data like arrays and map type objects.

## FILTER Function

FILTER() filters an array using a given lambda function.

Creating a new column (`multiple_copies`) where the `books` column is filtered to extract only those books that have a quantity greater or equal to 2. That means, they have been bought in multiple copies (2 or more).

In [0]:
SELECT order_id, books,
  FILTER (books, i -> i.quantity >=2) AS multiple_copies
FROM orders

A new column (`multiple_copies`) is created, where ther is an array that contains only the filtered data. A lot of empty arrays are also been created in this new column. In this case, it is useful to use a `WHERE clause to show only non-empty array values in the return column.

That can be accomplished by using a subquery, which is a query within another query in order to apply the `WHERE` clause on the size of the returned column:

In [0]:
SELECT order_id, multiple_copies
FROM (
  SELECT order_id, FILTER (books, i -> i.quantity >=2) AS multiple_copies
  FROM orders
)
WHERE size(multiple_copies) > 0;

The empty arrays are not there any more.

## TRANSFORM Function

This function is used to apply a transfromation to all the items in an array, and extract the transformed value.

In [0]:
SELECT order_id, books, TRANSFORM(
  books, b -> CAST(b.subtotal * 0.8 AS INT)
) AS subtotal_after_discount
FROM orders

In this example, for each book in the `books` array, a discount is being applied on the `subtotal` value. A new column `subtotal_after_discount` has been created containing an array of the transformed values for each element in the 'books' array.

## User Defined Functions (UDFs)

This functions allow to register a custom combination of SQL as function in a database, making these methods reusable in any SQL query. Additionally, UDFs leverage Spark SQL directly maintaining all the optimization of Spark when applying the custom logic to large datasets.

An UDF requires:
* A function name
* Optional parameters
* Type to be returned
* Some custom logic

In [0]:
-- This function received an email (string), splits it in two parts, takes the second element and adds the 'http' protocol to the domain name
CREATE OR REPLACE FUNCTION get_url(email STRING)
RETURNS STRING

RETURN concat("http://www.", split(email, "@")[1])

In [0]:
-- Applying the created UDF to get the URL from the 'customers' table
SELECT email, get_url(email) domain
FROM customers

Note: UDFs are permanent objects that are persisted to the database. Then, it can be used in different Spark sessions and notebooks.

In [0]:
DESCRIBE FUNCTION get_url

In [0]:
DESCRIBE FUNCTION EXTENDED get_url

More complex logic can be applied to an UDF. For example:

In [0]:
-- Applying standard SQL 'CASE ... WHEN' statement to evaluate multiple conditions
CREATE FUNCTION site_type(email STRING)
RETURNS STRING
RETURN CASE 
  WHEN email like "%.com" THEN "Commercial business"
  WHEN email like "%.org" THEN "Non-profits organization"
  WHEN email like "%.edu" THEN "Educational institution"
  ELSE concat("Unknown extension for domain ", split(email, "@")[1])
END

In [0]:
SELECT email, site_type(email) as domain_category
FROM customers

Everything is evaluated natively in Spark and so, it's optimized for parallel execution.

In [0]:
-- Dropping UDFs
DROP FUNCTION get_url;
DROP FUNCTION site_type;