-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# User-Defined Functions

##### Objectives
1. Define a function
1. Create and apply a UDF
1. Register the UDF to use in SQL
1. Create and register a UDF with Python decorator syntax
1. Create and apply a Pandas (vectorized) UDF

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.udf.html?#pyspark.sql.functions.udf" target="_blank">UDF Registration (`spark.udf`)</a>: `register`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: `udf`
- <a href="https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#use-udf-with-dataframes" target="_blank">Python UDF Decorator</a>: `@udf`
- <a href="https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html#pandas-user-defined-functions" target="_blank">Pandas UDF Decorator</a>: `@pandas_udf`

### User-Defined Function (UDF)
A custom column transformation function

- Can’t be optimized by Catalyst Optimizer
- Function is serialized and sent to executors
- Row data is deserialized from Spark's native binary format to pass to the UDF, and the results are serialized back into Spark's native format
- For Python UDFs, additional interprocess communication overhead between the executor and a Python interpreter running on each worker node

In [0]:
%run ./Includes/Classroom-Setup

For this demo, we're going to use the sales data.

In [0]:
salesDF = spark.read.parquet(salesPath)
display(salesDF)

### Define a function

Define a function (on the driver) to get the first letter of a string from the `email` field.

In [0]:
def firstLetterFunction(email):
    return email[0]

firstLetterFunction("annagray@kaufman.com")

### Create and apply UDF
Register the function as a UDF. This serializes the function and sends it to executors to be able to transform DataFrame records.

In [0]:
firstLetterUDF = udf(firstLetterFunction)

Apply the UDF on the `email` column.

In [0]:
from pyspark.sql.functions import col

display(salesDF.select(firstLetterUDF(col("email"))))

### Register UDF to use in SQL
Register the UDF using `spark.udf.register` to also make it available for use in the SQL namespace.

In [0]:
salesDF.createOrReplaceTempView("sales")

firstLetterUDF = spark.udf.register("sql_udf", firstLetterFunction)

In [0]:
# You can still apply the UDF from Python
display(salesDF.select(firstLetterUDF(col("email"))))

In [0]:
%sql
-- You can now also apply the UDF from SQL
SELECT sql_udf(email) AS firstLetter FROM sales

### Use Decorator Syntax (Python Only)

Alternatively, you can define and register a UDF using <a href="https://realpython.com/primer-on-python-decorators/" target="_blank">Python decorator syntax</a>. The `@udf` decorator parameter is the Column datatype the function returns.

You will no longer be able to call the local Python function (i.e., `firstLetterUDF("annagray@kaufman.com")` will not work).

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> This example also uses <a href="https://docs.python.org/3/library/typing.html" target="_blank">Python type hints</a>, which were introduced in Python 3.5. Type hints are not required for this example, but instead serve as "documentation" to help developers use the function correctly. They are used in this example to emphasize that the UDF processes one record at a time, taking a single `str` argument and returning a `str` value.

In [0]:
# Our input/output is a string
@udf("string")
def firstLetterUDF(email: str) -> str:
    return email[0]

And let's use our decorator UDF here.

In [0]:
from pyspark.sql.functions import col

salesDF = spark.read.parquet("/mnt/training/ecommerce/sales/sales.parquet")
display(salesDF.select(firstLetterUDF(col("email"))))

### Pandas/Vectorized UDFs

As of Spark 2.3, there are Pandas UDFs available in Python to improve the efficiency of UDFs. Pandas UDFs utilize Apache Arrow to speed up computation.

* <a href="https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html" target="_blank">Blog post</a>
* <a href="https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html?highlight=arrow" target="_blank">Documentation</a>

<img src="https://databricks.com/wp-content/uploads/2017/10/image1-4.png" alt="Benchmark" width ="500" height="1500">

The user-defined functions are executed using: 
* <a href="https://arrow.apache.org/" target="_blank">Apache Arrow</a>, an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes with near-zero (de)serialization cost
* Pandas inside the function, to work with Pandas instances and APIs

<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> As of Spark 3.0, you should **always** define your Pandas UDF using Python type hints.

In [0]:
import pandas as pd
from pyspark.sql.functions import pandas_udf

# We have a string input/output
@pandas_udf("string")
def vectorizedUDF(email: pd.Series) -> pd.Series:
    return email.str[0]

# Alternatively
# def vectorizedUDF(email: pd.Series) -> pd.Series:
#     return email.str[0]
# vectorizedUDF = pandas_udf(vectorizedUDF, "string")

In [0]:
display(salesDF.select(vectorizedUDF(col("email"))))

We can also register these Pandas UDFs to the SQL namespace.

In [0]:
spark.udf.register("sql_vectorized_udf", vectorizedUDF)

In [0]:
%sql
-- Use the Pandas UDF from SQL
SELECT sql_vectorized_udf(email) AS firstLetter FROM sales

# Sort Day Lab

##### Tasks
1. Define a UDF to label the day of week
1. Apply the UDF to label and sort by day of week
1. Plot active users by day of week as a bar graph

Start with a DataFrame of the average number of active users by day of week.

This was the resulting `df` in a previous lab.

In [0]:
from pyspark.sql.functions import approx_count_distinct, avg, col, date_format, to_date

df = (spark
      .read
      .parquet(eventsPath)
      .withColumn("ts", (col("event_timestamp") / 1e6).cast("timestamp"))
      .withColumn("date", to_date("ts"))
      .groupBy("date").agg(approx_count_distinct("user_id").alias("active_users"))
      .withColumn("day", date_format(col("date"), "E"))
      .groupBy("day").agg(avg(col("active_users")).alias("avg_users"))
     )

display(df)

### 1. Define UDF to label day of week

Use the **`labelDayOfWeek`** function provided below to create the UDF **`labelDowUDF`**

In [0]:
def labelDayOfWeek(day: str) -> str:
    dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
           "Fri": "5", "Sat": "6", "Sun": "7"}
    return dow.get(day) + "-" + day

In [0]:
# ANSWER
labelDowUDF = spark.udf.register("labelDow", labelDayOfWeek)

### 2. Apply UDF to label and sort by day of week
- Update the **`day`** column by applying the UDF and replacing this column
- Sort by **`day`**
- Plot as a bar graph

In [0]:
# ANSWER
finalDF = (df
           .withColumn("day", labelDowUDF(col("day")))
           .sort("day")
          )
display(finalDF)

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>