# User-Defined Functions
While Apache Spark has a plethora of built-in functions, the flexibility of Spark
allows for data engineers and data scientists to define their own functions too. These
are known as user-defined functions (UDFs).

## Spark SQL UDFs
The benefit of creating your own PySpark or Scala UDFs is that you (and others) will
be able to make use of them within Spark SQL itself. For example, a data scientist can
wrap an ML model within a UDF so that a data analyst can query its predictions in
Spark SQL without necessarily understanding the internals of the model.
Here’s a simplified example of creating a Spark SQL UDF. Note that UDFs operate per
session and they will not be persisted in the underlying metastore:

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import LongType

In [3]:
spark = SparkSession.builder.appName("udf").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/07 20:05:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
def cubeFunc(val):
    return val ** 3

In [6]:
spark.udf.register("cubed", cubeFunc, LongType())

<function __main__.cubeFunc(val)>

In [7]:
spark.range(1,10).createOrReplaceTempView("udf_test")

In [9]:
spark.sql("select id, cubed(id) from udf_test").show()

[Stage 1:>                                                        (0 + 12) / 12]

+---+---------+
| id|cubed(id)|
+---+---------+
|  1|        1|
|  2|        8|
|  3|       27|
|  4|       64|
|  5|      125|
|  6|      216|
|  7|      343|
|  8|      512|
|  9|      729|
+---+---------+



                                                                                

In [10]:
def rectangle_area(w,h):
    return w * h

In [11]:
spark.sql("select id as width, cubed(id) as height from udf_test").createOrReplaceTempView("rectangle_dimensions")

In [12]:
spark.udf.register("calc_rectangle_area",rectangle_area, LongType())

<function __main__.rectangle_area(w, h)>

In [13]:
spark.sql("select width, height, calc_rectangle_area(width, height) from rectangle_dimensions").show()

+-----+------+----------------------------------+
|width|height|calc_rectangle_area(width, height)|
+-----+------+----------------------------------+
|    1|     1|                                 1|
|    2|     8|                                16|
|    3|    27|                                81|
|    4|    64|                               256|
|    5|   125|                               625|
|    6|   216|                              1296|
|    7|   343|                              2401|
|    8|   512|                              4096|
|    9|   729|                              6561|
+-----+------+----------------------------------+



## Evaluation order and null checking in Spark SQL
Spark SQL (this includes SQL, the DataFrame API, and the Dataset API) does not
guarantee the order of evaluation of subexpressions. For example, the following query
does not guarantee that the s is NOT NULL clause is executed prior to the strlen(s)> 1 clause:

spark.sql("SELECT s FROM test1 WHERE s IS NOT NULL AND strlen(s) > 1")

Therefore, to perform proper null checking, it is recommended that you do the
following:
1. Make the UDF itself null-aware and do null checking inside the UDF.
2. Use IF or CASE WHEN expressions to do the null check and invoke the UDF in a
conditional branch.

## Speeding up and distributing PySpark UDFs with Pandas UDFs
One of the previous prevailing issues with using PySpark UDFs was that they had
slower performance than Scala UDFs. This was because the PySpark UDFs required
data movement between the JVM and Python, which was quite expensive. To resolve
this problem, Pandas UDFs (also known as vectorized UDFs) were introduced as part
of Apache Spark 2.3. A Pandas UDF uses Apache Arrow to transfer data and Pandas
to work with the data. You define a Pandas UDF using the keyword pandas_udf as
the decorator, or to wrap the function itself. Once the data is in Apache Arrow format,
there is no longer the need to serialize/pickle the data as it is already in a format
consumable by the Python process. Instead of operating on individual inputs row by
row, you are operating on a Pandas Series or DataFrame (i.e., vectorized execution).

From Apache Spark 3.0 with Python 3.6 and above, Pandas UDFs were split into two
API categories: Pandas UDFs and Pandas Function APIs.

## Pandas UDFs
With Apache Spark 3.0, Pandas UDFs infer the Pandas UDF type from Python
type hints in Pandas UDFs such as pandas.Series, pandas.DataFrame, Tuple,
and Iterator. Previously you needed to manually define and specify each Pandas
UDF type. Currently, the supported cases of Python type hints in Pandas
UDFs are Series to Series, Iterator of Series to Iterator of Series, Iterator of Multiple
Series to Iterator of Series, and Series to Scalar (a single value).

## Pandas Function APIs
Pandas Function APIs allow you to directly apply a local Python function to a
PySpark DataFrame where both the input and output are Pandas instances. For
Spark 3.0, the supported Pandas Function APIs are grouped map, map, cogrouped
map.

For more information, refer to “Redesigned Pandas UDFs with Python Type Hints”
on page 354 in Chapter 12.

In [18]:
from pyspark.sql.functions import pandas_udf, col
import pandas as pd

In [19]:
def cube_udf(n: pd.Series) -> pd.Series:
    return n ** 3

In [22]:
# !pip install pyarrow
get_cube_of = pandas_udf(cube_udf, LongType())

# above line of code depends on pyarrow, error will be thrown if not installed

In [25]:
nums = pd.Series([1,2,3])
print(cube_udf(nums))    # this is usual pandas function not pandas udf

0     1
1     8
2    27
dtype: int64


In [28]:
# Now let’s switch to a Spark DataFrame. We can execute this function as a Spark vectorized UDF as follows:

df = spark.range(1,4)
df.select("id", get_cube_of("id")).show() # pandas udf is used here for better performance 


# As opposed to a local function, using a vectorized UDF will result in the execution of
# Spark jobs; the previous local function is a Pandas function executed only on the
# Spark driver. This becomes more apparent when viewing the Spark UI for one of the
# stages of this pandas_udf function (Figure 5-1).
# For a deeper dive into Pandas UDFs, refer to pandas user-defined
# functions documentation.

[Stage 4:>                                                        (0 + 12) / 12]

+---+------------+
| id|cube_udf(id)|
+---+------------+
|  1|           1|
|  2|           8|
|  3|          27|
+---+------------+



                                                                                