# Building and Using PySpark UDFs

Explore user-defined functions (UDFs) in PySpark, understand when to reach for them, and see how to keep performance and testing in check.


## When to Use UDFs

- Express custom business logic that is hard to model using built-in functions.
- Integrate external libraries or Python-only algorithms into Spark pipelines.
- Prototype transformations quickly before rewriting them with Spark SQL functions.


## Setup and Shared Dataset

Load the shared orders dataset we use throughout the course so we can attach UDF-generated columns.


In [None]:
from pathlib import Path
from pyspark.sql import SparkSession, functions as F, types as T

spark = SparkSession.builder.appName('PySparkUDFsTutorial').getOrCreate()

repo_root = Path.cwd()
if (repo_root / 'notebooks').exists():
    data_path = repo_root / 'notebooks' / 'data' / 'orders_demo.csv'
else:
    data_path = Path('..') / 'data' / 'orders_demo.csv'

orders_df = (
    spark.read
    .option('header', True)
    .option('inferSchema', True)
    .csv(str(data_path))
    .withColumn('order_date', F.to_date('order_date'))
)
orders_df.orderBy('order_date', 'region').show()


## Scalar UDF Basics

A scalar UDF transforms each row independently. Use `pyspark.sql.functions.udf` with a return type to register it.


In [None]:
from pyspark.sql.functions import udf

@udf(returnType=T.StringType())
def classify_region(region: str) -> str:
    if region in {'east', 'west'}:
        return 'coastal'
    return 'inland'

with_udf = orders_df.withColumn('region_type', classify_region(F.col('region')))
with_udf.orderBy('order_date', 'region').show()


## Pandas UDF for Vectorized Logic

Pandas UDFs (a.k.a. vectorized UDFs) operate on batches of data for better performance. They require Arrow to be enabled and return pandas objects.


In [None]:
from pyspark.sql.functions import pandas_udf

@pandas_udf('double')
def add_priority_bonus(orders):
    return orders * 1.1

with_bonus = with_udf.withColumn('orders_with_bonus', add_priority_bonus(F.col('orders')))
with_bonus.orderBy('order_date', 'region').show()


## When to Prefer Built-ins Over UDFs

- Built-in functions benefit from Catalyst optimization; they are usually faster.
- Use UDFs only when necessary—consider rewriting a UDF as SQL functions or window expressions once logic stabilizes.
- Watch out for serialization overhead: Pandas UDFs mitigate some cost but still require careful benchmarking.


## Testing UDFs

Combine Python unit tests with DataFrame comparisons. Here we reuse the testing helpers from earlier tutorials.


In [None]:
from pyspark.testing.utils import assertDataFrameEqual

expected_rows = [
    ('east', 'coastal'),
    ('north', 'inland'),
    ('south', 'inland'),
]
expected_df = spark.createDataFrame(expected_rows, schema=['region', 'region_type']).orderBy('region')
actual_df = with_udf.select('region', 'region_type').distinct().orderBy('region')

assertDataFrameEqual(actual_df, expected_df)
print('Region classification UDF behaves as expected.')


## Exercises

- Implement a scalar UDF that categorizes `orders` into buckets such as `low`, `medium`, and `high`.
- Write a Pandas UDF that normalizes orders within each region and compare it against a built-in window function approach.
- Benchmark a UDF against a built-in function using `timeit` or Spark’s event logs to understand the performance trade-offs.


In [None]:
spark.stop()
