# PySpark: Zero to Hero
## Module 18: User Defined Functions (UDFs)

Spark provides a vast library of built-in functions, but sometimes you need to apply custom logic that isn't available out of the box. This is where **User Defined Functions (UDFs)** come in.

However, UDFs in PySpark come with a performance cost. In this notebook, we will learn how to create them and understand the architecture behind why they can be slow.

### Agenda:
1.  **What is a UDF?** Extending Spark's capabilities.
2.  **Creating a Python UDF:** The `udf()` function.
3.  **Registering UDFs for SQL:** Using `spark.udf.register`.
4.  **Performance Implications:** Serialization/Deserialization overhead.
5.  **Best Practices:** Why Native Functions are better.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, expr
from pyspark.sql.types import DoubleType, StringType

spark = SparkSession.builder \
    .appName("UDF_Deep_Dive") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

In [None]:
# Reading the Employee dataset
file_path = "data/input/employee.csv"

# Load data
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(file_path)

df.show(5)
df.printSchema()

## 1. Creating a UDF for DataFrame API

To create a UDF, we follow two steps:
1.  Define a standard Python function.
2.  Convert it to a Spark UDF using `udf(function, return_type)`.

**Scenario:** Let's calculate a **10% Bonus** based on the Salary.

In [None]:
# Step 1: Define a standard Python function
def calculate_bonus(salary):
    if salary is None:
        return 0.0
    return salary * 0.1

# Step 2: Register it as a UDF for DataFrame API
# We must specify the return type (DoubleType), otherwise, it defaults to StringType.
bonus_udf = udf(calculate_bonus, DoubleType())

# Step 3: Apply the UDF
df_with_bonus = df.withColumn("bonus", bonus_udf(col("salary")))

print("--- DataFrame with UDF Calculated Bonus ---")
df_with_bonus.show(5)

## 2. Registering UDFs for Spark SQL

If you want to use your Python function inside a SQL query (e.g., `spark.sql("SELECT ...")`) or inside `expr()`, you must register it with the SparkSession.

**Syntax:** `spark.udf.register("sql_function_name", python_function, return_type)`

In [None]:
# Register the function for use in SQL/Expressions
spark.udf.register("calculate_bonus_sql", calculate_bonus, DoubleType())

# Usage Method A: Using expr() inside withColumn
df_sql_udf = df.withColumn("bonus_sql", expr("calculate_bonus_sql(salary)"))

print("--- Bonus calculated using SQL Registered UDF ---")
df_sql_udf.show(5)

# Usage Method B: Using spark.sql()
df.createOrReplaceTempView("employees")
spark.sql("SELECT name, salary, calculate_bonus_sql(salary) as bonus FROM employees").show(5)

## 3. Why are Python UDFs slower?

When you run a standard Spark command (like `.filter()` or `.select()`), the code runs directly inside the JVM (Java Virtual Machine) on the executors.

When you use a **Python UDF**:
1.  **Serialization:** Spark (JVM) converts the data into a format Python can understand (Pickle).
2.  **Process Spin-up:** It sends this data to a separate **Python Worker Process**.
3.  **Execution:** Python processes the data row-by-row.
4.  **Deserialization:** The result is sent back to the JVM and converted back to Spark format.

This **Context Switching** and **Serialization/Deserialization** creates significant overhead.

## 4. Best Practice: Use Native Expressions

Whenever possible, avoid UDFs. Use the built-in Spark SQL functions (`pyspark.sql.functions`). They run directly in the JVM and are highly optimized (Catalyst Optimizer).

Let's achieve the same result without a UDF.

In [None]:
# Native Spark approach
# This logic is translated directly to optimized JVM bytecode. No Python overhead.

df_native = df.withColumn("bonus_native", col("salary") * 0.1)

print("--- Bonus calculated using Native Expressions (FASTEST) ---")
df_native.show(5)

## Summary

1.  **Flexibility:** UDFs allow you to implement complex logic not available in standard Spark functions.
2.  **Implementation:**
    *   `udf()` for DataFrame API.
    *   `spark.udf.register()` for Spark SQL.
3.  **Performance:** Python UDFs are slower due to serialization overhead between JVM and Python processes.
4.  **Optimization:** Always prefer **Native Spark Functions** (`col() * 0.1`) over UDFs whenever possible.

**Next Steps:**
In the next module, we will explore **Vectorized UDFs (Pandas UDFs)**, which solve the performance issues of standard Python UDFs by processing data in batches using Apache Arrow.