# User Define Function
A UDF lets you define custom logic that you can apply to your data when built-in functions are not enough.

- You write a function in Python, Scala, Java, or R.
- Register it with Spark.
- Then you can call it like any Spark function in SQL or DataFrame operations.

### Why Use UDFs?
**Reasons to use UDFs:**

1. **Custom logic**: You can encode business-specific calculations or transformations that are too complex for standard functions.
1. **Reusability**: Instead of writing the same code over and over, define the logic once and reuse it.
1. **Readability**: Makes complex transformations easier to understand.

**When to avoid UDFs:**

- UDFs can be **slower** because Spark treats them as a “black box” and can’t optimize them.
- If possible, **always try to use built-in Spark SQL functions first** (they run faster).

### UDF in PySpark

PySpark UDFs are Python functions that you register so they can be used in Spark DataFrame transformations.

**How it works:**

- You define a Python function.
- You use `pyspark.sql.functions.udf()` to convert it into a UDF.
- You specify the return type.
- You call it with `.withColumn()`, `.select()`, or `.filter()`.

#### Create Sample DataFrame

In [0]:
%python
data = [
    (1, "Alice", 45),
    (2, "Bob", 75),
    (3, "Charlie", 30)
]

df = spark.createDataFrame(data, ["id", "name", "score"])
df.show()

#### Define a Python Function
We want to label scores as "Low", "Medium", or "High":

In [0]:
%python
def label_score(score):
    if score >= 70:
        return "High"
    elif score >= 40:
        return "Medium"
    else:
        return "Low"

#### Register the UDF
Import udf and StringType:

In [0]:
%python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

label_score_udf = udf(label_score, StringType())

#### Use the UDF in DataFrame

In [0]:
%python
df_labeled = df.withColumn("score_label", label_score_udf(df["score"]))
df_labeled.show()

## What is a Pandas UDF?
**Pandas UDFs** (also called _vectorized_ UDFs) are much faster than regular PySpark UDFs because they use Apache Arrow to efficiently transfer data between Spark and Python.

- A Pandas UDF processes entire _batches_ (chunks) of data at once rather than one row at a time.
- You write the logic using **Pandas Series** instead of single values.
- Spark can parallelize and optimize them much better.

#### Define Pandas UDF
Important:
- Use pandas_udf decorator
- Input: Pandas Series
- Output: Pandas Series

In [0]:
%python
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

@pandas_udf(StringType())
def label_score_pandas_udf(scores: pd.Series) -> pd.Series:
    return scores.apply(
        lambda s: "High" if s >= 70 else "Medium" if s >= 40 else "Low"
    )

#### Apply Pandas UDF

In [0]:
%python
df_labeled = df.withColumn("score_label", label_score_pandas_udf(df["score"]))
df_labeled.show()

## SQL UDFs

SQL user-defined functions (UDFs) are a powerful feature that allows you to encapsulate custom logic inside a reusable SQL function.

Unlike **external UDFs** (written in Python, Scala, Java, or R), SQL UDFs are defined purely in SQL. Because the logic is transparent to Spark's Catalyst Optimizer, **SQL UDFs usually perform better** on large datasets than external UDFs (which behave like opaque black boxes).

### Creating SQL UDFs
To create a SQL UDF, you must define:

- A `function name`
- Optional `input parameters`
- The `return data type`
- A `SQL expression` that implements the function logic

**Create a table with product discounts**

In [0]:
CREATE OR REPLACE TEMP VIEW product_discounts AS
SELECT * FROM VALUES
  (101, 'Pencil', 15),
  (102, 'Notebook', 45),
  (103, 'Laptop', 75),
  (104, 'Desk', 55)
AS t(product_id, product_name, discount_percent);

#### Create the UDF
This UDF will return a string label for the discount level:

In [0]:
CREATE OR REPLACE FUNCTION discount_category(pct INT)
RETURNS STRING
RETURN
  CASE
    WHEN pct >= 70 THEN 'High'
    WHEN pct >= 40 THEN 'Medium'
    ELSE 'Low'
  END;

#### Inspect the UDF
You can describe the function metadata:

In [0]:
DESCRIBE FUNCTION EXTENDED discount_category;

#### Apply the UDF in a query

In [0]:
SELECT
  product_id,
  product_name,
  discount_percent,
  discount_category(discount_percent) AS discount_label
FROM product_discounts
ORDER BY product_id;

#### Drop the UDF

In [0]:
DROP FUNCTION IF EXISTS discount_category;