**⭐ 1. What This Pattern Solves**

PySpark UDFs can fail on bad or malformed input (e.g., invalid strings, nulls, or unexpected types). Wrapping the logic in try/except prevents entire jobs from failing and allows returning defaults or logging errors.

**Use-cases:**

Parsing JSON strings in a column

Converting strings to numeric types with potential bad formatting

Complex business logic where exceptions may occur per row

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT
    CASE 
        WHEN TRY_CAST(amount AS INT) IS NULL THEN 0
        ELSE CAST(amount AS INT)
    END AS safe_amount
FROM raw_table;


**⭐ 3. Core Idea**

Wrap row-level transformations in a try/except inside a UDF. You can return a default value or log the error without failing the whole job.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def safe_parse(x):
    try:
        return int(x)
    except Exception:
        return 0  # default value

safe_parse_udf = udf(safe_parse, IntegerType())

df_transformed = df.withColumn("safe_amount", safe_parse_udf("amount"))

**⭐ 5. Detailed Example**

In [0]:
data = [("100",), ("200",), ("abc",), (None,)]
df_raw = spark.createDataFrame(data, ["amount"])

# UDF with try/except
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def safe_parse(x):
    try:
        return int(x)
    except Exception:
        return 0

safe_parse_udf = udf(safe_parse, IntegerType())
df_safe = df_raw.withColumn("safe_amount", safe_parse_udf("amount"))

df_safe.show()

In [0]:
+------+-----------+
|amount|safe_amount|
+------+-----------+
|   100|        100|
|   200|        200|
|   abc|          0|
|  null|          0|
+------+-----------+


**⭐ 6. Mini Practice Problems**

Write a UDF to safely parse float values from strings, returning -1 for invalid values.

Parse dates from a string column, returning None if parsing fails.

Create a UDF that divides two columns, returning 0 if division by zero occurs.

**⭐ 7. Full Data Engineering Problem**

Scenario: You ingest API response data with a price field that may contain non-numeric values or nulls. Build a pipeline that:

Reads raw JSON.

Safely converts price to float using a UDF with try/except.

Adds a price_usd column converting local currency to USD.

Aggregates total revenue per day.

Writes the result to Delta.

**⭐ 8. Time & Space Complexity**

UDFs run row-by-row, slower than native PySpark functions (withColumn, cast)

Complexity: O(n) for n rows.

Memory: low, unless logging errors extensively per row.

**⭐ 9. Common Pitfalls**

Using Python UDFs unnecessarily — built-in Spark functions are faster.

Returning None inconsistently → may break downstream type expectations.

Logging inside UDF for every row → can overwhelm driver logs.

Forgetting to specify the return type in the UDF.