**⭐ 1. What This Pattern Solves**

Encapsulates common transformations or logic into functions to avoid code duplication, improve readability, and make pipelines maintainable.

Use-cases:

Standardizing column cleaning logic (trim, lowercase)

Reusable aggregations (sum, count per key)

Generic join functions for multiple tables

UDFs for repeated business rules

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- SQL view or stored procedure
CREATE VIEW cleaned_customers AS
SELECT TRIM(LOWER(name)) AS clean_name, age
FROM raw_customers;

**⭐ 3. Core Idea**

Wrap repeatable DataFrame transformations in Python functions that take DataFrame as input and return transformed DataFrame.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
def clean_string_column(df, col_name, new_col_name):
    return df.withColumn(new_col_name, trim(lower(col(col_name))))

# Usage
df_cleaned = clean_string_column(df_raw, "name", "clean_name")

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, trim, lower

spark = SparkSession.builder.getOrCreate()

data = [(" Alice ", 30), ("BOB", 25)]
df_raw = spark.createDataFrame(data, ["name", "age"])

# Reusable function
def clean_string_column(df, col_name, new_col_name):
    return df.withColumn(new_col_name, trim(lower(col(col_name))))

df_cleaned = clean_string_column(df_raw, "name", "clean_name")
df_cleaned.show()


In [0]:
+-----+---+----------+
| name|age|clean_name|
+-----+---+----------+
| Alice| 30|     alice|
|  BOB| 25|       bob|
+-----+---+----------+

**⭐ 6. Mini Practice Problems**

Write a reusable function to cast multiple columns to IntegerType.

Create a function that drops nulls from a list of columns.

Build a function that aggregates count per key for any DataFrame.

**⭐ 7. Full Data Engineering Problem**

Scenario: Standardizing ETL transformations across multiple pipelines:

Raw tables: customers, transactions, products.

Apply common cleaning: trim & lowercase string columns, fill nulls for critical numeric columns.

Use reusable functions to avoid repeating the same code in each pipeline.

Aggregate metrics and write to Silver Delta tables.

**⭐ 8. Time & Space Complexity**

Complexity depends on underlying transformation: typically O(n) per column transformation.

Minimal memory overhead; can chain transformations efficiently.

**⭐ 9. Common Pitfalls**

Hardcoding column names → reduces reusability.

Returning None for invalid columns → can break downstream transformations.

Wrapping too many transformations into a single function → harder to debug.

Forgetting return statement → DataFrame not updated.