**⭐ 1. What This Pattern Solves**

Adds new derived columns or overwrites existing ones.
Used in:

Enriching Bronze → Silver datasets

Normalizing fields

Standardizing formats

Feature engineering

Business-rule calculations

Safety checks before joins

This is one of the core transformations in every ETL job.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT *,
       salary * 1.1 AS salary_bonus,
       UPPER(name) AS name_upper
FROM employees;

In [0]:
df.withColumn("salary_bonus", F.col("salary") * 1.1) \
  .withColumn("name_upper", F.upper("name"))

**⭐ 3. Core Idea**

withColumn():

Adds new columns

Overwrites existing columns

Works with expressions: math, string ops, date ops

Works with UDFs, conditional logic, nested fields

Every new transformation becomes a new column.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
df.withColumn("new_col", expr) \
  .withColumn("another_col", expr2)

In [0]:
df = df.withColumn("col", F.expr("expression"))

**⭐ 5. Detailed Example**

In [0]:
+----+-------+--------+
| id | name  | salary |
+----+-------+--------+
| 1  | Alice | 1000   |
| 2  | Bob   | 1500   |
+----+-------+--------+


In [0]:
out = df.withColumn("name_upper", F.upper("name")) \
        .withColumn("salary_bonus", F.col("salary") * 1.1)

In [0]:
+----+-------+-------+-------------+
| id | name  | name_upper | salary_bonus |
+----+-------+------------+--------------+
| 1  | Alice | ALICE      | 1100         |
| 2  | Bob   | BOB        | 1650         |
+----+-------+------------+--------------+


**⭐ 6. Mini Practice Problems**

Add a column age_group: "ADULT" if age ≥ 18 else "MINOR".

Add a column year extracted from event_date.

Overwrite column amount to convert USD → EUR using rate 0.9.

**⭐ 7. Full Data Engineering Problem**

**Scenario:**
You’re enriching a Bronze Orders table.
You must produce the Silver version with:

order_year from order_timestamp

total_amount = unit_price * quantity

is_high_value = amount > 1000

Uppercase customer_state

**Task:**
Write the full withColumn() chain.

This is exactly what happens in retail order-processing pipelines.

**⭐ 8. Time & Space Complexity**

| Operation      | Complexity                        |
| -------------- | --------------------------------- |
| Adding columns | **O(n)** — computes value per row |
| Memory         | Medium — stores new columns       |


**⭐ 9. Common Pitfalls**

❌ Adding many columns in separate actions (plan explosion)
❌ Using UDFs where built-in functions exist
❌ Forgetting that withColumn overwrites existing columns
❌ Creating unnecessary intermediate columns
❌ Long chained expressions without aliases → unreadable ETL code