**⭐ 1. What This Pattern Solves**

Implements IF / CASE WHEN style logic in PySpark.
Used in:

Data cleaning

Categorizing values (age groups, risk bands, salary levels)

Replacing invalid or NULL values

Creating rule-based business columns

Standardizing formats

Any time you need IF / ELSE, this is the standard pattern.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT
    CASE
        WHEN score >= 90 THEN 'A'
        WHEN score >= 75 THEN 'B'
        ELSE 'C'
    END AS grade
FROM exams;


In [0]:
F.when(F.col("score") >= 90, "A") \
 .when(F.col("score") >= 75, "B") \
 .otherwise("C")

**⭐ 3. Core Idea**

Chain multiple conditions using:

F.when(condition, value)

Multiple .when()

Final .otherwise(default)

Then wrap it inside withColumn() to add a derived column.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
df.withColumn(
    "new_col",
    F.when(condition1, value1)
     .when(condition2, value2)
     .otherwise(default_value)
)

**⭐ 5. Detailed Example**

In [0]:
+----+-------+
| id | score |
+----+-------+
| 1  | 92    |
| 2  | 80    |
| 3  | 60    |
+----+-------+


In [0]:
from pyspark.sql import functions as F

out = df.withColumn(
    "grade",
    F.when(F.col("score") >= 90, "A")
     .when(F.col("score") >= 75, "B")
     .otherwise("C")
)

In [0]:
+----+-------+-------+
| id | score | grade |
+----+-------+-------+
| 1  | 92    | A     |
| 2  | 80    | B     |
| 3  | 60    | C     |
+----+-------+-------+


**⭐ 6. Mini Practice Problems**

Create column risk_level = "HIGH" if score > 0.8 else "LOW".

Create column age_group: "SENIOR" ≥ 65, "ADULT" 18–64, else "MINOR".

Replace null city values with "UNKNOWN".

**⭐ 7. Full Data Engineering Problem**

**Scenario:**
Bronze patient dataset contains bmi and age.
Silver table requires a derived health flag:

"AT_RISK" if bmi > 30 or age > 60

"NORMAL" if bmi BETWEEN 18.5 AND 30

"UNDERWEIGHT" otherwise

Also standardize gender:

"M" or "F"

All others → "UNKNOWN"

**Task:**
Write PySpark conditional transformations for both derived columns.

This is exactly how health-risk stratification works in insurance and provider analytics.

**⭐ 8. Time & Space Complexity**

| Operation               | Complexity                               |
| ----------------------- | ---------------------------------------- |
| Conditional expressions | **O(n)** — evaluates each branch per row |
| Memory                  | Low/Medium — creates new column          |


**⭐ 9. Common Pitfalls**

❌ Forgetting .otherwise() → produces nulls
❌ Using Python if/else instead of Spark when
❌ Overlapping conditions → Spark picks first match and ignores the rest
❌ Large nested conditions instead of separate transformations
❌ Not normalizing case before comparisons (e.g., "m", "M")