**⭐ 1. What This Pattern Solves**

Allows you to write SQL directly inside PySpark transformations.

Use this pattern when:

SQL version is easier than Column API

You want complex expressions in one line

Migrating SQL logic → PySpark

You need functions not easily accessible via F.*

Working with window functions inside transformations

This pattern is essential for SQL-first data engineers.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT 
    id,
    salary * 1.1 AS bonus,
    CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;

In [0]:
df.selectExpr(
    "id",
    "salary * 1.1 AS bonus",
    "CONCAT(first_name, ' ', last_name) AS full_name"
)

In [0]:
df.withColumn("bonus", F.expr("salary * 1.1"))

**⭐ 3. Core Idea**

expr() lets you put SQL functions inside PySpark:

Math expressions

String expressions

Date expressions

CASE WHEN logic

Built-in SQL functions

Window expressions (inside selectExpr)

You write SQL, Spark compiles it to the Catalyst plan.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
## Pattern A — Inside withColumn()

df.withColumn("new_col", F.expr("sql_expression_here"))

In [0]:
## Pattern B — Using selectExpr()

df.selectExpr(
    "col1",
    "sql_expression AS new_col",
    "CASE WHEN ... END AS flag"
)

**⭐ 5. Detailed Example**

In [0]:
+----+----------+-----------+
| id | first    | last      |
+----+----------+-----------+
| 1  | Alice    | Smith     |
| 2  | Bob      | Johnson   |
+----+----------+-----------+


In [0]:
from pyspark.sql import functions as F

out = df.withColumn(
    "full_name",
    F.expr("CONCAT(first, ' ', last)")
)

In [0]:
+----+----------+-----------+-----------+
| id | first    | last      | full_name |
+----+----------+-----------+-----------+
| 1  | Alice    | Smith     | Alice Smith|
| 2  | Bob      | Johnson   | Bob Johnson|
+----+----------+-----------+-----------+


**⭐ 6. Mini Practice Problems (Active Recall)**

Create column total = price * quantity using expr().

Extract the year from order_date using SQL syntax.

Build full_address with "CONCAT(street, ', ', city, ', ', state)".

**⭐ 7. Full Data Engineering Problem**

**Scenario:**
You have a Bronze Sales table.
You must create the Silver enriched version using SQL-style expressions:

gross_amount = unit_price * qty

net_amount = unit_price * qty * (1 - discount)

order_year from order_timestamp

is_weekend using SQL CASE WHEN + dayofweek(order_timestamp)

**Task:**
Write the full PySpark transformation using ONLY expr() or selectExpr().

This mirrors enterprise sales pipelines in retail, travel, and e-commerce.

**⭐ 8. Time & Space Complexity**

| Operation           | Complexity                       |
| ------------------- | -------------------------------- |
| `expr()` evaluation | **O(n)** — computes for each row |
| Memory              | Depends on number of new columns |


**⭐ 9. Common Pitfalls**

❌ Using Python string ops instead of SQL ops
❌ Forgetting you can use CASE WHEN inside expr()
❌ Mixing SQL syntax with Column syntax incorrectly
❌ Hardcoding strings without quotes inside expression
❌ Misusing selectExpr("*", "...") without alias → duplicate col names