**⭐ 1. What This Pattern Solves**

Converts columns from one data type to another.

Used for:

Fixing schema mismatches across Bronze → Silver

Preparing data for joins

Converting strings to dates/timestamps

Ensuring numeric operations don’t fail

Enforcing correct schema for Delta Lake tables

Cleaning messy input (CSV/JSON ingestion)

Casting is one of the most frequent transformations in real pipelines.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT 
    CAST(order_date AS DATE) AS order_date,
    CAST(amount AS DOUBLE) AS amount_dbl
FROM orders;

In [0]:
df.withColumn("order_date", F.col("order_date").cast("date")) \
  .withColumn("amount_dbl", F.col("amount").cast("double"))

**⭐ 3. Core Idea**

PySpark lets you cast using:

.cast("type")

F.col("x").astype("type")

F.expr("CAST(x AS type)")

Using SQL types (date, timestamp, string, double, int, boolean)

Casting is required whenever you want Spark to treat a column as the correct type.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
## Pattern A — Standard Cast

df.withColumn("col", F.col("col").cast("new_type"))

In [0]:
## Pattern B — Multiple Casts in One Select

df.select(
    F.col("a").cast("int").alias("a_int"),
    F.col("b").cast("timestamp").alias("b_ts")
)

In [0]:
## Pattern C — Using SQL

df.withColumn("col", F.expr("CAST(col AS new_type)"))

**⭐ 5. Detailed Example**

In [0]:
+----+-------------+---------+
| id | amount      | order_dt|
+----+-------------+---------+
| 1  | "100.5"     | "2025-01-01" |
| 2  | "250.75"    | "2025-01-02" |
+----+-------------+---------+


In [0]:
out = df.withColumn("amount", F.col("amount").cast("double")) \
        .withColumn("order_dt", F.col("order_dt").cast("date"))

In [0]:
+----+--------+------------+
| id | amount | order_dt   |
+----+--------+------------+
| 1  | 100.5  | 2025-01-01 |
| 2  | 250.75 | 2025-01-02 |
+----+--------+------------+


**⭐ 6. Mini Practice Problems**

Convert column price_str (string) → double.

Convert event_time (string) → timestamp.

Cast id into long so it matches schema of another table.

**⭐ 7. Full Data Engineering Problem**

**Scenario:**
Bronze e-commerce events table contains all fields as STRING.
Before Silver, you must cast:

event_time → timestamp

order_date → date

amount → double

quantity → integer

is_return → boolean

customer_id → long

This is a common scenario when ingesting raw CSV or JSON logs.

**Task:**
Write the PySpark casting block for all fields.

**⭐ 8. Time & Space Complexity**

| Operation       | Complexity                                        |
| --------------- | ------------------------------------------------- |
| Casting columns | **O(n)** per column                               |
| Memory          | Low — no extra columns unless you create new ones |


**⭐ 9. Common Pitfalls**

❌ Forgetting to cast before numeric operations
❌ Attempting to cast invalid strings → results in NULL
❌ Mismatched types between join keys
❌ Confusing date with timestamp
❌ Using Python int() inside Spark transformations (never use Python ops!)