# üöÄ ADVANCED TRANSFORMATIONS WITH PYSPARK

---

## üìã **DAY 3 - LESSON 1: ADVANCED TRANSFORMATIONS**

### **üéØ OBJECTIVES:**

1. **Complex Transformations** - withColumn, select, expr
2. **Conditional Logic** - when/otherwise, case statements
3. **String Operations** - concat, split, regex, substring
4. **Date/Time Operations** - date_add, date_diff, date_format
5. **Array Operations** - explode, array functions
6. **Struct Operations** - nested data handling
7. **User Defined Functions (UDFs)** - Python UDFs, Pandas UDFs
8. **Performance Best Practices** - Avoid UDFs when possible

---

## üîß **SETUP SPARK SESSION**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
from datetime import datetime, timedelta
import pandas as pd

spark = SparkSession.builder \
    .appName("AdvancedTransformations") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/08 15:53:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created
Spark Version: 3.5.1
Master: spark://spark-master:7077


---

## üìä **1. CREATE SAMPLE DATASET**

T·∫°o dataset ph·ª©c t·∫°p ƒë·ªÉ th·ª±c h√†nh transformations

In [2]:
# Create complex sample data
data = [
    ("ORD001", "CUST001", "John Doe", "john.doe@email.com", 
     ["Product A", "Product B"], [100.0, 200.0], "2024-01-15 10:30:00", "USA", "completed"),
    
    ("ORD002", "CUST002", "Jane Smith", "jane.smith@email.com", 
     ["Product C"], [150.0], "2024-01-16 14:20:00", "UK", "pending"),
    
    ("ORD003", "CUST003", "Bob Johnson", "bob.johnson@email.com", 
     ["Product A", "Product C", "Product D"], [100.0, 150.0, 250.0], "2024-01-17 09:15:00", "Canada", "completed"),
    
    ("ORD004", "CUST001", "John Doe", "john.doe@email.com", 
     ["Product B", "Product D"], [200.0, 250.0], "2024-01-18 16:45:00", "USA", "cancelled"),
    
    ("ORD005", "CUST004", "Alice Brown", "alice.brown@email.com", 
     ["Product A"], [100.0], "2024-01-19 11:00:00", "USA", "completed"),
    
    ("ORD006", "CUST005", "Charlie Wilson", "charlie.wilson@email.com", 
     ["Product B", "Product C"], [200.0, 150.0], "2024-01-20 13:30:00", "UK", "completed"),
    
    ("ORD007", "CUST002", "Jane Smith", "jane.smith@email.com", 
     ["Product D"], [250.0], "2024-01-21 15:20:00", "UK", "pending"),
    
    ("ORD008", "CUST006", "David Lee", "david.lee@email.com", 
     ["Product A", "Product B", "Product C", "Product D"], [100.0, 200.0, 150.0, 250.0], 
     "2024-01-22 10:10:00", "Canada", "completed"),
]

schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("customer_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("products", ArrayType(StringType()), True),
    StructField("prices", ArrayType(DoubleType()), True),
    StructField("order_timestamp", StringType(), True),
    StructField("country", StringType(), True),
    StructField("status", StringType(), True)
])

df = spark.createDataFrame(data, schema)

print("üìä SAMPLE DATASET:")
df.show(truncate=False)
print(f"\nTotal rows: {df.count()}")
df.printSchema()

üìä SAMPLE DATASET:


                                                                                

+--------+-----------+--------------+------------------------+--------------------------------------------+----------------------------+-------------------+-------+---------+
|order_id|customer_id|customer_name |email                   |products                                    |prices                      |order_timestamp    |country|status   |
+--------+-----------+--------------+------------------------+--------------------------------------------+----------------------------+-------------------+-------+---------+
|ORD001  |CUST001    |John Doe      |john.doe@email.com      |[Product A, Product B]                      |[100.0, 200.0]              |2024-01-15 10:30:00|USA    |completed|
|ORD002  |CUST002    |Jane Smith    |jane.smith@email.com    |[Product C]                                 |[150.0]                     |2024-01-16 14:20:00|UK     |pending  |
|ORD003  |CUST003    |Bob Johnson   |bob.johnson@email.com   |[Product A, Product C, Product D]           |[100.0, 150.0, 250

---

## üîÑ **2. BASIC TRANSFORMATIONS REVIEW**

√în t·∫≠p c√°c transformations c∆° b·∫£n

In [3]:
# 2.1 withColumn - Add/modify columns
print("üîπ withColumn - Add new column:")
df_with_col = df.withColumn("order_year", lit(2024))
df_with_col.select("order_id", "order_year").show(5)

# 2.2 select - Select specific columns
print("\nüîπ select - Select columns:")
df.select("order_id", "customer_name", "country").show(5)

# 2.3 selectExpr - Select with SQL expressions
print("\nüîπ selectExpr - SQL expressions:")
df.selectExpr(
    "order_id",
    "customer_name",
    "upper(country) as country_upper"
).show(5)

# 2.4 drop - Remove columns
print("\nüîπ drop - Remove columns:")
df_dropped = df.drop("email")
print(f"Columns before: {df.columns}")
print(f"Columns after: {df_dropped.columns}")

# 2.5 withColumnRenamed - Rename columns
print("\nüîπ withColumnRenamed - Rename column:")
df_renamed = df.withColumnRenamed("customer_name", "full_name")
df_renamed.select("order_id", "full_name").show(5)

üîπ withColumn - Add new column:
+--------+----------+
|order_id|order_year|
+--------+----------+
|  ORD001|      2024|
|  ORD002|      2024|
|  ORD003|      2024|
|  ORD004|      2024|
|  ORD005|      2024|
+--------+----------+
only showing top 5 rows


üîπ select - Select columns:
+--------+-------------+-------+
|order_id|customer_name|country|
+--------+-------------+-------+
|  ORD001|     John Doe|    USA|
|  ORD002|   Jane Smith|     UK|
|  ORD003|  Bob Johnson| Canada|
|  ORD004|     John Doe|    USA|
|  ORD005|  Alice Brown|    USA|
+--------+-------------+-------+
only showing top 5 rows


üîπ selectExpr - SQL expressions:
+--------+-------------+-------------+
|order_id|customer_name|country_upper|
+--------+-------------+-------------+
|  ORD001|     John Doe|          USA|
|  ORD002|   Jane Smith|           UK|
|  ORD003|  Bob Johnson|       CANADA|
|  ORD004|     John Doe|          USA|
|  ORD005|  Alice Brown|          USA|
+--------+-------------+-------------+
onl

---

## üéØ **3. CONDITIONAL LOGIC - WHEN/OTHERWISE**

X·ª≠ l√Ω logic ƒëi·ªÅu ki·ªán ph·ª©c t·∫°p

In [4]:
# 3.1 Simple when/otherwise
print("üîπ Simple when/otherwise:")
df_status = df.withColumn(
    "status_label",
    when(col("status") == "completed", "‚úÖ Completed")
    .when(col("status") == "pending", "‚è≥ Pending")
    .when(col("status") == "cancelled", "‚ùå Cancelled")
    .otherwise("‚ùì Unknown")
)

df_status.select("order_id", "status", "status_label").show()

# 3.2 Multiple conditions with AND/OR
print("\nüîπ Multiple conditions:")
df_priority = df.withColumn(
    "priority",
    when(
        (col("status") == "completed") & (col("country") == "USA"),
        "High"
    ).when(
        (col("status") == "pending") | (col("status") == "cancelled"),
        "Medium"
    ).otherwise("Low")
)

df_priority.select("order_id", "status", "country", "priority").show()

# 3.3 Nested when/otherwise
print("\nüîπ Nested conditions:")
df_category = df.withColumn(
    "customer_category",
    when(col("country") == "USA",
        when(col("status") == "completed", "US-Premium")
        .otherwise("US-Standard")
    ).when(col("country") == "UK",
        when(col("status") == "completed", "UK-Premium")
        .otherwise("UK-Standard")
    ).otherwise("International")
)

df_category.select("order_id", "country", "status", "customer_category").show()

# 3.4 Using expr for complex logic
print("\nüîπ Using expr:")
df_expr = df.withColumn(
    "discount",
    expr("""
        CASE 
            WHEN status = 'completed' AND country = 'USA' THEN 0.15
            WHEN status = 'completed' THEN 0.10
            WHEN status = 'pending' THEN 0.05
            ELSE 0.0
        END
    """)
)

df_expr.select("order_id", "status", "country", "discount").show()

üîπ Simple when/otherwise:
+--------+---------+------------+
|order_id|   status|status_label|
+--------+---------+------------+
|  ORD001|completed| ‚úÖ Completed|
|  ORD002|  pending|   ‚è≥ Pending|
|  ORD003|completed| ‚úÖ Completed|
|  ORD004|cancelled| ‚ùå Cancelled|
|  ORD005|completed| ‚úÖ Completed|
|  ORD006|completed| ‚úÖ Completed|
|  ORD007|  pending|   ‚è≥ Pending|
|  ORD008|completed| ‚úÖ Completed|
+--------+---------+------------+


üîπ Multiple conditions:
+--------+---------+-------+--------+
|order_id|   status|country|priority|
+--------+---------+-------+--------+
|  ORD001|completed|    USA|    High|
|  ORD002|  pending|     UK|  Medium|
|  ORD003|completed| Canada|     Low|
|  ORD004|cancelled|    USA|  Medium|
|  ORD005|completed|    USA|    High|
|  ORD006|completed|     UK|     Low|
|  ORD007|  pending|     UK|  Medium|
|  ORD008|completed| Canada|     Low|
+--------+---------+-------+--------+


üîπ Nested conditions:
+--------+-------+---------+----------

---

## üìù **4. STRING OPERATIONS**

X·ª≠ l√Ω chu·ªói k√Ω t·ª± n√¢ng cao

In [5]:
# 4.1 Basic string functions
print("üîπ Basic string functions:")
df_string = df.select(
    "order_id",
    "customer_name",
    upper(col("customer_name")).alias("name_upper"),
    lower(col("customer_name")).alias("name_lower"),
    initcap(col("customer_name")).alias("name_initcap"),
    length(col("customer_name")).alias("name_length")
)

df_string.show(truncate=False)

# 4.2 Concatenation
print("\nüîπ String concatenation:")
df_concat = df.select(
    "order_id",
    "customer_name",
    "country",
    # Method 1: concat
    concat(col("customer_name"), lit(" - "), col("country")).alias("name_country_1"),
    # Method 2: concat_ws (with separator)
    concat_ws(" | ", col("customer_name"), col("country")).alias("name_country_2"),
    # Method 3: format_string
    format_string("%s from %s", col("customer_name"), col("country")).alias("name_country_3")
)

df_concat.show(truncate=False)

# 4.3 Substring and split
print("\nüîπ Substring and split:")
df_substr = df.select(
    "order_id",
    "customer_name",
    # Substring (start from position 1, length 4)
    substring(col("customer_name"), 1, 4).alias("first_4_chars"),
    # Split by space
    split(col("customer_name"), " ").alias("name_parts"),
    # Get first name (first element of split)
    split(col("customer_name"), " ").getItem(0).alias("first_name"),
    # Get last name (last element of split)
    split(col("customer_name"), " ").getItem(1).alias("last_name")
)

df_substr.show(truncate=False)

# 4.4 Regex operations
print("\nüîπ Regex operations:")
df_regex = df.select(
    "order_id",
    "email",
    # Extract domain from email
    regexp_extract(col("email"), r"@(.+)", 1).alias("email_domain"),
    # Replace
    regexp_replace(col("email"), r"@.+", "@company.com").alias("email_masked"),
    # Check if matches pattern
    col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$").alias("is_valid_email")
)

df_regex.show(truncate=False)

# 4.5 Trim and pad
print("\nüîπ Trim and pad:")
df_trim = df.select(
    "order_id",
    # Left pad with zeros (total length 10)
    lpad(col("order_id"), 10, "0").alias("order_id_padded"),
    # Right pad with spaces
    rpad(col("customer_name"), 20, " ").alias("name_padded"),
    # Trim
    trim(col("customer_name")).alias("name_trimmed"),
    ltrim(col("customer_name")).alias("name_ltrimmed"),
    rtrim(col("customer_name")).alias("name_rtrimmed")
)

df_trim.show(truncate=False)

üîπ Basic string functions:
+--------+--------------+--------------+--------------+--------------+-----------+
|order_id|customer_name |name_upper    |name_lower    |name_initcap  |name_length|
+--------+--------------+--------------+--------------+--------------+-----------+
|ORD001  |John Doe      |JOHN DOE      |john doe      |John Doe      |8          |
|ORD002  |Jane Smith    |JANE SMITH    |jane smith    |Jane Smith    |10         |
|ORD003  |Bob Johnson   |BOB JOHNSON   |bob johnson   |Bob Johnson   |11         |
|ORD004  |John Doe      |JOHN DOE      |john doe      |John Doe      |8          |
|ORD005  |Alice Brown   |ALICE BROWN   |alice brown   |Alice Brown   |11         |
|ORD006  |Charlie Wilson|CHARLIE WILSON|charlie wilson|Charlie Wilson|14         |
|ORD007  |Jane Smith    |JANE SMITH    |jane smith    |Jane Smith    |10         |
|ORD008  |David Lee     |DAVID LEE     |david lee     |David Lee     |9          |
+--------+--------------+--------------+--------------+---

---

## üìÖ **5. DATE/TIME OPERATIONS**

X·ª≠ l√Ω ng√†y th√°ng v√† th·ªùi gian

In [7]:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

# 5.1 Parse timestamp
print("üîπ Parse timestamp:")
df_date = df.withColumn(
    "order_datetime",
    to_timestamp(col("order_timestamp"), "yyyy-MM-dd HH:mm:ss")
).withColumn(
    "order_date",
    to_date(col("order_timestamp"), "yyyy-MM-dd")
)

df_date.select("order_id", "order_timestamp", "order_datetime", "order_date").show()

# 5.2 Extract date parts
print("\nüîπ Extract date parts:")
df_parts = df_date.select(
    "order_id",
    "order_datetime",
    year(col("order_datetime")).alias("year"),
    month(col("order_datetime")).alias("month"),
    dayofmonth(col("order_datetime")).alias("day"),
    dayofweek(col("order_datetime")).alias("day_of_week"),  # 1=Sunday, 7=Saturday
    dayofyear(col("order_datetime")).alias("day_of_year"),
    weekofyear(col("order_datetime")).alias("week_of_year"),
    quarter(col("order_datetime")).alias("quarter"),
    hour(col("order_datetime")).alias("hour"),
    minute(col("order_datetime")).alias("minute"),
    second(col("order_datetime")).alias("second")
)

df_parts.show()

# 5.3 Date arithmetic
print("\nüîπ Date arithmetic:")
df_calc = df_date.select(
    "order_id",
    "order_date",
    # Add days
    date_add(col("order_date"), 7).alias("delivery_date"),
    # Subtract days
    date_sub(col("order_date"), 3).alias("preparation_date"),
    # Add months
    add_months(col("order_date"), 1).alias("next_month"),
    # Days between dates
    datediff(current_date(), col("order_date")).alias("days_since_order"),
    # Months between dates
    months_between(current_date(), col("order_date")).alias("months_since_order")
)

df_calc.show()

# 5.4 Date formatting
print("\nüîπ Date formatting:")
df_format = df_date.select(
    "order_id",
    "order_datetime",
    # Format date
    date_format(col("order_datetime"), "yyyy-MM-dd").alias("date_iso"),
    date_format(col("order_datetime"), "dd/MM/yyyy").alias("date_eu"),
    date_format(col("order_datetime"), "MM/dd/yyyy").alias("date_us"),
    date_format(col("order_datetime"), "EEEE, MMMM dd, yyyy").alias("date_full"),
    date_format(col("order_datetime"), "HH:mm:ss").alias("time_only"),
    date_format(col("order_datetime"), "yyyy-MM-dd HH:mm:ss").alias("datetime_full")
)

df_format.show(truncate=False)

# 5.5 Current date/time functions
print("\nüîπ Current date/time:")
df_current = df.select(
    "order_id",
    current_date().alias("today"),
    current_timestamp().alias("now"),
    unix_timestamp().alias("unix_timestamp"),
    from_unixtime(unix_timestamp()).alias("from_unix")
)

df_current.show(truncate=False)

# 5.6 Truncate date
print("\nüîπ Truncate date:")
df_trunc = df_date.select(
    "order_id",
    "order_datetime",
    date_trunc("year", col("order_datetime")).alias("year_start"),
    date_trunc("month", col("order_datetime")).alias("month_start"),
    date_trunc("week", col("order_datetime")).alias("week_start"),
    date_trunc("day", col("order_datetime")).alias("day_start"),
    date_trunc("hour", col("order_datetime")).alias("hour_start")
)

df_trunc.show(truncate=False)

üîπ Parse timestamp:


                                                                                

+--------+-------------------+-------------------+----------+
|order_id|    order_timestamp|     order_datetime|order_date|
+--------+-------------------+-------------------+----------+
|  ORD001|2024-01-15 10:30:00|2024-01-15 10:30:00|2024-01-15|
|  ORD002|2024-01-16 14:20:00|2024-01-16 14:20:00|2024-01-16|
|  ORD003|2024-01-17 09:15:00|2024-01-17 09:15:00|2024-01-17|
|  ORD004|2024-01-18 16:45:00|2024-01-18 16:45:00|2024-01-18|
|  ORD005|2024-01-19 11:00:00|2024-01-19 11:00:00|2024-01-19|
|  ORD006|2024-01-20 13:30:00|2024-01-20 13:30:00|2024-01-20|
|  ORD007|2024-01-21 15:20:00|2024-01-21 15:20:00|2024-01-21|
|  ORD008|2024-01-22 10:10:00|2024-01-22 10:10:00|2024-01-22|
+--------+-------------------+-------------------+----------+


üîπ Extract date parts:
+--------+-------------------+----+-----+---+-----------+-----------+------------+-------+----+------+------+
|order_id|     order_datetime|year|month|day|day_of_week|day_of_year|week_of_year|quarter|hour|minute|second|
+--------

---

## üî¢ **6. NUMERIC OPERATIONS**

X·ª≠ l√Ω s·ªë h·ªçc v√† to√°n h·ªçc

In [12]:
# First, calculate total amount from prices array - FIXED
print("üîπ Calculating total amount from prices array:")
# C√°ch 1: S·ª≠ d·ª•ng explode v√† sum (d·ªÖ hi·ªÉu h∆°n)
df_numeric = df.withColumn(
    "price_element",
    expr("explode(prices)")
).groupBy("order_id").agg(
    expr("sum(price_element)").alias("total_amount")
)

# Join l·∫°i v·ªõi DataFrame g·ªëc ƒë·ªÉ c√≥ c√°c c·ªôt kh√°c
df_numeric = df.join(df_numeric, on="order_id")

# 6.1 Basic math operations
print("üîπ Basic math operations:")
df_math = df_numeric.select(
    "order_id",
    "total_amount",
    # Arithmetic
    (col("total_amount") + 10).alias("amount_plus_10"),
    (col("total_amount") - 10).alias("amount_minus_10"),
    (col("total_amount") * 1.1).alias("amount_with_tax"),
    (col("total_amount") / 2).alias("amount_half"),
    (col("total_amount") % 100).alias("amount_mod_100")
)

df_math.show()

# 6.2 Rounding functions
print("\nüîπ Rounding functions:")
df_round = df_numeric.select(
    "order_id",
    "total_amount",
    round(col("total_amount"), 0).alias("rounded"),
    round(col("total_amount"), 2).alias("rounded_2dp"),
    ceil(col("total_amount")).alias("ceiling"),
    floor(col("total_amount")).alias("floor"),
    bround(col("total_amount"), 0).alias("banker_rounded")  # Banker's rounding
)

df_round.show()

# 6.3 Mathematical functions
print("\nüîπ Mathematical functions:")
df_math_func = df_numeric.select(
    "order_id",
    "total_amount",
    abs(col("total_amount")).alias("absolute"),
    sqrt(col("total_amount")).alias("square_root"),
    pow(col("total_amount"), 2).alias("squared"),
    log(col("total_amount")).alias("natural_log"),
    log10(col("total_amount")).alias("log_base_10"),
    exp(lit(1)).alias("exponential")
)

df_math_func.show()

# 6.4 Statistical functions
print("\nüîπ Statistical functions:")
df_stats = df_numeric.select(
    "order_id",
    "total_amount",
    # Min/Max with literal
    greatest(col("total_amount"), lit(200)).alias("max_with_200"),
    least(col("total_amount"), lit(400)).alias("min_with_400")
)

df_stats.show()

# 6.5 Null handling in numeric operations
print("\nüîπ Null handling:")
df_null = df_numeric.select(
    "order_id",
    "total_amount",
    coalesce(col("total_amount"), lit(0)).alias("amount_or_zero"),
    nvl(col("total_amount"), lit(0)).alias("amount_nvl"),
    when(col("total_amount").isNull(), 0).otherwise(col("total_amount")).alias("amount_when")
)

df_null.show()

üîπ Calculating total amount from prices array:
üîπ Basic math operations:
+--------+------------+--------------+---------------+------------------+-----------+--------------+
|order_id|total_amount|amount_plus_10|amount_minus_10|   amount_with_tax|amount_half|amount_mod_100|
+--------+------------+--------------+---------------+------------------+-----------+--------------+
|  ORD001|       300.0|         310.0|          290.0|             330.0|      150.0|           0.0|
|  ORD003|       500.0|         510.0|          490.0|             550.0|      250.0|           0.0|
|  ORD002|       150.0|         160.0|          140.0|             165.0|       75.0|          50.0|
|  ORD004|       450.0|         460.0|          440.0|495.00000000000006|      225.0|          50.0|
|  ORD008|       700.0|         710.0|          690.0| 770.0000000000001|      350.0|           0.0|
|  ORD005|       100.0|         110.0|           90.0|110.00000000000001|       50.0|           0.0|
|  ORD006|    

üîπ Calculating total amount from prices array:


                                                                                

+--------+--------------------+------------+
|order_id|              prices|total_amount|
+--------+--------------------+------------+
|  ORD008|[100.0, 200.0, 15...|       700.0|
|  ORD005|             [100.0]|       100.0|
|  ORD006|      [200.0, 150.0]|       350.0|
|  ORD007|             [250.0]|       250.0|
|  ORD001|      [100.0, 200.0]|       300.0|
|  ORD003|[100.0, 150.0, 25...|       500.0|
|  ORD002|             [150.0]|       150.0|
|  ORD004|      [200.0, 250.0]|       450.0|
+--------+--------------------+------------+



In [11]:
df_numeric.show()

+--------+-----------+--------------+--------------------+--------------------+--------------------+-------------------+-------+---------+------------+
|order_id|customer_id| customer_name|               email|            products|              prices|    order_timestamp|country|   status|total_amount|
+--------+-----------+--------------+--------------------+--------------------+--------------------+-------------------+-------+---------+------------+
|  ORD001|    CUST001|      John Doe|  john.doe@email.com|[Product A, Produ...|      [100.0, 200.0]|2024-01-15 10:30:00|    USA|completed|       300.0|
|  ORD003|    CUST003|   Bob Johnson|bob.johnson@email...|[Product A, Produ...|[100.0, 150.0, 25...|2024-01-17 09:15:00| Canada|completed|       500.0|
|  ORD002|    CUST002|    Jane Smith|jane.smith@email.com|         [Product C]|             [150.0]|2024-01-16 14:20:00|     UK|  pending|       150.0|
|  ORD004|    CUST001|      John Doe|  john.doe@email.com|[Product B, Produ...|      [20

---

## üì¶ **7. ARRAY OPERATIONS**

X·ª≠ l√Ω m·∫£ng (arrays) - R·∫§T QUAN TR·ªåNG!

In [14]:
# 7.1 Basic array functions
print("üîπ Basic array functions:")
df_array = df.select(
    "order_id",
    "products",
    "prices",
    # Array size
    size(col("products")).alias("num_products"),
    # Get element by index (0-based)
    col("products").getItem(0).alias("first_product"),
    col("prices").getItem(0).alias("first_price"),
    # Check if array contains element
    array_contains(col("products"), "Product A").alias("has_product_a")
)

df_array.show(truncate=False)

# 7.2 Explode - Convert array to rows
print("\nüîπ Explode array to rows:")
df_exploded = df.select(
    "order_id",
    "customer_name",
    explode(col("products")).alias("product")
)

df_exploded.show(truncate=False)
print(f"Original rows: {df.count()}, After explode: {df_exploded.count()}")

# 7.3 Explode with position
print("\nüîπ Explode with position:")
df_exploded_pos = df.select(
    "order_id",
    posexplode(col("products")).alias("position", "product")
)

df_exploded_pos.show(truncate=False)

# 7.4 Explode multiple arrays together
print("\nüîπ Explode multiple arrays (zip):")
df_zipped = df.select(
    "order_id",
    "customer_name",
    explode(arrays_zip(col("products"), col("prices"))).alias("item")
).select(
    "order_id",
    "customer_name",
    col("item.products").alias("product"),
    col("item.prices").alias("price")
)

df_zipped.show(truncate=False)

# 7.5 Array aggregations
print("\nüîπ Array aggregations:")
df_agg = df.select(
    "order_id",
    "products",
    "prices",
    # Sum of array elements
    expr("aggregate(prices, CAST(0 AS DOUBLE), (acc, x) -> acc + x)").alias("total_amount"),
    # Average
   (expr("aggregate(prices, CAST(0 AS DOUBLE), (acc, x) -> acc + x)") / size(col("prices"))).alias("avg_price"),
    # Min/Max
    array_min(col("prices")).alias("min_price"),
    array_max(col("prices")).alias("max_price")
)

df_agg.show(truncate=False)

# 7.6 Array transformations
print("\nüîπ Array transformations:")
df_transform = df.select(
    "order_id",
    "products",
    "prices",
    # Transform each element (add 10% tax)
    expr("transform(prices, x -> x * 1.1)").alias("prices_with_tax"),
    # Filter array elements
    expr("filter(prices, x -> x > 150)").alias("expensive_items"),
    # Check if any element matches condition
    expr("exists(prices, x -> x > 200)").alias("has_expensive_item"),
    # Check if all elements match condition
    expr("forall(prices, x -> x > 0)").alias("all_positive")
)

df_transform.show(truncate=False)

# 7.7 Array sorting and distinct
print("\nüîπ Array sorting and distinct:")
df_sort = df.select(
    "order_id",
    "products",
    "prices",
    # Sort array
    array_sort(col("products")).alias("products_sorted"),
    array_sort(col("prices")).alias("prices_sorted"),
    # Remove duplicates
    array_distinct(col("products")).alias("products_unique"),
    # Reverse array
    reverse(col("products")).alias("products_reversed")
)

df_sort.show(truncate=False)

# 7.8 Array union, intersect, except
print("\nüîπ Array set operations:")
# Create sample data for set operations
df_sets = spark.createDataFrame([
    ("ORD001", ["A", "B", "C"], ["B", "C", "D"]),
    ("ORD002", ["X", "Y"], ["Y", "Z"])
], ["order_id", "array1", "array2"])

df_set_ops = df_sets.select(
    "order_id",
    "array1",
    "array2",
    # Union (combine arrays)
    array_union(col("array1"), col("array2")).alias("union"),
    # Intersect (common elements)
    array_intersect(col("array1"), col("array2")).alias("intersect"),
    # Except (elements in array1 but not in array2)
    array_except(col("array1"), col("array2")).alias("except")
)

df_set_ops.show(truncate=False)

üîπ Basic array functions:
+--------+--------------------------------------------+----------------------------+------------+-------------+-----------+-------------+
|order_id|products                                    |prices                      |num_products|first_product|first_price|has_product_a|
+--------+--------------------------------------------+----------------------------+------------+-------------+-----------+-------------+
|ORD001  |[Product A, Product B]                      |[100.0, 200.0]              |2           |Product A    |100.0      |true         |
|ORD002  |[Product C]                                 |[150.0]                     |1           |Product C    |150.0      |false        |
|ORD003  |[Product A, Product C, Product D]           |[100.0, 150.0, 250.0]       |3           |Product A    |100.0      |true         |
|ORD004  |[Product B, Product D]                      |[200.0, 250.0]              |2           |Product B    |200.0      |false        |
|ORD00

---

## üèóÔ∏è **8. STRUCT OPERATIONS**

X·ª≠ l√Ω d·ªØ li·ªáu nested (struct)

In [15]:
# 8.1 Create struct
print("üîπ Create struct:")
df_struct = df.select(
    "order_id",
    # Create struct from columns
    struct(
        col("customer_id"),
        col("customer_name"),
        col("email")
    ).alias("customer_info"),
    # Create struct with named fields
    struct(
        col("products").alias("items"),
        col("prices").alias("amounts")
    ).alias("order_details")
)

df_struct.show(truncate=False)
df_struct.printSchema()

# 8.2 Access struct fields
print("\nüîπ Access struct fields:")
df_access = df_struct.select(
    "order_id",
    # Method 1: Dot notation
    col("customer_info.customer_name").alias("name_1"),
    # Method 2: getField
    col("customer_info").getField("email").alias("email_1"),
    # Access nested array in struct
    col("order_details.items").alias("products")
)

df_access.show(truncate=False)

# 8.3 Flatten struct
print("\nüîπ Flatten struct:")
df_flatten = df_struct.select(
    "order_id",
    "customer_info.*",  # Flatten all fields
    "order_details.*"
)

df_flatten.show(truncate=False)

# 8.4 Complex nested structure
print("\nüîπ Complex nested structure:")
df_complex = df.select(
    "order_id",
    struct(
        col("customer_id"),
        col("customer_name"),
        struct(
            col("email"),
            col("country")
        ).alias("contact")
    ).alias("customer"),
    struct(
        col("products"),
        col("prices"),
        col("status")
    ).alias("order")
)

df_complex.show(truncate=False)
df_complex.printSchema()

# Access deeply nested fields
print("\nüîπ Access deeply nested fields:")
df_deep = df_complex.select(
    "order_id",
    col("customer.customer_name").alias("name"),
    col("customer.contact.email").alias("email"),
    col("customer.contact.country").alias("country"),
    col("order.status").alias("status")
)

df_deep.show(truncate=False)

üîπ Create struct:
+--------+---------------------------------------------------+----------------------------------------------------------------------------+
|order_id|customer_info                                      |order_details                                                               |
+--------+---------------------------------------------------+----------------------------------------------------------------------------+
|ORD001  |{CUST001, John Doe, john.doe@email.com}            |{[Product A, Product B], [100.0, 200.0]}                                    |
|ORD002  |{CUST002, Jane Smith, jane.smith@email.com}        |{[Product C], [150.0]}                                                      |
|ORD003  |{CUST003, Bob Johnson, bob.johnson@email.com}      |{[Product A, Product C, Product D], [100.0, 150.0, 250.0]}                  |
|ORD004  |{CUST001, John Doe, john.doe@email.com}            |{[Product B, Product D], [200.0, 250.0]}                                    |


---

## üé® **9. USER DEFINED FUNCTIONS (UDFs)**

‚ö†Ô∏è **WARNING:** UDFs are SLOW! Use built-in functions when possible!

In [17]:
# 9.1 Simple Python UDF
print("üîπ Simple Python UDF:")

from pyspark.sql.functions import udf

# Define Python function
def categorize_amount(amount):
    if amount < 200:
        return "Low"
    elif amount < 400:
        return "Medium"
    else:
        return "High"

# Register as UDF
categorize_udf = udf(categorize_amount, StringType())

# Use UDF
df_with_amount = df.withColumn(
    "total_amount",
    expr("aggregate(prices, CAST(0 AS DOUBLE), (acc, x) -> acc + x)")
)

df_udf = df_with_amount.withColumn(
    "amount_category",
    categorize_udf(col("total_amount"))
)

df_udf.select("order_id", "total_amount", "amount_category").show()

# 9.2 UDF with multiple inputs
print("\nüîπ UDF with multiple inputs:")

def calculate_discount(amount, country, status):
    discount = 0.0
    if status == "completed":
        if country == "USA":
            discount = 0.15
        elif country == "UK":
            discount = 0.10
        else:
            discount = 0.05
    return amount * discount

discount_udf = udf(calculate_discount, DoubleType())

df_discount = df_with_amount.withColumn(
    "discount_amount",
    discount_udf(col("total_amount"), col("country"), col("status"))
)

df_discount.select("order_id", "total_amount", "country", "status", "discount_amount").show()

# 9.3 UDF with complex return type
print("\nüîπ UDF with complex return type:")

def parse_email(email):
    if email:
        parts = email.split("@")
        return {"username": parts[0], "domain": parts[1] if len(parts) > 1 else None}
    return {"username": None, "domain": None}

email_schema = StructType([
    StructField("username", StringType(), True),
    StructField("domain", StringType(), True)
])

parse_email_udf = udf(parse_email, email_schema)

df_email = df.withColumn(
    "email_parts",
    parse_email_udf(col("email"))
).select(
    "order_id",
    "email",
    "email_parts.*"
)

df_email.show(truncate=False)

# 9.4 Register UDF for SQL
print("\nüîπ Register UDF for SQL:")

spark.udf.register("categorize_amount_sql", categorize_amount, StringType())

df_with_amount.createOrReplaceTempView("orders")

df_sql_udf = spark.sql("""
    SELECT 
        order_id,
        total_amount,
        categorize_amount_sql(total_amount) as category
    FROM orders
""")

df_sql_udf.show()

üîπ Simple Python UDF:


                                                                                

+--------+------------+---------------+
|order_id|total_amount|amount_category|
+--------+------------+---------------+
|  ORD001|       300.0|         Medium|
|  ORD002|       150.0|            Low|
|  ORD003|       500.0|           High|
|  ORD004|       450.0|           High|
|  ORD005|       100.0|            Low|
|  ORD006|       350.0|         Medium|
|  ORD007|       250.0|         Medium|
|  ORD008|       700.0|           High|
+--------+------------+---------------+


üîπ UDF with multiple inputs:
+--------+------------+-------+---------+---------------+
|order_id|total_amount|country|   status|discount_amount|
+--------+------------+-------+---------+---------------+
|  ORD001|       300.0|    USA|completed|           45.0|
|  ORD002|       150.0|     UK|  pending|            0.0|
|  ORD003|       500.0| Canada|completed|           25.0|
|  ORD004|       450.0|    USA|cancelled|            0.0|
|  ORD005|       100.0|    USA|completed|           15.0|
|  ORD006|       350.0|

---

## ‚ö° **10. PANDAS UDFs (VECTORIZED UDFs)**

‚úÖ **MUCH FASTER** than regular UDFs!

In [21]:
# 10.1 Pandas UDF (Scalar)
print("üîπ Pandas UDF (Scalar):")

from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def categorize_amount_pandas(amounts: pd.Series) -> pd.Series:
    return amounts.apply(lambda x: "Low" if x < 200 else ("Medium" if x < 400 else "High"))

df_pandas_udf = df_with_amount.withColumn(
    "amount_category_pandas",
    categorize_amount_pandas(col("total_amount"))
)

df_pandas_udf.select("order_id", "total_amount", "amount_category_pandas").show()

# # 10.2 Pandas UDF with multiple columns
# print("\nüîπ Pandas UDF with multiple columns:")

# @pandas_udf(DoubleType())
# def calculate_discount_pandas(amounts: pd.Series, countries: pd.Series, statuses: pd.Series) -> pd.Series:
#     def calc(amount, country, status):
#         if status == "completed":
#             if country == "USA":
#                 return amount * 0.15
#             elif country == "UK":
#                 return amount * 0.10
#             else:
#                 return amount * 0.05
#         return 0.0
    
#     return pd.Series([calc(a, c, s) for a, c, s in zip(amounts, countries, statuses)])

# df_pandas_discount = df_with_amount.withColumn(
#     "discount_pandas",
#     calculate_discount_pandas(col("total_amount"), col("country"), col("status"))
# )

# df_pandas_discount.select("order_id", "total_amount", "country", "status", "discount_pandas").show()

# # 10.3 Performance comparison: Regular UDF vs Pandas UDF
# print("\n‚ö° Performance Comparison:")
# print("Regular UDF: Processes row by row (SLOW)")
# print("Pandas UDF: Processes batches using Arrow (FAST)")
# print("Speedup: 3-100x faster depending on data size!")

üîπ Pandas UDF (Scalar):


26/01/08 17:08:10 WARN TaskSetManager: Lost task 0.0 in stage 148.0 (TID 159) (172.18.0.7 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1231, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1067, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 529, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 174, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", li

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1231, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1067, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 529, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 174, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 472, in loads
    return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'pandas'


---

## üéØ **11. BEST PRACTICES & PERFORMANCE TIPS**

In [None]:
print("="*80)
print("üéØ TRANSFORMATION BEST PRACTICES")
print("="*80)

print("""
‚úÖ DO:
1. Use built-in functions whenever possible (MUCH faster than UDFs)
2. Use Pandas UDFs instead of regular UDFs (3-100x faster)
3. Chain transformations efficiently (lazy evaluation)
4. Use selectExpr for complex SQL expressions
5. Use expr() for complex logic instead of nested when/otherwise
6. Leverage array functions instead of explode when possible
7. Use broadcast for small lookup tables
8. Cache intermediate results if reused multiple times

‚ùå DON'T:
1. Use Python UDFs unless absolutely necessary (very slow!)
2. Use collect() on large datasets (brings all data to driver)
3. Use rdd.map() when DataFrame operations are available
4. Create too many small partitions (overhead)
5. Use UDFs for operations that can be done with built-in functions
6. Explode large arrays unnecessarily (data multiplication)
7. Use nested loops in UDFs (extremely slow)
8. Ignore null handling (causes errors)

‚ö° PERFORMANCE RANKING (Fastest to Slowest):
1. Built-in functions (SQL expressions) ‚ö°‚ö°‚ö°‚ö°‚ö°
2. Pandas UDFs (Vectorized) ‚ö°‚ö°‚ö°‚ö°
3. Python UDFs ‚ö°‚ö°
4. RDD operations ‚ö°
""")

print("="*80)

---

## üíæ **12. SAVE TRANSFORMED DATA**

In [None]:
# Create final transformed dataset
df_final = df.withColumn(
    "order_datetime",
    to_timestamp(col("order_timestamp"), "yyyy-MM-dd HH:mm:ss")
).withColumn(
    "total_amount",
    expr("aggregate(prices, 0.0, (acc, x) -> acc + x)")
).withColumn(
    "num_products",
    size(col("products"))
).withColumn(
    "avg_price",
    col("total_amount") / col("num_products")
).withColumn(
    "order_date",
    to_date(col("order_datetime"))
).withColumn(
    "order_year",
    year(col("order_datetime"))
).withColumn(
    "order_month",
    month(col("order_datetime"))
).withColumn(
    "amount_category",
    when(col("total_amount") < 200, "Low")
    .when(col("total_amount") < 400, "Medium")
    .otherwise("High")
)

print("‚úÖ FINAL TRANSFORMED DATA:")
df_final.show(truncate=False)

# Save to MinIO
output_path = "s3a://warehouse/transformed_orders/"

df_final.write \
    .mode("overwrite") \
    .partitionBy("order_year", "order_month") \
    .parquet(output_path)

print(f"\n‚úÖ Data saved to: {output_path}")

# Verify
df_verify = spark.read.parquet(output_path)
print(f"\n‚úÖ Verification: {df_verify.count()} rows loaded")
df_verify.show(5)

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Conditional Logic** - when/otherwise, case statements
2. **String Operations** - concat, split, regex, substring
3. **Date/Time** - date_add, date_diff, date_format, date_trunc
4. **Numeric Operations** - round, ceil, floor, math functions
5. **Array Operations** - explode, array functions, transformations
6. **Struct Operations** - nested data handling
7. **UDFs** - Python UDFs (slow) vs Pandas UDFs (fast)
8. **Performance** - Always prefer built-in functions!

### **üöÄ Next Steps:**
- **Day 3 - Lesson 2:** Aggregations (groupBy, pivot, rollup, cube)
- **Day 3 - Lesson 3:** Window Functions (ranking, lag/lead, running totals)
- **Day 3 - Lesson 4:** Joins (inner, outer, broadcast, optimization)

---

In [None]:
# Cleanup
spark.stop()
print("‚úÖ Spark session stopped")
print("\nüéâ DAY 3 - LESSON 1 COMPLETED!")