# üîó JOINS WITH PYSPARK

---

## üìã **DAY 3 - LESSON 4: JOINS**

### **üéØ OBJECTIVES:**

1. **Join Types** - inner, left, right, full, cross, semi, anti
2. **Join Conditions** - single, multiple, complex
3. **Broadcast Joins** - Optimize small table joins
4. **Join Strategies** - shuffle, broadcast, sort-merge
5. **Performance** - Skew handling, partitioning
6. **Real-World Patterns** - Dimension tables, fact tables

---

## üîß **SETUP**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("Joins") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.sql.autoBroadcastJoinThreshold", "10485760") \
    .getOrCreate()

print("‚úÖ Spark Session Created")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/10 16:40:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created


---

## üìä **1. CREATE SAMPLE DATASETS**

In [2]:
# Orders (Fact table)
orders_data = [
    ("ORD001", "CUST001", "PROD001", "2024-01-15", 2, 1200.0),
    ("ORD002", "CUST002", "PROD002", "2024-01-16", 1, 800.0),
    ("ORD003", "CUST001", "PROD003", "2024-01-17", 3, 150.0),
    ("ORD004", "CUST003", "PROD001", "2024-01-18", 1, 1200.0),
    ("ORD005", "CUST004", "PROD004", "2024-01-19", 2, 600.0),
    ("ORD006", "CUST999", "PROD999", "2024-01-20", 1, 500.0),  # No matching customer/product
]

orders = spark.createDataFrame(orders_data, 
    ["order_id", "customer_id", "product_id", "order_date", "quantity", "amount"])

# Customers (Dimension table)
customers_data = [
    ("CUST001", "John Doe", "john@email.com", "USA"),
    ("CUST002", "Jane Smith", "jane@email.com", "UK"),
    ("CUST003", "Bob Johnson", "bob@email.com", "Canada"),
    ("CUST004", "Alice Brown", "alice@email.com", "USA"),
    ("CUST005", "Charlie Wilson", "charlie@email.com", "UK"),  # No orders
]

customers = spark.createDataFrame(customers_data,
    ["customer_id", "customer_name", "email", "country"])

# Products (Dimension table)
products_data = [
    ("PROD001", "Laptop", "Electronics", 1200.0),
    ("PROD002", "Phone", "Electronics", 800.0),
    ("PROD003", "Shirt", "Clothing", 50.0),
    ("PROD004", "Tablet", "Electronics", 600.0),
    ("PROD005", "Shoes", "Clothing", 120.0),  # No orders
]

products = spark.createDataFrame(products_data,
    ["product_id", "product_name", "category", "price"])

print("üìä ORDERS:")
orders.show()

print("\nüë• CUSTOMERS:")
customers.show()

print("\nüì¶ PRODUCTS:")
products.show()

üìä ORDERS:


                                                                                

+--------+-----------+----------+----------+--------+------+
|order_id|customer_id|product_id|order_date|quantity|amount|
+--------+-----------+----------+----------+--------+------+
|  ORD001|    CUST001|   PROD001|2024-01-15|       2|1200.0|
|  ORD002|    CUST002|   PROD002|2024-01-16|       1| 800.0|
|  ORD003|    CUST001|   PROD003|2024-01-17|       3| 150.0|
|  ORD004|    CUST003|   PROD001|2024-01-18|       1|1200.0|
|  ORD005|    CUST004|   PROD004|2024-01-19|       2| 600.0|
|  ORD006|    CUST999|   PROD999|2024-01-20|       1| 500.0|
+--------+-----------+----------+----------+--------+------+


üë• CUSTOMERS:


                                                                                

+-----------+--------------+-----------------+-------+
|customer_id| customer_name|            email|country|
+-----------+--------------+-----------------+-------+
|    CUST001|      John Doe|   john@email.com|    USA|
|    CUST002|    Jane Smith|   jane@email.com|     UK|
|    CUST003|   Bob Johnson|    bob@email.com| Canada|
|    CUST004|   Alice Brown|  alice@email.com|    USA|
|    CUST005|Charlie Wilson|charlie@email.com|     UK|
+-----------+--------------+-----------------+-------+


üì¶ PRODUCTS:
+----------+------------+-----------+------+
|product_id|product_name|   category| price|
+----------+------------+-----------+------+
|   PROD001|      Laptop|Electronics|1200.0|
|   PROD002|       Phone|Electronics| 800.0|
|   PROD003|       Shirt|   Clothing|  50.0|
|   PROD004|      Tablet|Electronics| 600.0|
|   PROD005|       Shoes|   Clothing| 120.0|
+----------+------------+-----------+------+



---

## üîó **2. BASIC JOIN TYPES**

In [3]:
# 2.1 INNER JOIN (default)
print("üîπ INNER JOIN - Only matching rows:")
inner_join = orders.join(customers, "customer_id", "inner")
inner_join.show()
print(f"Rows: {inner_join.count()}")

# 2.2 LEFT JOIN (LEFT OUTER)
print("\nüîπ LEFT JOIN - All orders + matching customers:")
left_join = orders.join(customers, "customer_id", "left")
left_join.show()
print(f"Rows: {left_join.count()}")

# 2.3 RIGHT JOIN (RIGHT OUTER)
print("\nüîπ RIGHT JOIN - All customers + matching orders:")
right_join = orders.join(customers, "customer_id", "right")
right_join.show()
print(f"Rows: {right_join.count()}")

# 2.4 FULL OUTER JOIN
print("\nüîπ FULL OUTER JOIN - All orders + all customers:")
full_join = orders.join(customers, "customer_id", "full")
full_join.show()
print(f"Rows: {full_join.count()}")

# 2.5 LEFT SEMI JOIN (like IN clause)
print("\nüîπ LEFT SEMI JOIN - Orders with existing customers:")
semi_join = orders.join(customers, "customer_id", "left_semi")
semi_join.show()
print(f"Rows: {semi_join.count()}")

# 2.6 LEFT ANTI JOIN (like NOT IN clause)
print("\nüîπ LEFT ANTI JOIN - Orders without customers:")
anti_join = orders.join(customers, "customer_id", "left_anti")
anti_join.show()
print(f"Rows: {anti_join.count()}")

# 2.7 CROSS JOIN (Cartesian product)
print("\nüîπ CROSS JOIN - All combinations:")
# Use small sample to avoid explosion
small_orders = orders.limit(2)
small_customers = customers.limit(2)
cross_join = small_orders.crossJoin(small_customers)
cross_join.show()
print(f"Rows: {cross_join.count()} (2 orders √ó 2 customers)")

üîπ INNER JOIN - Only matching rows:


                                                                                

+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|customer_id|order_id|product_id|order_date|quantity|amount|customer_name|          email|country|
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|    CUST001|  ORD001|   PROD001|2024-01-15|       2|1200.0|     John Doe| john@email.com|    USA|
|    CUST001|  ORD003|   PROD003|2024-01-17|       3| 150.0|     John Doe| john@email.com|    USA|
|    CUST002|  ORD002|   PROD002|2024-01-16|       1| 800.0|   Jane Smith| jane@email.com|     UK|
|    CUST003|  ORD004|   PROD001|2024-01-18|       1|1200.0|  Bob Johnson|  bob@email.com| Canada|
|    CUST004|  ORD005|   PROD004|2024-01-19|       2| 600.0|  Alice Brown|alice@email.com|    USA|
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+



                                                                                

Rows: 5

üîπ LEFT JOIN - All orders + matching customers:
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|customer_id|order_id|product_id|order_date|quantity|amount|customer_name|          email|country|
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|    CUST001|  ORD001|   PROD001|2024-01-15|       2|1200.0|     John Doe| john@email.com|    USA|
|    CUST001|  ORD003|   PROD003|2024-01-17|       3| 150.0|     John Doe| john@email.com|    USA|
|    CUST002|  ORD002|   PROD002|2024-01-16|       1| 800.0|   Jane Smith| jane@email.com|     UK|
|    CUST004|  ORD005|   PROD004|2024-01-19|       2| 600.0|  Alice Brown|alice@email.com|    USA|
|    CUST999|  ORD006|   PROD999|2024-01-20|       1| 500.0|         NULL|           NULL|   NULL|
|    CUST003|  ORD004|   PROD001|2024-01-18|       1|1200.0|  Bob Johnson|  bob@email.com| Canada|
+-----------+--------+----------+----------+------

---

## üéØ **3. JOIN CONDITIONS**

In [4]:
# 3.1 Single column join
print("üîπ Single column join:")
join1 = orders.join(customers, "customer_id")
join1.show()

# 3.2 Multiple columns join
print("\nüîπ Multiple columns join:")
# Create sample data with composite key
orders_comp = orders.withColumn("region", lit("US"))
customers_comp = customers.withColumn("region", lit("US"))

join2 = orders_comp.join(customers_comp, ["customer_id", "region"])
join2.show()

# 3.3 Join with different column names
print("\nüîπ Join with different column names:")
orders_renamed = orders.withColumnRenamed("customer_id", "cust_id")
join3 = orders_renamed.join(customers, orders_renamed.cust_id == customers.customer_id)
join3.show()

# 3.4 Complex join conditions
print("\nüîπ Complex join conditions:")
join4 = orders.join(
    customers,
    (orders.customer_id == customers.customer_id) & (orders.amount > 500)
)
join4.show()

# 3.5 Join with expressions
print("\nüîπ Join with expressions:")
join5 = orders.join(
    products,
    (orders.product_id == products.product_id) & (orders.amount == products.price * orders.quantity)
)
join5.show()

üîπ Single column join:
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|customer_id|order_id|product_id|order_date|quantity|amount|customer_name|          email|country|
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|    CUST001|  ORD001|   PROD001|2024-01-15|       2|1200.0|     John Doe| john@email.com|    USA|
|    CUST001|  ORD003|   PROD003|2024-01-17|       3| 150.0|     John Doe| john@email.com|    USA|
|    CUST002|  ORD002|   PROD002|2024-01-16|       1| 800.0|   Jane Smith| jane@email.com|     UK|
|    CUST003|  ORD004|   PROD001|2024-01-18|       1|1200.0|  Bob Johnson|  bob@email.com| Canada|
|    CUST004|  ORD005|   PROD004|2024-01-19|       2| 600.0|  Alice Brown|alice@email.com|    USA|
+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+


üîπ Multiple columns join:
+-----------+------+--------+----------+----------+---

---

## üì° **4. BROADCAST JOINS**

In [5]:
# 4.1 Automatic broadcast (small table < 10MB)
print("üîπ Automatic broadcast join:")
auto_broadcast = orders.join(customers, "customer_id")
auto_broadcast.explain()
auto_broadcast.show()

# 4.2 Explicit broadcast hint
print("\nüîπ Explicit broadcast join:")
broadcast_join = orders.join(broadcast(customers), "customer_id")
broadcast_join.explain()
broadcast_join.show()

# 4.3 Multiple broadcasts
print("\nüîπ Multiple broadcast joins:")
multi_broadcast = orders \
    .join(broadcast(customers), "customer_id") \
    .join(broadcast(products), "product_id")

multi_broadcast.show()

# 4.4 When to use broadcast
print("""
üìù BROADCAST JOIN GUIDELINES:

‚úÖ USE BROADCAST WHEN:
- Small table (< 10MB, configurable)
- Dimension tables (customers, products, categories)
- Lookup tables
- Reference data

‚ùå DON'T BROADCAST WHEN:
- Large tables (> 100MB)
- Both tables are large
- Memory constraints

‚ö° BENEFITS:
- No shuffle (faster)
- Less network I/O
- Better for small dimension tables
""")

üîπ Automatic broadcast join:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [customer_id#1, order_id#0, product_id#2, order_date#3, quantity#4L, amount#5, customer_name#13, email#14, country#15]
   +- SortMergeJoin [customer_id#1], [customer_id#12], Inner
      :- Sort [customer_id#1 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(customer_id#1, 200), ENSURE_REQUIREMENTS, [plan_id=2815]
      :     +- Filter isnotnull(customer_id#1)
      :        +- Scan ExistingRDD[order_id#0,customer_id#1,product_id#2,order_date#3,quantity#4L,amount#5]
      +- Sort [customer_id#12 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(customer_id#12, 200), ENSURE_REQUIREMENTS, [plan_id=2816]
            +- Filter isnotnull(customer_id#12)
               +- Scan ExistingRDD[customer_id#12,customer_name#13,email#14,country#15]


+-----------+--------+----------+----------+--------+------+-------------+---------------+-------+
|customer_id|order_id|prod

---

## üîÑ **5. MULTIPLE JOINS**

In [6]:
# 5.1 Chain joins
print("üîπ Chain joins - Orders + Customers + Products:")
full_data = orders \
    .join(customers, "customer_id", "left") \
    .join(products, "product_id", "left")

full_data.show(truncate=False)

# 5.2 Select specific columns after joins
print("\nüîπ Select specific columns:")
result = orders \
    .join(customers, "customer_id") \
    .join(products, "product_id") \
    .select(
        orders.order_id,
        orders.order_date,
        customers.customer_name,
        customers.country,
        products.product_name,
        products.category,
        orders.quantity,
        orders.amount
    )

result.show(truncate=False)

# 5.3 Alias tables to avoid ambiguity
print("\nüîπ Using aliases:")
o = orders.alias("o")
c = customers.alias("c")
p = products.alias("p")

result_alias = o \
    .join(c, o.customer_id == c.customer_id) \
    .join(p, o.product_id == p.product_id) \
    .select(
        col("o.order_id"),
        col("o.order_date"),
        col("c.customer_name"),
        col("c.country"),
        col("p.product_name"),
        col("p.category"),
        col("o.quantity"),
        col("o.amount")
    )

result_alias.show(truncate=False)

üîπ Chain joins - Orders + Customers + Products:
+----------+-----------+--------+----------+--------+------+-------------+---------------+-------+------------+-----------+------+
|product_id|customer_id|order_id|order_date|quantity|amount|customer_name|email          |country|product_name|category   |price |
+----------+-----------+--------+----------+--------+------+-------------+---------------+-------+------------+-----------+------+
|PROD004   |CUST004    |ORD005  |2024-01-19|2       |600.0 |Alice Brown  |alice@email.com|USA    |Tablet      |Electronics|600.0 |
|PROD999   |CUST999    |ORD006  |2024-01-20|1       |500.0 |NULL         |NULL           |NULL   |NULL        |NULL       |NULL  |
|PROD001   |CUST003    |ORD004  |2024-01-18|1       |1200.0|Bob Johnson  |bob@email.com  |Canada |Laptop      |Electronics|1200.0|
|PROD001   |CUST001    |ORD001  |2024-01-15|2       |1200.0|John Doe     |john@email.com |USA    |Laptop      |Electronics|1200.0|
|PROD003   |CUST001    |ORD003  |

---

## üéØ **6. REAL-WORLD PATTERNS**

In [7]:
# 6.1 Star schema join (fact + dimensions)
print("üîπ Star schema - Fact table + Dimension tables:")
star_schema = orders \
    .join(broadcast(customers), "customer_id", "left") \
    .join(broadcast(products), "product_id", "left") \
    .select(
        orders.order_id,
        orders.order_date,
        customers.customer_name,
        customers.country,
        products.product_name,
        products.category,
        orders.quantity,
        orders.amount,
        (orders.quantity * products.price).alias("calculated_amount")
    )

star_schema.show(truncate=False)

# 6.2 Find missing references
print("\nüîπ Find orders without customers:")
missing_customers = orders.join(customers, "customer_id", "left_anti")
missing_customers.show()

print("\nüîπ Find customers without orders:")
customers_no_orders = customers.join(orders, "customer_id", "left_anti")
customers_no_orders.show()

# 6.3 Aggregate after join
print("\nüîπ Revenue by country:")
revenue_by_country = orders \
    .join(customers, "customer_id") \
    .groupBy("country") \
    .agg(
        count("order_id").alias("num_orders"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_order_value")
    ) \
    .orderBy(desc("total_revenue"))

revenue_by_country.show()

# 6.4 Revenue by category
print("\nüîπ Revenue by category:")
revenue_by_category = orders \
    .join(products, "product_id") \
    .groupBy("category") \
    .agg(
        count("order_id").alias("num_orders"),
        sum("amount").alias("total_revenue"),
        sum("quantity").alias("total_quantity")
    ) \
    .orderBy(desc("total_revenue"))

revenue_by_category.show()

# 6.5 Top customers by country
print("\nüîπ Top customers by country:")
from pyspark.sql.window import Window

customer_revenue = orders \
    .join(customers, "customer_id") \
    .groupBy("customer_id", "customer_name", "country") \
    .agg(sum("amount").alias("total_spent")) \
    .withColumn(
        "rank",
        row_number().over(Window.partitionBy("country").orderBy(desc("total_spent")))
    ) \
    .filter(col("rank") <= 2) \
    .orderBy("country", "rank")

customer_revenue.show()

üîπ Star schema - Fact table + Dimension tables:
+--------+----------+-------------+-------+------------+-----------+--------+------+-----------------+
|order_id|order_date|customer_name|country|product_name|category   |quantity|amount|calculated_amount|
+--------+----------+-------------+-------+------------+-----------+--------+------+-----------------+
|ORD001  |2024-01-15|John Doe     |USA    |Laptop      |Electronics|2       |1200.0|2400.0           |
|ORD002  |2024-01-16|Jane Smith   |UK     |Phone       |Electronics|1       |800.0 |800.0            |
|ORD003  |2024-01-17|John Doe     |USA    |Shirt       |Clothing   |3       |150.0 |150.0            |
|ORD004  |2024-01-18|Bob Johnson  |Canada |Laptop      |Electronics|1       |1200.0|1200.0           |
|ORD005  |2024-01-19|Alice Brown  |USA    |Tablet      |Electronics|2       |600.0 |1200.0           |
|ORD006  |2024-01-20|NULL         |NULL   |NULL        |NULL       |1       |500.0 |NULL             |
+--------+----------+--

---

## ‚ö° **7. JOIN PERFORMANCE OPTIMIZATION**

In [8]:
print("="*80)
print("‚ö° JOIN PERFORMANCE OPTIMIZATION")
print("="*80)

print("""
1Ô∏è‚É£ BROADCAST JOIN:
   - Best for: Small table (< 10MB) √ó Large table
   - No shuffle needed
   - Use: broadcast(small_df)

2Ô∏è‚É£ SHUFFLE HASH JOIN:
   - Best for: Medium tables
   - Requires shuffle
   - Automatic when tables are similar size

3Ô∏è‚É£ SORT MERGE JOIN:
   - Best for: Large tables
   - Both tables sorted
   - Default for large joins

4Ô∏è‚É£ OPTIMIZATION TIPS:

‚úÖ DO:
- Filter before join (reduce data size)
- Use broadcast for small dimensions
- Partition on join keys
- Cache frequently joined tables
- Use appropriate join type
- Select only needed columns

‚ùå DON'T:
- Join large tables without filtering
- Use cross join on large tables
- Broadcast large tables
- Join on non-partitioned columns
- Select all columns (*) unnecessarily

5Ô∏è‚É£ DATA SKEW HANDLING:
- Salting: Add random prefix to skewed keys
- Adaptive Query Execution (AQE)
- Repartition before join
- Use skew join hint (Spark 3.0+)

6Ô∏è‚É£ MONITORING:
- Use .explain() to see join strategy
- Check Spark UI for shuffle size
- Monitor task duration
- Look for data skew
""")

# Example: Check join strategy
print("\nüîπ Check join strategy:")
orders.join(broadcast(customers), "customer_id").explain()

print("\n" + "="*80)

‚ö° JOIN PERFORMANCE OPTIMIZATION

1Ô∏è‚É£ BROADCAST JOIN:
   - Best for: Small table (< 10MB) √ó Large table
   - No shuffle needed
   - Use: broadcast(small_df)

2Ô∏è‚É£ SHUFFLE HASH JOIN:
   - Best for: Medium tables
   - Requires shuffle
   - Automatic when tables are similar size

3Ô∏è‚É£ SORT MERGE JOIN:
   - Best for: Large tables
   - Both tables sorted
   - Default for large joins

4Ô∏è‚É£ OPTIMIZATION TIPS:

‚úÖ DO:
- Filter before join (reduce data size)
- Use broadcast for small dimensions
- Partition on join keys
- Cache frequently joined tables
- Use appropriate join type
- Select only needed columns

‚ùå DON'T:
- Join large tables without filtering
- Use cross join on large tables
- Broadcast large tables
- Join on non-partitioned columns
- Select all columns (*) unnecessarily

5Ô∏è‚É£ DATA SKEW HANDLING:
- Salting: Add random prefix to skewed keys
- Adaptive Query Execution (AQE)
- Repartition before join
- Use skew join hint (Spark 3.0+)

6Ô∏è‚É£ MONITORING:
- Use .exp

---

## üéØ **8. ADVANCED JOIN PATTERNS**

In [9]:
# 8.1 Self join
print("üîπ Self join - Find customers in same country:")
c1 = customers.alias("c1")
c2 = customers.alias("c2")

same_country = c1.join(
    c2,
    (col("c1.country") == col("c2.country")) & (col("c1.customer_id") < col("c2.customer_id"))
).select(
    col("c1.customer_name").alias("customer_1"),
    col("c2.customer_name").alias("customer_2"),
    col("c1.country")
)

same_country.show()

# 8.2 Inequality join
print("\nüîπ Inequality join - Products cheaper than order amount:")
cheaper_products = orders.join(
    products,
    products.price < orders.amount
).select(
    orders.order_id,
    orders.amount.alias("order_amount"),
    products.product_name,
    products.price
)

cheaper_products.show()

# 8.3 Conditional join with coalesce
print("\nüîπ Handle nulls after left join:")
with_defaults = orders \
    .join(customers, "customer_id", "left") \
    .select(
        orders.order_id,
        coalesce(customers.customer_name, lit("Unknown Customer")).alias("customer_name"),
        coalesce(customers.country, lit("Unknown")).alias("country"),
        orders.amount
    )

with_defaults.show()

# 8.4 Union after joins
print("\nüîπ Union after joins:")
usa_orders = orders.join(customers, "customer_id").filter(col("country") == "USA")
uk_orders = orders.join(customers, "customer_id").filter(col("country") == "UK")

combined = usa_orders.union(uk_orders)
combined.show()

üîπ Self join - Find customers in same country:
+----------+--------------+-------+
|customer_1|    customer_2|country|
+----------+--------------+-------+
|Jane Smith|Charlie Wilson|     UK|
|  John Doe|   Alice Brown|    USA|
+----------+--------------+-------+


üîπ Inequality join - Products cheaper than order amount:
+--------+------------+------------+-----+
|order_id|order_amount|product_name|price|
+--------+------------+------------+-----+
|  ORD001|      1200.0|       Phone|800.0|
|  ORD001|      1200.0|       Shirt| 50.0|
|  ORD002|       800.0|       Shirt| 50.0|
|  ORD003|       150.0|       Shirt| 50.0|
|  ORD001|      1200.0|      Tablet|600.0|
|  ORD001|      1200.0|       Shoes|120.0|
|  ORD002|       800.0|      Tablet|600.0|
|  ORD002|       800.0|       Shoes|120.0|
|  ORD003|       150.0|       Shoes|120.0|
|  ORD004|      1200.0|       Phone|800.0|
|  ORD004|      1200.0|       Shirt| 50.0|
|  ORD005|       600.0|       Shirt| 50.0|
|  ORD006|       500.0|      

---

## üíæ **9. SAVE JOINED DATA**

In [None]:
# Create enriched dataset
enriched_orders = orders \
    .join(customers, "customer_id", "left") \
    .join(products, "product_id", "left") \
    .select(
        orders.order_id,
        orders.order_date,
        orders.customer_id,
        coalesce(customers.customer_name, lit("Unknown")).alias("customer_name"),
        coalesce(customers.country, lit("Unknown")).alias("country"),
        orders.product_id,
        coalesce(products.product_name, lit("Unknown")).alias("product_name"),
        coalesce(products.category, lit("Unknown")).alias("category"),
        orders.quantity,
        orders.amount
    )

print("‚úÖ ENRICHED ORDERS:")
enriched_orders.show(truncate=False)

# Save to MinIO
output_path = "s3a://warehouse/enriched_orders/"

enriched_orders.write \
    .mode("overwrite") \
    .partitionBy("country") \
    .parquet(output_path)

print(f"\n‚úÖ Enriched orders saved to: {output_path}")

# Verify
df_verify = spark.read.parquet(output_path)
print(f"\n‚úÖ Verification: {df_verify.count()} rows loaded")
df_verify.show(5)

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Join Types** - inner, left, right, full, semi, anti, cross
2. **Join Conditions** - single, multiple, complex
3. **Broadcast Joins** - Optimize small table joins
4. **Multiple Joins** - Chain joins, star schema
5. **Performance** - Strategies, optimization, skew handling
6. **Real Patterns** - Fact/dimension, aggregations, top-N

### **üìä Join Types Summary:**

| Join Type | Returns | Use Case |
|-----------|---------|----------|
| **INNER** | Matching rows only | Standard join |
| **LEFT** | All left + matching right | Keep all orders |
| **RIGHT** | All right + matching left | Keep all customers |
| **FULL** | All rows from both | Complete picture |
| **SEMI** | Left rows with match | Filter (like IN) |
| **ANTI** | Left rows without match | Filter (like NOT IN) |
| **CROSS** | Cartesian product | All combinations |

### **‚ö° Performance Checklist:**

```python
# ‚úÖ Good
df.filter(...).join(broadcast(small_df), "key")

# ‚ùå Bad
df.join(large_df, "key").filter(...)
```

### **üöÄ Next:** Day 4 - Performance Optimization

---

In [None]:
spark.stop()
print("‚úÖ Spark session stopped")
print("\nüéâ DAY 3 COMPLETED! üéâ")
print("\nüìö You've mastered:")
print("  ‚úÖ Advanced Transformations")
print("  ‚úÖ Aggregations")
print("  ‚úÖ Window Functions")
print("  ‚úÖ Joins")
print("\nüöÄ Next: Day 4 - Performance Optimization")