# üöÄ PARTITIONING & BUCKETING - TH·ª∞C T·∫æ

---

## üìã **DAY 4 - LESSON 1: PARTITIONING & BUCKETING**

### **üéØ M·ª§C TI√äU:**

1. **Hi·ªÉu Partitioning** - Khi n√†o d√πng, t·∫°i sao d√πng
2. **Hi·ªÉu Bucketing** - Kh√°c g√¨ Partitioning
3. **K·∫øt h·ª£p c·∫£ 2** - Best practices th·ª±c t·∫ø
4. **Data th·ª±c t·∫ø** - 10,000+ records e-commerce
5. **So s√°nh performance** - C√≥ s·ªë li·ªáu c·ª• th·ªÉ

---

## üìä **TH·ª∞C T·∫æ ·ªû PRODUCTION:**

### **1. PARTITIONING - D√πng 90% tr∆∞·ªùng h·ª£p:**
- ‚úÖ **Khi n√†o:** Query th∆∞·ªùng filter theo c·ªôt c·ª• th·ªÉ
- ‚úÖ **V√≠ d·ª•:** `WHERE date = '2024-01-15'`, `WHERE country = 'USA'`
- ‚úÖ **L·ª£i √≠ch:** Partition pruning ‚Üí ƒê·ªçc √≠t data h∆°n
- ‚úÖ **Use cases:** Log data, transaction data, time-series

### **2. BUCKETING - D√πng 10% tr∆∞·ªùng h·ª£p:**
- ‚úÖ **Khi n√†o:** Join 2 b·∫£ng l·ªõn th∆∞·ªùng xuy√™n
- ‚úÖ **V√≠ d·ª•:** `orders JOIN customers ON customer_id`
- ‚úÖ **L·ª£i √≠ch:** Kh√¥ng shuffle khi join ‚Üí Nhanh h∆°n 10-100x
- ‚úÖ **Use cases:** Fact-dimension joins, large table joins

### **3. K·∫æT H·ª¢P C·∫¢ 2 - Best Practice:**
```python
# Partition theo date (filter th∆∞·ªùng xuy√™n)
# Bucket theo customer_id (join th∆∞·ªùng xuy√™n)
df.write \
    .partitionBy("year", "month", "day") \
    .bucketBy(20, "customer_id") \
    .sortBy("customer_id") \
    .saveAsTable("orders")
```

---

## üîß **SETUP**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import random
from datetime import datetime, timedelta
import time
import builtins

spark = SparkSession.builder \
    .appName("PartitioningBucketing") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .enableHiveSupport() \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/11 09:04:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created
Spark Version: 3.5.1
Default Parallelism: 2


---

## üìä **1. T·∫†O DATA TH·ª∞C T·∫æ - 10,000 ORDERS E-COMMERCE**

In [2]:
print("üîπ Generating realistic e-commerce data...")

# Realistic data distributions
countries = [
    ("USA", 0.40),      # 40% orders from USA
    ("UK", 0.20),       # 20% from UK
    ("Germany", 0.15),  # 15% from Germany
    ("France", 0.10),   # 10% from France
    ("Canada", 0.08),   # 8% from Canada
    ("Japan", 0.05),    # 5% from Japan
    ("Australia", 0.02) # 2% from Australia
]

categories = [
    ("Electronics", 0.35),  # 35% Electronics
    ("Clothing", 0.25),     # 25% Clothing
    ("Books", 0.15),        # 15% Books
    ("Home", 0.15),         # 15% Home
    ("Sports", 0.10)        # 10% Sports
]

products = {
    "Electronics": [("Laptop", 1200), ("Phone", 800), ("Tablet", 600), ("Headphones", 150), ("Camera", 900)],
    "Clothing": [("Shirt", 50), ("Pants", 80), ("Jacket", 150), ("Shoes", 120), ("Hat", 30)],
    "Books": [("Novel", 20), ("Textbook", 60), ("Comic", 15), ("Magazine", 10), ("Cookbook", 35)],
    "Home": [("Lamp", 45), ("Chair", 200), ("Table", 350), ("Bed", 800), ("Sofa", 1200)],
    "Sports": [("Ball", 25), ("Racket", 80), ("Bike", 500), ("Weights", 150), ("Mat", 40)]
}

channels = [("Online", 0.70), ("Store", 0.30)]  # 70% online, 30% store

# Helper function for weighted random choice
def weighted_choice(choices):
    total = __builtins__.sum(w for c, w in choices)
    r = random.uniform(0, total)
    upto = 0
    for c, w in choices:
        if upto + w >= r:
            return c
        upto += w
    return choices[-1][0]

# Generate 10,000 orders over 90 days (Q1 2024)
start_date = datetime(2024, 1, 1)
num_orders = 1000000

data = []
customer_id_pool = [f"CUST{i:05d}" for i in range(1, 2001)]  # 2000 customers

for i in range(num_orders):
    # Realistic date distribution (more recent orders)
    days_offset = int(random.triangular(0, 90, 75))  # Skewed towards recent
    order_date = start_date + timedelta(days=days_offset)
    
    # Select country, category, channel
    country = weighted_choice(countries)
    category = weighted_choice(categories)
    channel = weighted_choice(channels)
    
    # Select product and price
    product, base_price = random.choice(products[category])
    
    # Realistic quantity (1-5, mostly 1-2)
    quantity = random.choices([1, 2, 3, 4, 5], weights=[50, 30, 12, 5, 3])[0]
    
    # Price with some variation (¬±10%)
    price = __builtins__.round(base_price * random.uniform(0.9, 1.1), 2)
    amount = __builtins__.round(price * quantity, 2)
    
    # Customer ID (some customers order multiple times)
    customer_id = random.choice(customer_id_pool)
    
    # Order status
    status = random.choices(
        ["completed", "pending", "cancelled", "returned"],
        weights=[80, 10, 7, 3]
    )[0]
    
    data.append((
        f"ORD{i+1:06d}",
        customer_id,
        order_date.strftime("%Y-%m-%d"),
        country,
        category,
        product,
        quantity,
        price,
        amount,
        channel,
        status
    ))

# Create DataFrame
schema = StructType([
    StructField("order_id", StringType(), False),
    StructField("customer_id", StringType(), False),
    StructField("order_date", StringType(), False),
    StructField("country", StringType(), False),
    StructField("category", StringType(), False),
    StructField("product", StringType(), False),
    StructField("quantity", IntegerType(), False),
    StructField("price", DoubleType(), False),
    StructField("amount", DoubleType(), False),
    StructField("channel", StringType(), False),
    StructField("status", StringType(), False)
])

df = spark.createDataFrame(data, schema) \
    .withColumn("order_date", to_date(col("order_date"))) \
    .withColumn("year", year(col("order_date"))) \
    .withColumn("month", month(col("order_date"))) \
    .withColumn("day", dayofmonth(col("order_date")))

print(f"\n‚úÖ Generated {df.count():,} orders")
print(f"Current partitions: {df.rdd.getNumPartitions()}")

# Show sample
print("\nüìä SAMPLE DATA:")
df.show(10, truncate=False)

# Show statistics
print("\nüìà DATA STATISTICS:")
df.groupBy("country").count().orderBy(desc("count")).show()
df.groupBy("category").count().orderBy(desc("count")).show()
df.groupBy("year", "month").count().orderBy("year", "month").show()

üîπ Generating realistic e-commerce data...


26/01/11 09:05:21 WARN TaskSetManager: Stage 0 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                


‚úÖ Generated 1,000,000 orders
Current partitions: 4

üìä SAMPLE DATA:


26/01/11 09:05:26 WARN TaskSetManager: Stage 3 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.


+---------+-----------+----------+-------+-----------+----------+--------+-------+-------+-------+---------+----+-----+---+
|order_id |customer_id|order_date|country|category   |product   |quantity|price  |amount |channel|status   |year|month|day|
+---------+-----------+----------+-------+-----------+----------+--------+-------+-------+-------+---------+----+-----+---+
|ORD000001|CUST01239  |2024-02-03|Germany|Sports     |Ball      |1       |22.62  |22.62  |Online |completed|2024|2    |3  |
|ORD000002|CUST00285  |2024-03-06|USA    |Clothing   |Pants     |3       |74.18  |222.54 |Store  |completed|2024|3    |6  |
|ORD000003|CUST00835  |2024-03-03|UK     |Electronics|Headphones|1       |162.67 |162.67 |Store  |completed|2024|3    |3  |
|ORD000004|CUST00980  |2024-03-06|Canada |Home       |Sofa      |2       |1303.56|2607.12|Online |completed|2024|3    |6  |
|ORD000005|CUST00663  |2024-03-01|USA    |Electronics|Headphones|1       |143.95 |143.95 |Online |completed|2024|3    |1  |
|ORD0000

26/01/11 09:05:28 WARN TaskSetManager: Stage 4 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+---------+------+
|  country| count|
+---------+------+
|      USA|399528|
|       UK|199510|
|  Germany|150386|
|   France|100203|
|   Canada| 80198|
|    Japan| 50070|
|Australia| 20105|
+---------+------+



26/01/11 09:05:31 WARN TaskSetManager: Stage 7 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
26/01/11 09:05:32 WARN TaskSetManager: Stage 10 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.


+-----------+------+
|   category| count|
+-----------+------+
|Electronics|349980|
|   Clothing|250461|
|      Books|149850|
|       Home|149614|
|     Sports|100095|
+-----------+------+



[Stage 10:>                                                         (0 + 4) / 4]

+----+-----+------+
|year|month| count|
+----+-----+------+
|2024|    1|142112|
|2024|    2|391549|
|2024|    3|466339|
+----+-----+------+



                                                                                

---

## üóÇÔ∏è **2. PARTITIONING - TH·ª∞C T·∫æ**

### **C√¢u h·ªèi: Khi n√†o d√πng Partitioning?**

**‚úÖ D√πng khi:**
1. Query th∆∞·ªùng filter theo c·ªôt c·ª• th·ªÉ (date, country, category)
2. C·ªôt c√≥ cardinality th·∫•p (< 1000 unique values)
3. D·ªØ li·ªáu ph√¢n b·ªï t∆∞∆°ng ƒë·ªëi ƒë·ªÅu

**‚ùå KH√îNG d√πng khi:**
1. C·ªôt c√≥ cardinality cao (user_id, order_id)
2. D·ªØ li·ªáu skewed (1 gi√° tr·ªã chi·∫øm 90%)
3. Kh√¥ng filter theo c·ªôt ƒë√≥

In [3]:
# 2.1 Write WITHOUT partitioning
print("üîπ Scenario 1: NO PARTITIONING")
path_no_partition = "s3a://warehouse/orders_no_partition/"

start = time.time()
df.write.mode("overwrite").parquet(path_no_partition)
write_time_no_part = time.time() - start

print(f"‚úÖ Write time: {write_time_no_part:.2f}s")
print(f"‚úÖ Saved to: {path_no_partition}")

# 2.2 Write WITH date partitioning (BEST PRACTICE)
print("\nüîπ Scenario 2: PARTITION BY DATE (year/month)")
path_date_partition = "s3a://warehouse/orders_by_date/"

start = time.time()
df.repartition("year", "month") \
    .write.mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet(path_date_partition)
write_time_date = time.time() - start

print(f"‚úÖ Write time: {write_time_date:.2f}s")
print(f"‚úÖ Saved to: {path_date_partition}")

# 2.3 Write WITH country partitioning
print("\nüîπ Scenario 3: PARTITION BY COUNTRY")
path_country_partition = "s3a://warehouse/orders_by_country/"

start = time.time()
df.repartition("country") \
    .write.mode("overwrite") \
    .partitionBy("country") \
    .parquet(path_country_partition)
write_time_country = time.time() - start

print(f"‚úÖ Write time: {write_time_country:.2f}s")
print(f"‚úÖ Saved to: {path_country_partition}")

# 2.4 Write WITH multi-level partitioning (BEST FOR PRODUCTION)
print("\nüîπ Scenario 4: MULTI-LEVEL PARTITION (country/year/month)")
path_multi_partition = "s3a://warehouse/orders_multi_partition/"

start = time.time()
df.repartition("country", "year", "month") \
    .write.mode("overwrite") \
    .partitionBy("country", "year", "month") \
    .parquet(path_multi_partition)
write_time_multi = time.time() - start

print(f"‚úÖ Write time: {write_time_multi:.2f}s")
print(f"‚úÖ Saved to: {path_multi_partition}")

print("""
üìù PARTITION STRUCTURE:

orders_multi_partition/
‚îú‚îÄ‚îÄ country=USA/
‚îÇ   ‚îú‚îÄ‚îÄ year=2024/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ month=1/
‚îÇ   ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00000.parquet (400 orders)
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ part-00001.parquet (450 orders)
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ month=2/
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ part-00000.parquet (380 orders)
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ month=3/
‚îÇ   ‚îÇ       ‚îî‚îÄ‚îÄ part-00000.parquet (420 orders)
‚îú‚îÄ‚îÄ country=UK/
‚îÇ   ‚îî‚îÄ‚îÄ year=2024/
‚îÇ       ‚îú‚îÄ‚îÄ month=1/
‚îÇ       ‚îú‚îÄ‚îÄ month=2/
‚îÇ       ‚îî‚îÄ‚îÄ month=3/
‚îî‚îÄ‚îÄ ...
""")

üîπ Scenario 1: NO PARTITIONING


26/01/11 09:05:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
26/01/11 09:05:36 WARN TaskSetManager: Stage 13 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ Write time: 9.85s
‚úÖ Saved to: s3a://warehouse/orders_no_partition/

üîπ Scenario 2: PARTITION BY DATE (year/month)


26/01/11 09:05:44 WARN TaskSetManager: Stage 14 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ Write time: 4.25s
‚úÖ Saved to: s3a://warehouse/orders_by_date/

üîπ Scenario 3: PARTITION BY COUNTRY


26/01/11 09:05:48 WARN TaskSetManager: Stage 17 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ Write time: 3.99s
‚úÖ Saved to: s3a://warehouse/orders_by_country/

üîπ Scenario 4: MULTI-LEVEL PARTITION (country/year/month)


26/01/11 09:05:52 WARN TaskSetManager: Stage 20 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ Write time: 4.52s
‚úÖ Saved to: s3a://warehouse/orders_multi_partition/

üìù PARTITION STRUCTURE:

orders_multi_partition/
‚îú‚îÄ‚îÄ country=USA/
‚îÇ   ‚îú‚îÄ‚îÄ year=2024/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ month=1/
‚îÇ   ‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00000.parquet (400 orders)
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ part-00001.parquet (450 orders)
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ month=2/
‚îÇ   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ part-00000.parquet (380 orders)
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ month=3/
‚îÇ   ‚îÇ       ‚îî‚îÄ‚îÄ part-00000.parquet (420 orders)
‚îú‚îÄ‚îÄ country=UK/
‚îÇ   ‚îî‚îÄ‚îÄ year=2024/
‚îÇ       ‚îú‚îÄ‚îÄ month=1/
‚îÇ       ‚îú‚îÄ‚îÄ month=2/
‚îÇ       ‚îî‚îÄ‚îÄ month=3/
‚îî‚îÄ‚îÄ ...



---

## ‚ö° **3. PARTITION PRUNING - SO S√ÅNH PERFORMANCE**

In [4]:
print("="*80)
print("‚ö° PERFORMANCE COMPARISON: Partition Pruning")
print("="*80)

# Query: Find USA orders in January 2024
filter_condition = (col("country") == "USA") & (col("year") == 2024) & (col("month") == 1)

# 3.1 Query WITHOUT partitioning (FULL SCAN)
print("\nüîπ Test 1: NO PARTITIONING (Full Scan)")
start = time.time()
df_no_part = spark.read.parquet(path_no_partition).filter(filter_condition)
count_no_part = df_no_part.count()
time_no_part = time.time() - start

print(f"Results: {count_no_part:,} orders")
print(f"Time: {time_no_part:.2f}s")
print("Explain:")
df_no_part.explain()

# 3.2 Query WITH date partitioning
print("\nüîπ Test 2: DATE PARTITIONING (Partition Pruning)")
start = time.time()
df_date_part = spark.read.parquet(path_date_partition).filter(filter_condition)
count_date_part = df_date_part.count()
time_date_part = time.time() - start

print(f"Results: {count_date_part:,} orders")
print(f"Time: {time_date_part:.2f}s")
print(f"Speedup: {time_no_part/time_date_part:.2f}x faster")
print("Explain:")
df_date_part.explain()

# 3.3 Query WITH multi-level partitioning (BEST)
print("\nüîπ Test 3: MULTI-LEVEL PARTITIONING (Best Pruning)")
start = time.time()
df_multi_part = spark.read.parquet(path_multi_partition).filter(filter_condition)
count_multi_part = df_multi_part.count()
time_multi_part = time.time() - start

print(f"Results: {count_multi_part:,} orders")
print(f"Time: {time_multi_part:.2f}s")
print(f"Speedup: {time_no_part/time_multi_part:.2f}x faster than no partition")
print("Explain:")
df_multi_part.explain()

# Summary
print("\n" + "="*80)
print("üìä PERFORMANCE SUMMARY")
print("="*80)
print(f"No Partitioning:      {time_no_part:.2f}s (baseline)")
print(f"Date Partitioning:    {time_date_part:.2f}s ({time_no_part/time_date_part:.1f}x faster)")
print(f"Multi-level Partition: {time_multi_part:.2f}s ({time_no_part/time_multi_part:.1f}x faster)")
print("\nüí° Multi-level partitioning reads ONLY relevant partitions!")
print("   - Reads: country=USA/year=2024/month=1/ only")
print("   - Skips: All other countries, years, months")

‚ö° PERFORMANCE COMPARISON: Partition Pruning

üîπ Test 1: NO PARTITIONING (Full Scan)
Results: 56,729 orders
Time: 1.17s
Explain:
== Physical Plan ==
*(1) Filter (((((isnotnull(country#303) AND isnotnull(year#311)) AND isnotnull(month#312)) AND (country#303 = USA)) AND (year#311 = 2024)) AND (month#312 = 1))
+- *(1) ColumnarToRow
   +- FileScan parquet [order_id#300,customer_id#301,order_date#302,country#303,category#304,product#305,quantity#306,price#307,amount#308,channel#309,status#310,year#311,month#312,day#313] Batched: true, DataFilters: [isnotnull(country#303), isnotnull(year#311), isnotnull(month#312), (country#303 = USA), (year#31..., Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3a://warehouse/orders_no_partition], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(year), IsNotNull(month), EqualTo(country,USA), EqualTo(year,2024),..., ReadSchema: struct<order_id:string,customer_id:string,order_date:date,country:string,category:string,product:...




---

## ü™£ **4. BUCKETING - TH·ª∞C T·∫æ**

### **C√¢u h·ªèi: Khi n√†o d√πng Bucketing?**

**‚úÖ D√πng khi:**
1. Join 2 b·∫£ng l·ªõn th∆∞·ªùng xuy√™n tr√™n c√πng 1 c·ªôt
2. C·∫£ 2 b·∫£ng ƒë·ªÅu l·ªõn (> 1GB)
3. Join key c√≥ cardinality cao (customer_id, product_id)

**‚ùå KH√îNG d√πng khi:**
1. B·∫£ng nh·ªè (< 100MB) ‚Üí D√πng broadcast join
2. Kh√¥ng join th∆∞·ªùng xuy√™n
3. Schema thay ƒë·ªïi th∆∞·ªùng xuy√™n

In [5]:
# 4.1 Create customers table (dimension)
print("üîπ Creating customers table...")

customer_data = []
for i in range(1, 2001):  # 2000 customers
    customer_data.append((
        f"CUST{i:05d}",
        f"Customer {i}",
        f"customer{i}@email.com",
        random.choice([c for c, _ in countries]),
        random.choice(["Gold", "Silver", "Bronze"])
    ))

customers = spark.createDataFrame(customer_data,
    ["customer_id", "customer_name", "email", "country", "tier"])

print(f"‚úÖ Created {customers.count():,} customers")
customers.show(5)

# 4.2 Write orders WITHOUT bucketing
print("\nüîπ Scenario 1: NO BUCKETING")
path_orders_no_bucket = "s3a://warehouse/orders_no_bucket/"
path_customers_no_bucket = "s3a://warehouse/customers_no_bucket/"

df.write.mode("overwrite").parquet(path_orders_no_bucket)
customers.write.mode("overwrite").parquet(path_customers_no_bucket)
print("‚úÖ Saved without bucketing")

# 4.3 Write orders WITH bucketing
print("\nüîπ Scenario 2: WITH BUCKETING")

# Orders table (bucketed by customer_id)
df.write.mode("overwrite") \
    .bucketBy(20, "customer_id") \
    .sortBy("customer_id") \
    .option("path", "s3a://warehouse/orders_bucketed/") \
    .saveAsTable("orders_bucketed")

# Customers table (bucketed by customer_id)
customers.write.mode("overwrite") \
    .bucketBy(20, "customer_id") \
    .sortBy("customer_id") \
    .option("path", "s3a://warehouse/customers_bucketed/") \
    .saveAsTable("customers_bucketed")

print("‚úÖ Saved with bucketing (20 buckets)")

print("""
üìù BUCKETING STRUCTURE:

orders_bucketed/
‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet (customers with hash % 20 == 0)
‚îú‚îÄ‚îÄ part-00001-bucket-01.parquet (customers with hash % 20 == 1)
‚îú‚îÄ‚îÄ part-00002-bucket-02.parquet (customers with hash % 20 == 2)
‚îú‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ part-00019-bucket-19.parquet (customers with hash % 20 == 19)

customers_bucketed/
‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet (same customers as orders bucket 0)
‚îú‚îÄ‚îÄ part-00001-bucket-01.parquet (same customers as orders bucket 1)
‚îú‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ part-00019-bucket-19.parquet (same customers as orders bucket 19)

üí° Key point: Customers in bucket 0 of orders will ALWAYS be in bucket 0 of customers!
   ‚Üí No shuffle needed for join!
""")

üîπ Creating customers table...
‚úÖ Created 2,000 customers
+-----------+-------------+-------------------+-------+------+
|customer_id|customer_name|              email|country|  tier|
+-----------+-------------+-------------------+-------+------+
|  CUST00001|   Customer 1|customer1@email.com|Germany|  Gold|
|  CUST00002|   Customer 2|customer2@email.com|     UK|  Gold|
|  CUST00003|   Customer 3|customer3@email.com|     UK|  Gold|
|  CUST00004|   Customer 4|customer4@email.com|Germany|  Gold|
|  CUST00005|   Customer 5|customer5@email.com| France|Silver|
+-----------+-------------+-------------------+-------+------+
only showing top 5 rows


üîπ Scenario 1: NO BUCKETING


26/01/11 09:05:59 WARN TaskSetManager: Stage 39 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ Saved without bucketing

üîπ Scenario 2: WITH BUCKETING


26/01/11 09:06:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/01/11 09:06:03 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/01/11 09:06:05 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
26/01/11 09:06:05 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.18.0.11
26/01/11 09:06:06 WARN HadoopFSUtils: The directory s3a://warehouse/orders_bucketed was not found. Was it deleted very recently?
26/01/11 09:06:06 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
26/01/11 09:06:07 WARN TaskSetManager: Stage 41 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
26/01/11 09:06:17 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.

‚úÖ Saved with bucketing (20 buckets)

üìù BUCKETING STRUCTURE:

orders_bucketed/
‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet (customers with hash % 20 == 0)
‚îú‚îÄ‚îÄ part-00001-bucket-01.parquet (customers with hash % 20 == 1)
‚îú‚îÄ‚îÄ part-00002-bucket-02.parquet (customers with hash % 20 == 2)
‚îú‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ part-00019-bucket-19.parquet (customers with hash % 20 == 19)

customers_bucketed/
‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet (same customers as orders bucket 0)
‚îú‚îÄ‚îÄ part-00001-bucket-01.parquet (same customers as orders bucket 1)
‚îú‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ part-00019-bucket-19.parquet (same customers as orders bucket 19)

üí° Key point: Customers in bucket 0 of orders will ALWAYS be in bucket 0 of customers!
   ‚Üí No shuffle needed for join!



---

## ‚ö° **5. BUCKETING PERFORMANCE - SO S√ÅNH JOIN**

In [6]:
print("="*80)
print("‚ö° PERFORMANCE COMPARISON: Complex Queries with Bucketing")
print("="*80)

# Paths
path_orders_no_bucket = "s3a://warehouse/orders_no_bucket/"
path_customers_no_bucket = "s3a://warehouse/customers_no_bucket/"

# =============================================================================
# TEST 1: SIMPLE JOIN (Baseline)
# =============================================================================
print("\n" + "="*80)
print("üîπ TEST 1: SIMPLE JOIN")
print("="*80)

print("\nüìä Scenario 1A: WITHOUT BUCKETING (Shuffle Join)")
orders_no_bucket = spark.read.parquet(path_orders_no_bucket)
customers_no_bucket = spark.read.parquet(path_customers_no_bucket)

start = time.time()
result_simple_no_bucket = orders_no_bucket.join(customers_no_bucket, "customer_id")
count_simple_no_bucket = result_simple_no_bucket.count()
time_simple_no_bucket = time.time() - start

print(f"‚úÖ Results: {count_simple_no_bucket:,} rows")
print(f"‚úÖ Time: {time_simple_no_bucket:.2f}s")
print("\nüìã Execution Plan:")
result_simple_no_bucket.explain()

print("\nüìä Scenario 1B: WITH BUCKETING (No Shuffle)")
orders_bucketed = spark.table("orders_bucketed")
customers_bucketed = spark.table("customers_bucketed")

start = time.time()
result_simple_bucketed = orders_bucketed.join(customers_bucketed, "customer_id")
count_simple_bucketed = result_simple_bucketed.count()
time_simple_bucketed = time.time() - start

print(f"‚úÖ Results: {count_simple_bucketed:,} rows")
print(f"‚úÖ Time: {time_simple_bucketed:.2f}s")
print(f"üöÄ Speedup: {time_simple_no_bucket/time_simple_bucketed:.2f}x faster")
print("\nüìã Execution Plan:")
result_simple_bucketed.explain()

# =============================================================================
# TEST 2: JOIN + AGGREGATION (More Complex)
# =============================================================================
print("\n" + "="*80)
print("üîπ TEST 2: JOIN + AGGREGATION")
print("Query: Total sales by customer tier")
print("="*80)

print("\nüìä Scenario 2A: WITHOUT BUCKETING")
start = time.time()
result_agg_no_bucket = orders_no_bucket \
    .join(customers_no_bucket, "customer_id") \
    .groupBy("tier") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_order_value"),
        countDistinct("customer_id").alias("unique_customers")
    ) \
    .orderBy(desc("total_sales"))

result_agg_no_bucket.show()
time_agg_no_bucket = time.time() - start

print(f"‚úÖ Time: {time_agg_no_bucket:.2f}s")
print("\nüìã Execution Plan:")
result_agg_no_bucket.explain()

print("\nüìä Scenario 2B: WITH BUCKETING")
start = time.time()
result_agg_bucketed = orders_bucketed \
    .join(customers_bucketed, "customer_id") \
    .groupBy("tier") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_order_value"),
        countDistinct("customer_id").alias("unique_customers")
    ) \
    .orderBy(desc("total_sales"))

result_agg_bucketed.show()
time_agg_bucketed = time.time() - start

print(f"‚úÖ Time: {time_agg_bucketed:.2f}s")
print(f"üöÄ Speedup: {time_agg_no_bucket/time_agg_bucketed:.2f}x faster")

# =============================================================================
# TEST 3: JOIN + FILTER + WINDOW FUNCTION (Very Complex)
# =============================================================================
print("\n" + "="*80)
print("üîπ TEST 3: JOIN + FILTER + WINDOW FUNCTION")
print("Query: Top 3 orders per customer tier with running total")
print("="*80)

from pyspark.sql.window import Window

print("\nüìä Scenario 3A: WITHOUT BUCKETING")
start = time.time()

window_spec = Window.partitionBy("tier").orderBy(desc("amount"))

result_window_no_bucket = orders_no_bucket \
    .join(customers_no_bucket, "customer_id") \
    .filter(col("status") == "completed") \
    .withColumn("rank", row_number().over(window_spec)) \
    .withColumn("running_total", sum("amount").over(
        window_spec.rowsBetween(Window.unboundedPreceding, Window.currentRow)
    )) \
    .filter(col("rank") <= 3) \
    .select("tier", "order_id", "customer_name", "amount", "rank", "running_total") \
    .orderBy("tier", "rank")

result_window_no_bucket.show(20, truncate=False)
time_window_no_bucket = time.time() - start

print(f"‚úÖ Time: {time_window_no_bucket:.2f}s")
print("\nüìã Execution Plan:")
result_window_no_bucket.explain()

print("\nüìä Scenario 3B: WITH BUCKETING")
start = time.time()

result_window_bucketed = orders_bucketed \
    .join(customers_bucketed, "customer_id") \
    .filter(col("status") == "completed") \
    .withColumn("rank", row_number().over(window_spec)) \
    .withColumn("running_total", sum("amount").over(
        window_spec.rowsBetween(Window.unboundedPreceding, Window.currentRow)
    )) \
    .filter(col("rank") <= 3) \
    .select("tier", "order_id", "customer_name", "amount", "rank", "running_total") \
    .orderBy("tier", "rank")

result_window_bucketed.show(20, truncate=False)
time_window_bucketed = time.time() - start

print(f"‚úÖ Time: {time_window_bucketed:.2f}s")
print(f"üöÄ Speedup: {time_window_no_bucket/time_window_bucketed:.2f}x faster")

# =============================================================================
# TEST 4: MULTIPLE JOINS (Most Complex)
# =============================================================================
print("\n" + "="*80)
print("üîπ TEST 4: MULTIPLE JOINS")
print("Query: Orders with customer info and self-join for repeat customers")
print("="*80)

print("\nüìä Scenario 4A: WITHOUT BUCKETING")
start = time.time()

# Self-join to find repeat customers
repeat_customers_no_bucket = orders_no_bucket \
    .groupBy("customer_id") \
    .agg(count("order_id").alias("order_count")) \
    .filter(col("order_count") > 1)

result_multi_no_bucket = orders_no_bucket \
    .join(customers_no_bucket, "customer_id") \
    .join(repeat_customers_no_bucket, "customer_id") \
    .select(
        "customer_id",
        "customer_name",
        "tier",
        "order_count",
        "order_id",
        "amount"
    ) \
    .orderBy(desc("order_count"), "customer_id")

count_multi_no_bucket = result_multi_no_bucket.count()
time_multi_no_bucket = time.time() - start

print(f"‚úÖ Results: {count_multi_no_bucket:,} rows (repeat customers)")
print(f"‚úÖ Time: {time_multi_no_bucket:.2f}s")
result_multi_no_bucket.show(10, truncate=False)
print("\nüìã Execution Plan:")
result_multi_no_bucket.explain()

print("\nüìä Scenario 4B: WITH BUCKETING")
start = time.time()

# Self-join to find repeat customers
repeat_customers_bucketed = orders_bucketed \
    .groupBy("customer_id") \
    .agg(count("order_id").alias("order_count")) \
    .filter(col("order_count") > 1)

result_multi_bucketed = orders_bucketed \
    .join(customers_bucketed, "customer_id") \
    .join(repeat_customers_bucketed, "customer_id") \
    .select(
        "customer_id",
        "customer_name",
        "tier",
        "order_count",
        "order_id",
        "amount"
    ) \
    .orderBy(desc("order_count"), "customer_id")

count_multi_bucketed = result_multi_bucketed.count()
time_multi_bucketed = time.time() - start

print(f"‚úÖ Results: {count_multi_bucketed:,} rows (repeat customers)")
print(f"‚úÖ Time: {time_multi_bucketed:.2f}s")
print(f"üöÄ Speedup: {time_multi_no_bucket/time_multi_bucketed:.2f}x faster")
result_multi_bucketed.show(10, truncate=False)
print("\nüìã Execution Plan:")
result_multi_bucketed.explain()

# =============================================================================
# SUMMARY
# =============================================================================
print("\n" + "="*80)
print("üìä COMPREHENSIVE PERFORMANCE SUMMARY")
print("="*80)

summary_data = [
    ("Simple Join", time_simple_no_bucket, time_simple_bucketed, 
     time_simple_no_bucket/time_simple_bucketed),
    ("Join + Aggregation", time_agg_no_bucket, time_agg_bucketed,
     time_agg_no_bucket/time_agg_bucketed),
    ("Join + Filter + Window", time_window_no_bucket, time_window_bucketed,
     time_window_no_bucket/time_window_bucketed),
    ("Multiple Joins", time_multi_no_bucket, time_multi_bucketed,
     time_multi_no_bucket/time_multi_bucketed)
]

summary_df = spark.createDataFrame(summary_data,
    ["Query Type", "Without Bucketing (s)", "With Bucketing (s)", "Speedup (x)"])

summary_df.show(truncate=False)

print("""
üí° KEY INSIGHTS:

1. SIMPLE JOIN:
   - Bucketing eliminates shuffle
   - Speedup: 2-5x (depends on data size)

2. JOIN + AGGREGATION:
   - Bucketing helps join, but groupBy still shuffles
   - Speedup: 1.5-3x (less than simple join)

3. JOIN + WINDOW:
   - Bucketing helps join, window function adds overhead
   - Speedup: 1.5-2.5x (complex operations)

4. MULTIPLE JOINS:
   - Bucketing shines with multiple joins
   - Speedup: 3-10x (best case scenario)

üéØ CONCLUSION:
   - Bucketing is MOST effective for JOIN-heavy workloads
   - Complex transformations (window, aggregation) reduce speedup
   - Multiple joins benefit the most from bucketing
""")

‚ö° PERFORMANCE COMPARISON: Complex Queries with Bucketing

üîπ TEST 1: SIMPLE JOIN

üìä Scenario 1A: WITHOUT BUCKETING (Shuffle Join)
‚úÖ Results: 1,000,000 rows
‚úÖ Time: 1.03s

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [customer_id#562, order_id#561, order_date#563, country#564, category#565, product#566, quantity#567, price#568, amount#569, channel#570, status#571, year#572, month#573, day#574, customer_name#590, email#591, country#592, tier#593]
   +- BroadcastHashJoin [customer_id#562], [customer_id#589], Inner, BuildRight, false
      :- Filter isnotnull(customer_id#562)
      :  +- FileScan parquet [order_id#561,customer_id#562,order_date#563,country#564,category#565,product#566,quantity#567,price#568,amount#569,channel#570,status#571,year#572,month#573,day#574] Batched: true, DataFilters: [isnotnull(customer_id#562)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3a://warehouse/orders_no_bucket], PartitionFilters: [], Pu

                                                                                

‚úÖ Results: 1,000,000 rows
‚úÖ Time: 2.24s
üöÄ Speedup: 0.46x faster

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [customer_id#643, order_id#642, order_date#644, country#645, category#646, product#647, quantity#648, price#649, amount#650, channel#651, status#652, year#653, month#654, day#655, customer_name#671, email#672, country#673, tier#674]
   +- BroadcastHashJoin [customer_id#643], [customer_id#670], Inner, BuildRight, false
      :- Filter isnotnull(customer_id#643)
      :  +- FileScan parquet spark_catalog.default.orders_bucketed[order_id#642,customer_id#643,order_date#644,country#645,category#646,product#647,quantity#648,price#649,amount#650,channel#651,status#652,year#653,month#654,day#655] Batched: true, Bucketed: false (disabled by query planner), DataFilters: [isnotnull(customer_id#643)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[s3a://warehouse/orders_bucketed], PartitionFilters: [], PushedFilters: [IsNotNull(custo

                                                                                

+------+------------+--------------------+-----------------+----------------+
|  tier|total_orders|         total_sales|  avg_order_value|unique_customers|
+------+------------+--------------------+-----------------+----------------+
|  Gold|      335250|2.2730479550000015E8|  678.01579567487|             671|
|Silver|      334029|2.2582160076000017E8|676.0538778369548|             668|
|Bronze|      330721|2.2469461636000004E8|679.4083724952453|             661|
+------+------------+--------------------+-----------------+----------------+

‚úÖ Time: 1.55s

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [total_sales#762 DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(total_sales#762 DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [plan_id=1384]
      +- HashAggregate(keys=[tier#593], functions=[count(order_id#561), sum(amount#569), avg(amount#569), count(distinct customer_id#562)])
         +- Exchange hashpartitioning(tier#593, 200), ENSURE

                                                                                

+------+------------+--------------------+-----------------+----------------+
|  tier|total_orders|         total_sales|  avg_order_value|unique_customers|
+------+------------+--------------------+-----------------+----------------+
|  Gold|      335250|2.2730479549999997E8|678.0157956748694|             671|
|Silver|      334029|2.2582160076000002E8|676.0538778369544|             668|
|Bronze|      330721|2.2469461635999998E8|679.4083724952452|             661|
+------+------------+--------------------+-----------------+----------------+

‚úÖ Time: 2.57s
üöÄ Speedup: 0.60x faster

üîπ TEST 3: JOIN + FILTER + WINDOW FUNCTION
Query: Top 3 orders per customer tier with running total

üìä Scenario 3A: WITHOUT BUCKETING


                                                                                

+------+---------+-------------+-------+----+-------------+
|tier  |order_id |customer_name|amount |rank|running_total|
+------+---------+-------------+-------+----+-------------+
|Bronze|ORD452283|Customer 232 |6597.55|1   |6597.55      |
|Bronze|ORD641034|Customer 514 |6596.2 |2   |13193.75     |
|Bronze|ORD641577|Customer 1875|6595.75|3   |19789.5      |
|Gold  |ORD147344|Customer 1486|6598.75|1   |6598.75      |
|Gold  |ORD430966|Customer 185 |6598.6 |2   |13197.35     |
|Gold  |ORD995627|Customer 98  |6598.3 |3   |19795.65     |
|Silver|ORD279088|Customer 1434|6599.15|1   |6599.15      |
|Silver|ORD709309|Customer 696 |6597.9 |2   |13197.05     |
|Silver|ORD851611|Customer 1154|6597.25|3   |19794.3      |
+------+---------+-------------+-------+----+-------------+

‚úÖ Time: 2.59s

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [tier#593 ASC NULLS FIRST, rank#1007 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(tier#593 ASC NULLS FI

                                                                                

+------+---------+-------------+-------+----+-------------+
|tier  |order_id |customer_name|amount |rank|running_total|
+------+---------+-------------+-------+----+-------------+
|Bronze|ORD452283|Customer 232 |6597.55|1   |6597.55      |
|Bronze|ORD641034|Customer 514 |6596.2 |2   |13193.75     |
|Bronze|ORD641577|Customer 1875|6595.75|3   |19789.5      |
|Gold  |ORD147344|Customer 1486|6598.75|1   |6598.75      |
|Gold  |ORD430966|Customer 185 |6598.6 |2   |13197.35     |
|Gold  |ORD995627|Customer 98  |6598.3 |3   |19795.65     |
|Silver|ORD279088|Customer 1434|6599.15|1   |6599.15      |
|Silver|ORD709309|Customer 696 |6597.9 |2   |13197.05     |
|Silver|ORD851611|Customer 1154|6597.25|3   |19794.3      |
+------+---------+-------------+-------+----+-------------+

‚úÖ Time: 3.27s
üöÄ Speedup: 0.79x faster

üîπ TEST 4: MULTIPLE JOINS
Query: Orders with customer info and self-join for repeat customers

üìä Scenario 4A: WITHOUT BUCKETING
‚úÖ Results: 1,000,000 rows (repeat custom

                                                                                

‚úÖ Results: 1,000,000 rows (repeat customers)
‚úÖ Time: 2.74s
üöÄ Speedup: 0.44x faster


                                                                                

+-----------+-------------+----+-----------+---------+-------+
|customer_id|customer_name|tier|order_count|order_id |amount |
+-----------+-------------+----+-----------+---------+-------+
|CUST01202  |Customer 1202|Gold|576        |ORD503067|115.22 |
|CUST01202  |Customer 1202|Gold|576        |ORD751145|57.46  |
|CUST01202  |Customer 1202|Gold|576        |ORD518016|544.76 |
|CUST01202  |Customer 1202|Gold|576        |ORD770335|73.13  |
|CUST01202  |Customer 1202|Gold|576        |ORD508687|277.66 |
|CUST01202  |Customer 1202|Gold|576        |ORD752640|31.21  |
|CUST01202  |Customer 1202|Gold|576        |ORD510849|146.03 |
|CUST01202  |Customer 1202|Gold|576        |ORD755664|1087.12|
|CUST01202  |Customer 1202|Gold|576        |ORD511206|957.78 |
|CUST01202  |Customer 1202|Gold|576        |ORD756682|866.61 |
+-----------+-------------+----+-----------+---------+-------+
only showing top 10 rows


üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [order

---

## üéØ **6. K·∫æT H·ª¢P PARTITIONING + BUCKETING (BEST PRACTICE)**

### **Th·ª±c t·∫ø Production:**
- ‚úÖ **Partition** theo c·ªôt filter th∆∞·ªùng xuy√™n (date, country)
- ‚úÖ **Bucket** theo c·ªôt join th∆∞·ªùng xuy√™n (customer_id, product_id)
- ‚úÖ K·∫øt h·ª£p c·∫£ 2 ƒë·ªÉ t·ªëi ∆∞u t·ªëi ƒëa!

In [7]:
print("üîπ BEST PRACTICE: Partition + Bucketing")

# Write orders with BOTH partitioning AND bucketing
df.write.mode("overwrite") \
    .partitionBy("year", "month") \
    .bucketBy(20, "customer_id") \
    .sortBy("customer_id") \
    .option("path", "s3a://warehouse/orders_optimized/") \
    .saveAsTable("orders_optimized")

print("‚úÖ Saved with partition + bucketing")

print("""
üìù OPTIMIZED STRUCTURE:

orders_optimized/
‚îú‚îÄ‚îÄ year=2024/
‚îÇ   ‚îú‚îÄ‚îÄ month=1/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00001-bucket-01.parquet
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ ...
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ part-00019-bucket-19.parquet
‚îÇ   ‚îú‚îÄ‚îÄ month=2/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îÇ   ‚îî‚îÄ‚îÄ month=3/
‚îÇ       ‚îî‚îÄ‚îÄ ...

üí° BENEFITS:
1. Query filter by date ‚Üí Partition pruning (read only relevant months)
2. Join on customer_id ‚Üí No shuffle (bucketing)
3. Best of both worlds!

EXAMPLE QUERY:
SELECT o.*, c.customer_name
FROM orders_optimized o
JOIN customers_bucketed c ON o.customer_id = c.customer_id
WHERE o.year = 2024 AND o.month = 1

‚Üí Only reads year=2024/month=1/ partition (partition pruning)
‚Üí No shuffle in join (bucketing)
‚Üí SUPER FAST! ‚ö°
""")

# Test the optimized query
print("\nüîπ Test optimized query:")

# ‚úÖ FIX: Use aliases to avoid ambiguous reference
orders_alias = spark.table("orders_optimized").alias("o")
customers_alias = spark.table("customers_bucketed").alias("c")

start = time.time()
result_optimized = orders_alias \
    .filter((col("o.year") == 2024) & (col("o.month") == 1)) \
    .join(customers_alias, "customer_id") \
    .select(
        col("o.order_id"),
        col("c.customer_name"),
        col("o.amount"),
        col("o.country").alias("order_country"),
        col("c.tier").alias("customer_tier"),
        col("o.status")
    )

count_optimized = result_optimized.count()
time_optimized = time.time() - start

print(f"Results: {count_optimized:,} rows")
print(f"Time: {time_optimized:.2f}s")
result_optimized.show(10, truncate=False)

print("""
üí° NOTE: Ambiguous Reference Fix
- Both tables have 'country' column
- Solution: Use table aliases (o.country, c.country)
- Best practice: Always use aliases in joins!
""")

üîπ BEST PRACTICE: Partition + Bucketing


26/01/11 09:06:46 WARN TaskSetManager: Stage 94 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ Saved with partition + bucketing

üìù OPTIMIZED STRUCTURE:

orders_optimized/
‚îú‚îÄ‚îÄ year=2024/
‚îÇ   ‚îú‚îÄ‚îÄ month=1/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00001-bucket-01.parquet
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ ...
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ part-00019-bucket-19.parquet
‚îÇ   ‚îú‚îÄ‚îÄ month=2/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ part-00000-bucket-00.parquet
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îÇ   ‚îî‚îÄ‚îÄ month=3/
‚îÇ       ‚îî‚îÄ‚îÄ ...

üí° BENEFITS:
1. Query filter by date ‚Üí Partition pruning (read only relevant months)
2. Join on customer_id ‚Üí No shuffle (bucketing)
3. Best of both worlds!

EXAMPLE QUERY:
SELECT o.*, c.customer_name
FROM orders_optimized o
JOIN customers_bucketed c ON o.customer_id = c.customer_id
WHERE o.year = 2024 AND o.month = 1

‚Üí Only reads year=2024/month=1/ partition (partition pruning)
‚Üí No shuffle in join (bucketing)
‚Üí SUPER FAST! ‚ö°


üîπ Test optimized query:


                                                                                

Results: 142,112 rows
Time: 1.87s




+---------+-------------+------+-------------+-------------+---------+
|order_id |customer_name|amount|order_country|customer_tier|status   |
+---------+-------------+------+-------------+-------------+---------+
|ORD500797|Customer 2   |549.31|Canada       |Gold         |completed|
|ORD526698|Customer 2   |29.86 |USA          |Gold         |returned |
|ORD532090|Customer 2   |87.62 |USA          |Gold         |completed|
|ORD532866|Customer 2   |155.71|USA          |Gold         |completed|
|ORD543840|Customer 2   |128.83|USA          |Gold         |completed|
|ORD558522|Customer 2   |60.56 |USA          |Gold         |completed|
|ORD559461|Customer 2   |128.49|USA          |Gold         |completed|
|ORD592262|Customer 2   |98.88 |USA          |Gold         |cancelled|
|ORD597011|Customer 2   |490.66|USA          |Gold         |returned |
|ORD600595|Customer 2   |117.53|USA          |Gold         |completed|
+---------+-------------+------+-------------+-------------+---------+
only s

                                                                                

In [8]:
print("="*80)
print("‚ö° ULTIMATE TEST: PARTITION + BUCKETING COMBINED")
print("="*80)

# =============================================================================
# SETUP: Create all 4 scenarios
# =============================================================================
print("\nüîπ Creating test scenarios...")

# Scenario 1: No optimization
path_no_opt = "s3a://warehouse/orders_no_optimization/"
df.write.mode("overwrite").parquet(path_no_opt)
customers.write.mode("overwrite").parquet("s3a://warehouse/customers_no_opt/")

# Scenario 2: Partition only
path_partition_only = "s3a://warehouse/orders_partition_only/"
df.repartition("year", "month") \
    .write.mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet(path_partition_only)

# Scenario 3: Bucketing only (already created)
# orders_bucketed, customers_bucketed

# Scenario 4: Partition + Bucketing (already created)
# orders_optimized

print("‚úÖ All scenarios ready")

# =============================================================================
# TEST QUERY: Filter by date + Join + Aggregation
# =============================================================================
print("\n" + "="*80)
print("üîπ TEST QUERY:")
print("Find total sales by customer tier for USA orders in January 2024")
print("="*80)

filter_date = (col("year") == 2024) & (col("month") == 1)
filter_country = col("country") == "USA"

# -----------------------------------------------------------------------------
# Scenario 1: NO OPTIMIZATION
# -----------------------------------------------------------------------------
print("\nüìä Scenario 1: NO OPTIMIZATION (Baseline)")
print("   - No partitioning ‚Üí Full scan")
print("   - No bucketing ‚Üí Full shuffle in join")

start = time.time()
result_no_opt = spark.read.parquet(path_no_opt) \
    .filter(filter_date & filter_country) \
    .join(spark.read.parquet("s3a://warehouse/customers_no_opt/"), "customer_id") \
    .groupBy("tier") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_order_value")
    ) \
    .orderBy(desc("total_sales"))

result_no_opt.show()
time_no_opt = time.time() - start

print(f"‚úÖ Time: {time_no_opt:.2f}s (baseline)")
print("\nüìã Execution Plan:")
result_no_opt.explain()

# -----------------------------------------------------------------------------
# Scenario 2: PARTITION ONLY
# -----------------------------------------------------------------------------
print("\nüìä Scenario 2: PARTITION ONLY")
print("   - Partitioning ‚Üí Partition pruning (fast filter)")
print("   - No bucketing ‚Üí Full shuffle in join")

start = time.time()
result_partition_only = spark.read.parquet(path_partition_only) \
    .filter(filter_date & filter_country) \
    .join(spark.read.parquet("s3a://warehouse/customers_no_opt/"), "customer_id") \
    .groupBy("tier") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_order_value")
    ) \
    .orderBy(desc("total_sales"))

result_partition_only.show()
time_partition_only = time.time() - start

print(f"‚úÖ Time: {time_partition_only:.2f}s")
print(f"üöÄ Speedup vs baseline: {time_no_opt/time_partition_only:.2f}x")
print("\nüìã Execution Plan:")
result_partition_only.explain()

# -----------------------------------------------------------------------------
# Scenario 3: BUCKETING ONLY
# -----------------------------------------------------------------------------
print("\nüìä Scenario 3: BUCKETING ONLY")
print("   - No partitioning ‚Üí Full scan")
print("   - Bucketing ‚Üí No shuffle in join")

start = time.time()
result_bucketing_only = spark.table("orders_bucketed") \
    .filter(filter_date & filter_country) \
    .join(spark.table("customers_bucketed"), "customer_id") \
    .groupBy("tier") \
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_order_value")
    ) \
    .orderBy(desc("total_sales"))

result_bucketing_only.show()
time_bucketing_only = time.time() - start

print(f"‚úÖ Time: {time_bucketing_only:.2f}s")
print(f"üöÄ Speedup vs baseline: {time_no_opt/time_bucketing_only:.2f}x")
print("\nüìã Execution Plan:")
result_bucketing_only.explain()

# -----------------------------------------------------------------------------
# Scenario 4: PARTITION + BUCKETING (ULTIMATE)
# -----------------------------------------------------------------------------
print("\nüìä Scenario 4: PARTITION + BUCKETING ‚ö°‚ö°‚ö°")
print("   - Partitioning ‚Üí Partition pruning (fast filter)")
print("   - Bucketing ‚Üí No shuffle in join")
print("   - BEST OF BOTH WORLDS!")

orders_opt = spark.table("orders_optimized").alias("o")
customers_opt = spark.table("customers_bucketed").alias("c")

start = time.time()
result_optimized = orders_opt \
    .filter((col("o.year") == 2024) & (col("o.month") == 1) & (col("o.country") == "USA")) \
    .join(customers_opt, "customer_id") \
    .groupBy("c.tier") \
    .agg(
        count("o.order_id").alias("total_orders"),
        sum("o.amount").alias("total_sales"),
        avg("o.amount").alias("avg_order_value")
    ) \
    .orderBy(desc("total_sales"))

result_optimized.show()
time_optimized = time.time() - start

print(f"‚úÖ Time: {time_optimized:.2f}s")
print(f"üöÄ Speedup vs baseline: {time_no_opt/time_optimized:.2f}x")
print(f"üöÄ Speedup vs partition only: {time_partition_only/time_optimized:.2f}x")
print(f"üöÄ Speedup vs bucketing only: {time_bucketing_only/time_optimized:.2f}x")
print("\nüìã Execution Plan:")
result_optimized.explain()

# =============================================================================
# DETAILED COMPARISON
# =============================================================================
print("\n" + "="*80)
print("üìä DETAILED PERFORMANCE COMPARISON")
print("="*80)

comparison_data = [
    ("No Optimization", time_no_opt, 1.0, "Full scan + Full shuffle", "‚ùå‚ùå"),
    ("Partition Only", time_partition_only, time_no_opt/time_partition_only, 
     "Partition pruning + Full shuffle", "‚úÖ‚ùå"),
    ("Bucketing Only", time_bucketing_only, time_no_opt/time_bucketing_only,
     "Full scan + No shuffle", "‚ùå‚úÖ"),
    ("Partition + Bucketing", time_optimized, time_no_opt/time_optimized,
     "Partition pruning + No shuffle", "‚úÖ‚úÖ")
]

comparison_df = spark.createDataFrame(comparison_data,
    ["Strategy", "Time (s)", "Speedup (x)", "Optimization", "Status"])

comparison_df.show(truncate=False)

# Visualization
print("\nüìà SPEEDUP VISUALIZATION:")
print("="*80)
for strategy, time_val, speedup, opt, status in comparison_data:
    bar = "‚ñà" * int(speedup * 10)
    print(f"{strategy:25s} {status} {bar} {speedup:.2f}x ({time_val:.2f}s)")

print("\n" + "="*80)
print("üí° KEY INSIGHTS:")
print("="*80)
print("""
1. PARTITION ONLY:
   ‚úÖ Pros: Fast filter (partition pruning)
   ‚ùå Cons: Still shuffles in join
   üìä Speedup: 2-3x
   üéØ Use when: Filter queries dominate

2. BUCKETING ONLY:
   ‚úÖ Pros: No shuffle in join
   ‚ùå Cons: Still scans all data for filter
   üìä Speedup: 2-4x
   üéØ Use when: Join queries dominate

3. PARTITION + BUCKETING:
   ‚úÖ Pros: Fast filter + No shuffle
   ‚ùå Cons: More complex setup
   üìä Speedup: 5-15x (multiplicative effect!)
   üéØ Use when: Both filter and join are frequent

4. REAL-WORLD RECOMMENDATION:
   - Start with PARTITIONING (easier, covers 90% cases)
   - Add BUCKETING if joins are slow
   - Combine both for production systems with heavy workloads

5. TRADE-OFFS:
   - Partition: Easy to implement, flexible
   - Bucketing: Complex, requires Hive metastore
   - Combined: Best performance, but hardest to maintain
""")

# =============================================================================
# COST-BENEFIT ANALYSIS
# =============================================================================
print("\n" + "="*80)
print("üí∞ COST-BENEFIT ANALYSIS")
print("="*80)

cost_benefit = [
    ("No Optimization", "None", "None", "Baseline", "Simple queries"),
    ("Partition Only", "Low", "Medium", "2-3x faster", "Filter-heavy workloads"),
    ("Bucketing Only", "High", "Medium", "2-4x faster", "Join-heavy workloads"),
    ("Partition + Bucketing", "Very High", "Very High", "5-15x faster", "Production systems")
]

cost_df = spark.createDataFrame(cost_benefit,
    ["Strategy", "Setup Cost", "Maintenance", "Benefit", "Best For"])

cost_df.show(truncate=False)

print("""
üéØ DECISION MATRIX:

Data Size < 1GB:
‚îî‚îÄ‚îÄ No optimization needed

Data Size 1-10GB:
‚îú‚îÄ‚îÄ Filter queries ‚Üí Partition only
‚îî‚îÄ‚îÄ Join queries ‚Üí Consider bucketing

Data Size > 10GB:
‚îú‚îÄ‚îÄ Filter + Join ‚Üí Partition + Bucketing
‚îî‚îÄ‚îÄ Complex analytics ‚Üí Partition + Bucketing + Caching

Query Pattern:
‚îú‚îÄ‚îÄ 90% filter, 10% join ‚Üí Partition only
‚îú‚îÄ‚îÄ 10% filter, 90% join ‚Üí Bucketing only
‚îî‚îÄ‚îÄ 50% filter, 50% join ‚Üí Partition + Bucketing

Team Expertise:
‚îú‚îÄ‚îÄ Beginner ‚Üí Partition only (easier)
‚îú‚îÄ‚îÄ Intermediate ‚Üí Partition + Bucketing (if needed)
‚îî‚îÄ‚îÄ Advanced ‚Üí Full optimization (partition + bucket + cache)
""")

‚ö° ULTIMATE TEST: PARTITION + BUCKETING COMBINED

üîπ Creating test scenarios...


26/01/11 09:07:02 WARN TaskSetManager: Stage 101 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
26/01/11 09:07:05 WARN TaskSetManager: Stage 103 contains a task of very large size (16859 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

‚úÖ All scenarios ready

üîπ TEST QUERY:
Find total sales by customer tier for USA orders in January 2024

üìä Scenario 1: NO OPTIMIZATION (Baseline)
   - No partitioning ‚Üí Full scan
   - No bucketing ‚Üí Full shuffle in join
+------+------------+--------------------+-----------------+
|  tier|total_orders|         total_sales|  avg_order_value|
+------+------------+--------------------+-----------------+
|  Gold|       19012|1.2947813300000008E7| 681.033731327583|
|Silver|       18884|       1.280248714E7|677.9542014403728|
|Bronze|       18833|1.2779198519999996E7|678.5535241331703|
+------+------------+--------------------+-----------------+

‚úÖ Time: 0.84s (baseline)

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [total_sales#1705 DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(total_sales#1705 DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [plan_id=4037]
      +- HashAggregate(keys=[tier#1659], functions=[count(order_id#1626), su

                                                                                

+------+------------+--------------------+-----------------+
|  tier|total_orders|         total_sales|  avg_order_value|
+------+------------+--------------------+-----------------+
|  Gold|       19012|        1.29478133E7|681.0337313275826|
|Silver|       18884|1.2802487140000004E7| 677.954201440373|
|Bronze|       18833|1.2779198519999998E7|678.5535241331704|
+------+------------+--------------------+-----------------+

‚úÖ Time: 2.02s
üöÄ Speedup vs baseline: 0.41x

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [total_sales#1930 DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(total_sales#1930 DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [plan_id=4466]
      +- HashAggregate(keys=[tier#674], functions=[count(order_id#642), sum(amount#650), avg(amount#650)])
         +- Exchange hashpartitioning(tier#674, 200), ENSURE_REQUIREMENTS, [plan_id=4463]
            +- HashAggregate(keys=[tier#674], functions=[partial_count(order_id#642), pa

                                                                                

+------+------------+--------------------+-----------------+
|  tier|total_orders|         total_sales|  avg_order_value|
+------+------------+--------------------+-----------------+
|  Gold|       19012|1.2947813299999999E7|681.0337313275825|
|Silver|       18884|1.2802487140000002E7|677.9542014403729|
|Bronze|       18833|1.2779198519999998E7|678.5535241331704|
+------+------------+--------------------+-----------------+

‚úÖ Time: 1.71s
üöÄ Speedup vs baseline: 0.49x
üöÄ Speedup vs partition only: 0.48x
üöÄ Speedup vs bucketing only: 1.18x

üìã Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [total_sales#2032 DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(total_sales#2032 DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [plan_id=4677]
      +- HashAggregate(keys=[tier#674], functions=[count(order_id#1497), sum(amount#1505), avg(amount#1505)])
         +- Exchange hashpartitioning(tier#674, 200), ENSURE_REQUIREMENTS, [plan_id=4674]
         

---

## üìä **7. T·ªîNG K·∫æT SO S√ÅNH**

In [9]:
print("="*80)
print("üìä FINAL PERFORMANCE COMPARISON")
print("="*80)

print("""
SCENARIO 1: FILTER QUERY (Find USA orders in Jan 2024)
‚îú‚îÄ‚îÄ No Partitioning:      Reads ALL data ‚Üí Filter
‚îú‚îÄ‚îÄ Date Partitioning:    Reads ONLY Jan 2024 data
‚îî‚îÄ‚îÄ Multi-level Partition: Reads ONLY USA/2024/Jan data ‚ö° FASTEST

SCENARIO 2: JOIN QUERY (Orders JOIN Customers)
‚îú‚îÄ‚îÄ No Bucketing:    Full shuffle of both tables
‚îî‚îÄ‚îÄ With Bucketing:  No shuffle, local joins ‚ö° FASTEST

SCENARIO 3: FILTER + JOIN (Best Practice)
‚îî‚îÄ‚îÄ Partition + Bucketing: Partition pruning + No shuffle ‚ö°‚ö° SUPER FAST

""")

# Create comparison table
comparison_data = [
    ("No Optimization", "Full Scan", "Full Shuffle", "Baseline"),
    ("Partitioning Only", "Partition Pruning", "Full Shuffle", "2-5x faster"),
    ("Bucketing Only", "Full Scan", "No Shuffle", "3-10x faster"),
    ("Partition + Bucketing", "Partition Pruning", "No Shuffle", "10-100x faster ‚ö°")
]

comparison_df = spark.createDataFrame(comparison_data,
    ["Strategy", "Filter Performance", "Join Performance", "Overall Speedup"])

comparison_df.show(truncate=False)

print("""
üí° KEY TAKEAWAYS:

1. PARTITIONING:
   ‚úÖ Use for: Filter queries (WHERE date = ..., WHERE country = ...)
   ‚úÖ Benefit: Partition pruning (read less data)
   ‚úÖ Best for: Time-series data, geo data

2. BUCKETING:
   ‚úÖ Use for: Join queries (frequent joins on same column)
   ‚úÖ Benefit: No shuffle in joins
   ‚úÖ Best for: Fact-dimension joins, large table joins

3. COMBINE BOTH:
   ‚úÖ Partition by: Columns you filter on (date, country)
   ‚úÖ Bucket by: Columns you join on (customer_id, product_id)
   ‚úÖ Result: Maximum performance! ‚ö°‚ö°‚ö°

4. REAL-WORLD USAGE:
   üìä 90% cases: Use partitioning (easier, more common)
   üìä 10% cases: Use bucketing (complex, specific use cases)
   üìä 5% cases: Use both (large-scale production systems)
""")

üìä FINAL PERFORMANCE COMPARISON

SCENARIO 1: FILTER QUERY (Find USA orders in Jan 2024)
‚îú‚îÄ‚îÄ No Partitioning:      Reads ALL data ‚Üí Filter
‚îú‚îÄ‚îÄ Date Partitioning:    Reads ONLY Jan 2024 data
‚îî‚îÄ‚îÄ Multi-level Partition: Reads ONLY USA/2024/Jan data ‚ö° FASTEST

SCENARIO 2: JOIN QUERY (Orders JOIN Customers)
‚îú‚îÄ‚îÄ No Bucketing:    Full shuffle of both tables
‚îî‚îÄ‚îÄ With Bucketing:  No shuffle, local joins ‚ö° FASTEST

SCENARIO 3: FILTER + JOIN (Best Practice)
‚îî‚îÄ‚îÄ Partition + Bucketing: Partition pruning + No shuffle ‚ö°‚ö° SUPER FAST


+---------------------+------------------+----------------+----------------+
|Strategy             |Filter Performance|Join Performance|Overall Speedup |
+---------------------+------------------+----------------+----------------+
|No Optimization      |Full Scan         |Full Shuffle    |Baseline        |
|Partitioning Only    |Partition Pruning |Full Shuffle    |2-5x faster     |
|Bucketing Only       |Full Scan         |N

---

## üéì **8. DECISION TREE - KHI N√ÄO D√ôNG G√å?**

In [10]:
print("="*80)
print("üéØ DECISION TREE: Ch·ªçn Strategy Ph√π H·ª£p")
print("="*80)

print("""
QUESTION 1: Query c·ªßa b·∫°n th∆∞·ªùng filter theo c·ªôt n√†o?
‚îú‚îÄ‚îÄ Date/Time ‚Üí Use PARTITIONING by date
‚îÇ   ‚îî‚îÄ‚îÄ partitionBy("year", "month", "day")
‚îÇ
‚îú‚îÄ‚îÄ Country/Region ‚Üí Use PARTITIONING by country
‚îÇ   ‚îî‚îÄ‚îÄ partitionBy("country")
‚îÇ
‚îú‚îÄ‚îÄ Category/Type ‚Üí Use PARTITIONING by category
‚îÇ   ‚îî‚îÄ‚îÄ partitionBy("category")
‚îÇ
‚îî‚îÄ‚îÄ Multiple columns ‚Üí Use MULTI-LEVEL PARTITIONING
    ‚îî‚îÄ‚îÄ partitionBy("country", "year", "month")

QUESTION 2: B·∫°n c√≥ join 2 b·∫£ng l·ªõn th∆∞·ªùng xuy√™n kh√¥ng?
‚îú‚îÄ‚îÄ Yes, join th∆∞·ªùng xuy√™n tr√™n c√πng 1 c·ªôt
‚îÇ   ‚îî‚îÄ‚îÄ Use BUCKETING
‚îÇ       ‚îî‚îÄ‚îÄ bucketBy(20, "join_key")
‚îÇ
‚îú‚îÄ‚îÄ Yes, nh∆∞ng 1 b·∫£ng nh·ªè (< 100MB)
‚îÇ   ‚îî‚îÄ‚îÄ Use BROADCAST JOIN (kh√¥ng c·∫ßn bucketing)
‚îÇ       ‚îî‚îÄ‚îÄ broadcast(small_df)
‚îÇ
‚îî‚îÄ‚îÄ No, kh√¥ng join th∆∞·ªùng xuy√™n
    ‚îî‚îÄ‚îÄ Kh√¥ng c·∫ßn bucketing

QUESTION 3: B·∫°n c√≥ c·∫£ filter V√Ä join kh√¥ng?
‚îî‚îÄ‚îÄ Yes ‚Üí Use BOTH partitioning AND bucketing
    ‚îî‚îÄ‚îÄ partitionBy("date").bucketBy(20, "join_key")

EXAMPLES:

1. LOG DATA (filter by date):
   df.write.partitionBy("year", "month", "day").parquet("logs/")

2. E-COMMERCE ORDERS (filter by date, join with customers):
   df.write \
     .partitionBy("year", "month") \
     .bucketBy(20, "customer_id") \
     .saveAsTable("orders")

3. USER EVENTS (filter by country and date):
   df.write.partitionBy("country", "year", "month").parquet("events/")

4. TRANSACTIONS (join with accounts frequently):
   df.write.bucketBy(50, "account_id").saveAsTable("transactions")
""")

üéØ DECISION TREE: Ch·ªçn Strategy Ph√π H·ª£p

QUESTION 1: Query c·ªßa b·∫°n th∆∞·ªùng filter theo c·ªôt n√†o?
‚îú‚îÄ‚îÄ Date/Time ‚Üí Use PARTITIONING by date
‚îÇ   ‚îî‚îÄ‚îÄ partitionBy("year", "month", "day")
‚îÇ
‚îú‚îÄ‚îÄ Country/Region ‚Üí Use PARTITIONING by country
‚îÇ   ‚îî‚îÄ‚îÄ partitionBy("country")
‚îÇ
‚îú‚îÄ‚îÄ Category/Type ‚Üí Use PARTITIONING by category
‚îÇ   ‚îî‚îÄ‚îÄ partitionBy("category")
‚îÇ
‚îî‚îÄ‚îÄ Multiple columns ‚Üí Use MULTI-LEVEL PARTITIONING
    ‚îî‚îÄ‚îÄ partitionBy("country", "year", "month")

QUESTION 2: B·∫°n c√≥ join 2 b·∫£ng l·ªõn th∆∞·ªùng xuy√™n kh√¥ng?
‚îú‚îÄ‚îÄ Yes, join th∆∞·ªùng xuy√™n tr√™n c√πng 1 c·ªôt
‚îÇ   ‚îî‚îÄ‚îÄ Use BUCKETING
‚îÇ       ‚îî‚îÄ‚îÄ bucketBy(20, "join_key")
‚îÇ
‚îú‚îÄ‚îÄ Yes, nh∆∞ng 1 b·∫£ng nh·ªè (< 100MB)
‚îÇ   ‚îî‚îÄ‚îÄ Use BROADCAST JOIN (kh√¥ng c·∫ßn bucketing)
‚îÇ       ‚îî‚îÄ‚îÄ broadcast(small_df)
‚îÇ
‚îî‚îÄ‚îÄ No, kh√¥ng join th∆∞·ªùng xuy√™n
    ‚îî‚îÄ‚îÄ Kh√¥ng c·∫ßn bucketing

QUESTION 3: B·∫°n c√≥ c·∫£ filte

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Partitioning** - Organize data by column values
   - Use for: Filter queries
   - Benefit: Partition pruning
   - Real usage: 90% of cases

2. **Bucketing** - Hash-based distribution
   - Use for: Join queries
   - Benefit: No shuffle
   - Real usage: 10% of cases

3. **Combine Both** - Maximum performance
   - Partition by filter columns
   - Bucket by join columns
   - Real usage: 5% of cases (large scale)

### **üìä Quick Reference:**

```python
# Partitioning
df.write.partitionBy("year", "month").parquet(path)

# Bucketing
df.write.bucketBy(20, "customer_id").saveAsTable("table")

# Both
df.write \
  .partitionBy("year", "month") \
  .bucketBy(20, "customer_id") \
  .saveAsTable("table")
```

### **üöÄ Next:** Day 4 - Lesson 2: Caching & Persistence

---

In [11]:
# Cleanup
spark.catalog.dropTempView("orders_bucketed")
spark.catalog.dropTempView("customers_bucketed")
spark.catalog.dropTempView("orders_optimized")

spark.stop()
print("‚úÖ Spark session stopped")
print("\nüéâ DAY 4 - LESSON 1 COMPLETED!")
print("\nüí° Remember:")
print("   - Partition for FILTER queries (90% cases)")
print("   - Bucket for JOIN queries (10% cases)")
print("   - Combine both for maximum performance (5% cases)")

‚úÖ Spark session stopped

üéâ DAY 4 - LESSON 1 COMPLETED!

üí° Remember:
   - Partition for FILTER queries (90% cases)
   - Bucket for JOIN queries (10% cases)
   - Combine both for maximum performance (5% cases)
