# üìä AGGREGATIONS WITH PYSPARK

---

## üìã **DAY 3 - LESSON 2: AGGREGATIONS**

### **üéØ OBJECTIVES:**

1. **Basic Aggregations** - count, sum, avg, min, max
2. **GroupBy Aggregations** - Single and multiple groups
3. **Multiple Aggregations** - agg() with multiple functions
4. **Pivot Tables** - Transform rows to columns
5. **Rollup** - Hierarchical aggregations
6. **Cube** - Multi-dimensional aggregations
7. **Window Aggregations** - Running totals, moving averages
8. **Advanced Patterns** - Custom aggregations, percentiles

---

## üîß **SETUP SPARK SESSION**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
from datetime import datetime, timedelta
import pandas as pd

spark = SparkSession.builder \
    .appName("Aggregations") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/09 15:46:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created
Spark Version: 3.5.1
Master: spark://spark-master:7077


---

## üìä **1. CREATE SAMPLE DATASET**

T·∫°o dataset sales ph·ª©c t·∫°p ƒë·ªÉ th·ª±c h√†nh aggregations

In [2]:
# Create comprehensive sales data
sales_data = [
    # USA Sales
    ("ORD001", "2024-01-15", "USA", "Electronics", "Laptop", 1200.0, 2, "Online", "John Doe"),
    ("ORD002", "2024-01-15", "USA", "Electronics", "Phone", 800.0, 3, "Store", "Jane Smith"),
    ("ORD003", "2024-01-16", "USA", "Clothing", "Shirt", 50.0, 5, "Online", "Bob Johnson"),
    ("ORD004", "2024-01-16", "USA", "Electronics", "Tablet", 600.0, 1, "Online", "John Doe"),
    ("ORD005", "2024-01-17", "USA", "Clothing", "Pants", 80.0, 3, "Store", "Alice Brown"),
    
    # UK Sales
    ("ORD006", "2024-01-15", "UK", "Electronics", "Laptop", 1200.0, 1, "Online", "Charlie Wilson"),
    ("ORD007", "2024-01-16", "UK", "Books", "Novel", 20.0, 10, "Store", "David Lee"),
    ("ORD008", "2024-01-16", "UK", "Electronics", "Phone", 800.0, 2, "Online", "Eve Davis"),
    ("ORD009", "2024-01-17", "UK", "Clothing", "Jacket", 150.0, 2, "Store", "Frank Miller"),
    ("ORD010", "2024-01-17", "UK", "Books", "Textbook", 60.0, 3, "Online", "Grace Lee"),
    
    # Canada Sales
    ("ORD011", "2024-01-15", "Canada", "Electronics", "Laptop", 1200.0, 1, "Online", "Henry Taylor"),
    ("ORD012", "2024-01-16", "Canada", "Clothing", "Shoes", 120.0, 2, "Store", "Ivy Anderson"),
    ("ORD013", "2024-01-16", "Canada", "Books", "Magazine", 10.0, 5, "Online", "Jack Thomas"),
    ("ORD014", "2024-01-17", "Canada", "Electronics", "Tablet", 600.0, 2, "Online", "Karen Jackson"),
    ("ORD015", "2024-01-17", "Canada", "Clothing", "Dress", 100.0, 1, "Store", "Leo White"),
    
    # More USA Sales (different dates)
    ("ORD016", "2024-01-18", "USA", "Electronics", "Headphones", 200.0, 4, "Online", "Mia Harris"),
    ("ORD017", "2024-01-18", "USA", "Books", "Comic", 15.0, 8, "Store", "Noah Martin"),
    ("ORD018", "2024-01-19", "USA", "Clothing", "Hat", 30.0, 6, "Online", "Olivia Garcia"),
    ("ORD019", "2024-01-19", "USA", "Electronics", "Camera", 1500.0, 1, "Store", "Paul Martinez"),
    ("ORD020", "2024-01-20", "USA", "Books", "Cookbook", 35.0, 4, "Online", "Quinn Robinson"),
    
    # More UK Sales
    ("ORD021", "2024-01-18", "UK", "Electronics", "Mouse", 50.0, 10, "Online", "Rachel Clark"),
    ("ORD022", "2024-01-18", "UK", "Clothing", "Scarf", 40.0, 5, "Store", "Sam Rodriguez"),
    ("ORD023", "2024-01-19", "UK", "Books", "Biography", 25.0, 6, "Online", "Tina Lewis"),
    ("ORD024", "2024-01-19", "UK", "Electronics", "Keyboard", 100.0, 3, "Store", "Uma Walker"),
    ("ORD025", "2024-01-20", "UK", "Clothing", "Gloves", 25.0, 8, "Online", "Victor Hall"),
]

schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("order_date", StringType(), True),
    StructField("country", StringType(), True),
    StructField("category", StringType(), True),
    StructField("product", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("channel", StringType(), True),
    StructField("customer_name", StringType(), True)
])

df = spark.createDataFrame(sales_data, schema)

# Add calculated columns
df = df.withColumn("total_amount", col("price") * col("quantity")) \
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \
    .withColumn("year", year(col("order_date"))) \
    .withColumn("month", month(col("order_date"))) \
    .withColumn("day", dayofmonth(col("order_date")))

print("üìä SALES DATASET:")
df.show(25, truncate=False)
print(f"\nTotal rows: {df.count()}")
df.printSchema()

üìä SALES DATASET:


                                                                                

+--------+----------+-------+-----------+----------+------+--------+-------+--------------+------------+----+-----+---+
|order_id|order_date|country|category   |product   |price |quantity|channel|customer_name |total_amount|year|month|day|
+--------+----------+-------+-----------+----------+------+--------+-------+--------------+------------+----+-----+---+
|ORD001  |2024-01-15|USA    |Electronics|Laptop    |1200.0|2       |Online |John Doe      |2400.0      |2024|1    |15 |
|ORD002  |2024-01-15|USA    |Electronics|Phone     |800.0 |3       |Store  |Jane Smith    |2400.0      |2024|1    |15 |
|ORD003  |2024-01-16|USA    |Clothing   |Shirt     |50.0  |5       |Online |Bob Johnson   |250.0       |2024|1    |16 |
|ORD004  |2024-01-16|USA    |Electronics|Tablet    |600.0 |1       |Online |John Doe      |600.0       |2024|1    |16 |
|ORD005  |2024-01-17|USA    |Clothing   |Pants     |80.0  |3       |Store  |Alice Brown   |240.0       |2024|1    |17 |
|ORD006  |2024-01-15|UK     |Electronics




Total rows: 25
root
 |-- order_id: string (nullable = true)
 |-- order_date: date (nullable = true)
 |-- country: string (nullable = true)
 |-- category: string (nullable = true)
 |-- product: string (nullable = true)
 |-- price: double (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- channel: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)



                                                                                

---

## üìà **2. BASIC AGGREGATIONS**

C√°c aggregation functions c∆° b·∫£n

In [3]:
# 2.1 Single aggregations
print("üîπ Single aggregations:")

# Count
total_orders = df.count()
print(f"Total orders: {total_orders}")

# Sum
total_revenue = df.select(sum("total_amount")).collect()[0][0]
print(f"Total revenue: ${total_revenue:,.2f}")

# Average
avg_order_value = df.select(avg("total_amount")).collect()[0][0]
print(f"Average order value: ${avg_order_value:,.2f}")

# Min/Max
min_order = df.select(min("total_amount")).collect()[0][0]
max_order = df.select(max("total_amount")).collect()[0][0]
print(f"Min order: ${min_order:,.2f}")
print(f"Max order: ${max_order:,.2f}")

# 2.2 Multiple aggregations at once
print("\nüîπ Multiple aggregations:")
summary = df.select(
    count("*").alias("total_orders"),
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_order_value"),
    min("total_amount").alias("min_order"),
    max("total_amount").alias("max_order"),
    stddev("total_amount").alias("stddev_order"),
    variance("total_amount").alias("variance_order")
)

summary.show()

# 2.3 Count distinct
print("\nüîπ Count distinct:")
distinct_counts = df.select(
    countDistinct("country").alias("num_countries"),
    countDistinct("category").alias("num_categories"),
    countDistinct("product").alias("num_products"),
    countDistinct("customer_name").alias("num_customers")
)

distinct_counts.show()

# 2.4 Approximate count distinct (faster for large datasets)
print("\nüîπ Approximate count distinct:")
approx_counts = df.select(
    approx_count_distinct("customer_name").alias("approx_customers"),
    countDistinct("customer_name").alias("exact_customers")
)

approx_counts.show()

üîπ Single aggregations:
Total orders: 25
Total revenue: $16,250.00
Average order value: $650.00
Min order: $50.00
Max order: $2,400.00

üîπ Multiple aggregations:
+------------+-------------+---------------+---------+---------+-----------------+--------------+
|total_orders|total_revenue|avg_order_value|min_order|max_order|     stddev_order|variance_order|
+------------+-------------+---------------+---------+---------+-----------------+--------------+
|          25|      16250.0|          650.0|     50.0|   2400.0|705.6025793603649|      497875.0|
+------------+-------------+---------------+---------+---------+-----------------+--------------+


üîπ Count distinct:


26/01/09 15:46:37 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------------+--------------+------------+-------------+
|num_countries|num_categories|num_products|num_customers|
+-------------+--------------+------------+-------------+
|            3|             3|          21|           24|
+-------------+--------------+------------+-------------+


üîπ Approximate count distinct:
+----------------+---------------+
|approx_customers|exact_customers|
+----------------+---------------+
|              25|             24|
+----------------+---------------+



---

## üë• **3. GROUPBY AGGREGATIONS**

Group data v√† aggregate theo nh√≥m

In [4]:
# 3.1 Simple groupBy
print("üîπ Group by country:")
by_country = df.groupBy("country") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .orderBy(desc("total_revenue"))

by_country.show()

# 3.2 Group by multiple columns
print("\nüîπ Group by country and category:")
by_country_category = df.groupBy("country", "category") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue"),
        avg("total_amount").alias("avg_order_value")
    ) \
    .orderBy("country", desc("total_revenue"))

by_country_category.show()

# 3.3 Multiple aggregations per column
print("\nüîπ Multiple aggregations per column:")
detailed_stats = df.groupBy("category") \
    .agg(
        count("*").alias("num_orders"),
        sum("quantity").alias("total_quantity"),
        sum("total_amount").alias("total_revenue"),
        avg("total_amount").alias("avg_order_value"),
        min("total_amount").alias("min_order"),
        max("total_amount").alias("max_order"),
        stddev("total_amount").alias("stddev_order")
    ) \
    .orderBy(desc("total_revenue"))

detailed_stats.show()

# 3.4 Group by with filtering
print("\nüîπ Group by with filtering (HAVING):")
high_value_categories = df.groupBy("category") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .filter(col("total_revenue") > 1000) \
    .orderBy(desc("total_revenue"))

high_value_categories.show()

# 3.5 Group by date
print("\nüîπ Group by date:")
daily_sales = df.groupBy("order_date") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("daily_revenue"),
        avg("total_amount").alias("avg_order_value")
    ) \
    .orderBy("order_date")

daily_sales.show()

# 3.6 Collect list/set
print("\nüîπ Collect list/set:")
products_by_category = df.groupBy("category") \
    .agg(
        collect_list("product").alias("all_products"),
        collect_set("product").alias("unique_products"),
        count("*").alias("num_orders")
    )

products_by_category.show(truncate=False)

üîπ Group by country:
+-------+----------+-------------+
|country|num_orders|total_revenue|
+-------+----------+-------------+
|    USA|        10|       8630.0|
|     UK|        10|       4830.0|
| Canada|         5|       2790.0|
+-------+----------+-------------+


üîπ Group by country and category:
+-------+-----------+----------+-------------+------------------+
|country|   category|num_orders|total_revenue|   avg_order_value|
+-------+-----------+----------+-------------+------------------+
| Canada|Electronics|         2|       2400.0|            1200.0|
| Canada|   Clothing|         2|        340.0|             170.0|
| Canada|      Books|         1|         50.0|              50.0|
|     UK|Electronics|         4|       3600.0|             900.0|
|     UK|   Clothing|         3|        700.0|233.33333333333334|
|     UK|      Books|         3|        530.0|176.66666666666666|
|    USA|Electronics|         5|       7700.0|            1540.0|
|    USA|   Clothing|         3|  

---

## üîÑ **4. PIVOT TABLES**

Transform rows to columns (like Excel pivot tables)

In [5]:
# 4.1 Basic pivot
print("üîπ Basic pivot - Revenue by country and category:")
pivot_basic = df.groupBy("country") \
    .pivot("category") \
    .sum("total_amount")

pivot_basic.show()

# 4.2 Pivot with multiple aggregations
print("\nüîπ Pivot with multiple aggregations:")
pivot_multi = df.groupBy("country") \
    .pivot("category") \
    .agg(
        sum("total_amount").alias("revenue"),
        count("*").alias("orders")
    )

pivot_multi.show()

# 4.3 Pivot with specific values (performance optimization)
print("\nüîπ Pivot with specific values:")
categories = ["Electronics", "Clothing", "Books"]
pivot_optimized = df.groupBy("country") \
    .pivot("category", categories) \
    .sum("total_amount")

pivot_optimized.show()

# 4.4 Pivot by date
print("\nüîπ Pivot by date - Daily revenue by country:")
pivot_date = df.groupBy("order_date") \
    .pivot("country") \
    .sum("total_amount") \
    .orderBy("order_date")

pivot_date.show()

# 4.5 Fill null values in pivot
print("\nüîπ Pivot with null handling:")
pivot_filled = df.groupBy("country") \
    .pivot("category") \
    .sum("total_amount") \
    .fillna(0)

pivot_filled.show()

üîπ Basic pivot - Revenue by country and category:
+-------+-----+--------+-----------+
|country|Books|Clothing|Electronics|
+-------+-----+--------+-----------+
|    USA|260.0|   670.0|     7700.0|
|     UK|530.0|   700.0|     3600.0|
| Canada| 50.0|   340.0|     2400.0|
+-------+-----+--------+-----------+


üîπ Pivot with multiple aggregations:
+-------+-------------+------------+----------------+---------------+-------------------+------------------+
|country|Books_revenue|Books_orders|Clothing_revenue|Clothing_orders|Electronics_revenue|Electronics_orders|
+-------+-------------+------------+----------------+---------------+-------------------+------------------+
|    USA|        260.0|           2|           670.0|              3|             7700.0|                 5|
|     UK|        530.0|           3|           700.0|              3|             3600.0|                 4|
| Canada|         50.0|           1|           340.0|              2|             2400.0|              

---

## üìä **5. ROLLUP - HIERARCHICAL AGGREGATIONS**

T·∫°o subtotals v√† grand totals

In [6]:
# 5.1 Basic rollup
print("üîπ Rollup - Country and Category:")
rollup_basic = df.rollup("country", "category") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .orderBy("country", "category")

rollup_basic.show(30)

print("""
üìù ROLLUP EXPLANATION:
- Rows with both country AND category: Detail level
- Rows with country but NULL category: Country subtotal
- Row with NULL country and NULL category: Grand total
""")

# 5.2 Rollup with 3 levels
print("\nüîπ Rollup - Country, Category, Channel:")
rollup_3level = df.rollup("country", "category", "channel") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .orderBy("country", "category", "channel")

rollup_3level.show(50)

# 5.3 Identify aggregation levels
print("\nüîπ Rollup with level identification:")
rollup_labeled = df.rollup("country", "category") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .withColumn(
        "level",
        when(col("country").isNull(), "Grand Total")
        .when(col("category").isNull(), "Country Subtotal")
        .otherwise("Detail")
    ) \
    .orderBy("country", "category")

rollup_labeled.show(30)

# 5.4 Filter rollup results
print("\nüîπ Show only subtotals:")
subtotals_only = rollup_labeled.filter(
    (col("level") == "Country Subtotal") | (col("level") == "Grand Total")
)

subtotals_only.show()

üîπ Rollup - Country and Category:
+-------+-----------+----------+-------------+
|country|   category|num_orders|total_revenue|
+-------+-----------+----------+-------------+
|   NULL|       NULL|        25|      16250.0|
| Canada|       NULL|         5|       2790.0|
| Canada|      Books|         1|         50.0|
| Canada|   Clothing|         2|        340.0|
| Canada|Electronics|         2|       2400.0|
|     UK|       NULL|        10|       4830.0|
|     UK|      Books|         3|        530.0|
|     UK|   Clothing|         3|        700.0|
|     UK|Electronics|         4|       3600.0|
|    USA|       NULL|        10|       8630.0|
|    USA|      Books|         2|        260.0|
|    USA|   Clothing|         3|        670.0|
|    USA|Electronics|         5|       7700.0|
+-------+-----------+----------+-------------+


üìù ROLLUP EXPLANATION:
- Rows with both country AND category: Detail level
- Rows with country but NULL category: Country subtotal
- Row with NULL country and NU

---

## üé≤ **6. CUBE - MULTI-DIMENSIONAL AGGREGATIONS**

T·∫°o t·∫•t c·∫£ combinations c·ªßa group by columns

In [7]:
# 6.1 Basic cube
print("üîπ Cube - Country and Category:")
cube_basic = df.cube("country", "category") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .orderBy("country", "category")

cube_basic.show(30)

print("""
üìù CUBE EXPLANATION:
- Rows with both country AND category: Detail level
- Rows with country but NULL category: Country subtotal
- Rows with NULL country but has category: Category subtotal (across all countries)
- Row with NULL country and NULL category: Grand total
""")

# 6.2 Cube with level identification
print("\nüîπ Cube with level identification:")
cube_labeled = df.cube("country", "category") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .withColumn(
        "level",
        when(col("country").isNull() & col("category").isNull(), "Grand Total")
        .when(col("country").isNull(), "Category Total")
        .when(col("category").isNull(), "Country Total")
        .otherwise("Detail")
    ) \
    .orderBy("level", "country", "category")

cube_labeled.show(30)

# 6.3 Cube with 3 dimensions
print("\nüîπ Cube - Country, Category, Channel:")
cube_3d = df.cube("country", "category", "channel") \
    .agg(
        count("*").alias("num_orders"),
        sum("total_amount").alias("total_revenue")
    ) \
    .orderBy("country", "category", "channel")

print(f"Total combinations: {cube_3d.count()}")
cube_3d.show(50)

# 6.4 Compare Rollup vs Cube
print("\nüîπ Comparison - Rollup vs Cube:")
rollup_count = df.rollup("country", "category").count().count()
cube_count = df.cube("country", "category").count().count()

print(f"Rollup combinations: {rollup_count}")
print(f"Cube combinations: {cube_count}")
print(f"\nRollup: Hierarchical (country -> category -> total)")
print(f"Cube: All combinations (country, category, country+category, total)")

üîπ Cube - Country and Category:
+-------+-----------+----------+-------------+
|country|   category|num_orders|total_revenue|
+-------+-----------+----------+-------------+
|   NULL|       NULL|        25|      16250.0|
|   NULL|      Books|         6|        840.0|
|   NULL|   Clothing|         8|       1710.0|
|   NULL|Electronics|        11|      13700.0|
| Canada|       NULL|         5|       2790.0|
| Canada|      Books|         1|         50.0|
| Canada|   Clothing|         2|        340.0|
| Canada|Electronics|         2|       2400.0|
|     UK|       NULL|        10|       4830.0|
|     UK|      Books|         3|        530.0|
|     UK|   Clothing|         3|        700.0|
|     UK|Electronics|         4|       3600.0|
|    USA|       NULL|        10|       8630.0|
|    USA|      Books|         2|        260.0|
|    USA|   Clothing|         3|        670.0|
|    USA|Electronics|         5|       7700.0|
+-------+-----------+----------+-------------+


üìù CUBE EXPLANATION:
-

---

## ü™ü **7. WINDOW AGGREGATIONS**

Aggregations over a window of rows (preview for next lesson)

In [8]:
# 7.1 Running total
print("üîπ Running total by country:")
windowSpec = Window.partitionBy("country").orderBy("order_date")

running_total = df.withColumn(
    "running_total",
    sum("total_amount").over(windowSpec)
).select(
    "order_id",
    "order_date",
    "country",
    "total_amount",
    "running_total"
).orderBy("country", "order_date")

running_total.show(20)

# 7.2 Moving average
print("\nüîπ Moving average (3-day window):")
windowSpec3 = Window.partitionBy("country") \
    .orderBy("order_date") \
    .rowsBetween(-2, 0)  # Current row and 2 rows before

moving_avg = df.withColumn(
    "moving_avg_3day",
    avg("total_amount").over(windowSpec3)
).select(
    "order_id",
    "order_date",
    "country",
    "total_amount",
    "moving_avg_3day"
).orderBy("country", "order_date")

moving_avg.show(20)

# 7.3 Cumulative count
print("\nüîπ Cumulative count:")
cumulative = df.withColumn(
    "order_number",
    count("*").over(windowSpec)
).select(
    "order_id",
    "order_date",
    "country",
    "total_amount",
    "order_number"
).orderBy("country", "order_date")

cumulative.show(20)

print("\nüí° More window functions in next lesson (03_window_functions.ipynb)!")

üîπ Running total by country:


                                                                                

+--------+----------+-------+------------+-------------+
|order_id|order_date|country|total_amount|running_total|
+--------+----------+-------+------------+-------------+
|  ORD011|2024-01-15| Canada|      1200.0|       1200.0|
|  ORD013|2024-01-16| Canada|        50.0|       1490.0|
|  ORD012|2024-01-16| Canada|       240.0|       1490.0|
|  ORD014|2024-01-17| Canada|      1200.0|       2790.0|
|  ORD015|2024-01-17| Canada|       100.0|       2790.0|
|  ORD006|2024-01-15|     UK|      1200.0|       1200.0|
|  ORD007|2024-01-16|     UK|       200.0|       3000.0|
|  ORD008|2024-01-16|     UK|      1600.0|       3000.0|
|  ORD009|2024-01-17|     UK|       300.0|       3480.0|
|  ORD010|2024-01-17|     UK|       180.0|       3480.0|
|  ORD021|2024-01-18|     UK|       500.0|       4180.0|
|  ORD022|2024-01-18|     UK|       200.0|       4180.0|
|  ORD023|2024-01-19|     UK|       150.0|       4630.0|
|  ORD024|2024-01-19|     UK|       300.0|       4630.0|
|  ORD025|2024-01-20|     UK|  

---

## üéØ **8. ADVANCED AGGREGATION PATTERNS**

In [9]:
# 8.1 Percentiles and quantiles
print("üîπ Percentiles:")
percentiles = df.select(
    expr("percentile_approx(total_amount, 0.25)").alias("p25"),
    expr("percentile_approx(total_amount, 0.50)").alias("p50_median"),
    expr("percentile_approx(total_amount, 0.75)").alias("p75"),
    expr("percentile_approx(total_amount, 0.95)").alias("p95")
)

percentiles.show()

# 8.2 Percentiles by group
print("\nüîπ Percentiles by category:")
percentiles_by_category = df.groupBy("category").agg(
    expr("percentile_approx(total_amount, 0.50)").alias("median"),
    expr("percentile_approx(total_amount, 0.95)").alias("p95")
)

percentiles_by_category.show()

# 8.3 First/Last values
print("\nüîπ First and last orders by country:")
first_last = df.groupBy("country").agg(
    first("order_date").alias("first_order_date"),
    last("order_date").alias("last_order_date"),
    first("total_amount").alias("first_order_amount"),
    last("total_amount").alias("last_order_amount")
)

first_last.show()

# 8.4 Conditional aggregations
print("\nüîπ Conditional aggregations:")
conditional_agg = df.groupBy("country").agg(
    count("*").alias("total_orders"),
    sum(when(col("channel") == "Online", 1).otherwise(0)).alias("online_orders"),
    sum(when(col("channel") == "Store", 1).otherwise(0)).alias("store_orders"),
    sum(when(col("channel") == "Online", col("total_amount")).otherwise(0)).alias("online_revenue"),
    sum(when(col("channel") == "Store", col("total_amount")).otherwise(0)).alias("store_revenue")
)

conditional_agg.show()

# 8.5 Weighted average
print("\nüîπ Weighted average (price weighted by quantity):")
weighted_avg = df.groupBy("category").agg(
    sum(col("price") * col("quantity")).alias("weighted_sum"),
    sum("quantity").alias("total_quantity")
).withColumn(
    "weighted_avg_price",
    col("weighted_sum") / col("total_quantity")
).select("category", "weighted_avg_price")

weighted_avg.show()

# 8.6 Mode (most frequent value)
print("\nüîπ Most popular product by category:")
mode_product = df.groupBy("category", "product").count() \
    .withColumn(
        "rank",
        row_number().over(Window.partitionBy("category").orderBy(desc("count")))
    ) \
    .filter(col("rank") == 1) \
    .select("category", "product", "count") \
    .withColumnRenamed("product", "most_popular_product") \
    .withColumnRenamed("count", "num_orders")

mode_product.show()

üîπ Percentiles:
+-----+----------+------+------+
|  p25|p50_median|   p75|   p95|
+-----+----------+------+------+
|180.0|     250.0|1200.0|2400.0|
+-----+----------+------+------+


üîπ Percentiles by category:
+-----------+------+------+
|   category|median|   p95|
+-----------+------+------+
|Electronics|1200.0|2400.0|
|   Clothing| 200.0| 300.0|
|      Books| 140.0| 200.0|
+-----------+------+------+


üîπ First and last orders by country:
+-------+----------------+---------------+------------------+-----------------+
|country|first_order_date|last_order_date|first_order_amount|last_order_amount|
+-------+----------------+---------------+------------------+-----------------+
|    USA|      2024-01-18|     2024-01-17|             800.0|            240.0|
|     UK|      2024-01-18|     2024-01-17|             500.0|            180.0|
| Canada|      2024-01-16|     2024-01-16|              50.0|            240.0|
+-------+----------------+---------------+------------------+-------

---

## üìä **9. COMPREHENSIVE SALES REPORT**

T·∫°o b√°o c√°o t·ªïng h·ª£p ho√†n ch·ªânh

In [11]:
# 9.1 Executive Summary
print("="*80)
print("üìä EXECUTIVE SUMMARY")
print("="*80)

exec_summary = df.select(
    count("*").alias("total_orders"),
    countDistinct("customer_name").alias("unique_customers"),
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_order_value"),
    sum("quantity").alias("total_items_sold")
)

exec_summary.show()

# L·∫•y total revenue ƒë·ªÉ t√≠nh ph·∫ßn trƒÉm
total_revenue = exec_summary.select("total_revenue").collect()[0]["total_revenue"]

# 9.2 Revenue by Country - FIXED
print("\nüìç REVENUE BY COUNTRY:")
country_report = df.groupBy("country").agg(
    count("*").alias("orders"),
    sum("total_amount").alias("revenue"),
    avg("total_amount").alias("avg_order_value"),
    sum("quantity").alias("items_sold")
).withColumn(
    "revenue_pct",
    round((col("revenue") / lit(total_revenue)) * 100, 2)
).orderBy(desc("revenue"))

country_report.show()

# 9.3 Category Performance
print("\nüì¶ CATEGORY PERFORMANCE:")
category_report = df.groupBy("category").agg(
    count("*").alias("orders"),
    sum("total_amount").alias("revenue"),
    avg("total_amount").alias("avg_order_value"),
    countDistinct("product").alias("num_products")
).orderBy(desc("revenue"))

category_report.show()

# 9.4 Channel Performance - FIXED
print("\nüõí CHANNEL PERFORMANCE:")
channel_report = df.groupBy("channel").agg(
    count("*").alias("orders"),
    sum("total_amount").alias("revenue"),
    avg("total_amount").alias("avg_order_value")
).withColumn(
    "revenue_pct",
    round((col("revenue") / lit(total_revenue)) * 100, 2)
).orderBy(desc("revenue"))

channel_report.show()

# 9.5 Daily Trend
print("\nüìà DAILY TREND:")
daily_trend = df.groupBy("order_date").agg(
    count("*").alias("orders"),
    sum("total_amount").alias("revenue"),
    avg("total_amount").alias("avg_order_value")
).orderBy("order_date")

daily_trend.show()

# 9.6 Top Products
print("\nüèÜ TOP 10 PRODUCTS:")
top_products = df.groupBy("product", "category").agg(
    count("*").alias("orders"),
    sum("quantity").alias("quantity_sold"),
    sum("total_amount").alias("revenue")
).orderBy(desc("revenue")).limit(10)

top_products.show(truncate=False)

# 9.7 Top Customers
print("\nüë• TOP 10 CUSTOMERS:")
top_customers = df.groupBy("customer_name").agg(
    count("*").alias("orders"),
    sum("total_amount").alias("total_spent"),
    avg("total_amount").alias("avg_order_value")
).orderBy(desc("total_spent")).limit(10)

top_customers.show(truncate=False)

print("\n" + "="*80)

üìä EXECUTIVE SUMMARY
+------------+----------------+-------------+---------------+----------------+
|total_orders|unique_customers|total_revenue|avg_order_value|total_items_sold|
+------------+----------------+-------------+---------------+----------------+
|          25|              24|      16250.0|          650.0|              98|
+------------+----------------+-------------+---------------+----------------+


üìç REVENUE BY COUNTRY:
+-------+------+-------+---------------+----------+-----------+
|country|orders|revenue|avg_order_value|items_sold|revenue_pct|
+-------+------+-------+---------------+----------+-----------+
|    USA|    10| 8630.0|          863.0|        37|      53.11|
|     UK|    10| 4830.0|          483.0|        50|      29.72|
| Canada|     5| 2790.0|          558.0|        11|      17.17|
+-------+------+-------+---------------+----------+-----------+


üì¶ CATEGORY PERFORMANCE:
+-----------+------+-------+------------------+------------+
|   category|orde

---

## üíæ **10. SAVE AGGREGATED DATA**

In [None]:
# Save aggregated reports to MinIO
base_path = "s3a://warehouse/aggregated_reports/"

# Save country report
country_report.write.mode("overwrite").parquet(f"{base_path}country_report/")
print(f"‚úÖ Country report saved to: {base_path}country_report/")

# Save category report
category_report.write.mode("overwrite").parquet(f"{base_path}category_report/")
print(f"‚úÖ Category report saved to: {base_path}category_report/")

# Save daily trend
daily_trend.write.mode("overwrite").partitionBy("order_date").parquet(f"{base_path}daily_trend/")
print(f"‚úÖ Daily trend saved to: {base_path}daily_trend/")

# Save pivot table
pivot_filled.write.mode("overwrite").parquet(f"{base_path}pivot_country_category/")
print(f"‚úÖ Pivot table saved to: {base_path}pivot_country_category/")

# Verify
print("\n‚úÖ VERIFICATION:")
df_verify = spark.read.parquet(f"{base_path}country_report/")
print(f"Country report rows: {df_verify.count()}")
df_verify.show()

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Basic Aggregations** - count, sum, avg, min, max, stddev
2. **GroupBy** - Single and multiple columns, filtering
3. **Multiple Aggregations** - agg() with multiple functions
4. **Pivot Tables** - Transform rows to columns
5. **Rollup** - Hierarchical subtotals (country ‚Üí category ‚Üí total)
6. **Cube** - All combinations of dimensions
7. **Window Aggregations** - Running totals, moving averages
8. **Advanced Patterns** - Percentiles, conditional agg, weighted avg

### **üìä Aggregation Comparison:**

| Type | Use Case | Example |
|------|----------|----------|
| **GroupBy** | Standard grouping | Sales by country |
| **Pivot** | Cross-tabulation | Country vs Category matrix |
| **Rollup** | Hierarchical totals | Country ‚Üí Category ‚Üí Total |
| **Cube** | All combinations | All possible groupings |
| **Window** | Running calculations | Cumulative sum, moving avg |

### **‚ö° Performance Tips:**

1. **Use approx_count_distinct** for large datasets (faster)
2. **Specify pivot values** explicitly (better performance)
3. **Filter before aggregating** (reduce data size)
4. **Use broadcast joins** for small lookup tables
5. **Cache intermediate results** if reused
6. **Avoid collect()** on large aggregated data

### **üöÄ Next Steps:**
- **Day 3 - Lesson 3:** Window Functions (ranking, lag/lead, percentiles)
- **Day 3 - Lesson 4:** Joins (inner, outer, broadcast, optimization)

---

In [None]:
# Cleanup
spark.stop()
print("‚úÖ Spark session stopped")
print("\nüéâ DAY 3 - LESSON 2 COMPLETED!")