**⭐ 1. What This Pattern Solves**

Aggregates multiple metrics at once, often over multiple grouping columns. Used for richer summaries in a single pass instead of multiple groupBys.

Use cases:

Total and average sales per Customer and Region

Min, max, and count of orders per Product per Month

Multiple KPIs in dashboards

Pre-aggregating multiple measures for ETL pipelines

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT CustomerID, Region,
       SUM(Amount) AS TotalAmount,
       AVG(Amount) AS AvgAmount,
       COUNT(OrderID) AS OrderCount
FROM Orders
GROUP BY CustomerID, Region;

**⭐ 3. Core Idea**

Use groupBy on multiple columns and agg with multiple aggregate functions. This avoids multiple scans of the dataset, which is key in production pipelines.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql import functions as F

df.groupBy("col1", "col2") \
  .agg(
      F.sum("metric1").alias("sum_metric1"),
      F.avg("metric2").alias("avg_metric2"),
      F.count("metric3").alias("count_metric3")
  )

**⭐ 5. Detailed Example**

In [0]:
data = [
    ("Alice", "East", 100, 1),
    ("Alice", "East", 150, 2),
    ("Bob", "West", 200, 3),
    ("Bob", "West", 50, 4)
]

df = spark.createDataFrame(data, ["Customer", "Region", "Amount", "Orders"])

result = df.groupBy("Customer", "Region").agg(
    F.sum("Amount").alias("TotalAmount"),
    F.avg("Amount").alias("AvgAmount"),
    F.count("Orders").alias("OrderCount")
)

result.show()


In [0]:
+--------+------+-----------+---------+----------+
|Customer|Region|TotalAmount|AvgAmount|OrderCount|
+--------+------+-----------+---------+----------+
|Alice   |East  |        250|    125.0|         2|
|Bob     |West  |        250|    125.0|         2|
+--------+------+-----------+---------+----------+


**⭐ 6. Mini Practice Problems**

Compute sum, avg, and count of Sales per Store and Category.

Find min, max, and total Revenue per Region and Month.

Count orders and sum quantities per Product and Customer.

**⭐ 7. Full Data Engineering Problem**

Scenario: You have a Bronze order dataset with CustomerID, Region, OrderDate, Amount, Quantity.

Requirement: Silver table with total sales, average sales, and number of orders per Customer and Region.

Steps:

Read Bronze Parquet.

groupBy("CustomerID", "Region").

Compute sum(Amount), avg(Amount), count(OrderID).

Write to Delta/S3 for downstream reporting.

This mirrors real ETL pipelines aggregating multiple KPIs in one pass.

**⭐ 8. Time & Space Complexity**

Time: O(n) rows processed; distributed across cluster partitions.

Space: Depends on unique combinations of grouping columns (high cardinality → more shuffle memory).

**⭐ 9. Common Pitfalls**

Using too many grouping columns → causes shuffle bottlenecks.

Forgetting to alias multiple aggregates → messy columns.

Using df.select(...).groupBy(...) incorrectly (column names mismatch).

Ignoring nulls in numeric columns → skewed averages.