**⭐ 1. What This Pattern Solves**

Aggregates data over one or more columns. Used for summarizing, reporting, and pre-aggregating data.

Use cases:

Total sales per customer

Average session duration per day

Maximum/minimum order amount per region

Pre-aggregating before writing to Delta/S3

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- SQL version
SELECT CustomerID, SUM(TotalAmount) AS TotalSpent
FROM Orders
GROUP BY CustomerID;


**⭐ 3. Core Idea**

groupBy defines the grouping key(s), and agg specifies aggregate(s) to compute. PySpark allows multiple aggregates and chaining with functions from pyspark.sql.functions.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql import functions as F

# Single or multiple aggregates
df.groupBy("col1", "col2") \
  .agg(
      F.sum("metric1").alias("sum_metric1"),
      F.avg("metric2").alias("avg_metric2")
  )

**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ("Alice", "2025-01-01", 100),
    ("Alice", "2025-01-02", 150),
    ("Bob", "2025-01-01", 200),
    ("Bob", "2025-01-02", 50)
]

df = spark.createDataFrame(data, ["Customer", "Date", "Amount"])

# Aggregate
result = df.groupBy("Customer").agg(
    F.sum("Amount").alias("TotalAmount"),
    F.avg("Amount").alias("AvgAmount")
)

result.show()


In [0]:
+--------+-----------+---------+
|Customer|TotalAmount|AvgAmount|
+--------+-----------+---------+
|Alice   |        250|    125.0|
|Bob     |        250|    125.0|
+--------+-----------+---------+


**⭐ 6. Mini Practice Problems**

Find total and average order amount per Region.

Count number of orders per Customer.

Find max and min TransactionAmount per Day.

**⭐ 7. Full Data Engineering Problem**

Scenario: You have a Bronze sales dataset in S3 with CustomerID, OrderDate, Amount. You need to create a Silver aggregate table showing daily total and average sales per customer.

Steps:

Read Bronze CSV/Parquet.

Use groupBy("CustomerID", "OrderDate").

Compute sum and avg.

Write aggregated table to Delta as Silver.

This is exactly what a real DE pipeline does daily for reporting.

**⭐ 8. Time & Space Complexity**

Time: O(n) over the number of rows (Spark distributes groups across nodes).

Space: Depends on the number of unique group keys. Wide cardinality → more memory needed.

**⭐ 9. Common Pitfalls**

Forgetting .alias() → unclear column names.

Using Python built-ins (e.g., sum) instead of F.sum → Spark cannot optimize.

Grouping on too many high-cardinality columns → driver memory issues.

Forgetting to cache if the result is reused multiple times.