# Advanced Aggregations – CUBE, ROLLUP, Grouping Sets

**Dataset**: `samples.tpch.lineitem`

In this notebook you will:
1. Compute revenue metrics on TPC-H `lineitem`
2. Use standard `groupBy`
3. Use `cube` and `rollup` for multi-dimensional analytics
4. Understand the `grouping` and `grouping_id` functions


In [None]:
from pyspark.sql import functions as F

lineitem_df = spark.read.table("samples.tpch.lineitem")

display(lineitem_df.limit(5))
print("Count:", lineitem_df.count())


## 1. Baseline Revenue Calculation

We'll define:
- `revenue = l_extendedprice * (1 - l_discount)`


In [None]:
revenue_df = lineitem_df.withColumn(
    "revenue",
    F.col("l_extendedprice") * (1 - F.col("l_discount"))
)

display(revenue_df.select("l_orderkey", "l_linestatus", "l_shipmode", "revenue").limit(10))


## 2. Standard `groupBy`

Example:
- Total revenue by `l_returnflag` and `l_linestatus`


In [None]:
gb_basic_df = (
    revenue_df
    .groupBy("l_returnflag", "l_linestatus")
    .agg(
        F.round(F.sum("revenue"), 2).alias("sum_revenue"),
        F.count("*").alias("row_count")
    )
    .orderBy("l_returnflag", "l_linestatus")
)

display(gb_basic_df)


## 3. `ROLLUP` – Hierarchical Aggregation

`ROLLUP(a, b)` produces:
- (a, b)
- (a, NULL)
- (NULL, NULL)

We'll roll up by:
- `l_returnflag` (top level)
- `l_linestatus` (lower level)


In [None]:
rollup_df = (
    revenue_df
    .rollup("l_returnflag", "l_linestatus")
    .agg(
        F.round(F.sum("revenue"), 2).alias("sum_revenue"),
        F.count("*").alias("row_count")
    )
    .orderBy("l_returnflag", "l_linestatus")
)

display(rollup_df)


## 4. `CUBE` – All Combinations of Dimensions

`CUBE(a, b)` produces aggregations for:
- (a, b)
- (a, NULL)
- (NULL, b)
- (NULL, NULL)

We'll cube by:
- `l_shipmode`
- `l_returnflag`


In [None]:
cube_df = (
    revenue_df
    .cube("l_shipmode", "l_returnflag")
    .agg(
        F.round(F.sum("revenue"), 2).alias("sum_revenue"),
        F.count("*").alias("row_count")
    )
    .orderBy("l_shipmode", "l_returnflag")
)

display(cube_df)


## 5. Understanding `grouping` and `grouping_id`

When using rollup/cube, NULL might mean:
- Real NULL value in the data, or
- "Grand total / subtotal" row

`grouping(col)` returns 1 if the column is aggregated away, 0 otherwise.


In [None]:
cube_flags_df = (
    revenue_df
    .cube("l_shipmode", "l_returnflag")
    .agg(
        F.round(F.sum("revenue"), 2).alias("sum_revenue"),
        F.count("*").alias("row_count"),
        F.grouping("l_shipmode").alias("g_shipmode"),
        F.grouping("l_returnflag").alias("g_returnflag"),
        F.grouping_id("l_shipmode", "l_returnflag").alias("grouping_id")
    )
    .orderBy("grouping_id", "l_shipmode", "l_returnflag")
)

display(cube_flags_df.limit(50))


## 6. Grouping Sets (Manual Control)

You can specify exactly which combinations you want.

Example grouping sets:
- (l_shipmode, l_returnflag)
- (l_shipmode)
- ()


In [None]:
grouping_sets_df = (
    revenue_df
    .groupBy(F.groupingSets(
        ["l_shipmode", "l_returnflag"],
        ["l_shipmode"],
        []
    ))
    .agg(
        F.round(F.sum("revenue"), 2).alias("sum_revenue"),
        F.count("*").alias("row_count")
    )
    .orderBy("l_shipmode", "l_returnflag")
)

display(grouping_sets_df)
