## Group iteration & aggregations
By the end of this lecture you will be able to:
- iterate over groups
- get group values
- do multiple aggregations

In [50]:
import polars as pl
import polars.selectors as cs

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [51]:
df = pl.read_csv("../Files/Sample_Superstore.csv")

## Iterating over groups
We can access the `DataFrame` for each group by looping over a `GroupBy` object.

The group key is a `tuple` even when we are only grouping by one column. For this reason we set the first iteration variable to be a one-element tuple as `(Profit)` so we can define a variable that matches the column name

In [52]:
obj = df.group_by(['Category'])

In [53]:
type(obj)

polars.dataframe.group_by.GroupBy

In [54]:
obj

<polars.dataframe.group_by.GroupBy at 0x10c941550>

In [55]:
for category, group_df in df.group_by(['Category']):
    display(category)
    display(group_df.head(3))
    print("\n")

('Furniture',)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
4,,"""11-10-2015""",,"""Standard Class""","""SO-20335""","""Sean O'Donnell""","""Consumer""","""United States""","""Fort Lauderdale""","""Florida""",33311,"""South""","""FUR-TA-10000577""","""Furniture""","""Tables""","""Bretford CR4500 Series Slim Re…",957.5775,5,0.45,-383.031






('Technology',)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
8,"""CA-2014-115812""","""09-06-2014""","""14-06-2014""","""Standard Class""","""BH-11710""","""Brosina Hoffman""","""Consumer""",,"""Los Angeles""","""California""",90032,"""West""","""TEC-PH-10002275""","""Technology""","""Phones""","""Mitel 5320 IP Phone VoIP phone""",907.152,6,0.2,90.7152
12,"""CA-2014-115812""","""09-06-2014""","""14-06-2014""","""Standard Class""","""BH-11710""","""Brosina Hoffman""","""Consumer""","""United States""","""Los Angeles""","""California""",90032,"""West""","""TEC-PH-10002033""","""Technology""","""Phones""","""Konftel 250 Conference phone -…",911.424,4,0.2,68.3568
20,"""CA-2014-143336""","""27-08-2014""","""01-09-2014""","""Second Class""","""ZD-21925""","""Zuschuss Donatelli""","""Consumer""","""United States""","""San Francisco""","""California""",94109,"""West""","""TEC-PH-10001949""","""Technology""","""Phones""","""Cisco SPA 501G IP Phone""",213.48,3,0.2,16.011






('Office Supplies',)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714
5,"""US-2015-108966""","""11-10-2015""","""18-10-2015""","""Standard Class""","""SO-20335""","""Sean O'Donnell""","""Consumer""","""United States""",,"""Florida""",33311,"""South""","""OFF-ST-10000760""","""Office Supplies""","""Storage""","""Eldon Fold 'N Roll Cart System""",22.368,2,0.2,2.5164
7,"""CA-2014-115812""","""09-06-2014""","""14-06-2014""","""Standard Class""","""BH-11710""","""Brosina Hoffman""","""Consumer""","""United States""","""Los Angeles""","""California""",90032,"""West""","""OFF-AR-10002833""","""Office Supplies""","""Art""","""Newell 322""",7.28,4,0.0,1.9656






### Types for group by Key:Value

In [56]:
for category, group_df in df.group_by(['Category']):
    display(type(category))
    display(type(group_df))
    print("\n")

tuple

polars.dataframe.frame.DataFrame





tuple

polars.dataframe.frame.DataFrame





tuple

polars.dataframe.frame.DataFrame





## Group values
We use `head` to get the first rows in each group.

In this example we return a `DataFrame` with the first 2 rows from each group

In [None]:
df.group_by("Category").head(2)

We can also use `tail` to get the last elements

## Calling aggregations directly on `group_by`
We can call aggregations on all columns directly on `group_by` without using `agg`

In this example, we count the number of rows per group and we get a single column of counts

In [None]:
df.group_by("Category").len()

The methods we can all on `GroupBy` include:
 - `first` get the first element of each group
 - `last` get the last element of each group
 - `n_unique` get the number of unique elements in each group
 - `count` get the number of elements in each group
 - `sum` sum the elements in each group
 - `min` get the smallest element in each group
 - `max` get the largest element in each group
 - `mean` get the average of elements in each group
 - `median` get the median in each group
 - `quantile` calculate quantiles in each group

We can also call aggregations on a lazy group though not all of the above are supported
 

## Multiple aggregations on the same columns
We can use the `prefix` or `suffix` expressions when we do different aggregations on the same columns.

In this example we get the `min` and `max` of the floating point columns grouped by Category. We then sort the outputs to have aggregations on the same column together by sorting the column names inside a `pipe` function

In [None]:
group_column = "Category"
(
    df.
    group_by(group_column)
    .agg(
        pl.col(pl.Float64).min().name.suffix("_min"),
        pl.col(pl.Float64).max().name.suffix("_max")
    )
    .pipe(
        lambda df: df.select([group_column]+ sorted(df.columns[1:]))
    )
)