# Groupby-aggregations 
By the end of this lecture you will be able to:
- do a group by-aggregation
- group by multiple columns
- sort group by outputs
- grouping on a sorted column

In [2]:
import polars as pl

In [3]:
df = pl.read_csv("../../Files/Sample_Superstore.csv")


In [4]:
df.head(3)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit,Is_Return
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64,bool
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136,True
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582,True
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714,True


## Group-by and aggregation
In Polars we can group by a column and aggregate the data in other columns with the `group_by.agg` combination.

In this example we group by the `Category` and take the count of the `Profit` column

In [11]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").count()
    )
)

Category,Profit
str,u32
"""Technology""",1845
"""Furniture""",2119
,8
"""Office Supplies""",6022


> Why group_by and not groupby? The Polars API aims to be readable and one standard is to split words by `_`

Almost everything we do after this will be some variation on this basic pattern of `group_by` and `agg`.

Note that we passed an aggregation expression `pl.col("Profit").min()` inside `agg` to get a single value for each group.

Let's see what happens if we don't pass an aggregation expression

In [12]:
a(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").min()
    )
)

Category,Profit
str,f64
"""Furniture""",-1862.3124
"""Office Supplies""",-3701.8928
,0.777
"""Technology""",-6599.978


In [13]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").max()
    )
)

Category,Profit
str,f64
,240.2649
"""Office Supplies""",4946.37
"""Furniture""",1013.127
"""Technology""",8399.976


In [20]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").sum()
    )
)

Category,Profit
str,f64
"""Office Supplies""",122469.5792
"""Technology""",145125.1584
,592.0532
"""Furniture""",18210.2309


In this case the `Fare` column is a `pl.List` column with all the values for each group on each row


## What happens when we run `group_by.agg`?
While the full workings are more complicated than this a basic description of the internal flow is that:
- when we call `.group_by` Polars creates a `GroupBy` object that catpures the group-by parameters (e.g. the columns to group by) but **does not calculate the groups** until a further method (such as `agg`) is called on it
- when we call `agg` on the `GroupBy` object Polars:
    - Polars calculates the groups by getting the row indexes for each group
    - Polars applies the expressions in `agg` to each group
    - Polars joins the outputs of the expressions back to each group to create the output `DataFrame`

## Grouping by multiple columns
We can group by multiple columns by passing a `list` to `group_by` or a comma-separated list of columns

In [17]:
(
    df
    .group_by("Category","Region")
    .agg(
        pl.col("Profit").sum()
    )
) 

Category,Region,Profit
str,str,f64
"""Technology""","""South""",19991.8314
"""Technology""","""West""",44303.6496
,"""West""",242.7473
"""Furniture""","""East""",3045.3888
"""Office Supplies""","""West""",52607.3666
…,…,…
"""Furniture""","""West""",11264.6854
,"""South""",5.4432
"""Office Supplies""","""South""",19980.9496
"""Office Supplies""","""Central""",8879.9799


We can also use expressions inside `group_by` - in fact when we pass column names as strings (as above) Polars converts these to expressions internally.

As we can pass expressions to `group_by` we can also group by a transformed column. Here, for example, we group by the `Row_ID` column with values cast to integer

In [15]:
(
    df
    .group_by(pl.col("Row_ID").cast(pl.Int64))
    .agg(
        pl.col("Profit").max()
    )
    .head()
)

Row_ID,Profit
i64,f64
9340,68.9631
9611,13.3056
3430,11.223
4067,0.8988
4213,1.9024


In [21]:
(
    df
    .group_by("Category","Region")
    .agg(
        pl.col("Profit").min()
    )
)

Category,Region,Profit
str,str,f64
,"""West""",2.4824
"""Furniture""","""West""",-814.4832
"""Technology""","""Central""",-1359.992
"""Office Supplies""","""East""",-1049.3406
,"""South""",5.4432
…,…,…
"""Office Supplies""","""West""",-694.2936
"""Office Supplies""","""South""",-1306.5504
"""Technology""","""South""",-3839.9904
"""Technology""","""East""",-6599.978


In [22]:
(
    df
    .group_by("Category","Region")
    .agg(
        pl.col("Profit").sum()
    )
)

Category,Region,Profit
str,str,f64
"""Office Supplies""","""West""",52607.3666
"""Office Supplies""","""Central""",8879.9799
"""Furniture""","""East""",3045.3888
"""Furniture""","""West""",11264.6854
"""Furniture""","""Central""",-2871.0494
…,…,…
,"""Central""",329.7897
"""Office Supplies""","""South""",19980.9496
"""Technology""","""East""",47462.0351
"""Technology""","""West""",44303.6496


## Ordering of the output
We have seen that the output `DataFrame` has a different order each time. This happens because Polars works out the row indexes for the group keys in parallel. This means that Polars:
- splits the group columns into chunks (e.g. first 10 rows in one chunk, second 10 rows in another chunk, etc)
- finds the row indexes within each chunk on a seperate thread
- brings the results from different threads back together


We can force the order of the output to match the order the group keys occur in the input with the `maintain_order` argument

In [23]:
(
    df
    .group_by("Category",maintain_order=True)
    .agg(
        pl.col("Profit").mean()
    )
)

Category,Profit
str,f64
"""Furniture""",8.593785
"""Office Supplies""",20.337027
"""Technology""",78.658622
,74.00665


The first row is group `3` because the first row of `df` is `3` and so on.

Setting maintain_order=True results will affect performance to some extent. We also cannot use the streaming engine for large datasets when `maintain_order=True`.



## Groupby on a list
We can groupby on a list column just as for non-list columns. 

First we create a `DataFrame` with a `pl.List` column

In [16]:
list_df = pl.DataFrame(
            {
                "lists": [
                    ["a", "b"],
                    ["a", "c"],
                    ["a", "b"],
                ]
            }
    )

df_lists = (
    list_df
    .with_row_index()
)


In [17]:
df_lists

index,lists
u32,list[str]
0,"[""a"", ""b""]"
1,"[""a"", ""c""]"
2,"[""a"", ""b""]"


Then we `group_by` and count the number of occurences of each list

In [18]:
df_lists.group_by("lists").len()

lists,len
list[str],u32
"[""a"", ""c""]",1
"[""a"", ""b""]",2
