# Groupby-aggregations 
By the end of this lecture you will be able to:
- do a group by-aggregation
- group by multiple columns
- sort group by outputs
- grouping on a sorted column

In [3]:
import polars as pl

In [4]:
csv_file = "../../Files/Sample_Superstore.csv"

In [5]:
df = pl.read_csv(csv_file)
df.head(3)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


## Group-by and aggregation
In Polars we can group by a column and aggregate the data in other columns with the `group_by.agg` combination.

In this example we group by the passenger class and take the mean of the `Profit` column

In [7]:
(
    df
    .group_by("Customer_Name")
    .agg(
        pl.col("Profit").mean().round()
    )
)

Customer_Name,Profit
str,f64
"""Roy Französisch""",25.0
"""Philisse Overcash""",64.0
"""Alan Schoenberger""",55.0
"""Beth Fritzler""",6.0
"""Jill Fjeld""",67.0
…,…
"""Becky Pak""",59.0
"""Anthony O'Donnell""",12.0
"""Rose O'Brian""",-105.0
"""Mark Haberlin""",7.0


> Why group_by and not groupby? The Polars API aims to be readable and one standard is to split words by `_`

Almost everything we do after this will be some variation on this basic pattern of `group_by` and `agg`.

Note that we passed an aggregation expression `pl.col("Profit").mean()` inside `agg` to get a single value for each group.

Let's see what happens if we don't pass an aggregation expression

In [8]:
(
    df
    .group_by("Customer_Name")
    .agg(
        pl.col("Profit").head(2)
    )
)

Customer_Name,Profit
str,list[f64]
"""Chuck Magee""","[0.1472, 10.626]"
"""Sam Zeldin""","[62.737, 8.88]"
"""Theresa Swint""","[4.5216, -2.2134]"
"""Lisa Ryan""","[-6.9828, -131.994]"
"""Tonja Turnell""","[2.3952, 130.7581]"
…,…
"""Michael Dominguez""","[9.0912, -24.803]"
"""Maureen Fritzler""","[3.792, -459.9875]"
"""Anthony Witt""","[-0.2685, -13.6152]"
"""Arthur Wiediger""","[-46.3946, 3.8272]"


In [19]:
(
    df
    .group_by("Customer_Name")
    .agg(
        pl.col("Profit").max().round()
    )
)

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


Customer_Name,Profit
str,f64
"""Valerie Mitchum""",337.0
"""Mick Brown""",71.0
"""Lycoris Saunders""",27.0
"""Frank Hawley""",314.0
"""Sandra Flanagan""",91.0
…,…
"""Ted Trevino""",190.0
"""Cassandra Brandow""",268.0
"""Shirley Daniels""",1906.0
"""Don Miller""",108.0


In [22]:
(
    df
    .group_by("Customer_Name")
    .agg(
        pl.col("Profit").sum()
    )
)

Customer_Name,Profit
str,f64
"""Dave Kipp""",536.3935
"""Barry Franz""",-291.3811
"""Steven Cartwright""",1276.6513
"""Paul Knutson""",-798.705
"""Benjamin Patterson""",-197.2695
…,…
"""Bradley Nguyen""",340.7054
"""Irene Maddox""",514.6527
"""Ann Blume""",-274.9604
"""Deanra Eno""",464.4714


In this case the `Fare` column is a `pl.List` column with all the values for each group on each row


## What happens when we run `group_by.agg`?
While the full workings are more complicated than this a basic description of the internal flow is that:
- when we call `.group_by` Polars creates a `GroupBy` object that catpures the group-by parameters (e.g. the columns to group by) but **does not calculate the groups** until a further method (such as `agg`) is called on it
- when we call `agg` on the `GroupBy` object Polars:
    - Polars calculates the groups by getting the row indexes for each group
    - Polars applies the expressions in `agg` to each group
    - Polars joins the outputs of the expressions back to each group to create the output `DataFrame`

## Grouping by multiple columns
We can group by multiple columns by passing a `list` to `group_by` or a comma-separated list of columns

In [9]:
(
    df
    .group_by("Customer_Name","Region")
    .agg(
        pl.col("Profit").mean()
    )
)

Customer_Name,Region,Profit
str,str,f64
"""Dennis Pardue""","""East""",3.3312
"""Cari Schnelling""","""Central""",-2.838
"""Theone Pippenger""","""East""",1.0178
"""Mike Gockenbach""","""Central""",-152.7156
"""Michael Paige""","""South""",-8.032933
…,…,…
"""Meg Tillman""","""East""",37.4875
"""Katherine Hughes""","""East""",6.4182
"""Anne Pryor""","""East""",8.533975
"""Julie Kriz""","""East""",35.8847


We can also use expressions inside `group_by` - in fact when we pass column names as strings (as above) Polars converts these to expressions internally.

As we can pass expressions to `group_by` we can also group by a transformed column. Here, for example, we group by the `Row_ID` column with values cast to integer

In [11]:
(
    df
    .group_by(pl.col("Row_ID").cast(pl.Int64))
    .agg(
        pl.col("Profit").mean()
    )
    .head()
)

Row_ID,Profit
i64,f64
2504,3.4686
1989,215.1198
4880,2.6068
5279,48.5392
6991,18.6606


In [24]:
(
    df
    .group_by("Customer_Name","Region")
    .agg(
        pl.col("Profit").max()
    )
)

Customer_Name,Region,Profit
str,str,f64
"""Clay Ludtke""","""East""",252.588
"""Jay Fein""","""South""",28.0032
"""Bradley Talbott""","""East""",56.264
"""Ruben Ausman""","""West""",314.2719
"""Benjamin Venier""","""Central""",51.75
…,…,…
"""Mark Van Huff""","""West""",9.2928
"""Kelly Lampkin""","""East""",0.307
"""Darren Powers""","""West""",180.7659
"""Barbara Fisher""","""South""",29.364


In [25]:
(
    df
    .group_by("Customer_Name","Region")
    .agg(
        pl.col("Profit").sum()
    )
)

Customer_Name,Region,Profit
str,str,f64
"""Nora Pelletier""","""West""",38.5009
"""Susan Gilcrest""","""East""",2.94
"""Bradley Drucker""","""Central""",309.0769
"""Herbert Flentye""","""West""",139.5054
"""Todd Sumrall""","""West""",-338.3503
…,…,…
"""Nathan Cano""","""East""",18.5136
"""Gene McClure""","""South""",0.9952
"""Eric Hoffmann""","""South""",9.072
"""Andrew Gjertsen""","""West""",1.7881


## Ordering of the output
We have seen that the output `DataFrame` has a different order each time. This happens because Polars works out the row indexes for the group keys in parallel. This means that Polars:
- splits the group columns into chunks (e.g. first 10 rows in one chunk, second 10 rows in another chunk, etc)
- finds the row indexes within each chunk on a seperate thread
- brings the results from different threads back together


We can force the order of the output to match the order the group keys occur in the input with the `maintain_order` argument

In [14]:
(
    df
    .group_by("Customer_Name",maintain_order=True)
    .agg(
        pl.col("Profit").mean()
    )
)

Customer_Name,Profit
str,f64
"""Claire Gute""",33.98688
"""Darrin Van Huff""",-47.464889
"""Sean O'Donnell""",-5.40572
"""Brosina Hoffman""",33.449575
"""Andrew Allen""",36.31895
…,…
"""Carl Jackson""",1.652
"""Roy Skaria""",3.1946
"""Sung Chung""",7.793925
"""Ricardo Emerson""",6.045


The first row is group `3` because the first row of `df` is `3` and so on.

Setting maintain_order=True results will affect performance to some extent. We also cannot use the streaming engine for large datasets when `maintain_order=True`.



## Groupby on a sorted column

A fast-track algorithm can also be used if the groupby column is sorted. 

## Groupby on a list
We can groupby on a list column just as for non-list columns. 

First we create a `DataFrame` with a `pl.List` column

In [16]:
df_lists = (
    pl.DataFrame(
            {
                "lists": [
                    ["a", "b"],
                    ["a", "c"],
                    ["a", "b"],
                ]
            }
    )
    .with_row_index()
)


In [17]:
df_lists

index,lists
u32,list[str]
0,"[""a"", ""b""]"
1,"[""a"", ""c""]"
2,"[""a"", ""b""]"


Then we `group_by` and count the number of occurences of each list

In [18]:
df_lists.group_by("lists").len()

lists,len
list[str],u32
"[""a"", ""c""]",1
"[""a"", ""b""]",2
