---
title: "Named aggregations"
author: "Damien Martin"
date: "2024-10-28"
categories: [pandas]
---

# Problem

In pandas, we often want to do an aggregation with a groupby, and a rename simultaneously. Let's see an example of this by getting tne minimum and maximum daily temperatures from a temperature dataset:

In [14]:
import pandas as pd 
import vega_datasets

from vega_datasets import data

seattle_temperatures = data('seattle-temps')
seattle_temperatures.head()

Unnamed: 0,date,temp
0,2010-01-01 00:00:00,39.4
1,2010-01-01 01:00:00,39.2
2,2010-01-01 02:00:00,39.0
3,2010-01-01 03:00:00,38.9
4,2010-01-01 04:00:00,38.8


We will do this the least sophisticated way: make a new column, group by it, and then apply an aggregation:

In [20]:
(
    seattle_temperatures
    .assign(date_only=seattle_temperatures['date'].dt.date)
    .groupby('date_only')
    ['temp']
    .agg(['min', 'max'])
).head()

Unnamed: 0_level_0,min,max
date_only,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-01,38.6,43.5
2010-01-02,38.8,43.8
2010-01-03,39.0,44.0
2010-01-04,39.2,44.2
2010-01-05,39.3,44.4


This works, but leaves awkward column names (`min` and `max`), with no reference to what they are the min or max _of_. It gets worse if we are trying to look at multiple columns of data:

In [21]:
# Including the date timestamps as part of the groupby dosen't really make sense here
# but including anyway
(
    seattle_temperatures
    .assign(date_only=seattle_temperatures['date'].dt.date)
    .groupby('date_only')
    .agg({'temp': ['min', 'max'], 'date': 'count'})
).head()

Unnamed: 0_level_0,temp,temp,date
Unnamed: 0_level_1,min,max,count
date_only,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2010-01-01,38.6,43.5,24
2010-01-02,38.8,43.8,24
2010-01-03,39.0,44.0,24
2010-01-04,39.2,44.2,24
2010-01-05,39.3,44.4,24


Now we have context, but we also have a multi-index that we have to deal with.

# Solution

Use the `agg(new_column_name=(column_name, aggfunc), ...)` syntax, to rename the columns and keep at a single level.

# Examples

Solving the first aggregation (min and max temperatures per day):

In [24]:
(
    seattle_temperatures
    .assign(date_only=seattle_temperatures['date'].dt.date)
    .groupby('date_only')
    .agg(daily_high_temp=('temp', 'max'), daily_low_temp=('temp', 'min'))
).head()

Unnamed: 0_level_0,daily_high_temp,daily_low_temp
date_only,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-01,43.5,38.6
2010-01-02,43.8,38.8
2010-01-03,44.0,39.0
2010-01-04,44.2,39.2
2010-01-05,44.4,39.3


The second aggregation is solved similarily:

In [25]:
(
    seattle_temperatures
    .assign(date_only=seattle_temperatures['date'].dt.date)
    .groupby('date_only')
    .agg(
        daily_high_temp=('temp', 'max'), 
        daily_low_temp=('temp', 'min'),
        num_measurements=('date', 'count'),
    )
).head()

Unnamed: 0_level_0,daily_high_temp,daily_low_temp,num_measurements
date_only,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,43.5,38.6,24
2010-01-02,43.8,38.8,24
2010-01-03,44.0,39.0,24
2010-01-04,44.2,39.2,24
2010-01-05,44.4,39.3,24


Finally, we  can also use the `Grouper` to avoid creating an intermediate column, if we make the date column the index:

In [27]:
(
    seattle_temperatures
    .set_index('date')
    .groupby(pd.Grouper(freq='1d'))
    .agg(daily_high_temp=('temp', 'max'), daily_low_temp=('temp', 'min'))
).head()

Unnamed: 0_level_0,daily_high_temp,daily_low_temp
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-01,43.5,38.6
2010-01-02,43.8,38.8
2010-01-03,44.0,39.0
2010-01-04,44.2,39.2
2010-01-05,44.4,39.3
