---
title: "Pandas: Transform like behavior using data from multiple columns"
author: "Damien Martin"
date: "2024-10-28"
categories: [pandas]
---

# The problem

We have a couple of different patterns for aggregating data collections:

- `groupby([group_cols]).apply`: Allows us to group, and then apply an arbitrary function on the group. Resulting dataframe will have a different index (the grouped variable).
- `groupby([group_cols])['column'].transform`: Allows us to group, and then apply a transformation on the `column` data, and return the aggregate result for each row. ie. the index returned will be the same (we preserve rows)

For example, if we had grades of different houses in a potions class:

In [2]:
#| echo: False
import pandas as pd

grades = pd.DataFrame(
    [{'name': 'Harry', 'house': 'Griffindor', 'class': 'potions', 'grade': 68},
    {'name': 'Hermionie', 'house': 'Griffindor', 'class': 'potions', 'grade': 99},
    {'name': 'Ron', 'house': 'Griffindor', 'class': 'potions', 'grade': 34},
    {'name': 'Neville', 'house': 'Griffindor', 'class': 'potions', 'grade': 55},
    {'name': 'Dracro', 'house': 'Slytherin', 'class': 'potions', 'grade': 85},
    {'name': 'Goyle', 'house': 'Slytherin', 'class': 'potions', 'grade': 70},
    {'name': 'Crabble', 'house': 'Slytherin', 'class': 'potions', 'grade': 75},
    {'name': 'Daphne', 'house': 'Slytherin', 'class': 'potions', 'grade': 88},]
)
grades

Unnamed: 0,name,house,class,grade
0,Harry,Griffindor,potions,68
1,Hermionie,Griffindor,potions,99
2,Ron,Griffindor,potions,34
3,Neville,Griffindor,potions,55
4,Dracro,Slytherin,potions,85
5,Goyle,Slytherin,potions,70
6,Crabble,Slytherin,potions,75
7,Daphne,Slytherin,potions,88


Apply creates a new index:

In [4]:
grades.groupby('house')['grade'].apply(lambda x: x.mean())

house
Griffindor    64.0
Slytherin     79.5
Name: grade, dtype: float64

Whereas transform will create a new row:

In [6]:
grades['house_mean'] = grades.groupby('house')['grade'].transform(lambda x: x.mean())
grades

Unnamed: 0,name,house,class,grade,house_mean
0,Harry,Griffindor,potions,68,64.0
1,Hermionie,Griffindor,potions,99,64.0
2,Ron,Griffindor,potions,34,64.0
3,Neville,Griffindor,potions,55,64.0
4,Dracro,Slytherin,potions,85,79.5
5,Goyle,Slytherin,potions,70,79.5
6,Crabble,Slytherin,potions,75,79.5
7,Daphne,Slytherin,potions,88,79.5


This isn't super useful, but we could turn it into something like a z-score:

In [10]:
grades['z_score_within_house'] = grades.groupby('house')['grade'].transform(lambda x: (x-x.mean())/x.std())
grades

Unnamed: 0,name,house,class,grade,house_mean,z_score,z_score_within_house
0,Harry,Griffindor,potions,68,64.0,0.146977,0.146977
1,Hermionie,Griffindor,potions,99,64.0,1.286046,1.286046
2,Ron,Griffindor,potions,34,64.0,-1.102326,-1.102326
3,Neville,Griffindor,potions,55,64.0,-0.330698,-0.330698
4,Dracro,Slytherin,potions,85,79.5,0.65273,0.65273
5,Goyle,Slytherin,potions,70,79.5,-1.127443,-1.127443
6,Crabble,Slytherin,potions,75,79.5,-0.534052,-0.534052
7,Daphne,Slytherin,potions,88,79.5,1.008764,1.008764
