# Aggregate, apply, and groupby operations
Lukas Jarosch

## The titanic dataset
For this chapter we will use the "titanic" dataset, which contains data about passengers on the Titanic, including if they survived or not.

In [1]:
import pandas as pd

df = pd.read_csv("../data/titanic_data.csv")
df

Unnamed: 0,survived,class,sex,age,siblings/spouses,parents/children,fare,who,embark_town,alone
0,False,3,male,22.0,1,0,7.2500,man,Southampton,False
1,True,1,female,38.0,1,0,71.2833,woman,Cherbourg,False
2,True,3,female,26.0,0,0,7.9250,woman,Southampton,True
3,True,1,female,35.0,1,0,53.1000,woman,Southampton,False
4,False,3,male,35.0,0,0,8.0500,man,Southampton,True
...,...,...,...,...,...,...,...,...,...,...
886,False,2,male,27.0,0,0,13.0000,man,Southampton,True
887,True,1,female,19.0,0,0,30.0000,woman,Southampton,True
888,False,3,female,,1,2,23.4500,woman,Southampton,False
889,True,1,male,26.0,0,0,30.0000,man,Cherbourg,True



## Aggregation
### Basic aggregation functions
Very often, we want to compute some kind of summary statistics on our data. For that purpose, pandas provides several utility functions for aggregating data with important ones listed below:


 Method | Description 
---------|----------
 `count()` | Total number of non-NaN items
 `mean()`, `median()` | Mean and median
 `min()`, `max()` | Minimum and maximum
 `std()`, `var()` | Standard deviation and variance
 `sum()` | Sum of all items

You can use these aggregation functions on Series and DataFrame objects. In the latter case, pandas will apply the functions to each column individually.

In [2]:
## demo of different functions on a Series
ages = df["age"]

print(ages.count(), ages.mean(), ages.min(), ages.max(), sep="\n")

714
29.69911764705882
0.42
80.0


In [3]:
## demo on full DataFrame
df.count()

survived            891
class               891
sex                 891
age                 714
siblings/spouses    891
parents/children    891
fare                891
who                 891
embark_town         889
alone               891
dtype: int64

In [4]:
# many functions only work on numeric columns
numeric_cols = ["class", "age", "siblings/spouses", "parents/children", "fare"]

df[numeric_cols].mean()

class                2.308642
age                 29.699118
siblings/spouses     0.523008
parents/children     0.381594
fare                32.204208
dtype: float64

In [5]:
df[numeric_cols].max()

class                 3.0000
age                  80.0000
siblings/spouses      8.0000
parents/children      6.0000
fare                512.3292
dtype: float64

### `.agg()` function
It is also possible to compute multiple summary statistics in one call, using the `.agg()` function.

In [6]:
# .agg() on a series
df["fare"].agg(["min", "max", "mean", "median"])

min         0.000000
max       512.329200
mean       32.204208
median     14.454200
Name: fare, dtype: float64

In [7]:
# .agg() on a dataframe
df[numeric_cols].agg(["min", "max", "mean", "median"])

Unnamed: 0,class,age,siblings/spouses,parents/children,fare
min,1.0,0.42,0.0,0.0,0.0
max,3.0,80.0,8.0,6.0,512.3292
mean,2.308642,29.699118,0.523008,0.381594,32.204208
median,3.0,28.0,0.0,0.0,14.4542


It is even possible to specify which aggregation function should be used for each column by supplying a dictionary mapping.

In [8]:
df.agg({"age": "mean", "fare": "median"})

age     29.699118
fare    14.454200
dtype: float64

## Apply
### `.apply()` on DataFrames
Pandas also provides the very flexible `.apply()` method for modifying and summarizing data. `.apply()` accepts a function which accepts a Series as input and will be applied to each column in the DataFrame.

In [9]:
# example with a custom mean function (returns a single value)
def my_mean(series):
    return series.sum() / len(series)

df[numeric_cols].apply(my_mean)

class                2.308642
age                 23.799293
siblings/spouses     0.523008
parents/children     0.381594
fare                32.204208
dtype: float64

In [10]:
# example with a function that converts all string columns to upper case
# (returns a Series of values)
str_cols = ["sex", "who", "embark_town"]

def to_upper(series):
    return series.str.upper()

df[str_cols].apply(to_upper)

Unnamed: 0,sex,who,embark_town
0,MALE,MAN,SOUTHAMPTON
1,FEMALE,WOMAN,CHERBOURG
2,FEMALE,WOMAN,SOUTHAMPTON
3,FEMALE,WOMAN,SOUTHAMPTON
4,MALE,MAN,SOUTHAMPTON
...,...,...,...
886,MALE,MAN,SOUTHAMPTON
887,FEMALE,WOMAN,SOUTHAMPTON
888,FEMALE,WOMAN,SOUTHAMPTON
889,MALE,MAN,CHERBOURG


### `.apply()` on Series
If you use `.apply()` on a Series, your function should work with single values instead of Series objects. Below, we will use `.apply()` with a Series to create a new column that converts numerical age into age groups.

In [11]:
def get_age_group(age):
    # return NaN if age is NaN
    if pd.isna(age):
        return age
    elif age < 3:
        return "baby"
    elif age < 18:
        return "minor"
    elif age < 60:
        return "adult"
    else:
        return "senior"

# create a new Series with age groups
age_groups = df["age"].apply(get_age_group)

# add it as a dataframe column
df["age group"] = age_groups

df

Unnamed: 0,survived,class,sex,age,siblings/spouses,parents/children,fare,who,embark_town,alone,age group
0,False,3,male,22.0,1,0,7.2500,man,Southampton,False,adult
1,True,1,female,38.0,1,0,71.2833,woman,Cherbourg,False,adult
2,True,3,female,26.0,0,0,7.9250,woman,Southampton,True,adult
3,True,1,female,35.0,1,0,53.1000,woman,Southampton,False,adult
4,False,3,male,35.0,0,0,8.0500,man,Southampton,True,adult
...,...,...,...,...,...,...,...,...,...,...,...
886,False,2,male,27.0,0,0,13.0000,man,Southampton,True,adult
887,True,1,female,19.0,0,0,30.0000,woman,Southampton,True,adult
888,False,3,female,,1,2,23.4500,woman,Southampton,False,
889,True,1,male,26.0,0,0,30.0000,man,Cherbourg,True,adult


## The groupby method
### Basic grouping
Very often in data analysis, we are interested in computing separate summary statistics for different groups rather than on the whole dataframe. For example, we might be interested in the average fare for each passenger class or the average survival rate by age group. Such operations usually follow the **split-apply-combine** principle:
* 1. **split** your data into different groups
* 2. **apply** some summary function on each group separately
* 3. **combine** the results into one DataFrame again

![title](../img/split_apply_combine.png)

Image source: https://miro.medium.com/max/1400/1*w2oGdXv5btEMxAkAsz8fbg.png

In pandas, this is handled by the `.groupby()` method. As a first step, we can group our data by passenger class with the following syntax:

In [12]:
grouped = df.groupby("class")
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc5d0dae6d0>

Grouping in pandas is *lazy*, which means that our GroupBy object does not do anything before we use it with a specific method. To see how the GroupBy method will split our data, we can iterate through it which will give us group labels and group data.

In [13]:
for group, data in grouped:
    print(group)
    display(data)

1


Unnamed: 0,survived,class,sex,age,siblings/spouses,parents/children,fare,who,embark_town,alone,age group
1,True,1,female,38.0,1,0,71.2833,woman,Cherbourg,False,adult
3,True,1,female,35.0,1,0,53.1000,woman,Southampton,False,adult
6,False,1,male,54.0,0,0,51.8625,man,Southampton,True,adult
11,True,1,female,58.0,0,0,26.5500,woman,Southampton,True,adult
23,True,1,male,28.0,0,0,35.5000,man,Southampton,True,adult
...,...,...,...,...,...,...,...,...,...,...,...
871,True,1,female,47.0,1,1,52.5542,woman,Southampton,False,adult
872,False,1,male,33.0,0,0,5.0000,man,Southampton,True,adult
879,True,1,female,56.0,0,1,83.1583,woman,Cherbourg,False,adult
887,True,1,female,19.0,0,0,30.0000,woman,Southampton,True,adult


2


Unnamed: 0,survived,class,sex,age,siblings/spouses,parents/children,fare,who,embark_town,alone,age group
9,True,2,female,14.0,1,0,30.0708,child,Cherbourg,False,minor
15,True,2,female,55.0,0,0,16.0000,woman,Southampton,True,adult
17,True,2,male,,0,0,13.0000,man,Southampton,True,
20,False,2,male,35.0,0,0,26.0000,man,Southampton,True,adult
21,True,2,male,34.0,0,0,13.0000,man,Southampton,True,adult
...,...,...,...,...,...,...,...,...,...,...,...
866,True,2,female,27.0,1,0,13.8583,woman,Cherbourg,False,adult
874,True,2,female,28.0,1,0,24.0000,woman,Cherbourg,False,adult
880,True,2,female,25.0,0,1,26.0000,woman,Southampton,False,adult
883,False,2,male,28.0,0,0,10.5000,man,Southampton,True,adult


3


Unnamed: 0,survived,class,sex,age,siblings/spouses,parents/children,fare,who,embark_town,alone,age group
0,False,3,male,22.0,1,0,7.2500,man,Southampton,False,adult
2,True,3,female,26.0,0,0,7.9250,woman,Southampton,True,adult
4,False,3,male,35.0,0,0,8.0500,man,Southampton,True,adult
5,False,3,male,,0,0,8.4583,man,Queenstown,True,
7,False,3,male,2.0,3,1,21.0750,child,Southampton,False,baby
...,...,...,...,...,...,...,...,...,...,...,...
882,False,3,female,22.0,0,0,10.5167,woman,Southampton,True,adult
884,False,3,male,25.0,0,0,7.0500,man,Southampton,True,adult
885,False,3,female,39.0,0,5,29.1250,woman,Queenstown,False,adult
888,False,3,female,,1,2,23.4500,woman,Southampton,False,


We can nicely see that GroupBy splits our data based on the passenger class. In practice, you will not need to explicitly iterate through GroupBy but rather call summary functions directly on the object.

In [14]:
grouped.count()

Unnamed: 0_level_0,survived,sex,age,siblings/spouses,parents/children,fare,who,embark_town,alone,age group
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,216,216,186,216,216,216,216,214,216,186
2,184,184,173,184,184,184,184,184,184,173
3,491,491,355,491,491,491,491,491,491,355


This returns a new DataFrame with the group labels set as new index, and thereby completes the split-apply-combine scheme. It is also possible, and often more useful, to only calculate the summary statistic on certain columns.

In [15]:
# get the average fare by passenger class
df.groupby("class")["fare"].mean()

class
1    84.154687
2    20.662183
3    13.675550
Name: fare, dtype: float64

In [16]:
# use .agg() to get multiple summary statistics
df.groupby("class")["fare"].agg(["mean", "median"])

Unnamed: 0_level_0,mean,median
class,Unnamed: 1_level_1,Unnamed: 2_level_1
1,84.154687,60.2875
2,20.662183,14.25
3,13.67555,8.05


### Grouping on multiple columns
It is also possible to use multiple column identifiers for grouping. This will then create a separate group for each unique combination of column values.

In [17]:
df.groupby(["class", "age group"])[["age", "fare"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare
class,age group,Unnamed: 2_level_1,Unnamed: 3_level_1
1,adult,37.353503,89.447957
1,baby,1.46,151.55
1,minor,14.3,99.38583
1,senior,64.764706,60.033335
2,adult,32.592466,20.869349
2,baby,1.19,27.179171
2,minor,9.0,25.43125
2,senior,64.5,17.625
3,adult,28.926471,11.115852
3,baby,1.394667,23.525553


By default, this will put the group labels into a multi-level index. We won't cover how to work with such indices in this course, but a nice tutorial can be found on the [official pandas documentation](https://pandas.pydata.org/docs/user_guide/advanced.html). It is also possible to disable the creation of multi-indices and instead put the group labels into regular columns by setting  `as_index` to False.

In [18]:
df.groupby(["class", "age group"])[["age", "fare"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare
class,age group,Unnamed: 2_level_1,Unnamed: 3_level_1
1,adult,37.353503,89.447957
1,baby,1.46,151.55
1,minor,14.3,99.38583
1,senior,64.764706,60.033335
2,adult,32.592466,20.869349
2,baby,1.19,27.179171
2,minor,9.0,25.43125
2,senior,64.5,17.625
3,adult,28.926471,11.115852
3,baby,1.394667,23.525553
