## Groupby in Pandas

In [1]:
import pandas as pd
import numpy as np

In [2]:
students = pd.DataFrame({
    "student": ["Alice", "Bob", "Caro", "Dan", "Eve"],
    "gender": ["F", "M", "F", "M", "F"],
    "score": [88, 72, 91, 85, 77],
    "class": ["A", "B", "A", "B", "A"]
})

students

Unnamed: 0,student,gender,score,class
0,Alice,F,88,A
1,Bob,M,72,B
2,Caro,F,91,A
3,Dan,M,85,B
4,Eve,F,77,A


groupby is a three-step process:

Split → Apply → Combine

1. Split the data into groups

2. Apply a function to each group

3. Combine the results into a new object

In [3]:
students.groupby("gender") # doesn't print anything, simply creates a groupby object

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7b24746bb800>

## Applying an aggregation function to a groupby object

### Single Column, Single Aggregation

In [4]:
# using the builtin methods (mean, median, std, var, cumsum, sum, etc)
mean_score_per_gender = students.groupby("gender")["score"].mean() # results in a Series
mean_score_per_gender

Unnamed: 0_level_0,score
gender,Unnamed: 1_level_1
F,85.333333
M,78.5


In [5]:
# using the .agg method
mean_score_per_gender1 = students.groupby("gender")["score"].agg("mean")
mean_score_per_gender == mean_score_per_gender1 # check for equality

Unnamed: 0_level_0,score
gender,Unnamed: 1_level_1
F,True
M,True


The result of performing an aggregation on a groupby column is a Pandas Series indexable by the column it was grouped by. The series will become a DataFrame with two columns if the index is reset.

In [6]:
# the gender is now the index
mean_score_per_gender["F"]

np.float64(85.33333333333333)

In [7]:
type(mean_score_per_gender1.reset_index()) # resetting index results in a DataFrame, with "gender" being the first column

### Single Column, Multiple Aggregations

Involves passing a list of approppriate functions to the .agg method.

In [8]:
students.groupby("gender")["score"].agg(["mean", "min", "max"]) # results in a DataFrame

Unnamed: 0_level_0,mean,min,max
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,85.333333,77,91
M,78.5,72,85


### Multiple Columns, Single Aggregation

This involves a list of column names into .groupby() and calling a single aggregation function.

In [9]:
# using builtin aggregation methods (mean, max, etc)
students.groupby(["gender", "class"])["score"].max() # results in a DataFrame with two indexes

Unnamed: 0_level_0,Unnamed: 1_level_0,score
gender,class,Unnamed: 2_level_1
F,A,91
M,B,85


In [10]:
# using .agg
students.groupby(["gender", "class"])["score"].agg("max")

Unnamed: 0_level_0,Unnamed: 1_level_0,score
gender,class,Unnamed: 2_level_1
F,A,91
M,B,85


In [11]:
# Resetting index will yield a DataFrame with three columns, with gender and class being the first and second columns.
students.groupby(["gender", "class"])["score"].mean().reset_index()

Unnamed: 0,gender,class,score
0,F,A,85.333333
1,M,B,78.5


### Multiple Columns, Different Aggregations (One Function per Column)

In [12]:
# using a dictionary
# the keys are column names while the values are function names
students.groupby("gender").agg({
    "score": "mean",
    "class": "count"
})

Unnamed: 0_level_0,score,class
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,85.333333,3
M,78.5,2


In [13]:
# using labelled tuples
students.groupby("gender").agg(
    score_mean = ("score", "mean"),
    class_count = ("class", "count")
) # syntax is name_of_tuple = (column_name_for_aggregation, aggregation_function)

Unnamed: 0_level_0,score_mean,class_count
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,85.333333,3
M,78.5,2


### Multiple Aggregations for Multiple Columns

In [14]:
# using a dictionary of lists
students.groupby("gender").agg({
    "score": ["mean", "min", "max"],
    "class": "count"
})

Unnamed: 0_level_0,score,score,score,class
Unnamed: 0_level_1,mean,min,max,count
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
F,85.333333,77,91,3
M,78.5,72,85,2


## Filtering Groups with .filter

groupby().filter() is like boolean indexing, but at the group level, not the row level. Boolean indexing → decides row by row. filter() → decides group by group. When you do:

In [None]:
# df.groupby("gender").filter(func)

Pandas does:

1. Split the DataFrame into groups by gender

2. Pass each group (as a DataFrame) to func

3. If func(group) returns True, keep the entire group

4. If it returns False, drop the entire group

So the function must return a single boolean per group. Not a Series or DataFrame of booleans.

In [18]:
# keep the gender whose mean score is at least 80
students.groupby("gender").filter(lambda x: x["score"].mean() >= 80)

Unnamed: 0,student,gender,score,class
0,Alice,F,88,A
2,Caro,F,91,A
4,Eve,F,77,A


## Mini-Exercise

Given:

In [19]:
cities = pd.DataFrame({
    "city": ["Lagos", "Abuja", "Lagos", "Ibadan", "Abuja", "Lagos"],
    "sales": [100, 80, 120, 90, 70, 110],
    "profit": [20, 15, 25, 18, 10, 22]
})
cities

Unnamed: 0,city,sales,profit
0,Lagos,100,20
1,Abuja,80,15
2,Lagos,120,25
3,Ibadan,90,18
4,Abuja,70,10
5,Lagos,110,22


Find:

1. Total sales per city

2. Mean profit per city

3. Total sales + max profit per city

4. Mean and max for both sales and profit

5. Same as #4, but with clean column names

In [20]:
# total sales per city
cities.groupby("city")["sales"].sum()

Unnamed: 0_level_0,sales
city,Unnamed: 1_level_1
Abuja,150
Ibadan,90
Lagos,330


In [21]:
# mean profit per city
cities.groupby("city")["profit"].mean()

Unnamed: 0_level_0,profit
city,Unnamed: 1_level_1
Abuja,12.5
Ibadan,18.0
Lagos,22.333333


In [22]:
# total sales and max profit per city
cities.groupby("city").agg({
    "sales": "sum",
    "profit": "max"
})

Unnamed: 0_level_0,sales,profit
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Abuja,150,15
Ibadan,90,18
Lagos,330,25


In [23]:
# mean and max profit per city
cities.groupby("city").agg({
    "profit": ["mean", "max"]
})

Unnamed: 0_level_0,profit,profit
Unnamed: 0_level_1,mean,max
city,Unnamed: 1_level_2,Unnamed: 2_level_2
Abuja,12.5,15
Ibadan,18.0,18
Lagos,22.333333,25


In [24]:
# mean and max profit per city with clean column names
cities.groupby("city").agg(
    mean_profit = ("profit", "mean"),
    max_profit = ("profit", "max")
)

Unnamed: 0_level_0,mean_profit,max_profit
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Abuja,12.5,15
Ibadan,18.0,18
Lagos,22.333333,25
