# Pandas Toolkit Series (Part 2) - Aggregation, GroupBy & Joins

**INST447 Data Source and Manipulation**  
Author: **Wei Ai** (aiwei@umd.edu)  
University of Maryland

---

In Part 1, we covered the fundamentals of data selection, filtering, and transformation. Today we'll dive into more powerful operations:

- **Aggregation**: Summarizing data with statistics and counts
- **GroupBy Operations**: Split-Apply-Combine analysis patterns  
- **Joins & Merges**: Combining data from multiple sources

We'll continue using our personal flight history dataset to explore these concepts with realistic examples.

In [1]:
import pandas as pd
import numpy as np

# Recreate our flight history dataset from Part 1
flights_data = [
    ("2024-01-15", "UA1247", "BWI", "ORD", 651, "B737", "12A", 289.50, 15),
    ("2024-01-22", "DL456", "ORD", "LAX", 1745, "A321", "8F", 425.00, None),
    ("2024-02-08", "WN2891", "LAX", "PHX", 370, "B737", "", 149.99, 0),
    ("2024-02-10", "WN1055", "PHX", "DEN", 602, "B737", "15C", None, 45),
    ("2024-03-05", "AA892", "DEN", "DFW", 663, "B737", "21B", 198.75, None),
    ("2024-03-12", "UA634", "DFW", "IAD", 1216, "B777", "9A", 345.25, 12),
    ("2024-04-20", "B61840", "IAD", "BOS", 429, "", "11D", 179.50, 0),
    ("2024-05-15", "DL1123", "BOS", "ATL", 946, "A220", "4A", 267.00, 25),
    ("2024-05-18", "DL2967", "ATL", "MIA", 594, "B737", "", None, 8),
    ("2024-06-02", "AA1456", "MIA", "LGA", 1095, "A321", "18F", 312.80, None)
]

columns = ["flight_date", "flight_number", "origin", "destination",
          "distance", "aircraft", "seat", "price", "delay_min"]

flights = pd.DataFrame(flights_data, columns=columns)

# Add airline code extraction (from Part 1)
flights['airline_code'] = flights.flight_number.str[:2]

flights.head()

Unnamed: 0,flight_date,flight_number,origin,destination,distance,aircraft,seat,price,delay_min,airline_code
0,2024-01-15,UA1247,BWI,ORD,651,B737,12A,289.5,15.0,UA
1,2024-01-22,DL456,ORD,LAX,1745,A321,8F,425.0,,DL
2,2024-02-08,WN2891,LAX,PHX,370,B737,,149.99,0.0,WN
3,2024-02-10,WN1055,PHX,DEN,602,B737,15C,,45.0,WN
4,2024-03-05,AA892,DEN,DFW,663,B737,21B,198.75,,AA


In [2]:
# Create our airline lookup table - this will be crucial for joins!
airlines_data = [
    ("UA", "United Airlines"),
    ("DL", "Delta Air Lines"),
    ("WN", "Southwest Airlines"),
    ("AA", "American Airlines"),
    ("AS", "Alaska Airlines")  # Note: We don't have AS flights
]

airlines = pd.DataFrame(airlines_data, columns=["code", "airline_name"])
airlines

Unnamed: 0,code,airline_name
0,UA,United Airlines
1,DL,Delta Air Lines
2,WN,Southwest Airlines
3,AA,American Airlines
4,AS,Alaska Airlines


## 1. Basic Aggregation & Summary Statistics

Before we dive into grouping, let's understand how to summarize our entire dataset.

In [None]:
# Overall dataset summary
flights.describe()

array([289.5 , 425.  , 149.99,    nan, 198.75, 345.25, 179.5 , 267.  ,
       312.8 ])

**Why do we only see numeric columns?** The `.describe()` method automatically focuses on numerical data.

In [7]:
# Get summary for all columns (including text)
flights.describe(include='all')

Unnamed: 0,flight_date,flight_number,origin,destination,distance,aircraft,seat,price,delay_min,airline_code
count,10,10,10,10,10.0,10,10.0,8.0,7.0,10
unique,10,10,10,10,,5,9.0,,,5
top,2024-01-15,UA1247,BWI,ORD,,B737,,,,DL
freq,1,1,1,1,,5,2.0,,,3
mean,,,,,831.1,,,270.97375,15.0,
std,,,,,422.939568,,,92.249878,15.853496,
min,,,,,370.0,,,149.99,0.0,
25%,,,,,596.0,,,193.9375,4.0,
50%,,,,,657.0,,,278.25,12.0,
75%,,,,,1057.75,,,320.9125,20.0,


In [9]:
flights.airline_code.unique()

array(['UA', 'DL', 'WN', 'AA', 'B6'], dtype=object)

In [10]:
# Individual aggregation functions
print(f"Total flights: {len(flights)}")
print(f"Total miles flown: {flights.distance.sum()}")
print(f"Average flight distance: {flights.distance.mean()} miles")
print(f"Longest flight: {flights.distance.max()} miles")
print(f"Shortest flight: {flights.distance.min()} miles")

Total flights: 10
Total miles flown: 8311
Average flight distance: 831.1 miles
Longest flight: 1745 miles
Shortest flight: 370 miles


### Working with Missing Data in Aggregations

Notice how pandas handles missing values (`NaN`) in aggregations:

In [11]:
# Check for missing data
flights.isnull()

Unnamed: 0,flight_date,flight_number,origin,destination,distance,aircraft,seat,price,delay_min,airline_code
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,True,False
5,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,True,False,False
9,False,False,False,False,False,False,False,False,True,False


In [12]:
# How does missing data affect our calculations?
print(f"Count of price values: {flights.price.count()}")
print(f"Total rows: {len(flights)}")
print(f"Average price (ignoring NaN): ${flights.price.mean():.2f}")
print(f"Average delay (ignoring NaN): {flights.delay_min.mean():.1f} minutes")

# What does "ignoring NaN" mean?

Count of price values: 8
Total rows: 10
Average price (ignoring NaN): $270.97
Average delay (ignoring NaN): 15.0 minutes


### Aggregations for Categorical Data

**For non-numeric data, we use different aggregation methods:**

In [13]:
# Unique values and counts
print("Unique departure airports:")
print(flights.origin.unique())
print(f"\nUnique destination airports:")
print(flights.destination.unique())

Unique departure airports:
['BWI' 'ORD' 'LAX' 'PHX' 'DEN' 'DFW' 'IAD' 'BOS' 'ATL' 'MIA']

Unique destination airports:
['ORD' 'LAX' 'PHX' 'DEN' 'DFW' 'IAD' 'BOS' 'ATL' 'MIA' 'LGA']


In [14]:
# Value counts - extremely useful for categorical analysis
flights.airline_code.value_counts()

airline_code
DL    3
UA    2
WN    2
AA    2
B6    1
Name: count, dtype: int64

In [15]:
# Which airports do I fly from most?
flights.origin.value_counts()

origin
BWI    1
ORD    1
LAX    1
PHX    1
DEN    1
DFW    1
IAD    1
BOS    1
ATL    1
MIA    1
Name: count, dtype: int64

## 2. GroupBy Operations - The Split-Apply-Combine Pattern

**GroupBy is one of the most powerful features in pandas.** It follows a "Split-Apply-Combine" approach:
1. **Split** the data into groups based on some criteria
2. **Apply** a function to each group independently  
3. **Combine** the results into a data structure

Let's start with simple examples and build complexity.

In [17]:
# Basic groupby - group by airline
airline_groups = flights.groupby('airline_code')
airline_groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10c83dc10>

In [19]:
airline_groups.groups.keys()

dict_keys(['AA', 'B6', 'DL', 'UA', 'WN'])

In [20]:
# What does a group look like?
for airline, group_data in airline_groups:
    print(airline)
    print(group_data[['flight_number', 'origin', 'destination', 'distance']])

AA
  flight_number origin destination  distance
4         AA892    DEN         DFW       663
9        AA1456    MIA         LGA      1095
B6
  flight_number origin destination  distance
6        B61840    IAD         BOS       429
DL
  flight_number origin destination  distance
1         DL456    ORD         LAX      1745
7        DL1123    BOS         ATL       946
8        DL2967    ATL         MIA       594
UA
  flight_number origin destination  distance
0        UA1247    BWI         ORD       651
5         UA634    DFW         IAD      1216
WN
  flight_number origin destination  distance
2        WN2891    LAX         PHX       370
3        WN1055    PHX         DEN       602


### GroupBy + Aggregation: The Real Power

In [21]:
# Count flights by airline
flights.groupby('airline_code').size()

airline_code
AA    2
B6    1
DL    3
UA    2
WN    2
dtype: int64

In [None]:
flight_distances = flights.groupby('airline_code').distance.sum()
#.reset_index --> try this out by adding on to the end
# column names? --> change column names upon manipulation so you remember (reporducible)

airline_code
AA    1758
B6     429
DL    3285
UA    1867
WN     972
Name: distance, dtype: int64

In [None]:
# way to change column names 
flight_distances.columns = ['airline_code', 'distance']

# another way to change a specific column name without having to retype the entire thing (can also define a new data frame to replace the name, but its inefficient for memory to do this)
flight_distances.rename(columns = ("distanace", "distance_2"), inplace=True)

In [23]:
# Multiple aggregations at once
airline_stats = flights.groupby('airline_code').agg({
    'distance': ['count', 'sum', 'mean', 'max'],
    'price': ['mean', 'min', 'max'],
    'delay_min': 'mean'
})

airline_stats

Unnamed: 0_level_0,distance,distance,distance,distance,price,price,price,delay_min
Unnamed: 0_level_1,count,sum,mean,max,mean,min,max,mean
airline_code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
AA,2,1758,879.0,1095,255.775,198.75,312.8,
B6,1,429,429.0,429,179.5,179.5,179.5,0.0
DL,3,3285,1095.0,1745,346.0,267.0,425.0,16.5
UA,2,1867,933.5,1216,317.375,289.5,345.25,13.5
WN,2,972,486.0,602,149.99,149.99,149.99,22.5


**The result above has a multi-level column structure.** Let's clean it up:

In [24]:
airline_stats.reset_index().columns

MultiIndex([('airline_code',      ''),
            (    'distance', 'count'),
            (    'distance',   'sum'),
            (    'distance',  'mean'),
            (    'distance',   'max'),
            (       'price',  'mean'),
            (       'price',   'min'),
            (       'price',   'max'),
            (   'delay_min',  'mean')],
           )

In [25]:
# Flatten the column names and reset index
airline_stats.columns = ['_'.join(col).strip() for col in airline_stats.columns]
airline_stats = airline_stats.reset_index()
airline_stats

Unnamed: 0,airline_code,distance_count,distance_sum,distance_mean,distance_max,price_mean,price_min,price_max,delay_min_mean
0,AA,2,1758,879.0,1095,255.775,198.75,312.8,
1,B6,1,429,429.0,429,179.5,179.5,179.5,0.0
2,DL,3,3285,1095.0,1745,346.0,267.0,425.0,16.5
3,UA,2,1867,933.5,1216,317.375,289.5,345.25,13.5
4,WN,2,972,486.0,602,149.99,149.99,149.99,22.5


In [26]:
# Rename columns to be more readable
airline_stats.columns = ['airline_code', 'flight_count', 'total_distance', 'avg_distance', 'max_distance',
                        'avg_price', 'min_price', 'max_price', 'avg_delay']
airline_stats

Unnamed: 0,airline_code,flight_count,total_distance,avg_distance,max_distance,avg_price,min_price,max_price,avg_delay
0,AA,2,1758,879.0,1095,255.775,198.75,312.8,
1,B6,1,429,429.0,429,179.5,179.5,179.5,0.0
2,DL,3,3285,1095.0,1745,346.0,267.0,425.0,16.5
3,UA,2,1867,933.5,1216,317.375,289.5,345.25,13.5
4,WN,2,972,486.0,602,149.99,149.99,149.99,22.5


### Multiple Column Grouping

**What if we want to group by multiple criteria?**

In [27]:
# Group by airline and origin airport
multi_group = flights.groupby(['airline_code', 'origin']).size().reset_index(name='flight_count')
multi_group

Unnamed: 0,airline_code,origin,flight_count
0,AA,DEN,1
1,AA,MIA,1
2,B6,IAD,1
3,DL,ATL,1
4,DL,BOS,1
5,DL,ORD,1
6,UA,BWI,1
7,UA,DFW,1
8,WN,LAX,1
9,WN,PHX,1


In [28]:
# Which airline-airport combinations are most common?
multi_group.sort_values('flight_count', ascending=False)

Unnamed: 0,airline_code,origin,flight_count
0,AA,DEN,1
1,AA,MIA,1
2,B6,IAD,1
3,DL,ATL,1
4,DL,BOS,1
5,DL,ORD,1
6,UA,BWI,1
7,UA,DFW,1
8,WN,LAX,1
9,WN,PHX,1


### Custom Aggregation Functions (not covered)

**What if we need more complex analysis than built-in functions provide?**

## 3. Joins and Merges - Combining Data from Multiple Sources

**Real-world data analysis often requires combining information from multiple datasets.** Let's enhance our flight analysis by adding airline information.

In [29]:
# Inner Join (default) - only keep rows that match in both datasets
flights_with_airlines = flights.merge(
    airlines,
    left_on='airline_code',
    right_on='code',
    how='inner'
)

In [30]:
flights_with_airlines

Unnamed: 0,flight_date,flight_number,origin,destination,distance,aircraft,seat,price,delay_min,airline_code,code,airline_name
0,2024-01-15,UA1247,BWI,ORD,651,B737,12A,289.5,15.0,UA,UA,United Airlines
1,2024-01-22,DL456,ORD,LAX,1745,A321,8F,425.0,,DL,DL,Delta Air Lines
2,2024-02-08,WN2891,LAX,PHX,370,B737,,149.99,0.0,WN,WN,Southwest Airlines
3,2024-02-10,WN1055,PHX,DEN,602,B737,15C,,45.0,WN,WN,Southwest Airlines
4,2024-03-05,AA892,DEN,DFW,663,B737,21B,198.75,,AA,AA,American Airlines
5,2024-03-12,UA634,DFW,IAD,1216,B777,9A,345.25,12.0,UA,UA,United Airlines
6,2024-05-15,DL1123,BOS,ATL,946,A220,4A,267.0,25.0,DL,DL,Delta Air Lines
7,2024-05-18,DL2967,ATL,MIA,594,B737,,,8.0,DL,DL,Delta Air Lines
8,2024-06-02,AA1456,MIA,LGA,1095,A321,18F,312.8,,AA,AA,American Airlines


### Types of Joins

Let's start with our basic datasets:

In [None]:
print("Airline codes in flights:")
print(sorted(flights.airline_code.unique()))
print("\nAirline codes in lookup table:")
print(sorted(airlines.code.unique()))

Airline codes in flights:
['AA', 'B6', 'DL', 'UA', 'WN']

Airline codes in lookup table:
['AA', 'AS', 'DL', 'UA', 'WN']


In [32]:
print(f"Original flights: {len(flights)}")
print(f"After inner join: {len(flights_with_airlines)}")
flights_with_airlines[['flight_number', 'airline_code', 'airline_name']].head()

Original flights: 10
After inner join: 9


Unnamed: 0,flight_number,airline_code,airline_name
0,UA1247,UA,United Airlines
1,DL456,DL,Delta Air Lines
2,WN2891,WN,Southwest Airlines
3,WN1055,WN,Southwest Airlines
4,AA892,AA,American Airlines


In [33]:
# Left Join - keep all flights, even if no airline info exists
flights_left_join = flights.merge(
    airlines,
    left_on='airline_code',
    right_on='code',
    how='left'
)

print(f"After left join: {len(flights_left_join)}")
print("Any missing airline names?")
print(flights_left_join['airline_name'].isnull().sum())

After left join: 10
Any missing airline names?
1


In [34]:
# Right Join - keep all airlines, even if no flights exist
airlines_right_join = flights.merge(
    airlines,
    left_on='airline_code',
    right_on='code',
    how='right'
)

print(f"After right join: {len(airlines_right_join)}")
print("Airlines with no flights:")
print(airlines_right_join[airlines_right_join['flight_number'].isnull()]['airline_name'].tolist())

After right join: 10
Airlines with no flights:
['Alaska Airlines']


In [35]:
# Outer Join - keep everything from both datasets
flights_outer_join = flights.merge(
    airlines,
    left_on='airline_code',
    right_on='code',
    how='outer'
)

print(f"After outer join: {len(flights_outer_join)}")
print(f"Rows with missing flight data: {flights_outer_join['flight_number'].isnull().sum()}")
print(f"Rows with missing airline data: {flights_outer_join['airline_name'].isnull().sum()}")

After outer join: 11
Rows with missing flight data: 1
Rows with missing airline data: 1


### Practical Join Usage

**For our analysis, inner join makes most sense since all our flights have matching airline data:**

### Handling Column Name Conflicts

**What if both datasets have columns with the same name?**

In [None]:
# Create a dataset with conflicting column names to demonstrate
airport_info = pd.DataFrame({
    'code': ['BWI', 'ORD', 'LAX', 'BOS'],
    'name': ['Baltimore-Washington', 'Chicago O\'Hare', 'Los Angeles', 'Boston Logan'],
    'distance': [25, 18, 12, 5]  # Distance from city center - conflicts with flight distance!
})

# Join and see what happens
flight_with_airport = flights_with_airlines.merge(
    airport_info,
    left_on='origin',
    right_on='code'#,
    #suffixes=('', '_airport')  # Add suffix to conflicting columns
)

# flight_with_airport[['flight_number', 'origin', 'name_airport', 'distance', 'distance_airport']].head()

In [None]:
flight_with_airport