In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Discussion 4

# Grouping, Pivoting, and Merging


### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- The `groupby` method.
- Pivot tables.
- `merge` and types of merges.
- Worksheet 📝.

### Today's Dataset 🏈〽️

- Today, we're going to be working with a dataset on the past 100 years of Michigan Football.<br><small>**Source**: www.mgoblue.com.</small>

- Like last week, let's store our dataset in the `df` variable. 

In [None]:
df = pd.read_csv('data/michigan_football.csv')
df

In [None]:
df.columns

### Read the DataFrame internals guide!

- We've posted a guide, [**DataFrame Internals**](https://practicaldsc.org/guides/data-wrangling/df-internals/), that you should read.

- It covers various topics, like:
    - How to add columns to DataFrames without side effects.
    - How to write functions that manipulate DataFrames without side effects.
    - How to use Series methods, like `mean` and `describe`, on DataFrames.
    
    <br>
    
    and more!

### The `groupby` method

- **Key idea**: If the information we want involves performing some operation separately **for each group**, we use `groupby`.

- For instance, if we want to find the average number of points Michigan scored **each year**, we use `groupby('year')`.

In [None]:
df.groupby('year')['UM_score'].mean()

- <code style="background: none"><strong><span style="color: #0066cc">df.groupby('year')</span></strong><span style="color: #999">['UM_score'].mean()</span></code><br><small>First, we need to tell `pandas` which column we want to group by. Since we're grouping by `'year'`, the remainder of the calculations will be done separately for each `'year'`.</small>

- <code style="background: none"><span style="color: #999">df.groupby('year')</span><strong><span style="color: #0066cc">['UM_score']</span></strong><span style="color: #999">.mean()</span></code><br><small>Then, select the other column(s) that you want to aggregate. Here, we want to calculate the mean `'UM Score'`, so that's what we select.</span></small>

- <code style="background: none"><span style="color: #999">df.groupby('year')['UM_score']</span><strong><span style="color: #0066cc">.mean()</span></strong></code><br><small>Finally, we use an aggregation method. This is saying, for each `'year'`, compute the mean `'UM Score'`.</small>

<div class="alert alert-danger">
    
#### Reference Slide

### The split, apply, combine process
</div>

- Under the hood, `groupby` follows 3 steps: **split**, **apply**, and **combine**.

- We start by **splitting** rows into distinct groups.<br>When we use `df.groupby('year')`, the groups are created by separating rows based on `'year'` values.

In [None]:
# The 2024 group and 2023 group.
# df.groupby('year') will create separate groups for each 'year.'
group_2024 = df.loc[df['year'] == 2024]
group_2023 = df.loc[df['year'] == 2023]
dfs_side_by_side(group_2024.head(), group_2023.head())

- Then, we **apply** the operation to each group, individually.

In [None]:
print("2024 result:", group_2024['UM_score'].mean())
print("2023 result:", group_2023['UM_score'].mean())
print("and so on...")

- Finally, we **combine** the results of the operation on each group into a single Series or DataFrame (depending on the operation).

In [None]:
df.groupby('year')['UM_score'].mean().sort_index(ascending=False)

- In the expression below, `mean` is an **aggregation method**.<br>It takes many `'UM_score'` values for a particular `'year'` and combines them into a single value.

<div class="alert alert-danger">
    
#### Reference Slide

### Aggregating
</div>

In [None]:
df.groupby('year')['UM_score'].mean()

- Sometimes, we want to perform:
    - An aggregation that doesn't have a built-in implementation.
    - Different aggregation methods on different columns.

- The `aggregate` method, or `agg` for short, is the general-purpose tool for performing aggregations.

In [None]:
# Finds the average score, and the win percentage, for home vs. away games.
(
    df
    .groupby('venue')
    .agg({
        'UM_score': 'mean', # equivalent to df.groupby('venue')['UM_score'].mean()
        # the parameter `s` here is a Series of the 'result' values for each group.
        'result': lambda s: (s == 'W').sum() / len(s) 
    })
)

- 📢 Michigan performs significantly better at home, both in terms of points scored and win percentage!

### Filtering

- So far, the only operations we've applied to each group is some sort of aggregation (i.e. aggregating the `mean` score across `years`).

- Instead of aggregating, perhaps we want to only keep groups that satisfy a certain condition.<br>To do so, we use the `filter` method instead of any aggregation method.

- For instance, if we want to keep only the `'year'`s in which Michigan won at least 12 games:

In [None]:
def has_at_least_twelve_wins(group):
    # has_at_least_twelve_wins is called separately for each 'year'.
    # Each time it's called, the `group` argument is a DataFrame.
    return (group['result'] == 'W').sum() >= 12

twelveplus_wins = df.groupby('year').filter(has_at_least_twelve_wins)
twelveplus_wins

In [None]:
twelveplus_wins['year'].unique()

### Pivot tables

- It's possible to use `groupby` on two columns at once, e.g. `df.groupby(['year', 'venue'])`.<br>The resulting operations are performed separately for every **unique combination** of values in the specified columns.<br><small>See Lecture 5 for examples. We can also group on three or more columns too!</small>

In [None]:
# The index here is a MultiIndex.
df.groupby(['year', 'venue'])['UM_score'].mean()

- Rather than grouping on two columns at once, we can create a pivot table, which also shows the result of aggregating by two columns at once but in a more human-readable way.

In [None]:
df.pivot_table(index='year', columns='venue', values='UM_score', aggfunc='mean')

- The DataFrame above has:
    - One row for every unique value in `index`.
    - One column for every unique value in `columns`.
    - Values determined by applying `aggfunc` on values in `values`.

- Notice how the value **`12.50`** at **`(1924, 'Away')`** in our MultiIndex `groupby` result matches the corresponding value in our pivot table! Michigan scored 12.5 points away on average in 1924.

### Merging

- We merge DataFrames when we have two sources of information **about the same individuals** that is **linked by a common column(s)**. <br>The common column(s) are called the **join key**.<br><small>In SQL, we call this operation a `JOIN`.</small>

- When we merge DataFrames, we have to specify _how_ to merge the DataFrames, in the (likely) event that there are rows in the two DataFrames that don't match.

<center><img src="imgs/all-joins.png" width=400></img></center>

- To help us visualize these operations, we'll redefine `df` to be a smaller subset of our original DataFrame.

In [None]:
# Smaller game DataFrame
df = pd.DataFrame({
    "year": [2024, 2024, 2023, 2023, 2022],
    "opponent": ["Washington", "Alabama", "Washington", "Michigan State", "Oregon"],
    "venue": ["Home", "Away", "Away", "Home", "Home"],
    "result": ["L", "W", "L", "W", "L"]
})

# Big Ten Teams DataFrame (5 rows)
big_10 = pd.DataFrame({
    "team": ["Minnesota", "Michigan State", "Washington", "Ohio State", "Oregon"],
    "coach": ["P.J. Fleck", "Mel Tucker", "Kalen DeBoer", "Ryan Day", "Dan Lanning"],
    "location": ["Minneapolis, MN", "East Lansing, MI", "Seattle, WA", "Columbus, OH", "Eugene, OR"]
})

dfs_side_by_side(df, big_10)


- An inner merge keeps only the rows where the keys match in both DataFrames. In this case, it keeps the games where Michigan played teams from the Big 10 (based on the `opponent` column in `df` and the `team` column in `big_10`). Notice how the resulting DataFrame includes all columns from both DataFrames, but only for rows where there’s a match."


In [None]:
df.merge(big_10, left_on="opponent", right_on="team", how="inner")

### Merging Activity

<div class="alert alert-success">
    <h3>Activity</h3>

 Consider the following DataFrame, `evs`, which has 32 rows total.
 <center><div><img src="imgs/evs.png" width="800"></img></div></center>

 And consider the pivot table that contains the distribution of “BodyStyle” for all “Brands” in evs, other than Nissan. 
 
 `evs.pivot_table(index='Brand', columns='BodyStyle', values='BodyStyle', aggfunc='count')`. 
 <center><div><img src="imgs/evs-pivot.png" width="600"></img></div></center>
Suppose we ran the following lines of code: 

`tesla = evs[evs.get("Brand") == "Tesla"]`

`bmw = evs[evs.get("Brand") == "BMW"]`

`audi = evs[evs.get("Brand") == "Audi"]`
</br></br>

**Question**: How many rows does the DataFrame `combo` have?

`combo = tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle")`
</br></br>

**Hint**: When we leave the `how` parameter empty, `.merge` defaults to an inner join. 
</div>

<details>
    
<summary>Click <b>here</b> to see the answer to the previous activity.</summary>

### **Answer**: 35

First, we need to determine the number of rows in `tesla.merge(bmw, on="BodyStyle")`, and then determine the number of rows in combo. For the purposes of the solution, **let’s use `temp` to refer to the first merged DataFrame, `tesla.merge(bmw, on="BodyStyle")`**.

### Step 1: Merging Tesla and BMW (`temp`)

When merging two DataFrames, the resulting DataFrame (`temp`) contains a row for every match between the two columns being merged, while rows without matches are excluded. In this case, the column of interest is `BodyStyle`.

#### Matches Between Tesla and BMW:
- **SUV**: Tesla has $4$ rows for "SUV", and BMW has $1$ row for "SUV".  
  $$ 4 \times 1 = 4 \text{ SUV rows in } temp $$
- **Sedan**: Tesla has $3$ rows for "Sedan", and BMW has $1$ row for "Sedan".  
  $$ 3 \times 1 = 3 \text{ Sedan rows in } temp $$

Thus, the total number of rows in `temp` is:
$$ 4 + 3 = 7 \text{ rows} $$

### Step 2: Merging `temp` with Audi (`combo`)

Now, we merge `temp` with the Audi DataFrame, again on `BodyStyle`. Each row in `temp` matches with rows in Audi based on the same "BodyStyle".

#### Matches Between `temp` and Audi:
- **SUV**: `temp` has $4$ rows for "SUV", and Audi has $8$ rows for "SUV".  
  $$ 4 \times 8 = 32 \text{ SUV rows in } combo $$
- **Sedan**: `temp` has $3$ rows for "Sedan", and Audi has $1$ row for "Sedan".  
  $$ 3 \times 1 = 3 \text{ Sedan rows in } combo $$

Thus, the total number of rows in `combo` is:
$$ 32 + 3 = 35 $$

**Note**: You may notice that 35 is the result of multiplying the "SUV" and "Sedan" columns in the DataFrame provided, and adding up the results.

</details>

## Worksheet 📝

---