In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw03.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 3

# Grouping, Pivoting, and Merging

### EECS 398-003: Practical Data Science, Fall 2024

#### Due Thursday, September 19th at 11:59PM
    
</div>

## Instructions

Welcome to Homework 3! In this homework, you will practice core DataFrame methods introduced in Lectures 5 and 6 – grouping, merging, and pivoting, in particular. See the [Readings section of the Resources tab on the course website](https://practicaldsc.org/resources/#readings) for supplemental resources.

You are given six slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/fa24/). The [⚙️ Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps. Once you're done, you'll submit your completed notebook to Gradescope.

Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll grade your **most recent** submission. Remember that the public `grader.check` tests in your notebook are not comprehensive, and that your work will also be graded on hidden test cases on Gradescope after the submission deadline.

This homework is worth a total of **54 points**, 45 of which come from the autograder, **and 9 of which are manually graded by us** (Questions 1.3, 6.1, and 6.2). The number of points each question is worth is listed at the start of each question. **The four parts of the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the import cell below, plus the cell at the top of the notebook that imports and initializes `otter`.

<a name='like-dataframe'>

</a>

<div class="alert alert-warning" markdown="1">
    
**Note**: Throughout this homework, you'll see statements like this frequently:

<blockquote>Complete the implementation of the function ____, which takes in a DataFrame <code>df</code> like <code>other_df</code> and _____.</blockquote>

What this means is that you should assume that `df` has the same number of columns as `other_df`, with the same column titles and data types, but potentially a different number of rows in a different order, with a potentially different index. You should always also assume that `df` has at least one row.

We have you implement functions like this to prevent you from hard-coding your answers to one specific dataset.

</div>

<div class="alert alert-danger" markdown="1">

You **cannot** use any `for`-loops on this homework, and may lose points in certain questions for doing so!

</div>

In [None]:
import pandas as pd
import numpy as np

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

## Part 1: Presidential Elections 🇺🇸

---

The 2024 presidential election is on November 5th, and we're just a few days removed from the second presidential debate – a memorable one, to say the least.

<center><img src="imgs/debate.jpg" width=400>

</center>

<center><small>Kamala Harris (left) and Donald Trump (right), the Democratic and Republican candidates<br>in the 2024 presidential election, respectively.</small>
</center>

<br>

In this first part of the homework, we'll familiarize ourselves with two equally important facets of American society – how the DataFrame `groupby` method works and how presidential elections work. We'll gain this familiarity by working with voting data from the past 12 elections, starting from 1976 and going through 2020, when our current president was elected.

If you're not super familiar with the American political system, don't worry: [this brief article (along with the supplementary poster)](https://kz.usembassy.gov/summary-of-the-u-s-presidential-election-process/) has all of the context you need.

Run the cell below to load in a DataFrame, `votes`.

In [None]:
votes = pd.read_csv('data/elections/historical_votes.csv')
votes

Each row of `votes` tells us the number of votes for a particular presidential `'party'` and `'candidate'` in a particular `'state'` and `'year'`. For instance:
- The first row tells us that in 1976, 659170 voters in Alabama voted for Jimmy Carter, the Democrat `'candidate'`.
- The second last row tells us that in 2020, 5768 voters in Wyoming voted for Jo Jorgensen, the Libertarian `'candidate'`.

Note that each party only has one presidential candidate in a given year. That means, there is only one row in `votes` for any combination of `'year'`, `'state'`, and `'party'`.

Let's get started!

### Question 1: Counting Votes 🗳️

In Question 1, you'll answer some preliminary questions to familiarize yourself with the dataset. **Don't** hard-code your answers; use `pandas` code to find them programatically.

#### Question 1.1   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `michigan_total_2020` to the total number of votes cast in Michigan in the 2020 election. Your answer should be an integer.

In [None]:
michigan_total_2020 = ...
michigan_total_2020

In [None]:
grader.check("q01_1")

#### Question 1.2   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Assign `votes_per_year` to a Series, indexed by `'year'`, containing the total number of votes cast each `'year'`. The Series should be sorted by the index (i.e. `'year'`) in ascending order.

Example behavior is given below.

```python
>>> votes_per_year.loc[1996]
95486860

>>> votes_per_year.iloc[3]
91496698
```

In [None]:
votes_per_year = ...
votes_per_year

In [None]:
grader.check("q01_2")

If you answered Question 1.2 correctly, then below, you should see a line chart depicting the number of votes cast each `'year'`.

In [None]:
(
    votes_per_year
    .plot(kind='line', title='Votes Cast in the Presidential Election Each Year')
    .update_layout(xaxis_title='Year', yaxis_title='Votes', showlegend=False)
)

Without looking at the data, one might guess that the number of people who vote each election only increases, since the population of the US increases considerably from year-to-year. However, a variety of factors play a role in determining voter _turnout_.

<!-- BEGIN QUESTION -->

#### Question 1.3   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Identify **two** interesting takeways from the graph above. Do a little bit of research on _why_ those historical events may have occurred, and write 1-2 sentences per takeaway about your findings (so 2-4 sentences total). You can use Google or ChatGPT to do your research, but you **must** write your answers in your own words. As a data scientist, you'll need to do this a lot – identify trends in data and try to make sense of them.

For example, one interesting takeaway – **which you cannot use** – is that fewer people voted in 1996 than in 1992. Some [research](https://www.csmonitor.com/1996/1016/101696.us.us.1.html) shows that this was likely because the economy and nation more generally was relatively stable, meaning the public was generally happy with the job being done by the incumbent, leading to a lower sense of urgency to vote and change the status quo.

---

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

#### Question 1.4   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Assign `rep_per_state` to a Series, indexed by `'state'`, containing the number of votes cast for the Republican `'party'` in each `'state'` in 2020. The Series should be sorted by number of votes in descending order, and **should only contain information for the 10 `'state'`s with the most votes for the Republican `'party'`**.

Example behavior is given below.

```python
>>> rep_per_state.shape[0]
10

>>> rep_per_state.loc['MICHIGAN']
2649852

>>> rep_per_state.iloc[-1]
2446891
```

In [None]:
rep_per_state = ...
rep_per_state

In [None]:
grader.check("q01_4")

### Question 2: Winners and Losers 🏆

In Question 1, we explored the number of individuals that voted in different periods of time and in different regions. But we didn't really attempt to find _who won_ the most votes in any particular year. That's what we'll work towards now.

In Question 2, your solutions will be more complex than they were in Question 1. You may find yourself using `groupby` on multiple columns, or even `groupby` multiple times, to solve a single subpart. Expect to have to create custom aggregation methods and use other grouping-related methods from [Lecture 6](https://practicaldsc.org/resources/lectures/lec06/lec06-filled.html). Think one step at a time, and don't just write a bunch of code and then run it – run your cells frequently to _understand_ what they're doing!

#### Question 2.1   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `votes_per_year_party`, which takes in a DataFrame `df` like `votes`, and returns a DataFrame with three columns: `'year'`, `'party'`, and `'votes'`, the latter of which contains the total number of votes cast for every unique combination of `'year'` and `'party'` in `df`.

As an example, a random subset of the rows in `votes_per_year_party(votes)` are given below, though note that `votes_per_year_party(votes)` should have many more rows than below. And, remember [from the top of the assignment](#like-dataframe) that `votes_per_year_party` needs to work on other DataFrames like `votes`, not just `votes` itself.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>year</th>
      <th>party</th>
      <th>votes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>56</th>
      <td>1984</td>
      <td>DEMOCRAT</td>
      <td>37449813</td>
    </tr>
    <tr>
      <th>230</th>
      <td>2004</td>
      <td>POPULIST</td>
      <td>23094</td>
    </tr>
    <tr>
      <th>238</th>
      <td>2004</td>
      <td>SOCIALIST WORKERS</td>
      <td>7493</td>
    </tr>
    <tr>
      <th>175</th>
      <td>2000</td>
      <td>DEMOCRAT</td>
      <td>49662314</td>
    </tr>
    <tr>
      <th>355</th>
      <td>2016</td>
      <td>WORKERS WORLD PARTY</td>
      <td>3519</td>
    </tr>
  </tbody>
</table>

<center>A random sample of the rows in <code>votes_per_year_party(votes)</code>.</center>

<br>
The index and order of the resulting DataFrame do not matter.

In [None]:
def votes_per_year_party(df):
    ...

# Feel free to change this input to make sure your function works correctly.
# A good strategy is to make sure it works when you call it on a random subset of votes,
# e.g. votes_per_year_party(votes.sample(100)).
votes_per_year_party(votes)

In [None]:
grader.check("q02_1")

#### Question 2.2   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `above_threshold`, which takes in a DataFrame `df` like `votes` and a positive integer `threshold`, and returns a **list** containing the years in which at least `threshold` total votes were cast in the election. The returned list should be sorted in ascending order. Example behavior is given below.

```python
>>> above_threshold(votes, 125_000_000)
[2008, 2012, 2016, 2020]

>>> above_threshold(votes, 200_000_000)
[]
```

In [None]:
def above_threshold(df, threshold):
    ...

# Feel free to change this input to make sure your function works correctly.
above_threshold(votes, 125_000_000)

In [None]:
grader.check("q02_2")

#### Question 2.3   <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Complete the implementation of the function `popular_vote_winners`, which takes in a DataFrame `df` like `votes`, and returns a DataFrame, indexed by `'year'`, with three columns:
- `'party'`, which contains the name of the `'party'` who won the most votes across all `'state'`s that `'year'`.
- `'votes'`, which contains the total number of votes won by the `'party'` who won the most votes across all `'state'`s that `'year'`.
- `'vote_prop'`, which contains the proportion of votes won by the `'party'` who won the most votes across all `'state'`s that `'year'`. This is the fraction:

$$\text{vote prop} = \frac{\text{total number of votes cast for the party with the most votes this year}}{\text{total number of votes cast this year}}$$

The resulting DataFrame should be sorted by the index (i.e. `'year'`) in ascending order. Example behavior is given below.

```python
# In other words, the last two rows of popular_vote_winners(votes) should look like the example DataFrame below.
>>> popular_vote_winners(votes).tail(2)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>party</th>
      <th>votes</th>
      <th>vote_prop</th>
    </tr>
    <tr>
      <th>year</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2016</th>
      <td>DEMOCRAT</td>
      <td>65677168</td>
      <td>0.485258</td>
    </tr>
    <tr>
      <th>2020</th>
      <td>DEMOCRAT</td>
      <td>81268908</td>
      <td>0.513616</td>
    </tr>
  </tbody>
</table>


***Hint***: We defined a helper function that takes in a DataFrame that only has rows for a particular `'year'`, and returns the `'party'`, number of `'votes'`, and `'vote_prop'` earned by the `'party'` with the most total votes in that DataFrame. One possible solution is to create such a helper function yourself, and then use the `apply` method on a `DataFrameGroupBy` object with that helper function as the input. See the [documentation for the `apply` method](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html) for examples.

In [None]:
def popular_vote_winners(df):
    ...

# Feel free to change this input to make sure your function works correctly.
popular_vote_winners(votes)

In [None]:
grader.check("q02_3")

### Question 3: The Electoral College 🗺️

Let's take a look at the last two rows of `popular_vote_winners(votes)` once again:

In [None]:
popular_vote_winners(votes).tail(2)

If you answered Question 2.3 correctly, you'll see above that in 2016, the Democratic `'party'` won the most votes of any `'party'`, with 48.5\% of the total vote. But, the Democratic `'party'` **did not** actually win the 2016 presidential election – the Republican `'party'` did.

The reason for this is the Electoral College, which is explained in sufficient detail in the [article that was linked before](https://kz.usembassy.gov/summary-of-the-u-s-presidential-election-process/). In short, there are 538 Electoral College votes total, and each `'state'` (plus Washington D.C., which is treated like a `'state'` for the purposes of the Electoral College) is assigned some number of Electoral College votes.

The `'candidate'` that wins the most votes in a particular `'state'` wins **all** of the Electoral College votes assigned to that `'state'`*. For example, in 2020, the Democratic `'party'` only won 50.6% of the vote in Michigan, but because this was more than any other `'party'` won in Michigan, the Democratic `'party'` took all 16 Electoral College votes assigned to Michigan.

So, in the 2016 election, even though the Democratic `'party'` won more votes overall – i.e. they won the "popular vote" – they won fewer Electoral College votes, and so they lost the election to the Republicans. To win the election, a `'party'` needs to win at least 270 of the 538 Electoral College votes. (Why 270? $\frac{538}{2} = 269$, so if both `'party'`s won 269 votes, there would be a tie.)

_*Caveat: This is not exactly how the Electoral College works in Nebraska and New Hampshire, but for simplicity, we will assume that these two `'state'`s work the same way as all other `'state'`s, in that they give all of their Electoral College votes to the `'candidate'` that won the most votes in their `'state'`._

Run the cell below to load in a DataFrame, `ec`, which contains the number of Electoral College votes assigned to each `'state'` in 2020. (Note that the number of Electoral College votes assigned to each `'state'` changes every 10 years, when the US conducts the Census. The number of Electoral College votes per `'state'` is different in 2024 than it was in 2020.)

In [None]:
ec = pd.read_csv('data/elections/electoral_college.csv')
ec

Right now, this DataFrame is separate from the `votes` DataFrame that we've been working with. We'll combine it for you here, but you'll get your own hands-on practice with the DataFrame `merge` method in the next part of the homework.

Run the cell below to define a new DataFrame, `combined`, which results from merging the rows in `votes` specific just to 2020 with `ec`.

In [None]:
combined = (
    votes[votes['year'] == 2020]
    .merge(ec, left_on='state_ab', right_on='Abb_State')
    [['state', 'state_ab', 'candidate', 'party', 'votes', 'Electoral_College_Votes']]
)
combined

Now, `combined` contains enough information to determine the Electoral College winner in a particular election. **We'll focus just on 2020**, since that's the only year we've loaded in Electoral College winner information for.

#### Question 3.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `ec_results_per_state`, which takes in a DataFrame `df` like `combined` and returns a DataFrame with the same number of rows as `df` and 5 columns:
- `'state'`, one row per `'state'` in `df`.
- `'state_ab'`, the abbreviation of `'state'`.
- `'party'`, the party that won the most votes in `'state'` in 2020.
- `'votes'`, the number of (actual, human) votes won by `'party'` in `'state'` in 2020.
- `'Electoral_College_Votes'`, the number of Electoral College votes assigned to `'state'` in 2020.

The resulting DataFrame should be sorted by `'state'` in ascending order.

Example behavior is given below.

```python
>>> ec_results_per_state(combined).tail(2)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>state</th>
      <th>state_ab</th>
      <th>party</th>
      <th>votes</th>
      <th>Electoral_College_Votes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>49</th>
      <td>WISCONSIN</td>
      <td>WI</td>
      <td>DEMOCRAT</td>
      <td>1630866</td>
      <td>10</td>
    </tr>
    <tr>
      <th>50</th>
      <td>WYOMING</td>
      <td>WY</td>
      <td>REPUBLICAN</td>
      <td>193559</td>
      <td>3</td>
    </tr>
  </tbody>
</table>

Remember, `ec_results_per_state` will be tested on other DataFrames like `combined`, not just `combined` itself!

In [None]:
def ec_results_per_state(df):
    ...

# Feel free to change this input to make sure your function works correctly.
ec_results_per_state(combined)

In [None]:
grader.check("q03_1")

#### Question 3.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">1 Point</div>

Complete the implementation of the function `ec_totals`, which takes in a DataFrame `df` like `combined` and returns a Series, indexed by `'party'`, containing the total number of Electoral College votes won by each `'party'`.

In [None]:
def ec_totals(df):
    ...

# Feel free to change this input to make sure your function works correctly.
ec_totals(combined)

In [None]:
grader.check("q03_2")

#### Question 3.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Now's the (arguably) fun part. To wrap up our foray into voting data, we will create a choropleth, a kind of map that colors different regions in different colors.

Specifically, complete the implementation of the function `draw_choropleth`, which takes in the DataFrame `combined` (no other DataFrame) and returns a `plotly` figure object containing a [choropleth](https://en.wikipedia.org/wiki/Choropleth_map) of the United States in which each `'state'` is colored either blue or red, depending on whether the Democratic `'party'` or Republican `'party'` won the majority of votes in that `'state'` in 2020.

An example of what the graph `draw_choropleth(combined)` should look like is below.

<center>

<img src="imgs/repl.png" width=600>
    
</center>

Some added guidance and requirements:
- The [`plotly` choropleth documentation](https://plotly.com/python/choropleth-maps/) is excellent. It has many examples, which you can use to tweak several aspects of your plot.<br><br>
- Your plot must have the abbreviation of each `'state'` plotted on top of the `'state'`, along with the number of Electoral College votes assigned to that `'state'` in 2020. Use the `figure` `.add_scattergeo` method to do this after you create the rest of your choropleth; you'll find examples online of how this works.
    - You'll also need to figure out how to create a Series of strings, in which each string contains both the name of a `'state'` and its number of Electoral College vote counts.
    - To prevent the map from getting too crowded, we hid the annotations for New Hampshire, Connecticut, Rhode Island, Washington D.C., Maryland, and Delaware; you don't have to do this, but it's a good idea.<br><br>
- Your plot must use one color for the Democratic `'party'` and one color for the Republican `'party'`. The colors you choose do not matter, except for the fact that **you cannot use the default colors that `px.choropleth` uses**. (If you want to use the politically accurate colors, as we did above, the Republicans use red and the Democrats use blue.) You'll need to either read the documentation or do some Googling to figure out how to change colors for different categories, but as a hint, these can be set using an argument to `px.choropleth` (i.e. you don't need to use `fig.update_layout`).<br><br>
- **Question 3.3 has no hidden tests, so as long as you pass the public tests here, you'll receive full credit for it.** Everything that isn't mentioned above but is in our example plot (e.g. a text color of white instead of black, or a different font) is optional – but make your plot as pretty as you can!

<br>

***Note***: You _may_ need to use a `for`-loop or list comprehension in your implementation of `draw_choropleth` to hide particular labels, and that's okay. (Not for any other question on this homework, though!)

In [None]:
def draw_choropleth(combined):
    ...

# We won't 
# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
draw_choropleth(combined)

In [None]:
grader.check("q03_3")

Nice work! You're now well equipped to create your own political choropleths.

If you want a challenge, once you're finished Homework 3, see if you can adjust the choropleth so that the intensity (darkness) of each `'state'`'s color depends on the proportion of its population that voted for the winning `'party'`. For example, since the Democratic `'party'` won 92% of the vote in Washington D.C. but only 50.6% of the vote in Michigan, Washington D.C. should appear much darker blue than Michigan.

## Part 2: Combining Data 🫂

---

In Part 2 of this homework, you'll practice combining multiple DataFrames together. You'll want to review our treatment of the `merge` method from Lecture 6.

### Question 4: Paw Patrol 🐾

In this question, you'll analyze data from a veterinarian clinic in Michigan. The datasets contain several types of information from the clinic, including its customers (pet owners), pets, available procedures, and procedure history. The column names are self-explanatory. These DataFrames are provided to you:
-  `owners` stores the customer information, where every `'OwnerID'` is unique (verify this yourself).
-  `pets` stores the pet information. Each pet belongs to a customer in `owners`.
-  `procedure_detail` contains a catalog of procedures that are offered by the clinic.
-  `procedure_history` has procedure records. Most procedures were given to a pet in `pets`.

We define each DataFrame below and show the first two rows of each DataFrame. **Do not** modify any of these DataFrames directly!

In [None]:
owners = pd.read_csv('data/pets/owners.csv')
owners.head()

In [None]:
pets = pd.read_csv('data/pets/pets.csv')
pets.head()

In [None]:
procedure_detail = pd.read_csv('data/pets/procedures_details.csv')
procedure_detail.head()

In [None]:
procedure_history = pd.read_csv('data/pets/procedures_history.csv')
procedure_history.head()

Each of the following three parts asks you to answer a particular question about the data by implementing a function.

**Note**: Unlike in Parts 1 or 3, when we say (for example) that a function takes in DataFrames `a` and `b`, you can assume that your function will only ever be called on `a` and `b` exactly, not other DataFrames "like" `a` and `b`.

#### Question 4.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

> What is the most popular `'ProcedureType'` amongst all pets in the `pets` DataFrame? 

Complete the implementation of the function `most_popular_procedure`, which takes in two DataFrames, `pets` and `procedure_history`, and returns the name of the most popular `'ProcedureType'` (among all pets in `pets`) as a string.

Note that some pets are registered but haven't had any procedures performed. Also, some pets that have had procedures done are not registered in `pets`.

Here, you can assume that the DataFrames given to `most_popular_procedure` are `pets` and `procedure_history` exactly as they're defined in your notebook.

In [None]:
def most_popular_procedure(pets, procedure_history):
    ...

most_popular_procedure(pets, procedure_history)

In [None]:
grader.check("q04_1")

#### Question 4.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

> What is the name of each customer's pet(s)?

Complete the implementation of the function `pet_name_by_owner`, which takes in two DataFrames, `owners` and `pets`, and returns a Series whose index contains owner first names, and whose values are pet names as **strings**. If an owner has multiple pets, the value corresponding to that owner should instead be a **list of pet names as strings**.

Note that owner first names are not necessarily unique, and so the Series you return will not necessarily have a unique index.

In [None]:
def pet_name_by_owner(owners, pets):
    ...

pet_name_by_owner(owners, pets)

In [None]:
grader.check("q04_2")

#### Question 4.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Note that the `owners` DataFrame has a `'City'` column, describing the city in which each pet owner and their pets live.

> How much did each city spend in total on procedures?

Complete the implementation of the function `total_cost_per_city`, which takes in four DataFrames, `owners`, `pets`, `procedure_history`, and `procedure_detail`, and returns a Series indexed by `'City'` that describes the total amount that each city has spent on pets' procedures.

Some guidance:
- At some point, you may have to `merge` on multiple columns.
- Some owners may have never visited the veterinarian clinic in their city. This means some cities may have zero operational costs. **These cities should still appear in your output!**

In [None]:
def total_cost_per_city(owners, pets, procedure_history, procedure_detail):
    ...

total_cost_per_city(owners, pets, procedure_history, procedure_detail)

In [None]:
grader.check("q04_3")

## Part 3: Pivot Tables 🕺

---

In Part 3 of this homework, you'll get better at using the DataFrame `pivot_table` method. Recall from [Lecture 6](https://practicaldsc.org/resources/lectures/lec06/lec06-filled.html#Pivot-tables-using-the-pivot_table-method), a pivot table allows you to aggregate the entries in a DataFrame based on two categorical columns. If it helps, [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html) is another great resource that provides an overview of `pivot_table` with many examples from the Titanic dataset.

### Question 5: Summarizing Sales 💰

In this question, you'll analyze sales data for a (hypothetical) franchise with several locations in Metro Detroit. Each row tells us the amount (`'Total'`) that a customer (`'Name'`) spent at a particular location (`'Store'`) on a particular date (`'Date'`).

In [None]:
sales = pd.read_csv('data/sales.csv')
sales.head()

Before starting the question, do some preliminary analyses of your own. Try and answer questions like:
- How many rows are in `sales`?
- How many unique customers are there?
- What are the possible values of `'Store'`?
- What is the earliest and latest `'Date'` of any row?

<!-- **We have provided outlines for the DataFrames you need to create in this question, but yours may have a different number of rows and columns and different values.** -->

In [None]:
# Explore here.

#### Question 5.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `average_per_customer`, which takes in a DataFrame `df` **like** `sales` and returns a DataFrame, indexed by `'Name'`, with a single column, `'Average Transaction'`, which contains the average transaction price for each customer in `df`. The resulting DataFrame should be sorted by the index (`'Name'`) in ascending order.

Example behavior is given below.

```python
>>> average_per_customer(sales).head(3)
```

<table border="1" class="dataframe" style="text-align: left;">
  <thead>
    <tr>
      <th></th>
      <th>Average Transaction</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Gill</th>
      <td>1534.616162</td>
    </tr>
    <tr>
      <th>Hoffmeyer</th>
      <td>1529.581395</td>
    </tr>
    <tr>
      <th>Junior</th>
      <td>1392.861111</td>
    </tr>
  </tbody>
</table>


***Note***: You may be able to implement `average_per_customer` without using `pivot_table`, and that's totally fine.

In [None]:
def average_per_customer(df):
    ...

# Feel free to change this input to make sure your function works correctly.
# Remember that we may test your function on inputs like sales,
# e.g. sales.sample(100), so make sure it works there too!
average_per_customer(sales)

In [None]:
grader.check("q05_1")

#### Question 5.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

Complete the implementation of the function `store_and_customer`, which takes in a DataFrame `df` like `sales` and returns a DataFrame, indexed by `'Name'`, that has one column for each `'Store'`. The DataFrame should describe the total amount each customer spent at each `'Store'`. If a particular customer didn't spend any money at a particular store, you will have missing values; **don't** fill these in.

Example behavior is given below. We've intentionally hidden the true values that this DataFrame should produce, but the structure of your DataFrame – when called on the full `sales` DataFrame – should be the same as below.

```python
>>> store_and_customer(sales)
```

<table border="1" class="dataframe" style="text-align: left">
  <thead>
    <tr style="text-align: right;">
      <th>Store</th>
      <th>12 Oaks</th>
      <th>Birch Run</th>
      <th>Briarwood</th>
      <th>Great Lakes</th>
      <th>Oakland</th>
      <th>Somerset</th>
      <th>Westland</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Gill</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Hoffmeyer</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Junior</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Kheterpal</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Li</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Pratapa</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Rampure</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Rex</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Uppalapati</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <th>Zhuang</th>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
      <td>-</td>
    </tr>
  </tbody>
</table>

In [None]:
def store_and_customer(df):
    ...

# Feel free to change this input to make sure your function works correctly.
store_and_customer(sales)

In [None]:
grader.check("q05_2")

#### Question 5.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Complete the implementation of the function `transactions_per_store`, which takes in a DataFrame `df` like `sales` returns a DataFrame, indexed by **both** `'Store'` and `'Name'`, that contains the number of transactions made per `'Date'` at each location by each customer. Replace `NaN`s with 0s, and don't reset the index after pivoting. The order of the rows and columns don't matter.

Example behavior is given below.

```python
# For instance, this is saying that
# Rampure made two transactions at Somerset on
# August 13, 2024.
>>> transactions_per_store(sales.head(8))
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Date</th>
      <th>01.14.2024</th>
      <th>01.15.2023</th>
      <th>04.09.2024</th>
      <th>08.05.2023</th>
      <th>08.13.2024</th>
      <th>09.25.2023</th>
      <th>10.24.2023</th>
    </tr>
    <tr>
      <th>Store</th>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="3" valign="top">12 Oaks</th>
      <th>Pratapa</th>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Rex</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>Zhuang</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Briarwood</th>
      <th>Kheterpal</th>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">Somerset</th>
      <th>Kheterpal</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Rampure</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>2</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Westland</th>
      <th>Pratapa</th>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

In [None]:
def transactions_per_store(df):
    ...

# Feel free to change this input to make sure your function works correctly.
transactions_per_store(sales.head(8))

In [None]:
grader.check("q05_3")

#### Question 5.4 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `total_per_month`, which takes in a DataFrame `df` like `sales` and returns a DataFrame, indexed by **both** `'Store'` and `'Name'`, that contains the total **amount** spent per **month** at each location by each customer. Replace `NaN`s with 0s, and don't reset the index after pivoting. The order of the rows and columns don't matter. 

Example behavior is given below.

```python
>>> total_per_month(sales.head(8))
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Month</th>
      <th>April</th>
      <th>August</th>
      <th>January</th>
      <th>October</th>
      <th>September</th>
    </tr>
    <tr>
      <th>Store</th>
      <th>Name</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="3" valign="top">12 Oaks</th>
      <th>Pratapa</th>
      <td>0</td>
      <td>0</td>
      <td>1845</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Rex</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>2503</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Zhuang</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>346</td>
    </tr>
    <tr>
      <th>Briarwood</th>
      <th>Kheterpal</th>
      <td>392</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th rowspan="2" valign="top">Somerset</th>
      <th>Kheterpal</th>
      <td>0</td>
      <td>2781</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Rampure</th>
      <td>0</td>
      <td>2529</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>Westland</th>
      <th>Pratapa</th>
      <td>0</td>
      <td>0</td>
      <td>199</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

***Hint***: At no point should you need to manually parse `'Date'`s and manually map from 1 to January, 2 to February, and so on. Look into the [`pd.to_datetime`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) function. Once you use it, a single StackOverflow post has the one-line solution to extracting months.

In [None]:
def total_per_month(df):
    df = df.copy() # Leave this here – you'll probably need to make destructive modifications to the input.
    ...

# Feel free to change this input to make sure your function works correctly.
total_per_month(sales.head(8))

In [None]:
grader.check("q05_4")

## Part 4: Simpson's Paradox 🧐

---

In this class, we're not just teaching you how to wrangle DataFrames, but also how to think critically with data. In this final part of the homework, you'll explore a statistical phenomenon that appears when working with aggregated data – Simpson's paradox. **You won't write any code in this part – instead, you'll type out some math.**

First, let's walk through an illustrative example. Consider two students, Lisa and Bart, who just finished their first three semesters at Michigan. They both took a different number of classes in Winter 2023, Fall 2023, and Winter 2024. **Each semester, Lisa had a higher GPA than Bart, but overall, Bart has a higher GPA. How is this possible? 🤔**

Run this cell to create example DataFrames that contain each students' grades.

In [None]:
lisa = pd.DataFrame([[20, 46], [18, 54], [5, 20]],
    columns=['Credits', 'Grade Points Earned'], 
    index=['WI23', 'FA23', 'WI24'],
)
lisa.columns.name = 'Lisa' # This allows us to see the name "Lisa" in the top left of the DataFrame.

bart = pd.DataFrame([[5, 10], [5, 13.5], [22, 81.4]],
    columns=['Credits', 'Grade Points Earned'], 
    index=['WI23', 'FA23', 'WI24'],
)
bart.columns.name = 'Bart'

In [None]:
lisa

In [None]:
bart

The number of "grade points" earned for a course is:

$$\text{number of credits} \cdot \text{grade (out of 4)}$$

For instance, an A- in a 4 credit course earns $3.7 \cdot 4 = 14.8$ grade points. Your GPA, then, is the **weighted** average of your grade grade points, where the weight of each course grade is the number of credits the course is worth.

In our example data, Lisa has a higher GPA in all three semesters:

In [None]:
semesterly_gpas = pd.DataFrame({
    "Lisa's Semester GPA": lisa['Grade Points Earned'] / lisa['Credits'],
    "Bart's Semester GPA": bart['Grade Points Earned'] / bart['Credits'],
})

semesterly_gpas

But, overall, Bart has a higher GPA:

In [None]:
lisa['Grade Points Earned'].sum() / lisa['Credits'].sum()

In [None]:
# Higher than above!
bart['Grade Points Earned'].sum() / bart['Credits'].sum()

How did this happen? Let's take a look at all of our information together:

In [None]:
(
    semesterly_gpas
    .assign(Lisa_Units=lisa['Credits'], Bart_Units=bart['Credits']) 
    .iloc[:, [0, 2, 1, 3]]
)

When both students performed poorly, Lisa took more credits than Bart, **which brought 📉 Lisa's overall average**. On the other hand, when both students performed well, Bart took more credits than Lisa, **which brought up 📈 Bart's overall average**.

This phenomenon is known as Simpson's paradox. Specifically, Simpson's paradox is when **grouped and ungrouped data show opposing trends**. It's named after Edward H. Simpson, a statistician, not Lisa or Bart Simpson. It typically occurs when there is a hidden fact (i.e. a confounder) within the data that influences results.

If you'd like to read more about Simpson's paradox, [here's a great article](https://statisticsbyjim.com/basics/simpsons-paradox/).

But now, it's time for your task.

### Question 6: Save the Pets 🐶

Kyle is a veterinarian. Below, you'll find information about some of the dogs in his care, separated by district and breed.

<table style="border-collapse: collapse; width: 500; text-align: left;">
  <thead>
    <tr>
      <th colspan="2"></th>
      <th colspan="2" style="text-align: center; font-weight: bold; padding: 10px;">Golden Retriever</th>
      <th colspan="2" style="text-align: center; font-weight: bold; padding: 10px;">German Shepherd</th>
    </tr>
    <tr>
      <th></th>
      <th></th>
      <th style="text-align: center; font-weight: bold; padding: 10px;">Mean Weight</th>
      <th style="text-align: center; font-weight: bold; padding: 10px;">Count</th>
      <th style="text-align: center; font-weight: bold; padding: 10px;">Mean Weight</th>
      <th style="text-align: center; font-weight: bold; padding: 10px;">Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="font-weight: bold; padding: 10px;">District 1</td>
      <td></td>
      <td style="text-align: center; padding: 10px;">30</td>
      <td style="text-align: center; padding: 10px;">4</td>
      <td style="text-align: center; padding: 10px;">20</td>
      <td style="text-align: center; padding: 10px;">3</td>
    </tr>
    <tr>
      <td style="font-weight: bold; padding: 10px;">District 2</td>
      <td></td>
      <td style="text-align: center; padding: 10px;">45</td>
      <td style="text-align: center; padding: 10px;">1</td>
      <td style="text-align: center; padding: 10px;">$a$</td>
      <td style="text-align: center; padding: 10px;">$b$</td>
    </tr>
  </tbody>
</table>


<!-- BEGIN QUESTION -->

#### Question 6.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div>

What is the mean weight of all Golden Retrievers in Kyle's care? Type your final answer in the cell below, and **show your work**. While not required, this is a good opportunity to learn how to use [LaTeX](https://pages.uoregon.edu/torrence/391/labs/LaTeX-cheat-sheet.pdf). For example, here's how we might format a formula (double-click this cell to see how we did it):

$$3 + \frac{2}{5}$$

---

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

#### Question 6.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Find **integers** $a$ and $b$ such that Simpson's paradox occurs in this specific way:
- The mean weight of Golden Retrievers in District 1 is **greater than** the mean weight of German Shepherds in District 1, and
- The mean weight of Golden Retrievers in District 2 is **greater than** the mean weight of German Shepherds in District 2, but
- The mean weight of Golden Retrievers overall is **less than** the mean weight of German Shepherds overall.

There are infinitely many solutions; give a solution with the **smallest possible value of $a$**. If you still find that there are many possible values of $b$, then give the smallest possible value of $b$. Again, **show your work**.

---

_Type your answer here, replacing this text._

---

## Finish Line 🏁

Congratulations! You're ready to submit Homework 3.

To submit your homework:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 2".
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all **public tests** have passed on Gradescope. **Remember that homeworks have hidden tests, which you will not see your scores on until a few days after the deadline!**
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()