In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 4

# Exploratory Data Analysis and Missing Values

### EECS 398-003: Practical Data Science, Fall 2024

#### Due Thursday, September 26th at 11:59PM
    
</div>

## Instructions

Welcome to Homework 4! In this homework, you will practice exploratory data analysis. That is, you'll learn how to take messy data, clean it for analysis, draw meaningful visualizations from it, and impute missing values. See the [Readings section of the Resources tab on the course website](https://practicaldsc.org/resources/#readings) for supplemental resources.

You are given six slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/fa24/). The [⚙️ Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps. Once you're done, you'll submit your completed notebook to Gradescope.

Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll grade your **most recent** submission. Remember that the public `grader.check` tests in your notebook are not comprehensive, and that your work will also be graded on hidden test cases on Gradescope after the submission deadline.

This homework is worth a total of **40 points**, 31 of which come from the autograder, **and 9 of which are manually graded by us** (Questions 5.1 and 5.2). The number of points each question is worth is listed at the start of each question. **The three parts of the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

<a name='like-dataframe'>

</a>

<div class="alert alert-warning" markdown="1">
    
**Note**: Throughout this homework, you'll see statements like this frequently:

<blockquote>Complete the implementation of the function ____, which takes in a DataFrame <code>df</code> like <code>other_df</code> and _____.</blockquote>

What this means is that you should assume that `df` has the same number of columns as `other_df`, with the same column titles and data types, but potentially a different number of rows in a different order, with a potentially different index. You should always also assume that `df` has at least one row.

We have you implement functions like this to prevent you from hard-coding your answers to one specific dataset.

</div>

<div class="alert alert-danger" markdown="1">

`for`-loops are **allowed** in Questions 3 and 7. `for`-loops are **not allowed** in Questions 1, 2, 4, 5, and 6.

</div>

To get started, run the **two** import cells below, plus the cell at the top of the notebook that imports and initializes `otter`. The first cell below installs a new package that wasn't included in the `pds` conda environment that we'll need for Part 2.

In [None]:
!pip install -U kaleido

In [None]:
import pandas as pd
import numpy as np

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

def save_and_show(fig, path):
    plotly.io.write_image(fig, path, width=1200)
    display(Image(path))

from IPython.display import display, Image

## Part 1: LendingClub Returns 💰

---

In this part, we'll continue working with the LendingClub dataset from [Lecture 7](https://practicaldsc.org/resources/lectures/lec07/lec07-filled.html) and [Lecture 8](https://practicaldsc.org/resources/lectures/lec08/lec08-filled.html). Run the cell below to load in the file `'data/loans.csv'` as a DataFrame and clean it the way that we did in lecture.

In [None]:
def clean_term_column(df):
    return df.assign(
        term=df['term'].str.split().str[0].astype(int)
    )
    
def clean_date_column(df):
    return (
        df
        .assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
        .drop(columns=['issue_d'])
    )

loans = (
    pd.read_csv('data/loans.csv')
    .pipe(clean_term_column)
    .pipe(clean_date_column)
)

As a refresher, each row of the dataset corresponds to a different loan that the LendingClub approved and paid out. This is only a sample of all loans the LendingClub ever gave out, and remember, each row corresponds to an actually approved loan, **not** a loan application.

Some of the key columns are:
- `'loan_amnt' (float)`: The amount of the loan, or how much the borrower borrowed.
- `'issue_d' (str)`: The date on which the loan was issued.
- `'term' (str)`: The length of the loan, that is, the amount of time the borrower has to pay the loan back.
- `'int_rate' (float)`: The interest rate the borrower will pay on their loan amount.
- `'fico_range_low' (float)`: The borrower's credit score at the time of their application.

In lecture, we drew visualizations to uncover a few key patterns in the data. We saw that:
- Interest rates tend to be higher for 60 month loans than 36 month loans.
- Borrowers with larger debt-to-income (DTI) ratios tend to receive higher interest rates than borrowers with lower DTIs.

One feature we didn't use very much in our preliminary analyses was borrowers' credit scores. Both interest rate (`'int_rate'`) and credit score (`'fico_range_low'`) are numerical features, so to look at the relationship between them, we can use a scatter plot:

In [None]:
px.scatter(loans, x='fico_range_low', y='int_rate',
           labels={'fico_range_low': 'Credit Score', 'int_rate': 'Interest Rate (%)'},
           title='Interest Rate vs. Credit Score')

There's a lot of overplotting here, meaning that many points are being plotted on top of one another. It does indeed seem that as credit scores increase, interest rates tend to decrease on average, but perhaps there's a better way to visualize this information.


One idea is to place credit scores into categories by **binning** them. According to [Experian](https://www.experian.com/blogs/ask-experian/credit-education/score-basics/what-is-a-good-credit-score/#s1), one of the three major credit bureaus in the US, FICO credit scores are described qualitatively as follows:

| Score | Category |
|---|---|
| 580 - 669 | Fair |
| 670 - 739 | Good |
| 740 - 799 | Very Good |
| 800 - 850 | Excellent |

There is actually also a bin below fair, named "poor" with a range of 300-579, but since `loans` doesn't have any poor credit scores, we'll exclude them from our exploration here. Note that while the `dtype` of `'fico_range_low'` is `float`, credit scores are actually integers.

Once we place credit scores into bins, we can visualize the distribution of interest rates separately for each credit score bin. Here, that would allow us to draw four separate distributions of interest rates – one for the fair group, one for the good group, one for the very good group, and one for the excellent group. Each one of those four distributions are **numerical distributions**, which we have several tools for visualizing, including histograms, box plots, and violin plots.

### Question 1: Boxing Day 🥊 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Complete the implementation of the function `create_boxplot`, which takes in a DataFrame `df` like `loans` and returns a `plotly` figure object containing a **box plot describing the distribution of interest rates, separately for each of the four credit score bins described below, and separately for the two loan lengths**. Here's an example of the plot you'll need to create:

<center><img src="imgs/example-q1.png" width=60%></center>

To create your figure, you'll use the `px.box` function and provide several arguments. This [`plotly` article](https://plotly.com/python/box-plots/) will be extremely helpful.

Before using `px.box`, though, you'll need to place credit scores into bins. There's a `pandas` function that will be helpful here. **Make sure the bins match those in our example plot exactly – inclusive of the left endpoint and exclusive of the right endpoint.** You'll need to hard-code these when creating your plot. You can assume that nobody has an exact credit score of 850. Once you've binned scores, you'll need to convert your Series of bin assignments to strings so that they can be used on the $x$-axis of a `px.box` figure.

Some additional guidance:
- Make sure your axis labels, legend labels, and title are the same as ours.
- You **must** change the colors of the two terms from the default colors to something else. We chose purple and gold. To do this, you'll need to manually specify what color you want for the `36` group and the `60` group; it is fine to hard-code these two term lengths when creating your plot.

Note that unlike in previous plotting questions, there **are** hidden tests, but they just check that the values in your plot are correct; the formatting is checked in the public tests. Remember that we will test `create_boxplot` on random samples of `loans`!

<div class="alert alert-danger">

Once you're done implementing `create_boxplot`, please comment out the line at the bottom of the cell below that says `create_boxplot(loans)`; otherwise, we won't be able to manually grade your work!

</div>

In [None]:
def create_boxplot(df):
    ...

# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
create_boxplot(loans)

In [None]:
grader.check("q01")

If you created your box plot correctly, you should have seen a few things:
- As borrowers' credit scores increase, both the median and variance in interest rates tend to decrease.
- Across the spectrum of credit scores, 60 month loans tend to have higher interest rates than 36 month loans, which we'd seen before in our plots in lecture.

Good to know!

Another factor that lenders like the LendingClub may look at when deciding whether or not to approve loans is the amount of **disposable income** potential borrowers have. With more disposable income, borrowers may be more likely to make their payments on time and less likely to **default** on their loans.

Disposable income is defined as:

$$
\text{Disposable Income} = \text{Gross Income} - \text{Federal Income Tax} - \text{State Income Tax} 
$$

The `'annual_inc'` column in `loans` contains each borrower's gross income – that is, their income before taxes are removed. But it doesn't contain any information about the taxes that each borrower owes on their income.

Run the cell below to define a DataFrame named `state_taxes_raw` that contains the tax brackets for each state in 2023 ([source](https://taxfoundation.org/data/all/state/state-income-tax-rates-2023/)).

In [None]:
state_taxes_raw_path = 'data/state_taxes_raw.csv'
state_taxes_raw = pd.read_csv(state_taxes_raw_path)
state_taxes_raw.head()

Above, this is saying that Alabama has three state income tax brackets:
- Any income between \\$0 and \\$500 is taxed at 2%.
- Any income between \\$500 and \\$3000 is taxed at 4%.
- Any income above \\$3000 is taxed at 5%.

These tax brackets are a mess! There are full rows of `NaN` values, not every row has a `'State'` filled in, and more. The brackets need more work to be compatible with, say, the `tax_owed` function we wrote in Homework 1. **In the next question, your job will be to reformatting the brackets in `state_taxes_raw` to be in a more useful format**. We won't actually go through the full process of computing disposable incomes, to be clear; you're just responsible for doing the cleaning.

### Question 2: Death and Taxes 💵 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">7 Points</div>


Complete the implementation of the function `state_brackets`, which takes in a DataFrame `df` like `state_taxes_raw`. It should return a DataFrame, indexed by `'State'`, with a single column, `'bracket_list'`, that contains the tax brackets for each `'State'` as **lists of tuples** in the form `(tax_rate, bracket_lower_limit)`.

Example behavior is given below.

```python
>>> state_brackets(state_taxes_raw).head(4)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>bracket_list</th>
    </tr>
    <tr>
      <th>State</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Ala.</th>
      <td>[(0.02, 0), (0.04, 500), (0.05, 3000)]</td>
    </tr>
    <tr>
      <th>Alaska</th>
      <td>[(0.0, 0)]</td>
    </tr>
    <tr>
      <th>Ariz.</th>
      <td>[(0.02, 0)]</td>
    </tr>
    <tr>
      <th>Ark.</th>
      <td>[(0.02, 0), (0.04, 4300), (0.05, 8500)]</td>
    </tr>
  </tbody>
</table>

Some added guidance:

- In the returned DataFrame, `'State'`s should appear in the same order in which they appear in the input DataFrame `df`. 

- In each list of tuples in the returned `'State'` column, the tax rates themselves should be proportions, stored as floats (e.g. `0.02` instead of `2` to represent 2%), and the income amounts should be stored as integers. Round the tax rate proportions to two decimal places to avoid any floating point precision errors, and make sure to correctly account for states with no state income tax (like Alaska) as we did in the example.
- If you're a little stuck on where to start, try to first clean the input DataFrame so it looks like the following DataFrame, and then figure out how to correctly group it:

    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>State</th>
          <th>Rate</th>
          <th>Lower Limit</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>Ala.</td>
          <td>0.02</td>
          <td>0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>Ala.</td>
          <td>0.04</td>
          <td>500</td>
        </tr>
        <tr>
          <th>2</th>
          <td>Ala.</td>
          <td>0.05</td>
          <td>3000</td>
        </tr>
        <tr>
          <th>4</th>
          <td>Alaska</td>
          <td>0.00</td>
          <td>0</td>
        </tr>
        <tr>
          <th>6</th>
          <td>Ariz.</td>
          <td>0.02</td>
          <td>0</td>
        </tr>
      </tbody>
    </table>

- Make sure that any cleaning steps you perform are done within `state_brackets`, not directly to the `state_taxes_raw` DataFrame outside your notebook.
- We will call your function on subsets of `state_taxes_raw`, but only subsets that are still well-formed, i.e. where all of the rows for a particular `'State'` are in order and formatted the same way they are in `state_taxes_raw` (no randomly sampled subsets).
- You'll need to do some research to help you identify useful functions and methods to use (for instance, in how to fill in missing values in a particular way). For instance, you may need to `zip` ([documentation link](https://docs.python.org/3.3/library/functions.html#zip)) 🤐 at some point. 
- Don't forget that you can apply a function to every **row** of a DataFrame by using the DataFrame `apply` method with `axis=1`. `lambda` functions are your friend! 🫂
- This is quite an involved problem. Try to organize your work as best as you can, for instance, **by using many helper functions** and the `pipe` DataFrame method. **Nothing about this question requires the use of a `for`-loop, so don't use one!**

In [None]:
def state_brackets(df):
    ...

# If you've completed the question correctly,
# the first four rows below should match those in the
# example above.
state_brackets(state_taxes_raw)

In [None]:
grader.check("q02")

## Part 2: Clean It Up 🧹

---

### Question 3: Reading Malformed CSVs <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Up until now, most of the data we've had to work with was presented to us in a nice CSV file that we could call `pd.read_csv` on with no issues. But that won't always be the case! Sometimes there will be errors or problematic formatting.

`'data/malformed.csv'` is a file of comma-separated values, containing the following fields:

- `'first' (str)`: First name of person.
- `'last' (str)`: Last name of person.
- `'weight' (float)`: Weight of person (lbs).
- `'height' (float)`: Height of person (in).
- `'geo' (str)`: Location of person; comma-separated latitude/longitude.


Unfortunately, the entries contains errors with the placement of commas (`,`) and quotes (`"`) that cause `pd.read_csv` to fail parsing the file with the default settings. Don't believe us? Try using `pd.read_csv` on `'data/malformed.csv'` and look at what happens.

As a result, instead of using `pd.read_csv`, you must read in the file manually using Python's built-in `open` function.

Complete the implementation of the function `parse_malformed`, which takes in a string, `fp`, containing the path to a file, and returns a parsed, properly-typed DataFrame with the information in the corresponding file.

Example behavior is given below.

```python
>>> parse_malformed('data/malformed.csv').head(3)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>first</th>
      <th>last</th>
      <th>weight</th>
      <th>height</th>
      <th>geo</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Julia</td>
      <td>Wagner</td>
      <td>142.0</td>
      <td>86.0</td>
      <td>39.8,15.4</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Angelica</td>
      <td>Rija</td>
      <td>155.0</td>
      <td>56.0</td>
      <td>38.2,-71.7</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Tyler</td>
      <td>Micajah</td>
      <td>116.0</td>
      <td>73.0</td>
      <td>38.0,6.9</td>
    </tr>
  </tbody>
</table>

Some guidance:
- The only kinds of issues you need your function to handle are comma and quote misplacements; don't try and find any other issues with the CSV. 
- You should assume that `'data/malformed.csv'` is a sample of a larger file that has the same sorts of errors, but potentially in different lines. For example, `'data/malformed.csv'` has an unnecessary quote `"` in line 4, but your function may be called on another CSV that has a perfectly fine line 4 but an unnecessary quote on some other line.
- So, **don't** implement `parse_malformed` assuming that the commas and quotes are mispositioned on specific lines; rather, implement `parse_malformed` such that it can handle these issues on every single line they appear in. A good way to proceed is to open `'data/malformed.csv'` and look carefully at the comma and quote placements.
- **You can** use a `for`-loop in this question.

In [None]:
def parse_malformed(fp):
    ...

# If you've completed the question correctly,
# the first three rows below should match those in the
# example above.
# Remember that we will call your function on
# other similarly-formatted CSVs!
parse_malformed('data/malformed.csv')

In [None]:
grader.check("q03")

### Question 4: High Potential Individuals 📈 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

In 2022, the United Kingdom 🇬🇧 announced a new ["High Potential Individual" visa](https://www.gov.uk/high-potential-individual-visa/eligibility), which allows graduates of universities ranked in the Top 50 globally to move to the UK without a job lined up. This visa has been a subject of much debate, in part due to how much rankings play a role. Don't worry – the University of Michigan is on the list!

In the next two questions, you will clean and analyze a dataset of university rankings, collected from  [here](https://www.kaggle.com/datasets/mylesoneill/world-university-rankings?datasetId=) (though we have pre-processed and modified the original dataset for the purposes of this question).

Our version of the dataset is stored in `'data/universities.csv'`. We load it in as a DataFrame, `universities_raw`, below.

In [None]:
universities_raw = pd.read_csv('data/universities.csv')
universities_raw.head()

Here are what the columns of `universities_raw` contain:

- `'world_rank'`: World rank of the institution.
- `'institution'`: Name of the institution.
- `'national_rank'`: Rank within the nation, formatted as `'country, rank'`.
- `'quality_of_education'`: Rank by quality of education.
- `'alumni_employment'`: Rank by alumni employment.
- `'quality_of_faculty'`: Rank by quality of faculty.
- `'publications'`: Rank by publications.
- `'influence'`: Rank by influence.
- `'citations'`: Rank by number of citations.
- `'broad_impact'`: Rank by broad impact.
- `'patents'`: Rank by number of patents.
- `'score'`: Overall score of the institution, out of 100.
- `'control'`: Whether the university is public or private.
- `'city'`: City in which the institution is located.
- `'state'`: State in which the institution is located.

There are (still) a few aspects of the dataset we need to clean before it's ready for analysis.

Complete the implementation of the function `clean_universities`, which takes in a DataFrame `df` like `universities_raw` and returns a cleaned DataFrame, cleaned according to the following specifications:

- Some `'institution'` names contain `'\n'` characters (e.g. `'University of Michigan\nAnn Arbor'`). Replace all instances of `'\n'` with `', '` (a comma and a space) in the `'institution'` column.

- Change the data type of the `'broad_impact'` column to `int`.

- Split `'national_rank'` into two columns, `'nation'` and `'national_rank_cleaned'`, where:
    - `'nation'` is the country (or its dependency) indicated in the first part of `'national_rank'`. 
        - Note that there are **3** countries that appear under different names for different schools. For all 3 of these countries, you should pick **the name that is longer** and use that name for every occurrence of the country. One of the 3 countries is **`'Czech Republic'`**, which also appears as **`'Czechia'`** – since these refer to the same country and `'Czech Republic'` is longer, all instances of either name should be replaced with `'Czech Republic'`. You need to find the other 2 countries on your own. 
        - These are the only 3 country names you need to handle.
    - `'national_rank_cleaned'` is the integer in the latter part of `'national_rank'`. Make sure that the values in this column are stored as integers. 
    - Don't include the original `'national_rank'` column in the output DataFrame.

Example behavior is given below.

```python
>>> clean_universities(universities).loc[[18], ['institution', 'nation', 'national_rank_cleaned']]
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>institution</th>
      <th>nation</th>
      <th>national_rank_cleaned</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>18</th>
      <td>University of Michigan, Ann Arbor</td>
      <td>United States</td>
      <td>15</td>
    </tr>
  </tbody>
</table>

In [None]:
def clean_universities(df):
    ...

# Feel free to change this input to make sure your function works correctly.
# A good strategy is to make sure it works when you call it on a random subset of universities_raw,
# e.g. clean_universities(universities_raw.sample(100)).
clean_universities(universities_raw)

In [None]:
grader.check("q04")

Once you're done Question 4, run the cell below to define a new DataFrame, `universities_cleaned`, that we'll use in Question 5.

In [None]:
universities = clean_universities(universities_raw)
universities.head()

### Question 5: University of Practical Data Science 🏫

Now that we have a cleaned DataFrame, `universities`, we can use it for analysis! Note that it's still not perfectly clean. Try and find the top 20 public `'institution'`s in the United States. What do you notice?

In [None]:
# Try and answer the question above here. It's not required, but you should explore – 
# we promise you'll find it interesting!
...

#### Question 5.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Many of the top `'institutions'` in the world, like Harvard University and the Massachusetts Institute of Technology, don't have the substring `'University of'` in their name. But many do, like University of Cambridge and, of course, University of Michigan, Ann Arbor. One may wonder, which schools have higher `'score'`s on average – schools with `'University of'` in their name, or schools without?

Complete the implementation of the function `plot_university_of`, which takes in a DataFrame `df` like `universities` and returns a **`plotly` Figure object** depicting the distribution of `'score'`s, separately for `'institutions'` with and without `'University of'` in the title.

Some guidance:
- Names like "Delft University of Technology" count too – so `'University of'` doesn't have to be at the very start of the name. Similarly, if `'University Of Ann Arbor'` (capital O in "Of") was an `'institution'` in `df`, it should also count in the `'University of'` category.
- Your plot should be such that it shows us at least _some_ of the `'score'`s for individual `'institutions'` directly. For example, in addition to some distributional information, we need to be able to see – perhaps by hovering – that Harvard University's `'score'` is 100.
- Make sure your axis labels, legend labels, and title are all chosen manually by you: don't just use the defaults.

<div class="alert alert-danger">

Unlike other plots we've had you create in this class, this one will be manually graded, **not** autograded! To make sure we can grade your work correctly, once you're done implementing `plot_university_of`:

1. Comment out the line at the bottom of the cell below, that just has `plot_university_of(universities)`.
2. Run the cell that calls the function `save_and_show`. You should see a static (i.e. not interactive) version of your plot. This is to be expected. (This is what we'll use to grade your work.)

</div>

In [None]:
def plot_university_of(df):
    ...

# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
plot_university_of(universities)

In [None]:
# Run this cell, don't change anything.
save_and_show(plot_university_of(universities), 'imgs/q05_1.png')

#### Question 5.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

One of the most important skills a data scientist needs to master is the ability to _think_ of questions. That's what you'll need to do here.

**Your job is to think of an interesting question you might want to answer using the data `universities`, and then draw a visualization to help answer it.** Your visualization must involve at least one numerical feature and at least one categorical feature (potentially, one that you construct by taking an existing numerical feature and "binning" it, as we did in Question 1). The distribution of a single categorical feature is not _interesting_ enough, i.e. your question needs to be more sophisticated than "Which states have the most public universities?" with an accompanying bar chart.

Complete the implementation of the function `custom_vis`, which takes in a DataFrame `df` like `universities` and returns a **`plotly` Figure object** containing your visualization. **Then, in the cell beneath it, tell us the question you tried to answer, and give us 1-2 sentences describing your takeaways.**

This question will mostly be graded on effort. Meet the requirements above and you'll get full credit. That said, take pride in your work, and try your best to make a really special plot that others won't. Are you really proud of what you made? [**Post it in this Ed post**](https://edstem.org/us/courses/61012/discussion/5320217).

<div class="alert alert-danger">

Like the plot above, this one will be manually graded, **not** autograded! To make sure we can grade your work correctly, once you're done implementing `custom_vis`:

1. Comment out the line at the bottom of the cell above, that just has `custom_vis(universities)`.
2. Run the cell that calls the function `save_and_show`. You should see a static (i.e. not interactive) version of your plot. This is to be expected. (This is what we'll use to grade your work.)

Here, remember to provide your analysis of your plot in the cell below the one in which `save_and_show` is run.

</div>

In [None]:
def custom_vis(df):
    ...

# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
custom_vis(universities)

In [None]:
# Run this cell, don't change anything.
save_and_show(custom_vis(universities), 'imgs/q05_2.png')

_Type your answer here, replacing this text._

## Part 3: What's Missing? 👀

---

In [Lecture 8](https://practicaldsc.org/resources/lectures/lec08/lec08-filled.html), we learned about different **imputation strategies** for filling in missing values. Here, we'll take things a step further and develop a few more related strategies.

### Question 6: Conditioning on a Numerical Feature 🧍📏 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

In Lecture 8, you learned how to perform mean imputation conditionally on a **categorical** column: impute with the mean for each group. That is, for each distinct value of the **categorical** column, there is a single imputed value.

Here, you will perform single-valued imputation by conditioning on a **numerical** column. This is not something we learned how to do in class. 

You will work with a version of the `heights` DataFrame from class, now called `new_heights`, that has a `'father'` column and a single `'child'` column. The `'child'` column has missing values. To impute the `'child'` column, transform the `'father'` column into a categorical column by binning the values of `'father'` into [quartiles](https://en.wikipedia.org/wiki/Quartile). Once this is done, you can impute `'child'` as in lecture (and described above).

In [None]:
new_heights = pd.read_csv('data/missing_heights.csv')
new_heights.head()

Complete the implementation of the function `cond_single_imputation`, which takes in a DataFrame `df` like `new_heights` with columns `'father'` and `'child'` (where `'child'` has missing values) and performs a single-valued mean imputation of the `'child'` column, conditional on `'father'`. `cond_single_imputation` should return a **Series** with just the imputed `'child'` heights.

Example behavior is given below.

```python
>>> cond_single_imputation(new_heights).head(5)
0    68.083871
1    68.083871
2    69.000000
3    69.000000
4    73.500000
```

Some guidance:
- There's likely a new `pandas` function you needed to research and use in Question 1. If you use that here, you'll be able to write a two-line solution.
- If you use the `pandas` function above that we're talking about, you'll likely want to convert your column with group labels in it to hold type `str` (or `object`), otherwise you'll run into a `FutureWarning`.

In [None]:
def cond_single_imputation(df):
    ...

# Feel free to change this input to make sure your function works correctly.
cond_single_imputation(new_heights)

In [None]:
grader.check("q06")

### Question 7: Advanced Probabilistic Imputation 🎲 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

In [Lecture 8](https://practicaldsc.org/resources/lectures/lec08/lec08-filled.html), you learned how to impute a quantitative column by sampling from the observed values. **One problem with this technique is that the imputation will never generate imputed values that weren't already in the dataset.** For example, 57, 57.5, and 59 are values in the `'child'` column of `new_heights` while 58 is not. Thus, any imputation done by sampling from the observed values in the `'child'` column will not be able to generate a height of 58, even though it's clearly a reasonable value to occur in the dataset.

To keep things simple, you will impute the `'child'` column **unconditionally** from the distribution of `'child'` heights present in the dataset. This means that you will use the values present in `'child'` to impute missing values, without looking at other columns.

An approach to quantitative imputation that overcomes the limitation mentioned above is as follows:
- Create a histogram of observed `'child'` heights, using 10 bins.
- Use the histogram to generate a new `'child'` height to impute with, somewhere within the observed range of `'child'` heights:
    - The probability a generated `'child'` height will come from a given bin is equal to the proportion of `'observed'` values in that bin. For example, if 20\% of observed `'child'` heights fall within a particular bin, then there's a 20\% chance we select that bin to draw a new `'child'` height from.
    - Once we select a bin, any value within that bin's left and right endpoints is equally likely to be drawn. For example, if we selected the bin [12, 14), we could choose any real number between 12 (inclusive) and 14 (exclusive) to fill in as our missing `'child'` height; `np.random.uniform` can help us pick that number between 12 and 14.
    
Let's illustrate this approach with an example. Let `demo` be the array of 10 numbers defined below.

```python
demo = np.array([10, 11, 11, 13, 14, 14, 13.5, 14, 15, 16])
```

- The first step is creating a histogram of `demo`. Note that with this example dataset, we will use 4 bins, **but you will be using 10 bins** in your imputation process.

<center><img src='imgs/histogram-demo.png' width=700></center>

- Note that in your implementation, you don't actually need to draw a histogram – instead, use `np.histogram`. (Play around with it to figure out how it works.)
- In the histogram above, we see that $\frac{3}{10}$ of values lie in the [10, 12) bin, $\frac{2}{10}$ of values lie in the [12, 14) bin, $\frac{4}{10}$ of values lie in the [14, 16) bin, and $\frac{1}{10}$ of values lie in the [16, 18] bin.
- Next, we need to pick a bin at random. There's a 30\% chance we pick the [10, 12) bin, a 20\% chance we pick the [12, 14) bin, a 40\% chance we pick the [14, 16) bin, and a 10\% chance we pick the [16, 18] bin. `np.random.choice` will be helpful in picking a bin at random.
- Once we pick a bin, we pick a number **uniformly at random** from within the bin. For instance, suppose we randomly chose the [14, 16) bin in the previous step. We then must select a (real) number between 14 and 16 uniformly at random; as mentioned above, `np.random.uniform` can help us here.

Complete the implementation of the function `impute_height_quant`, which takes in a Series `s` like `heights['child']`, in which some values are missing, and returns a Series in which the values are imputed using the histogram scheme above. The length of the returned Series should be the same as the length of `s`.

Example behavior is given below.

```python
>>> impute_height_quant(new_heights['child']).head(5)
0    69.283202            # Randomly chosen!
1    69.208524            # Randomly chosen!
2    69.000000
3    69.000000
4    73.500000
```

Some guidance:
- You _can_ use a `for`-loop if needed.
- As always, it's good practice to define a helper function!

In [None]:
def impute_height_quant(s):
    ...

# Feel free to change this input to make sure your function works correctly.
impute_height_quant(new_heights['child'])

In [None]:
grader.check("q07")

## Finish Line 🏁

Congratulations! You're ready to submit Homework 4.

To submit your homework:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 4".
5. Stick around while the Gradescope autograder grades your work. Make sure you see that all **public tests** have passed on Gradescope. **Remember that homeworks have hidden tests, which you will not see your scores on until a few days after the deadline!**
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()