In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 4

# Exploratory Data Analysis and Missing Values

### EECS 398: Practical Data Science, Winter 2025

#### Due Tuesday, February 11th at 11:59PM
    
</div>

## Instructions

Welcome to Homework 4! In this homework, you will practice exploratory data analysis. That is, you'll learn how to take messy data, clean it for analysis, draw meaningful visualizations from it, and impute missing values. You'll also practice with scraping messy data from the internet.

You are given 8 slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/wn25/). The [Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps. Once you're done, you'll submit your completed notebook to Gradescope.

Please start early and submit often. You can submit as many times as you'd like to Gradescope, and we'll grade your **most recent** submission. Remember that the public `grader.check` tests in your notebook are not comprehensive, and that your work will also be graded on hidden test cases on Gradescope after the submission deadline.

This homework is worth a total of **46 points**, 37 of which come from the autograder, **and 9 of which are manually graded by us** (Questions 4.1 and 4.2). The number of points each question is worth is listed at the start of each question. **Most questions in the assignment are independent, so feel free to move around if you get stuck**. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

<div class="alert alert-success" markdown="1">

### Submission

Unlike previous homeworks, you will only submit your Homework 4 notebook; you will **not** submit a separate PDF with answers to the written questions. (As mentioned in Question 6, you will also upload an HTML file you create manually and a few images.)

In Question 4, you are asked to create visualizations that we will manually grade; we will look at your visualizations directly in your notebook after you submit it. To ensure that we can properly grade your notebook, make sure that your visualizations are visible on Gradescope after you submit.

The details of how to ensure your visualizations are submitted correctly are described in this [**video**](https://www.loom.com/share/58287a89121545fdbd0131d22c2d9c94?sid=10e6190f-4988-40d8-b999-c63c30101693'). **Please watch it before submitting!**

</div>


<a name='like-dataframe'>

</a>

<div class="alert alert-warning" markdown="1">

### Functions
    
**Note**: Throughout this homework, like in Homework 3, you'll see statements like this frequently:

<blockquote>Complete the implementation of the function ____, which takes in a DataFrame <code>df</code> like <code>other_df</code> and _____.</blockquote>

What this means is that you should assume that `df` has the same number of columns as `other_df`, with the same column titles and data types, but potentially a different number of rows in a different order, with a potentially different index. You should always also assume that `df` has at least one row.

We have you implement functions like this to prevent you from hard-coding your answers to one specific dataset.

</div>

To get started, run the import cell below, plus the cell at the top of the notebook that imports and initializes `otter`. 

In [None]:
import pandas as pd
import numpy as np

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

import os
import requests
from bs4 import BeautifulSoup

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

def save_and_show(fig, path):
    plotly.io.write_image(fig, path, width=1200)
    display(Image(path))

from IPython.display import display, Image, IFrame, HTML

## Question 1: LendingClub Returns 💰 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div> 

---

In this question, we'll continue working with the LendingClub dataset from [Lecture 6](https://practicaldsc.org/resources/lectures/lec06/lec06-filled.html) and [Lecture 7](https://practicaldsc.org/resources/lectures/lec07/lec07-filled.html). Run the cell below to load in the file `'data/loans.csv'` as a DataFrame and clean it the way that we did in lecture.

In [None]:
def clean_term_column(df):
    return df.assign(
        term=df['term'].str.split().str[0].astype(int)
    )
    
def clean_date_column(df):
    return (
        df
        .assign(date=pd.to_datetime(df['issue_d'], format='%b-%Y'))
        .drop(columns=['issue_d'])
    )

loans = (
    pd.read_csv('data/loans.csv')
    .pipe(clean_term_column)
    .pipe(clean_date_column)
)

As a refresher, each row of the dataset corresponds to a different loan that the LendingClub approved and paid out. This is only a sample of all loans the LendingClub ever gave out, and remember, each row corresponds to an actually approved loan, **not** a loan application.

Some of the key columns are:
- `'loan_amnt' (float)`: The amount of the loan, or how much the borrower borrowed.
- `'issue_d' (str)`: The date on which the loan was issued.
- `'term' (str)`: The length of the loan, that is, the amount of time the borrower has to pay the loan back.
- `'int_rate' (float)`: The interest rate the borrower will pay on their loan amount.
- `'fico_range_low' (float)`: The borrower's credit score at the time of their application.

In lecture, we drew visualizations to uncover a few key patterns in the data. We saw that:
- Interest rates tend to be higher for 60 month loans than 36 month loans.
- Borrowers with larger debt-to-income (DTI) ratios tend to receive higher interest rates than borrowers with lower DTIs.

One feature we didn't use very much in our preliminary analyses was borrowers' credit scores. Both interest rate (`'int_rate'`) and credit score (`'fico_range_low'`) are numerical features, so to look at the relationship between them, we can use a scatter plot:

In [None]:
px.scatter(loans, x='fico_range_low', y='int_rate',
           labels={'fico_range_low': 'Credit Score', 'int_rate': 'Interest Rate (%)'},
           title='Interest Rate vs. Credit Score')

There's a lot of overplotting here, meaning that many points are being plotted on top of one another. It does indeed seem that as credit scores increase, interest rates tend to decrease on average, but perhaps there's a better way to visualize this information.

One solution is to **jitter** the data, which involves adding a tiny amount of random noise to the horizontal ($x$) position of each point. We're adding noise to the horizontal positions only because it seems that the vertical positions ($y$) already are quite spread out, while the horizontal positions are at fixed intervals, specifically, every 5 units on the $x$-axis.

In [None]:
px.scatter(loans.assign(fico_jittered=loans['fico_range_low'] + np.random.normal(0, 1, size=loans.shape[0])), 
           x='fico_jittered', y='int_rate',
           labels={'fico_jittered': 'Credit Score', 'int_rate': 'Interest Rate (%)'},
           title='Interest Rate vs. Credit Score',
           opacity=0.5)

This is a _little_ better. Still, another idea is to place credit scores into categories by **binning** them. According to [Experian](https://www.experian.com/blogs/ask-experian/credit-education/score-basics/what-is-a-good-credit-score/#s1), one of the three major credit bureaus in the US, FICO credit scores are described qualitatively as follows:

| Score | Category |
|---|---|
| 580 - 669 | Fair |
| 670 - 739 | Good |
| 740 - 799 | Very Good |
| 800 - 850 | Excellent |

There is actually also a bin below fair, named "poor" with a range of 300-579, but since `loans` doesn't have any poor credit scores, we'll exclude them from our exploration here. Note that while the `dtype` of `'fico_range_low'` is `float`, credit scores are actually integers.

Once we place credit scores into bins, we can visualize the distribution of interest rates separately for each credit score bin. Here, that would allow us to draw four separate distributions of interest rates – one for the fair group, one for the good group, one for the very good group, and one for the excellent group. Each one of those four distributions are **numerical distributions**, which we have several tools for visualizing, including histograms, box plots, and violin plots.

Complete the implementation of the function `create_boxplot`, which takes in a DataFrame `df` like `loans` and returns a `plotly` figure object containing a **box plot describing the distribution of interest rates, separately for each of the four credit score bins described below, and separately for the two loan lengths**. Here's an example of the plot you'll need to create:

<center><img src="imgs/example-q1.png" width=60%></center>

To create your figure, you'll use the `px.box` function and provide several arguments. This [`plotly` article](https://plotly.com/python/box-plots/) will be extremely helpful.

Before using `px.box`, though, you'll need to place credit scores into bins. There's a `pandas` function that will be helpful here. **Make sure the bins match those in our example plot exactly – inclusive of the left endpoint and exclusive of the right endpoint.** You'll need to hard-code these when creating your plot. You can assume that nobody has an exact credit score of 850. Once you've binned scores, you'll need to convert your Series of bin assignments to strings so that they can be used on the $x$-axis of a `px.box` figure.

Some additional guidance:
- Make sure your axis labels, legend labels, and title are the same as ours.
- You **must** change the colors of the two terms from the default colors to something else. We chose purple and gold. To do this, you'll need to manually specify what color you want for the `36` group and the `60` group; it is fine to hard-code these two term lengths when creating your plot.
- **Don't** use a `for`-loop.

Note that unlike in previous plotting questions, there **are** hidden tests, but they just check that the values in your plot are correct; the formatting is checked in the public tests. Remember that we will test `create_boxplot` on random samples of `loans`!

<div class="alert alert-danger">

Once you're done implementing `create_boxplot`, please comment out the line at the bottom of the cell below that says `create_boxplot(loans)`; otherwise, we won't be able to manually grade your work!

</div>

In [None]:
def create_boxplot(df):
    ...

# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
create_boxplot(loans)

In [None]:
grader.check("q01")

If you created your box plot correctly, you should have seen a few things:
- As borrowers' credit scores increase, both the median and variance in interest rates tend to decrease.
- Across the spectrum of credit scores, 60 month loans tend to have higher interest rates than 36 month loans, which we'd seen before in our plots in lecture.

Good to know! Before you move forward, run the cell below and watch the video that appears. It shows you how to correctly submit your notebook.

In [None]:
IFrame(
    src="https://www.loom.com/embed/58287a89121545fdbd0131d22c2d9c94?sid=7b3a8240-0568-40b3-9d61-9f4bb78971ba",
    width=600,
    height=340
)

## Question 2: Reading Malformed CSVs <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

---

Up until now, most of the data we've had to work with was presented to us in a nice `.csv` file that we could call `pd.read_csv` on with no issues. But that won't always be the case! Sometimes there will be errors or problematic formatting.

`'data/malformed.csv'` is a file of comma-separated values, containing the following fields:

- `'first' (str)`: First name of person.
- `'last' (str)`: Last name of person.
- `'weight' (float)`: Weight of person (lbs).
- `'height' (float)`: Height of person (in).
- `'geo' (str)`: Location of person; comma-separated latitude/longitude.


Unfortunately, the entries contains errors with the placement of commas (`,`) and quotes (`"`) that cause `pd.read_csv` to fail parsing the file with the default settings. Don't believe us? Try using `pd.read_csv` on `'data/malformed.csv'` and look at what happens.

As a result, instead of using `pd.read_csv`, you must read in the file manually using Python's built-in `open` function.

Complete the implementation of the function `parse_malformed`, which takes in a string, `fp`, containing the path to a file, and returns a parsed, properly-typed DataFrame with the information in the corresponding file.

Example behavior is given below.

```python
>>> parse_malformed('data/malformed.csv').head(3)
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>first</th>
      <th>last</th>
      <th>weight</th>
      <th>height</th>
      <th>geo</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Julia</td>
      <td>Wagner</td>
      <td>142.0</td>
      <td>86.0</td>
      <td>39.8,15.4</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Angelica</td>
      <td>Rija</td>
      <td>155.0</td>
      <td>56.0</td>
      <td>38.2,-71.7</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Tyler</td>
      <td>Micajah</td>
      <td>116.0</td>
      <td>73.0</td>
      <td>38.0,6.9</td>
    </tr>
  </tbody>
</table>

Some guidance:
- The only kinds of issues you need your function to handle are comma and quote misplacements; don't try and find any other issues with the CSV. 
- You should assume that `'data/malformed.csv'` is a sample of a larger file that has the same sorts of errors, but potentially in different lines. For example, `'data/malformed.csv'` has an unnecessary quote `"` in line 4, but your function may be called on another CSV that has a perfectly fine line 4 but an unnecessary quote on some other line.
- So, **don't** implement `parse_malformed` assuming that the commas and quotes are mispositioned on specific lines; rather, implement `parse_malformed` such that it can handle these issues on every single line they appear in. A good way to proceed is to open `'data/malformed.csv'` and look carefully at the comma and quote placements.
- **You can** use a `for`-loop.

In [None]:
def parse_malformed(fp):
    ...

# If you've completed the question correctly,
# the first three rows below should match those in the example above.
# Remember that we will call your function on other similarly-formatted CSVs!
parse_malformed('data/malformed.csv')

In [None]:
grader.check("q02")

## Questions 3-4: University of Practical Data Science 🏫

---

### Question 3: High Potential Individuals 📈 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>


In 2022, the United Kingdom 🇬🇧 announced a new ["High Potential Individual" visa](https://www.gov.uk/high-potential-individual-visa/eligibility), which allows graduates of universities ranked in the Top 50 globally to move to the UK without a job lined up. This visa has been a subject of much debate, in part due to how much rankings play a role. Don't worry – the University of Michigan is on the list! 〽️

In the next two questions, you will clean and analyze a dataset of university rankings, collected from  [here](https://www.kaggle.com/datasets/mylesoneill/world-university-rankings?datasetId=), though we have pre-processed and modified the original dataset for the purposes of this question.

Our version of the dataset is stored in `'data/universities.csv'`. We load it in as a DataFrame, `universities_raw`, below.

In [None]:
universities_raw = pd.read_csv('data/universities.csv')
universities_raw.head()

Here are what the columns of `universities_raw` contain:

- `'world_rank'`: World rank of the institution.
- `'institution'`: Name of the institution.
- `'national_rank'`: Rank within the nation, formatted as `'country, rank'`.
- `'quality_of_education'`: Rank by quality of education.
- `'alumni_employment'`: Rank by alumni employment.
- `'quality_of_faculty'`: Rank by quality of faculty.
- `'publications'`: Rank by publications.
- `'influence'`: Rank by influence.
- `'citations'`: Rank by number of citations.
- `'broad_impact'`: Rank by broad impact.
- `'patents'`: Rank by number of patents.
- `'score'`: Overall score of the institution, out of 100.
- `'control'`: Whether the university is public or private.
- `'city'`: City in which the institution is located.
- `'state'`: State in which the institution is located.

There are (still) a few aspects of the dataset we need to clean before it's ready for analysis.

Complete the implementation of the function `clean_universities`, which takes in a DataFrame `df` like `universities_raw` and returns a cleaned DataFrame, cleaned according to the following specifications:

- Some `'institution'` names contain `'\n'` characters (e.g. `'University of Michigan\nAnn Arbor'`). Replace all instances of `'\n'` with `', '` (a comma and a space) in the `'institution'` column.

- Change the data type of the `'broad_impact'` column to `int`.

- Split `'national_rank'` into two columns, `'nation'` and `'national_rank_cleaned'`, where:
    - `'nation'` is the country (or its dependency) indicated in the first part of `'national_rank'`. 
        - Note that there are **3** countries that appear under different names for different schools. For all 3 of these countries, you should pick **the name that is longer** and use that name for every occurrence of the country. One of the 3 countries is **`'Czech Republic'`**, which also appears as **`'Czechia'`** – since these refer to the same country and `'Czech Republic'` is longer, all instances of either name should be replaced with `'Czech Republic'`. You need to find the other 2 countries on your own. 
        - These are the only 3 country names you need to handle.
    - `'national_rank_cleaned'` is the integer in the latter part of `'national_rank'`. Make sure that the values in this column are stored as integers. 
    - Don't include the original `'national_rank'` column in the output DataFrame.
   
- You may spot a few instances where the values in the `'national_rank'` column don't make sense. Ignore these issues; only fix the issues that are mentioned above.

Example behavior is given below.

```python
>>> clean_universities(universities).loc[[18], ['institution', 'nation', 'national_rank_cleaned']]
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>institution</th>
      <th>nation</th>
      <th>national_rank_cleaned</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>18</th>
      <td>University of Michigan, Ann Arbor</td>
      <td>United States</td>
      <td>15</td>
    </tr>
  </tbody>
</table>

In [None]:
def clean_universities(df):
    ...

# Feel free to change this input to make sure your function works correctly.
# A good strategy is to make sure it works when you call it on a random subset of universities_raw,
# e.g. clean_universities(universities_raw.sample(100)).
clean_universities(universities_raw)

In [None]:
grader.check("q03")

Once you've done Question 3, run the cell below to define a new DataFrame, `universities_cleaned`, that we'll use in Question 4.

In [None]:
universities = clean_universities(universities_raw)
universities.head()

### Question 4: Exploring University Rankings 📊
---

Now that we have a cleaned DataFrame, `universities`, we can use it for analysis! Note that it's still not perfectly clean. Try and find the top 20 public `'institution'`s in the United States. What do you notice?

In [None]:
# Try and answer the question above here. It's not required, but you should explore – 
# we promise you'll find it interesting!
...

#### Question 4.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Many of the top `'institutions'` in the world, like Harvard University and the Massachusetts Institute of Technology, don't have the substring `'University of'` in their name. But many do, like University of Cambridge and, of course, University of Michigan, Ann Arbor. One may wonder, which schools have higher `'score'`s on average – schools with `'University of'` in their name, or schools without?

Complete the implementation of the function `plot_university_of`, which takes in a DataFrame `df` like `universities` and returns a **`plotly` Figure object** depicting the distribution of `'score'`s, separately for `'institutions'` with and without `'University of'` in the title.

Some guidance:
- Names like "Delft University of Technology" count too – so `'University of'` doesn't have to be at the very start of the name. Similarly, if `'University Of Ann Arbor'` (capital O in "Of") was an `'institution'` in `df`, it should also count in the `'University of'` category.
- Your plot should be such that it shows us at least _some_ of the `'score'`s for individual `'institutions'` directly. For example, in addition to some distributional information, we need to be able to see – perhaps by hovering – that Harvard University's `'score'` is 100.
- Make sure your axis labels, legend labels, and title are all chosen manually by you: don't just use the defaults.

<div class="alert alert-danger">

Unlike other plots we've had you create in this class, this one will be manually graded, **not** autograded! To make sure we can grade your work correctly, once you're done implementing `plot_university_of`:

1. Comment out the line at the bottom of the cell below, that just has `plot_university_of(universities)`.
2. Run the cell that calls the function `save_and_show`. You should see a static (i.e. not interactive) version of your plot. This is to be expected. This is what we'll use to grade your work on Gradescope. **Make sure that static graph is visible in your notebook before submitting!**

</div>

In [None]:
def plot_university_of(df):
    ...

# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
plot_university_of(universities)

In [None]:
# Run this cell, don't change anything.  
# Make sure this static graph is visible in your notebook before submitting!
save_and_show(plot_university_of(universities), 'imgs/q04_01.png')

#### Question 4.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

One of the most important skills a data scientist needs to master is the ability to _think_ of questions. That's what you'll need to do here.

**Your job is to think of an interesting question you might want to answer using the data `universities`, and then draw a visualization to help answer it.** Your visualization must involve at least one numerical feature and at least one categorical feature (potentially, one that you construct by taking an existing numerical feature and "binning" it, as we did in Question 1). The distribution of a single categorical feature is not _interesting_ enough, i.e. your question needs to be more sophisticated than "Which states have the most public universities?" with an accompanying bar chart.

Complete the implementation of the function `custom_vis`, which takes in a DataFrame `df` like `universities` and returns a **`plotly` Figure object** containing your visualization. **Then, in the cell beneath it, tell us the question you tried to answer, and give us 1-2 sentences describing your takeaways.**

This question will mostly be graded on effort. Meet the requirements above and you'll get full credit. That said, take pride in your work, and try your best to make a really special plot that others won't. Are you really proud of what you made? [**Post it in this Ed post**](https://edstem.org/us/courses/61012/discussion/5320217).

<div class="alert alert-danger">

Like the plot above, this one will be manually graded, **not** autograded! To make sure we can grade your work correctly, once you're done implementing `custom_vis`:

1. Comment out the line at the bottom of the cell above, that just has `custom_vis(universities)`.
2. Run the cell that calls the function `save_and_show`. You should see a static (i.e. not interactive) version of your plot. This is to be expected. This is what we'll use to grade your work on Gradescope. **Make sure that static graph is visible in your notebook before submitting!**

Here, remember to provide your analysis of your plot in the cell below the one in which `save_and_show` is run.

</div>

In [None]:
def custom_vis(df):
    ...

# When you're ready to submit your homework, please comment the line below out;
# otherwise, we won't be able to manually grade your work.
custom_vis(universities)

In [None]:
# Run this cell, don't change anything.  
# Make sure this static graph is visible in your notebook before submitting!
save_and_show(custom_vis(universities), 'imgs/q04_02.png')

## Question 5: Advanced Probabilistic Imputation 🎲 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

---

In this question, you will work with a version of the heights DataFrame from class, now called `new_heights`, that has a `'father'` column and a single `'child'` column.

In [None]:
new_heights = pd.read_csv('data/missing_heights.csv')
new_heights.head()

In [Lecture 7](https://practicaldsc.org/resources/lectures/lec07/lec07-filled.html#Idea:-Probabilistic-imputation), you learned how to impute a quantitative column by sampling from the observed values. **One problem with this technique is that the imputation will never generate imputed values that weren't already in the dataset.** For example, 57, 57.5, and 59 are values in the `'child'` column of `new_heights` while 58 is not. Thus, any imputation done by sampling from the observed values in the `'child'` column will not be able to generate a height of 58, even though it's clearly a reasonable value to occur in the dataset.

To keep things simple, you will impute the `'child'` column **unconditionally** from the distribution of `'child'` heights present in the dataset. This means that you will use the values present in `'child'` to impute missing values, without looking at other columns.

An approach to quantitative imputation that overcomes the limitation mentioned above is as follows:
- Create a histogram of observed `'child'` heights, using 10 bins.
- Use the histogram to generate a new `'child'` height to impute with, somewhere within the observed range of `'child'` heights:
    - The probability a generated `'child'` height will come from a given bin is equal to the proportion of `'observed'` values in that bin. For example, if 20\% of observed `'child'` heights fall within a particular bin, then there's a 20\% chance we select that bin to draw a new `'child'` height from.
    - Once we select a bin, any value within that bin's left and right endpoints is equally likely to be drawn. For example, if we selected the bin [12, 14), we could choose any real number between 12 (inclusive) and 14 (exclusive) to fill in as our missing `'child'` height; `np.random.uniform` can help us pick that number between 12 and 14.
    
Let's illustrate this approach with an example. Let `demo` be the array of 10 numbers defined below.

```python
demo = np.array([10, 11, 11, 13, 14, 14, 13.5, 14, 15, 16])
```

- The first step is creating a histogram of `demo`. Note that with this example dataset, we will use 4 bins, **but you will be using 10 bins** in your imputation process.

<center><img src='imgs/histogram-demo.png' width=700></center>

- Note that in your implementation, you don't actually need to draw a histogram – instead, use `np.histogram`. (Play around with it to figure out how it works.)
- In the histogram above, we see that $\frac{3}{10}$ of values lie in the [10, 12) bin, $\frac{2}{10}$ of values lie in the [12, 14) bin, $\frac{4}{10}$ of values lie in the [14, 16) bin, and $\frac{1}{10}$ of values lie in the [16, 18] bin.
- Next, we need to pick a bin at random. There's a 30\% chance we pick the [10, 12) bin, a 20\% chance we pick the [12, 14) bin, a 40\% chance we pick the [14, 16) bin, and a 10\% chance we pick the [16, 18] bin. `np.random.choice` will be helpful in picking a bin at random.
- Once we pick a bin, we pick a number **uniformly at random** from within the bin. For instance, suppose we randomly chose the [14, 16) bin in the previous step. We then must select a (real) number between 14 and 16 uniformly at random; as mentioned above, `np.random.uniform` can help us here.

Complete the implementation of the function `impute_height_quant`, which takes in a Series `s` like `heights['child']`, in which some values are missing, and returns a Series in which the values are imputed using the histogram scheme above. The length of the returned Series should be the same as the length of `s`.

Example behavior is given below.

```python
>>> impute_height_quant(new_heights['child']).head(5)
0    69.283202            # Randomly chosen!
1    69.208524            # Randomly chosen!
2    69.000000
3    69.000000
4    73.500000
```

Some guidance:
- As always, it's good practice to break your work down into smaller steps. This might involve defining helper functions as needed.
- **You can** use a `for`-loop.

In [None]:
def impute_height_quant(s):
    ...

# Feel free to change this input to make sure your function works correctly.
impute_height_quant(new_heights['child'])

In [None]:
grader.check("q05")

## Question 6: Practice with HTML Tags 📎 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>
---

In Question 7, you'll spend plenty of time parsing HTML source code. But before you get your hands dirty trying to extract information from HTML written by other people, it is a good idea to write some basic HTML code yourself. This exercise will help you better understand how the code in a `.html` file is structured.

For this question, you'll create a very basic `.html` file, named `hw04_q06.html`, that satisfies the following conditions:

- It must have `<title>` and `<head>` tags.
- It must also have `<body>` tags. Within the `<body>` tags, it must have:
    - At least two headers.
    * At least three images.
        - At least one image must be a local file.
        - At least one image must be linked to online source.
        - At least one image has to have default text when it cannot be displayed.
    * At least three references (hyperlinks) to different web pages.
    * At least one table with two rows and two columns.
    

Make sure to save your file as `hw04_q06.html`. **When submitting this homework to Gradescope, make sure to also upload `hw04_q06.html` along with the _local_ image that you embedded in your site.** You can upload multiple files to Gradescope at a time.
   

Some guidance: 

- You can write and view basic HTML with a Jupyter Notebook, using either a Markdown cell or by using the `HTML` function that we've imported at the top of the notebook (which takes in a string of HTML and renders it).
- If you write your HTML code within a Jupyter Notebook, you should later copy your code into a text editor and save it with the `.html` extension. You could also write your HTML in a text editor directly.
- Be sure to open your final `.html` file in a browser and make sure it looks correct on its own.

In [None]:
grader.check("q06")

## Question 7: Scraping an Online Bookstore 📚
---

Browse through the following fake online bookstore: http://books.toscrape.com/. This website is meant for toying with scraping.

By the end of this question, you'll scrape this website, collecting data on all the books that have:
- **_at least_ a four-star rating**, and
- **a price _strictly_ less than £50**, and 
- **belong to specific categories** (more details below). 

This is a multi-step question, which we've broken into several sub-questions to help you organize your work.

### Question 7.1 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Complete the implementation of the function `extract_book_links`, which takes in the content of a page that contains book listings as a **string of HTML**, and returns a **list** of URLs of book-specific pages for all books with:
- **_at least_ a four-star rating**, and
-  **a price _strictly_ less than £50**.

Example behavior is given below.

```python
>>> out = extract_book_links(open('data/products.html', encoding='utf-8').read())
>>> len(out)
6

>>> out[1]
'scarlet-the-lunar-chronicles-2_218/index.html'

>>> out[-1]
'ready-player-one_209/index.html'
```

Some guidance:
- The URLs should appear in the order in which they appear in the string of HTML. Additionally, the URLs shouldn't contain the protocol, i.e. `'http://books.toscrape.com/catalogue/'`. The protocols should be added into the URLs when you actually make the requests in Question 7.3.
- Throughout this question, you should use the "Inspect" tool in your browser to view the source code of the pages you're trying to scrape. The public tests for this question are run on the file `data/products.html`, but your code should also work on any page of book listings from https://books.toscrape.com/, e.g. https://books.toscrape.com/catalogue/page-3.html. So, to test your work, you may want to request a few specific pages **outside** of your function; `extract_book_links` itself should not make any requests.

In [None]:
def extract_book_links(text):
    ...

# Feel free to change this input to make sure your function works correctly.
extract_book_links(open('data/products.html', encoding='utf-8').read())

The examples given below are to help you further test your function. To test each one, navigate to the website using the URL in the `requests.get` function, and verify that your function is printing out the correct product information according to the criteria listed in the problem statement.

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/category/books/music_14/index.html')
res.encoding = 'utf-8'
extract_book_links(res.text)

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/category/books/science_22/index.html')
res.encoding = 'utf-8'
extract_book_links(res.text)

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/category/books/humor_30/index.html')
res.encoding = 'utf-8'
extract_book_links(res.text)

In [None]:
grader.check("q07_01")

### Question 7.2 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Complete the implementation of the function `get_product_info`, which takes in the content of a single book-specific page as a **string of HTML**, and a list `categories` of book categories. If the input book is in the list of `categories`, `get_product_info` should return a **dictionary** corresponding to a row in the DataFrame in the image above (where the keys are the column names and the values are the row values). If the input book is not in the list of `categories`, return `None`.

Example behavior is given below.

```python
>>> html_string = open('data/Frankenstein.html', encoding='utf-8').read()
>>> out = get_product_info(html_string, ['Default'])
>>> type(out)
dict

>>> out.keys()
dict_keys(['UPC', 'Product Type', 'Price (excl. tax)', 'Price (incl. tax)', 'Tax', 'Availability', 'Number of reviews', 'Category', 'Rating', 'Description', 'Title'])

>>> out['Rating']
'Two'

>>> out['Price (incl. tax)']
'£38.00'
```

Some guidance:
- The public tests for this question are run on the file `data/Frankenstein.html`, but your code should also work on any individual book's page from https://books.toscrape.com/, e.g. https://books.toscrape.com/catalogue/sharp-objects_997/index.html. So, to test your work, you may want to request a few specific pages **outside** of your function; `extract_book_links` itself should not make any requests.
- Don't worry about the types of the values in your returned dictionary. That is, it's fine if your `'Number of reviews'` value is not stored as type `int`, and it's fine if your `'Price'` value is not stored as type `float`.

In [None]:
def get_product_info(text, categories):
    ...

# Feel free to change this input to make sure your function works correctly.
get_product_info(open('data/Frankenstein.html', encoding='utf-8').read(), ['Default'])

The examples given below are to help you further test your function. To test each one, navigate to the website using the URL in the `requests.get` function, and verify that your function is printing out the correct product information according to the criteria listed in the problem statement.

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/sharp-objects_997/index.html')
res.encoding = 'utf-8'
get_product_info(res.text, ['Mystery'])

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/olio_984/index.html')
res.encoding = 'utf-8'
get_product_info(res.text, ['Poetry'])

In [None]:
res = requests.get('https://books.toscrape.com/catalogue/layered-baking-building-and-styling-spectacular-cakes_904/index.html')
res.encoding = 'utf-8'
get_product_info(res.text, ['Food and Drink'])

In [None]:
grader.check("q07_02")

### Question 7.3 <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Finally, put everything together. Complete the implementation of the function `scrape_books`, which takes in an integer `k` and a list `categories` of book categories. `scrape_books` should use `requests` to scrape the first `k` pages of the bookstore and return a DataFrame of only the books that have:
- **_at least_ a four-star rating**, and
- **a price _strictly_ less than £50**, and
- **a category that is in the list `categories`**.

Example behavior is given below.

```python
>>> scrape_books(5, ['Default', 'Romance'])
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>UPC</th>
      <th>Product Type</th>
      <th>Price (excl. tax)</th>
      <th>Price (incl. tax)</th>
      <th>Tax</th>
      <th>Availability</th>
      <th>Number of reviews</th>
      <th>Category</th>
      <th>Rating</th>
      <th>Description</th>
      <th>Title</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>e10e1e165dc8be4a</td>
      <td>Books</td>
      <td>Â£22.60</td>
      <td>Â£22.60</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Default</td>
      <td>Four</td>
      <td>For readers of Laura Hillenbrand's Seabiscuit ...</td>
      <td>The Boys in the Boat: Nine Americans and Their...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>c2e46a2ee3b4a322</td>
      <td>Books</td>
      <td>Â£25.27</td>
      <td>Â£25.27</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Romance</td>
      <td>Five</td>
      <td>A Michelin two-star chef at twenty-eight, Viol...</td>
      <td>Chase Me (Paris Nights #2)</td>
    </tr>
    <tr>
      <th>2</th>
      <td>00bfed9e18bb36f3</td>
      <td>Books</td>
      <td>Â£34.53</td>
      <td>Â£34.53</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Romance</td>
      <td>Five</td>
      <td>No matter how busy he keeps himself, successfu...</td>
      <td>Black Dust</td>
    </tr>
    <tr>
      <th>3</th>
      <td>8c9e6bf2467d740d</td>
      <td>Books</td>
      <td>Â£20.59</td>
      <td>Â£20.59</td>
      <td>Â£0.00</td>
      <td>In stock (16 available)</td>
      <td>0</td>
      <td>Default</td>
      <td>Five</td>
      <td>Slay Procrastination, Distraction, and Overwhe...</td>
      <td>The Inefficiency Assassin: Time Management Tac...</td>
    </tr>
  </tbody>
</table>


<br>

Some guidance:

- The first page of the bookstore is at http://books.toscrape.com/catalogue/page-1.html. Subsequent pages can be found by clicking the "Next" button at the bottom of the page. Look at how the URLs change each time you navigate to a new page; think about how to use [f-strings](https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals) (or some other string formatting technique) to generate these URLs.
- **`scrape_books` should run in under 180 seconds on the entire bookstore (`k = 50`). `scrape_books` is also the only function that should make `GET` requests; the other two functions parse already-existing HTML.**
- It's fine if your `'Price'` column contains symbols other than `'£'`, as in the example above.

In [None]:
def scrape_books(k, categories):
    ...
    
# Feel free to change this input to make sure your function works correctly.
scrape_books(5, ['Default', 'Romance'])

In [None]:
grader.check("q07_03")

## Finish Line 🏁

Congratulations! You're ready to submit Homework 4. Remember that unlike in previous homeworks, you will only submit your Homework 4 notebook; you will **not** submit a separate PDF with answers to the written questions. (As mentioned in Question 6, you will also upload an HTML file you create manually and a few images.)

In Question 4, you are asked to create visualizations that we will manually grade; we will look at your visualizations directly in your notebook after you submit it. To ensure that we can properly grade your notebook, make sure that your visualizations are visible on Gradescope after you submit.

The details of how to ensure your visualizations are submitted correctly are described in this [**video**](https://www.loom.com/share/58287a89121545fdbd0131d22c2d9c94?sid=10e6190f-4988-40d8-b999-c63c30101693'). **Please watch it before submitting!**


To submit your homework:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
1. Read through the notebook to make sure everything is fine and all tests passed. Then, save the notebook.
1. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 4".
1. For Question 6, make sure to save your file as `hw04_q06.html`. **When submitting to Gradescope, make sure to also upload `hw04_q06.html` along with the _local_ image that you embedded in your site.** You can upload multiple files to Gradescope at a time.
1. Stick around while the Gradescope autograder grades your work. Make sure you see that all **public tests** have passed on Gradescope. **Remember that homeworks have hidden tests, which you will not see your scores on until a few days after the deadline!**
1. Also make sure that you can see your graphs for Question 4 directly on Gradescope.
1. Check that you have a confirmation email from Gradescope and save it as proof of your submission.