In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("hw09.ipynb")

<div class="alert alert-success" markdown="1">

#### Homework 9

# Multiple Linear Regression and Feature Engineering

### EECS 398-003: Practical Data Science, Fall 2024

#### Due Monday, November 11th at 11:59PM (note the later deadline!)
    
</div>

## Instructions

Welcome to Homework 9! In this homework, you'll gain a better understanding of how the normal equations and multiple regression work, and learn how to create new features for model building, both using `pandas` and `sklearn`. Only content through Lecture 18 is necessary, though parts of Question 3 touch on ideas from Lecture 19. See the [Readings section of the Resources tab on the course website](https://practicaldsc.org/resources/#readings) for supplemental resources.

You are given **eight** slip days throughout the semester to extend deadlines. See the [Syllabus](https://practicaldsc.org/syllabus) for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

To access this notebook, you'll need to clone our [public GitHub repository](https://github.com/practicaldsc/fa24/). The [⚙️ Environment Setup](https://practicaldsc.org/env-setup) page on the course website walks you through the necessary steps.
<div class="alert alert-warning" markdown="1">

<div class="alert alert-warning">
This homework features a mix of autograded programming questions and manually-graded questions.
    
- Questions 1, 2, and 3.4 are **manually graded**, like in Homework 8, and say **[Written ✏️]** in the title. For these questions, **do not write your answers in this notebook**! Instead, like in Homework 8, write **all** of your answers to the written questions in this homework in a separate PDF. You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in. Submit this separate PDF to the **Homework 9 (Questions 1, 2, and 3.4; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

- Questions 3, 4, and 5 (except 3.4) are **fully autograded**, and say **[Autograded 💻]** in the title. For these questions, all you need to is write your code in this notebook, run the local `grader.check` tests, and submit to the **Homework 9 (Questions 3-5; autograder problems)** assignment on Gradescope to have your code graded by the hidden autograder. This is the same workflow you followed in Homeworks 1-5 and Homework 8.

Your Homework 9 submission time will be the **later** of your two individual submissions.
</div>
</div>

**Make sure to show your work for all written questions! Answers without work shown may not receive full credit.**

    
This homework is worth a total of **62 points**, 26 of which are manually graded and 36 of which come from the autograder. The number of points each question is worth is listed at the start of each question. **All questions in the assignment are independent, so feel free to move around if you get stuck**, but keep in mind that you'll need to submit this homework twice – one submission for your written problems, and one submission for your autograded problems. Tip: if you're using Jupyter Lab, you can see a Table of Contents for the notebook by going to View > Table of Contents.

To get started, run the cell below, plus the cell at the top of the notebook that imports and initializes `otter`. 

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

import plotly
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

# Preferred styles
pio.templates["pds"] = go.layout.Template(
    layout=dict(
        margin=dict(l=30, r=30, t=30, b=30),
        autosize=True,
        width=600,
        height=400,
        xaxis=dict(showgrid=True),
        yaxis=dict(showgrid=True),
        title=dict(x=0.5, xanchor="center"),
    )
)
pio.templates.default = "simple_white+pds"

# Use plotly as default plotting engine
pd.options.plotting.backend = "plotly"

## Question 1: Sums of Residuals 🤔

---

In this problem, we will prove that the sum of the residuals of a fit regression model is 0.

We define the $i$th **residual** to be the difference between the actual and predicted values for individual $i$ in our dataset, when the predictions are made using a regression model whose coefficients $w_0^*$ and $w_1^*$ (or, for multiple linear regression models, $w_0^*$, $w_1^*$, $w_2^*$, ..., $w_d^*$) are all optimal.

In other words, the $i$th residual $e_i$ is: $$e_i = y_i - H^*(x_i)$$

We use the letter $e$ for residuals because residuals are also known as errors.

We'll get to the proof soon, but first, a warmup.

<!-- BEGIN QUESTION -->

### Question 1.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
Suppose $\vec{1} \in \mathbb{R}^n$ is a vector containing the value 1 for each element, i.e. $\vec{1} = \begin{bmatrix} 1 \\ 1 \\ ... \\ 1 \end{bmatrix}$.

For any other vector $\vec{b} = \begin{bmatrix} b_1 \\ b_2 \\ ... \\ b_n \end{bmatrix}$, what is the value of $\vec{1}^T \vec{b}$, i.e. what is the dot product of $\vec{1}$ and $\vec{b}$?

DONE

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>
Back to the main problem at hand.

Consider the typical multiple regression scenario where our hypothesis function has an intercept term:
$$H(\vec{x}) = w_0 + w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}$$

Note that another way of writing the $i$th residual, $e_i = y_i - H^*(x_i)$, is:

$$e_i = (\vec{y} - X \vec{w}^*)_i$$

Here, $X$ is a $n \times (d + 1)$ design matrix, $\vec{y} \in \mathbb{R}^n$ is an observation vector, and $\vec{w} \in \mathbb{R}^{(d+1)}$ is the parameter vector. We'll use $\vec{w}^*$ to denote the optimal parameter vector, or the one that satisfies the normal equations. $(\vec{y} - X \vec{w}^*)_i$ is referring to element $i$ of the vector $\vec{y} - X \vec{w}^*$.

Using facts about $\vec{w}^*$ we learned in Lectures 16 and 17, prove that for multiple linear regression models with an intercept term, the sum of the residuals is 0. That is, prove that:$$\sum_{i = 1}^n e_i = 0$$

*Hint: Refer to the derivation of $\vec{w*^*}$ in Lecture 16. How did we define $X$? Your proof should not be very long.*

DONE 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.3 [Written ✏️]  <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>
Now suppose our hypothesis function does not have an intercept term, but is otherwise linear with multiple features: $$H(\vec{x}) = w_1 x^{(1)} + w_2 x^{(2)} + ... + w_d x^{(d)}$$

- Is it still guaranteed that $\displaystyle\sum_{i = 1}^n e_i = 0$? Why or why not?
- Is it still possible that $\displaystyle\sum_{i = 1}^n e_i = 0$? If you believe the answer is yes, come up with a simple example where a linear hypothesis function without an intercept has residuals that sum to 0. If you believe the answer is no, state why not.

DONE 

<!-- END QUESTION -->

## Question 2: Real Estate 🏡

---

You are given a dataset containing information on recently sold houses in Ann Arbor, including:

- square footage
- number of bedrooms
- number of bathrooms
- year the house was built
- asking price, or how much the house was originally listed for, before negotiations
- sale price, or how much the house actually sold for, after negotiations

The table below shows the first few rows of the data set. Note that since you don't have the full dataset, you cannot answer the questions that follow based on calculations; you must answer conceptually.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>House</th>
      <th>Square Feet</th>
      <th>Bedrooms</th>
      <th>Bathrooms</th>
      <th>Year</th>
      <th>Asking Price</th>
      <th>Sale Price</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>1247</td>
      <td>3</td>
      <td>3</td>
      <td>2005</td>
      <td>500,000</td>
      <td>494,000</td>
    </tr>
    <tr>
      <td>2</td>
      <td>1670</td>
      <td>3</td>
      <td>2</td>
      <td>1927</td>
      <td>1,000,000</td>
      <td>985,000</td>
    </tr>
    <tr>
      <td>3</td>
      <td>716</td>
      <td>1</td>
      <td>1</td>
      <td>1993</td>
      <td>335,000</td>
      <td>333,850</td>
    </tr>
    <tr>
      <td>4</td>
      <td>1600</td>
      <td>4</td>
      <td>2</td>
      <td>1962</td>
      <td>830,000</td>
      <td>815,000</td>
    </tr>
    <tr>
      <td>5</td>
      <td>2635</td>
      <td>4</td>
      <td>3</td>
      <td>1993</td>
      <td>1,250,000</td>
      <td>1,250,000</td>
    </tr>
    <tr>
      <td>&#8943;</td> <!-- ellipsis -->
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
      <td>&#8943;</td>
    </tr>
  </tbody>
</table>


<!-- BEGIN QUESTION -->

### Question 2.1 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>
First, suppose we fit a multiple linear regression model to predict the sale price of a house given all five of the other variables.  Which feature would you expect to have the largest magnitude weight? Why? (Remember that the weight of a feature is the value of $w^*$ for that feature.)

Then, suppose we standardize each variable separately. (Recall, to standardize a column $x_1, x_2, ..., x_n$, we replace each value $x_i$ with $\frac{x_i - \bar{x}}{\sigma_x}$.) Suppose we fit another multiple linear regression model to predict the sale price of a house given all five of the other standardized variables. Now, which feature would you expect to have the largest magnitude weight? Why?

DONE

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.2 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Suppose we fit a multiple linear regression model to predict the sale price of a house given all five of the other variables in their original, unstandardized form. Suppose the weight for the Year feature is $\alpha$.

Now, suppose we replace Year with a new feature, Age, which is 0 if the house was built in 2024, 1 if the house was built in 2023, 2 if the house was built in 2022, and so on. If we fit a new multiple linear regression model on all five variables, but using Age instead of Year, what will the weight for the Age feature be, in terms of $\alpha$?

DONE

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.3 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Now, suppose we fit a multiple linear regression model to predict the sale price of a house given all five of the other features, plus a new sixth feature named $\text{Rooms}$, which is the total number of bedrooms and bathrooms in the house. Will our new regression model with an added sixth feature make better predictions than the models we fit in Questions 2.1 or 2.2?

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Now, suppose we fit two multiple linear regression models to predict the sale price of a house.

The first uses the features $\text{Square Feet}$ and $\text{Bathrooms}$:

$$H_\gamma(\vec x) = \gamma_0 + \gamma_1 \cdot \text{Square Feet} +\gamma_2 \cdot \text{Bathrooms}$$

The second model uses the  features $\text{Square Feet}$ and $\text{Bathrooms}$ and a new seventh feature named $\text{Length of Street Name}$, which is the number of letters in the name of the street that the house is on:

$$H_\lambda(\vec x) = \lambda_0 + \lambda_1 \cdot \text{Square Feet} +\lambda_2 \cdot \text{Bathrooms} + \lambda_3 \cdot \text{Length of Street Name}$$

Let $\text{TMSE}$ refer to the "training" mean squared error, that is, the mean squared error of a hypothesis function **on the same dataset we used to fit it**. (Through Lecture 18, we just referred to this idea as MSE.)

Argue why $\text{TMSE}(H_\lambda) \leq \text{TMSE}(H_\gamma)$.

<!-- END QUESTION -->

## Question 3: Play Ball ⚾️

---

In this question, you'll get a feel for the process of creating new features from existing ones and how to _think_ about model generalizability, an idea we'll see more in Lecture 19.

<br>

As we discussed in Lecture 18, a numerical-to-numerical transformation results from taking the values in some numerical column $x_1, x_2, ..., x_n$ and applying some function $f$ to each value, to produce a new set of numbers $f(x_1), f(x_2), ..., f(x_n)$. These **transformed** values, $f(x_1), f(x_2), ..., f(x_n)$, can then either be used as a feature, or as the target ($y$) variable.

A common goal of applying a numerical-to-numerical transformation is to modify the data from a complicated, non-linear relationship into a **linear** relationship. Linear relationships are easy to understand and are well-described using linear models.

However, non-linear growth is common in real-world datasets. Sometimes this growth is by a **fixed power** and sometimes it is **exponential**. The transformation functions, $f$, that turn these types of growth linear are **root** and **log** transformations respectively. (Generally, it is more difficult to determine which transformation is appropriate for a given dataset, though the [Tukey-Mosteller bulge diagram](https://freakonometrics.hypotheses.org/files/2014/06/Selection_005.png) from Lectures 17 and 18 is useful.)

Let's start by looking at some examples of transformations.

### Example 1

Run the cell below to generate a scatter plot.

In [3]:
# By setting a seed, we guarantee that we will see the same results each time we run this cell.
np.random.seed(23)

# Generates a random scatter plot
x = np.arange(1, 101) + np.random.normal(0, 0.5, 100)
y = 2 * ((x + np.random.normal(0, 1, 100)) ** 2) + np.abs(x) * np.random.normal(0, 30, 100)
df_1 = pd.DataFrame().assign(x=x, y=y)

px.scatter(df_1, x='x', y='y', trendline="ols", trendline_color_override="#ff7f0e")

It doesn't appear to be the case that `'x'` and `'y'` are linearly associated here, and they aren't – there is a **quadratic** relationship between them. 

One way we may be able to notice this is a **residual plot**, where we visualize the residuals (or errors), $e = y_i - H^*(x_i)$, as defined in Question 1. Note that if we were to create a **residual plot** based on the data above, there would be a pattern – the residuals for smaller `'x'` would mostly be positive, and the residuals for larger `'x'` would mostly be negative. Patterns in a residual plot imply that the relationship between the two variables is non-linear.

Let's take a look at the residual plot, using a helper function defined below. This function fits a `LinearRegression` model to `'x'` and `'y'`, adds a `'residuals'` column to the `df`, and plots that against the predictions `'pred'`. Note that it's equally valid to plot the residuals against `'x'`: to do that, change `x = 'pred'` to `x = x` in the call to `px.scatter` below. You'll see the trend is the same, but the x-axis will have different numbers. That's because `'pred'` is just a linear transformation of `'x'`.

In [4]:
# Feel free to use this function directly to help you answer Question 1.
def create_residual_plot(df, x, y):
    df = df.copy()
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(df[[x]], df[y])
    df['pred'] = model.predict(df[[x]])
    df[f'{y} residuals'] = df[y] - model.predict(df[[x]])
    return px.scatter(df, x='pred', y=f'{y} residuals', trendline='ols', trendline_color_override='red')

create_residual_plot(df_1, 'x', 'y')

To linearize the relationship, we can take the square root of each `'y'` value:

In [5]:
df_1['root y'] = np.sqrt(df_1['y'])

px.scatter(df_1, x='x', y='root y', trendline="ols", trendline_color_override="#ff7f0e")

That looks much better!

### Example 2

Run the cell below to generate another scatter plot.

In [6]:
# By setting a seed, we guarantee that we will see the same results each time we run this cell
np.random.seed(32)

# Generates a different random scatter plot
x = np.linspace(2, 5, 100)
y = 10 * (np.e ** x) + np.abs(x) * np.random.normal(0, 5, 100) + np.random.normal(0, 30, 100)
df_2 = pd.DataFrame().assign(x=x, y=y)

px.scatter(df_2, x='x', y='y', trendline="ols", trendline_color_override="#ff7f0e")

Again, the relationship between `'x'` and `'y'` is not quite linear. Let's try the square root transformation we tried in Example 1:

In [7]:
df_2['root y'] = np.sqrt(df_2['y'])

px.scatter(df_2, x='x', y='root y', trendline="ols", trendline_color_override="#ff7f0e")

Hmm... the relationship certainly looks _more_ linear than before, but still not quite linear. Let's look at the residual plot:

In [8]:
create_residual_plot(df_2, 'x', 'root y')

There is clearly a pattern in the residual plot. Let's instead try another transformation for the `'y'` values – $\log$.

In [9]:
df_2['log y'] = np.log(df_2['y'])

px.scatter(df_2, x='x', y='log y', trendline="ols", trendline_color_override="#ff7f0e")

That looks much better! We can verify that the residual plot has no "patterns":

In [10]:
create_residual_plot(df_2, 'x', 'log y')

Note – there is still evidence of **heteroscedasticity**, or "uneven spread", in this scatter plot, but the relationship is as close to linear as we'll get.

Now that we've learned how to perform transformations with example datasets, it's your job to apply these ideas to a real dataset. Below, we load in a dataset that describes the [number of home runs in the MLB per year](https://www.mlb.com/glossary/standard-stats/home-run). The relationship between the two variables, `'Year'` and `'Homeruns'`, is not linear.

In [11]:
homeruns = pd.read_csv('data/homeruns.csv')
homeruns.head()

Unnamed: 0,Year,Homeruns
0,1900,254
1,1901,455
2,1902,354
3,1903,335
4,1904,331


In [12]:
homeruns.plot(kind='scatter', x='Year', y='Homeruns')

**Throughout this entire question**, suppose we're modeling `'Homeruns'` as a function of `'Year'`, i.e. `'Homeruns'` is the $y$ variable and `'Year'` is the $x$.

### Question 3.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

**Your first job is to determine what the appropriate transformation to apply to the `'Homeruns'` column is, in order to linearize the relationship.** Specifically, try out the transformations below, and then draw and examine residual plots to identify which numerical-to-numerical transformation is best.

While you'll have to write a bunch of code, this is a multiple-choice question. Assign `best_transformation` to either 1, 2, 3, or 4, with the value corresponding to one of the following choices:

1. Square root transformation.
2. Log transformation.
3. Both work the same.
4. Neither gives a transformation revealing a linear relationship.

If you find that both residual plots have some sort of pattern, choose the residual plot in which the vertical spread is constant. There is one clearly correct answer.

In [13]:
homeruns["log"] = np.log(homeruns["Homeruns"])
homeruns["sqrt"] = np.sqrt(homeruns["Homeruns"])

In [14]:
homeruns.plot(kind='scatter', x='Year', y='sqrt')

In [15]:
create_residual_plot(homeruns, 'Year', 'sqrt')

In [16]:
homeruns.plot(kind='scatter', x='Year', y='log')

In [17]:
create_residual_plot(homeruns, 'Year', 'log')

In [18]:
best_transformation = 1
best_transformation

1

In [19]:
grader.check("q03_01")

### Question 3.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Recall, our goal in this question is to model `'Homeruns'` as a function of `'Year'`. In the previous part, we had you apply a numerical-to-numerical transformation to `'Homeruns'`, which is the $y$ variable.

In this part, you'll be required to engineer new quantitative features **of your own choosing**, all based on transformations of the $x$ variable, `'Year'`.

Complete the implementation of the function `fit_model_and_return_predictions`, which takes in:
- `X`, a DataFrame with a single column of `'Year'` values from `homeruns`, and
- `y`, an array or Series with a sequence of `'Homerun'` values from `homeruns`.

`fit_model_and_return_predictions` should:
- Create new numerical features by applying various transformations to the values in `X['Year']` (look at the "Polynomial regression" example from Lecture 18 for inspiration – you don't need to use the `apply` method to apply a transformation),
- Fit a `sklearn` LinearRegression object using your custom design matrix as the `X` argument and our passed-in `y` as the `y` argument, and
- **Return an array of predictions** that result from calling the `predict` method on the fit model, using your custom design matrix as the `X` argument.

For example, suppose our `fit_model_and_return_predictions` function creates polynomial features of degrees 2 through 10, and adds no other new features. Example behavior of `fit_model_and_return_predictions` may then be as follows:

```python
>>> fit_model_and_return_predictions(homeruns[['Year']], homeruns['Homeruns'])[:5]
array([165.07808666, 300.52105073, 174.28363771, 288.87689757, 395.065479  ])
```

A plot of the predictions returned by `fit_model_and_return_predictions` might then look like:

<center><img src="imgs/fit-model.png" width=500></center>

Is this a "good" model? Sure, it has a low training MSE, but it's not likely to generalize well to unseen $x$-values – in this case, future `'Year'`.

**You can create your features however you'd like!** Don't just use our example of using polynomial features of degrees 2 to 10. Try, intuitively, to come up with a fit hypothesis function that _you think_ is likely to generalize well to future `'Year'`s for whom we don't know the number of `'Homeruns'`. We will formalize how to develop models that generalize well in the coming lectures.

All we can autograde in this question are your resulting predictions – practically, we have no way of knowing how you come up with them. Other than what's described above, here are the only added requirements of your function:

- It should be able to take in a **subset** of the rows in `homeruns`, and should do all calculations (feature creation, fitting, predicting) using that subset. So, this should work too:
    ```python
        >>> fit_model_and_return_predictions(homeruns.head()[['Year']], homeruns.head()['Homeruns'])
        
    ```
    Note that in `fit_model_and_return_predictions`, the `X` data used to fit the model is always the same as the data used to make predictions. In other cases, this is not necessarily how it works – after all, we typically build models using historical data and use them to make predictions about future data – but this is how we'll use and test `fit_model_and_return_predictions`.

- The array that `fit_model_and_return_predictions` returns should be **deterministic**. That is, if `fit_model_and_return_predictions` is called twice with the exact same inputs `X` and `y`, the output should not change.
- The mean squared error of the predictions, when called on `X=homeruns[['Year']]` and `y=homeruns['Homeruns']`, should be **between 100,000 and 200,000**. Yes, it's possible to achieve a mean squared error of less than 100,000, but such a model is likely **overfitting** significantly to the data. (In fact, in Homework 8, you learned how to build models with 0 MSE, using Lagrange Interpolation! **Don't do that here – try and build more general-purpose models.**)

In [20]:
def MSE(pred,y):
    return np.mean((y-pred)**2)


In [21]:
from sklearn.linear_model import LinearRegression

def fit_model_and_return_predictions(X, y):
    X = X.copy()
    # Below, create your features and design matrix,
    # instantiate a LinearRegression object,
    # fit it, and then call model.predict on it.
    X["X^2"] = X["Year"]**2
    X["X^3"] = X["Year"]**3
    X["X^4"] = X["Year"]**4
    model = LinearRegression()
    model.fit(X, y)
    return model.predict(X)

# Feel free to change this input to make sure your function works correctly.
preds = fit_model_and_return_predictions(homeruns[['Year']], homeruns['Homeruns'])
print(MSE(preds,homeruns['Homeruns']))


# Uncomment the code below to see a graph of your
# fit hypothesis function's predictions.
fig = homeruns.plot(kind='scatter', x='Year', y='Homeruns')

fig.add_trace(go.Scatter(
    x=homeruns['Year'],
    y=preds,
    mode='lines',
    line=dict(width=4),
    name='Fit Model'
))

191902.32551333544


In [22]:
grader.check("q03_02")

### Question 3.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">5 Points</div>

Above, you had to manually create features that resulted in a hypothesis function that fit the data well (but not too well). You may wonder, is there a way to do this automatically?

One _kind-of_ solution is to use **Nearest Neighbors Regression**. In nearest neighbors regression, to evaluate the hypothesis function $H^*$ on the input $x_\text{new}$:

1. First, choose a value of $k$. Sometimes, this is called **$k$-Nearest Neighbors Regression, or $k$-NN Regression**.
1. Then:
    1. Find the $k$ points in the original dataset whose $x$-values are closest to $x_\text{new}$ in terms of absolute value (note that since we're essentially dealing with just a single $x$ feature, using squared distance would also work the same way).
    1. Return the mean of the $y$-values corresponding to the $k$ points found in the step above.

For example, suppose our original dataset is:

| x | y |
| --- | --- |
| 10 | 5 |
| 11 | 17 |
| 12 | 26 |
| 19 | -5 |
| 25 | 3 |

Suppose we choose $k = 3$, and suppose we want to predict the $y$-value for $x_\text{new} = 20$. Then:
- The $k = 3$ points with the closest $x$-values are $(19, -5)$, $(25, 3)$, and $(12, 26)$.
- The mean of the $y$-values of the three points above is $\frac{-5 + 3 + 26}{3} = 8$.
- So, we predict a $y$-value of 8 for input $x_\text{new} = 20$.

This is a regression technique, because it allows us to predict real-valued outputs. However, it is different from linear regression in that it is **non-parametric** – there are no **parameters** $w_0^*, w_1^*, ...$ that we're solving for in order to make our predictions.

We can choose $k$ to be whatever we want it to be, but some values of $k$ are "better" in some sense than others. We'll explore this idea in Question 3.4, when we tie things back into the `homeruns` dataset.

**Your job is to** complete the implementation of the function `create_knn_regressor`, which takes in:
- `x`, a 1D array/Series of $x$-values, e.g. `homeruns['Year']`,
- `y`, a 1D array/Series of $y$-values, e.g. `homeruns['Homeruns']`, and
- `k`, a positive integer corresponding to the value of $k$ (where `k <= len(x)`).

`create_knn_regressor` should return a **function** that can take in a single number `x_new` and return the predicted $y$-value for the input `x_new`, according to the process outlined above.

Example behavior is given below.

```python
>> regressor = create_knn_regressor(x=np.array([10, 11, 12, 19, 25]),
                                    y=np.array([5, 17, 26, -5, 3]),
                                    k=3)
>>> regressor(20)
8.0
```

Some guidance:
- The bulk of the work in this question is in understanding how Nearest Neighbors Regression works. Our implementation is very short (5 lines total).
- **You're not allowed to use `sklearn` here**, but don't forget to use what you know about `pandas` DataFrames! You shouldn't use a `for`-loop.
- Don't worry about cases in which there are ties in distance (e.g. if $k = 3$ but there are 4 points that are all equidistant from $x_\text{new}$ such that they are all the closest); our tests are written in a way that will not penalize your handling of this situation if it's different from ours.

In [23]:
def create_knn_regressor(x, y, k):
    x = x.copy()
    y = y.copy()
    def reg(new):
        dis = np.abs(x - new)
        idx = np.argsort(dis)
        return np.mean(y[idx[:k]])
    return reg

# Feel free to change these inputs to make sure your function works correctly.
# It's a good idea to test out create_knn_regressor on the homeruns dataset!
regressor = create_knn_regressor(x=np.array([10, 11, 12, 19, 25]),
                                 y=np.array([5, 17, 26, -5, 3]),
                                 k=3)
regressor(20)

8.0

In [24]:
grader.check("q03_03")

Once you've implemented `create_knn_regressor`, run the cell below to see an **interactive** widget that will allow you to choose different values of $k$ and see the resulting $k$-NN regressor plotted on top of the `homeruns` dataset.

In [25]:
import ipywidgets as widgets
from IPython.display import display, clear_output

def plot_given_k(k):
    x = homeruns['Year']
    y = homeruns['Homeruns']
    regressor = create_knn_regressor(x, y, k)
    preds = [regressor(xi) for xi in x]

    fig = px.scatter(x=x, y=y).update_layout(xaxis_title='Year', yaxis_title='Homeruns', title=f'Fit kNN Model with k={k}')

    return fig.add_trace(go.Scatter(
        x=x,
        y=preds,
        mode='lines',
        line=dict(width=4),
        name='Fit Model'
    ))

widgets.interact(plot_given_k, k=widgets.IntSlider(min=1, max=140, step=1, value=5));

interactive(children=(IntSlider(value=5, description='k', max=140, min=1), Output()), _dom_classes=('widget-in…

Try different values of $k$ – what do you notice?

<!-- BEGIN QUESTION -->

### Question 3.4 [Written ✏️] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div>

Play around with the interactive cell above. Then, comment on the following points **in your PDF writeup, along with your answers to the rest of the written problems in this homework (that is, _not_ in this notebook)**:
1. When $k = 1$, what does the resulting fit model look like, and how does it relate to models we've seen in earlier lectures/homeworks?
2. When $k = 140$, what does the resulting fit model look like, and how does it relate to models we've seen in earlier lectures/homeworks?
3. Which value of $k$ do you _feel_ best captures the trend in the data, and why? (Just give a one sentence intuitive answer – no calculations needed.)

DONE 

<!-- END QUESTION -->

## Question 4: Diamond Pricing 💎

---

In this next section, you will pretend you are a jewelry appraiser and predict the prices of diamonds given several standard characteristics of diamonds.

You will use linear regression to predict prices, while improving the quality of your predictions using **feature engineering**. Since this question is supposed to help you understand feature engineering, **you will be building these features from scratch, instead of using built-in `sklearn` methods**.

The `diamonds` dataset is accessible via `seaborn` (with `sns.load_dataset('diamonds')`), but we've skipped that step and loaded it for you below. The DataFrame has 53940 rows and 10 columns:

|column|description|unique values or range|
|---|---|---|
|`'carat'`|weight of the diamond in carats (each carat is 0.2 grams)| 0.2 - 5.01 |
|`'cut'`|quality of the cut | Fair, Good, Very Good, Premium, Ideal |
|`'color'`|diamond colour | J (worst, near colorless), I, H, G, F, E, D (best, absolute colorless) |
|`'clarity'`|a measurement of how clear the diamond is | I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best) |
|`'depth'`|total depth percentage, computed as z / mean(x, y) = 2 * z / (x + y) | 43 - 79 |
|`'table'`|width of top of diamond relative to widest point | 43 - 95 |
|`'price'`|price in US dollars | \\$326 - \\$18,823 USD |
|`'x'`|length in mm | 0 - 10.74 |
|`'y'`|width in mm | 0 - 58.9 | 
|`'z'`|depth in mm | 0 - 31.8 |

If you want to learn more about how diamonds are measured, refer to [this page by the American Gem Society](https://www.americangemsociety.org/4cs-of-diamonds/).

In [26]:
diamonds = pd.read_csv('data/diamonds.csv')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### Question 4.1 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Every categorical variable in the dataset is an ordinal column, meaning that there is an inherent order that we can use to sort the values in the column. An **ordinal encoding** is a feature transformation that maps the values in an ordinal column to positive integers in a way that preserves the order of the column values. For instance, an ordinal encoding for Freshman, Sophomore, Junior, Senior is 0, 1, 2, 3.

Complete the implementation of the function `create_ordinal`, which takes in a DataFrame `df` like `diamonds` and returns a DataFrame of ordinal features only with names of the form `'ordinal_<col>'`, where `'<col>'` is the original categorical column name. For instance, the `'ordinal_color'` column should consist of values from 0 to 6, where 0 refers to `'J'` and 6 refers to `'D'`. **In all cases, start counting from 0.**

Example behavior is given below.

```python
>>> create_ordinal(diamonds.head(5))
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>ordinal_cut</th>
      <th>ordinal_clarity</th>
      <th>ordinal_color</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>4</td>
      <td>1</td>
      <td>5</td>
    </tr>
    <tr>
      <th>1</th>
      <td>3</td>
      <td>2</td>
      <td>5</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>4</td>
      <td>5</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3</td>
      <td>3</td>
      <td>1</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>1</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

Some guidance:
- Remember, you're only permitted to use `pandas` operations. You might want to create a helper function that takes in a single column and an ordering for that column.
- Don't include non-ordinal features in the returned DataFrame. That is, if there are only three columns in `diamonds` that are ordinal, `create_ordinal` should return a DataFrame with three columns.
- The orderings for each of the ordinal columns are displayed in the data dictionary above (in the `'unique values or range'` column).

In [27]:
def create_ordinal(df):
    dfc = df.copy()
    dfc = dfc[["cut", "color", "clarity"]]
    dfc["ordinal_cut"] = dfc["cut"].replace({"Fair": 0, "Good": 1, "Very Good": 2, "Premium": 3, "Ideal": 4})
    dfc["ordinal_color"] = dfc["color"].replace({"J": 0, "I": 1, "H": 2, "G": 3, "F": 4, "E": 5, "D": 6})
    dfc["ordinal_clarity"] = dfc["clarity"].replace({"I1": 0, "SI2": 1, "SI1": 2, "VS2": 3, "VS1": 4, "VVS2": 5, "VVS1": 6, "IF": 7})
    
    dfc = dfc.drop(columns=["cut", "color", "clarity"])
    return dfc
    
# Feel free to change this input to make sure your function works correctly.
create_ordinal(diamonds)

Unnamed: 0,ordinal_cut,ordinal_color,ordinal_clarity
0,4,5,1
1,3,5,2
2,1,5,4
3,3,1,3
4,1,0,1
...,...,...,...
53935,4,6,2
53936,1,6,2
53937,2,6,2
53938,3,2,1


In [28]:
grader.check("q04_01")

### Question 4.2 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Even though the categorical variables in the dataset are ordinal, we can still treat them as nominal by forgetting their order. To treat the categorical variables in our dataset as nominal, we might **one hot encode** them. 

Complete the implementation of the function `create_one_hot`, which takes in a DataFrame `df` like `diamonds` and returns a DataFrame of one hot encoded features with names of the form `'one_hot_<col>_<val>'`, where `'<col>'` is the original categorical column name, and `'<val>'` is the value found in the categorical column `'<col>'`. For instance, one of your column names will be `'one_hot_color_J'`.

Example behavior is given below, for a subset of features.

```python
>>> out = create_one_hot(diamonds)
>>> out.loc[:5, out.columns.str.contains('cut')]
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>one_hot_cut_Ideal</th>
      <th>one_hot_cut_Premium</th>
      <th>one_hot_cut_Good</th>
      <th>one_hot_cut_Very Good</th>
      <th>one_hot_cut_Fair</th>
      <th>one_hot_color_E</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <th>5</th>
      <td>0</td>
      <td>0</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
  </tbody>
</table>


Some guidance:
- Only include one-hot-encoded columns in the DataFrame that `create_one_hot` returns.
- Create a helper function that creates the one-hot encoding for a single column. **Do not** use `sklearn` or `pd.get_dummies` for this question!
- As per usual, write an efficient implementation. You may use a `for`-loop over **columns**, but not over rows. And the order of **columns** does not matter.
- In lecture, we discussed the fact that for statistical reasons, we often drop a single one hot encoded column per original categorical feature. **Do not drop** any one hot encoded columns here!

In [29]:
def create_one_hot(df):
    dfc = df.copy()
    dfc = dfc[["cut", "color", "clarity"]]
    for col in dfc.columns:
        for val in dfc[col].unique():
            dfc[f'one_hot_{col}_{val}'] = (dfc[col] == val).astype(int)
            
    dfc = dfc.drop(columns=["cut", "color", "clarity"])
    return dfc

    
    
# Feel free to change this input to make sure your function works correctly.
create_one_hot(diamonds)

Unnamed: 0,one_hot_cut_Ideal,one_hot_cut_Premium,one_hot_cut_Good,one_hot_cut_Very Good,one_hot_cut_Fair,one_hot_color_E,one_hot_color_I,one_hot_color_J,one_hot_color_H,one_hot_color_F,one_hot_color_G,one_hot_color_D,one_hot_clarity_SI2,one_hot_clarity_SI1,one_hot_clarity_VS1,one_hot_clarity_VS2,one_hot_clarity_VVS2,one_hot_clarity_VVS1,one_hot_clarity_I1,one_hot_clarity_IF
0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53935,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
53936,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
53937,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
53938,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0


In [30]:
grader.check("q04_02")

### Question 4.3 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

Similar to the one hot encoding case, you can replace a value in a nominal column with the proportion of times that value appears in the column. For instance, if a column consists of the values `['a', 'b', 'a', 'c']`, then the proportion-encoded column is `[0.5, 0.25, 0.5, 0.25]`.  This might be a reasonable approach to predicting the price of a diamond, as you might expect **rarer attributes to be considered more valuable** than common ones.

Complete the implementation of the function `create_proportions`, which takes in a DataFrame `df` like `diamonds` and returns a DataFrame of proportion-encoded features with names of the form `'proportion_<col>'`, where `'<col>'` is the original categorical column name.

Example behavior is given below.

```python
>>> create_proportions(diamonds).head(5)
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>proportion_cut</th>
      <th>proportion_color</th>
      <th>proportion_clarity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.399537</td>
      <td>0.181628</td>
      <td>0.170449</td>
    </tr>
    <tr>
      <th>1</th>
      <td>0.255673</td>
      <td>0.181628</td>
      <td>0.242214</td>
    </tr>
    <tr>
      <th>2</th>
      <td>0.090953</td>
      <td>0.181628</td>
      <td>0.151483</td>
    </tr>
    <tr>
      <th>3</th>
      <td>0.255673</td>
      <td>0.100519</td>
      <td>0.227253</td>
    </tr>
    <tr>
      <th>4</th>
      <td>0.090953</td>
      <td>0.052058</td>
      <td>0.170449</td>
    </tr>
  </tbody>
</table>


In [31]:
def create_proportions(df):
    dfc = df.copy()
    dfc = dfc[["cut", "color", "clarity"]]
    for col in dfc.columns:
        dfc[f'proportion_{col}'] = dfc[col].value_counts(normalize=True)[dfc[col]].values
    dfc = dfc.drop(columns=["cut", "color", "clarity"])
    return dfc
    
# Feel free to change this input to make sure your function works correctly.
create_proportions(diamonds)

Unnamed: 0,proportion_cut,proportion_color,proportion_clarity
0,0.399537,0.181628,0.170449
1,0.255673,0.181628,0.242214
2,0.090953,0.181628,0.151483
3,0.255673,0.100519,0.227253
4,0.090953,0.052058,0.170449
...,...,...,...
53935,0.399537,0.125603,0.242214
53936,0.090953,0.125603,0.242214
53937,0.223990,0.125603,0.242214
53938,0.255673,0.153949,0.170449


In [32]:
grader.check("q04_03")

### Question 4.4 [Autograded 💻] <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">4 Points</div>

As we looked at in-depth in Question 3, linear regression doesn't capture non-linear relationships between variables, but you can create features that encode such dependencies **before** fitting your regression model, and creating polynomial features is one way to do this.

For instance, the diamonds dataset contains `'x'`, `'y'`, and `'z'` dimensions for each stone. However, different combinations of size may be more valuable than others: a "deep and wide" diamond might be considered more valuable than a shallow, but "long and wide" diamond.

Complete the implementation of the function `create_quadratics`, which takes in a DataFrame `df` like `diamonds` DataFrame and returns a DataFrame of quadratic features of the form `'<col1> * <col2>'`, where `'<col1>'` and `'<col2>'` are the original quantitative columns.
- The output DataFrame should contain a column for every distinct pair of quantitative columns in `df` (aside from `'price'`, which should be left out as it is what we are predicting).
- For instance, one of the columns in the returned DataFrame should named either `'carat * x'` or `'x * carat'`; the order of column names is not important.

Example behavior is given below. 


```python
>>> out = create_quadratics(diamonds)
>>> out.loc[:5, out.columns.str.contains('carat')]
```
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>carat * depth</th>
      <th>carat * table</th>
      <th>carat * x</th>
      <th>carat * y</th>
      <th>carat * z</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>14.145</td>
      <td>12.65</td>
      <td>0.9085</td>
      <td>0.9154</td>
      <td>0.5589</td>
    </tr>
    <tr>
      <th>1</th>
      <td>12.558</td>
      <td>12.81</td>
      <td>0.8169</td>
      <td>0.8064</td>
      <td>0.4851</td>
    </tr>
    <tr>
      <th>2</th>
      <td>13.087</td>
      <td>14.95</td>
      <td>0.9315</td>
      <td>0.9361</td>
      <td>0.5313</td>
    </tr>
    <tr>
      <th>3</th>
      <td>18.096</td>
      <td>16.82</td>
      <td>1.2180</td>
      <td>1.2267</td>
      <td>0.7627</td>
    </tr>
    <tr>
      <th>4</th>
      <td>19.623</td>
      <td>17.98</td>
      <td>1.3454</td>
      <td>1.3485</td>
      <td>0.8525</td>
    </tr>
    <tr>
      <th>5</th>
      <td>15.072</td>
      <td>13.68</td>
      <td>0.9456</td>
      <td>0.9504</td>
      <td>0.5952</td>
    </tr>
  </tbody>
</table>


Some guidance:
- Again, **do not** use `sklearn` for this question! 
- Try finding all pairs of quantitative columns efficiently; don't use a nested loop (hint: think back to `SimpleLAD` from Homework 8). Our solution contains just a single `for`-loop, over pairs of columns.
- The columns of the resulting DataFrame may be in any order.

In [33]:
from itertools import combinations 

def create_quadratics(df):
    dfc =  df.copy()
    dfc =dfc.drop(columns=['price'])
    num = dfc.columns[dfc.dtypes != 'object']
    pairs = list(combinations(num, 2))
    res = pd.DataFrame()
    for pair in pairs:
        res[f'{pair[0]} * {pair[1]}'] = dfc[pair[0]] * dfc[pair[1]]
    return res
    
   
    
# Feel free to change this input to make sure your function works correctly.
create_quadratics(diamonds)

Unnamed: 0,carat * depth,carat * table,carat * x,carat * y,carat * z,depth * table,depth * x,depth * y,depth * z,table * x,table * y,table * z,x * y,x * z,y * z
0,14.145,12.65,0.9085,0.9154,0.5589,3382.5,242.925,244.770,149.445,217.25,218.90,133.65,15.7210,9.5985,9.6714
1,12.558,12.81,0.8169,0.8064,0.4851,3647.8,232.622,229.632,138.138,237.29,234.24,140.91,14.9376,8.9859,8.8704
2,13.087,14.95,0.9315,0.9361,0.5313,3698.5,230.445,231.583,131.439,263.25,264.55,150.15,16.4835,9.3555,9.4017
3,18.096,16.82,1.2180,1.2267,0.7627,3619.2,262.080,263.952,164.112,243.60,245.34,152.54,17.7660,11.0460,11.1249
4,19.623,17.98,1.3454,1.3485,0.8525,3671.4,274.722,275.355,174.075,251.72,252.30,159.50,18.8790,11.9350,11.9625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53935,43.776,41.04,4.1400,4.1472,2.5200,3465.6,349.600,350.208,212.800,327.75,328.32,199.50,33.1200,20.1250,20.1600
53936,45.432,39.60,4.0968,4.1400,2.5992,3470.5,359.039,362.825,227.791,312.95,316.25,198.55,32.7175,20.5409,20.7575
53937,43.960,42.00,3.9620,3.9760,2.4920,3768.0,355.448,356.704,223.568,339.60,340.80,213.60,32.1488,20.1496,20.2208
53938,52.460,49.88,5.2890,5.2632,3.2164,3538.0,375.150,373.320,228.140,356.70,354.96,216.92,37.6380,23.0010,22.8888


In [34]:
grader.check("q04_04")

This homework is already quite long, and the focus of Question 4 was on having you develop the features yourself, not necesssarily use them in prediction tasks. So, we won't rqequire you to _fit_ any models using the features you've created. That said, you **should** try and experiment.

## Question 5: Feature Engineering with `sklearn` 🧠

---

In this final question, you will use `sklearn`'s transformers and estimators for feature engineering. While everything you do with `sklearn` is possible to do with `pandas`, `sklearn` transformers enable you to couple your feature engineering with your modeling. This will allow you to more quickly build and assess your models in `sklearn`.

Specifically, you will implement a `TransformDiamonds` class that has the three methods described below. In the starter code, there is a skeleton for `TransformDiamonds`, complete with an initializer.

Each of the methods you implement in the `TransformDiamonds` class should take in a DataFrame, initialize a specific `sklearn.Transformer` object (like `Binarizer` or `FunctionTransformer`), and use the transformer to transform columns from the input DataFrame. You should **not** use DataFrame methods like `apply` in this problem.

<br>

### 1. `transform_carat` <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div> [Autograded 💻]

We call a diamond **large** if its weight is strictly greater than 1 carat. We want to **binarize** weights, so that they are 1 for large diamonds and 0 for small diamonds. This methd takes in a DataFrame `df` like `diamonds` and returns a binarized **array** of weights. Use a `Binarizer` object as your transformer.

Additional guidance:
- `transform_carat` should return an array, not a Series, because `sklearn` thinks in terms of `np.ndarray`s, not DataFrames.
- The implementation of `transform_carat` should only take two lines.

<br>

### 2. `transform_to_quantile` <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">2 Points</div> [Autograded 💻]

Here, transform the `'carat'` column so that each diamond's weight in carats is replaced with the **percentile** amongst all diamonds in which its weight lies. This method takes in a DataFrame `df` like `diamonds` and returns an array containing the percentiles of the weight of each diamond, amongst all diamonds in `self.data`. This array should consist of proportions between 0 and 1; for instance, 0.65 will refer to the 65th percentile. The relevant transformer is `QuantileTransformer`. 

Additional guidance:

- Unlike with `Binarizer`, you need to `fit` your `QuantileTransformer` before calling `transform` on the input DataFrame `data`. 
    - You should `fit` your transformer on the DataFrame `self.data`, but you should only `transform` the `data` that is passed to `transform_to_quantiles`. 
    - Note that these two DataFrames, `self.data` and `data`, don't have to be the same! For instance, if we fit a `QuantileTransformer` using just the first 1000 rows of `diamonds`, and then `transform` the entire `diamonds` DataFrame, your `transform_to_quantiles` method should still work.
- When initializing your `QuantileTransformer`, use `n_quantiles=100` and `random_state=98`. The `random_state` argument **is necessary** because `QuantileTransformer` is non-deterministic, meaning that it potentially outputs different results each time it's called on the same output. Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html) for more details, and **don't forget this step!**

<br>

### 3. `transform_to_depth_pct` <div style="display:inline-block; vertical-align: middle; padding:7px 7px; font-size:10px; font-weight:light; color:white; background-color:#e84c4a; border-radius:7px; text-align:left;">3 Points</div> [Autograded 💻]

Recall from Question 4 that the "depth percentage" of a diamond is defined as:
$$\text{Depth Pct.} = 100\% \cdot \frac{2z}{x + y}$$
where $x$, $y$, and $z$ come from the `'x'`, `'y'`, and `'z'` columns in `diamonds`.

Let's suppose that for some reason we don't have access to the `'depth'` column in `diamonds`, and instead need to recreate it just by looking at the `'x'`, `'y'`, and `'z'` columns. 

This method takes in a DataFrame `df` like `diamonds` and returns an array consisting of the depth percentages of each diamond. Percentages should be between 0 and 100. The relevant transformer is `FunctionTransformer`.

Additional guidance:
- To use `FunctionTransformer`, you will need to define your own function that takes in a 2D array and returns a single array.
- Ignore `ZeroDivisionError` errors, and leave `np.NaN`s as is.
- To verify your work, compare your outputted array to the actual `'depth'` column in `diamonds`. Most – but not all – of the values should be the same.
- It may seem like `FunctionTransformer` is totally unnecessary, since we can compute depth percentages using broadcasting directly. However, as we will see in lecture, transformers can be **pipelined** with other processing steps which greatly simplifies our code.

The three test cells at the bottom of this section will test each method independently.

In [35]:
from sklearn.preprocessing import Binarizer, QuantileTransformer, FunctionTransformer

class TransformDiamonds(object):
    
    def __init__(self, diamonds):
        self.data = diamonds
        
    def transform_carat(self, df):
        transformer = Binarizer(threshold=1.0).fit(df[['carat']])
        return transformer.transform(df[['carat']]) 
          
    def transform_to_quantile(self, df):
        # Don't forget to use random_state=98!
        transformer = QuantileTransformer(n_quantiles=100, random_state=98).fit(df[['carat']])
        return transformer.transform(df[['carat']])
        
    def transform_to_depth_pct(self, df):
        transformer = FunctionTransformer(lambda X: 100 * (2 * X[:, 2]) / (X[:, 0] + X[:, 1]))
        return transformer.transform(df[["x", "y", "z"]].values)
    

        

In [36]:
grader.check("q05_test_transform_carat")

In [37]:
grader.check("q05_transform_to_quantile")

In [38]:
grader.check("q05_transform_to_depth_pct")

## Finish Line 🏁

Congratulations! You're ready to submit Homework 9. **Remember, you need to submit Homework 9 twice**:

### To submit the manually graded problems (Questions 1, 2, and 3.4; marked [Written ✏️])

- Make sure your answers **are not** in this notebook, but rather in a separate PDF.
    - You can create this PDF either digitally, using your tablet or using [Overleaf + LaTeX](https://overleaf.com) (or some other sort of digital document), or by writing your answers on a piece of paper and scanning them in.
- Submit this separate PDF to the **Homework 9 (Questions 1, 2, and 3.4; written problems)** assignment on Gradescope, and **make sure to correctly select the pages associated with each question**!

### To submit the autograded problems (Questions 3-5; marked [Autograded 💻])

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope under "Homework 9 (Questions 3-5; autograded problems)". Make sure your notebook is still named `hw09.ipynb` and the name has not been changed.
5. Stick around while the Gradescope autograder grades your work.
6. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

Your Homework 9 submission time will be the **later** of your two individual submissions.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [39]:
grader.check_all()

q03_01 results: All test cases passed!

q03_02 results: All test cases passed!

q03_03 results: All test cases passed!

q04_01 results: All test cases passed!

q04_02 results: All test cases passed!

q04_03 results: All test cases passed!

q04_04 results: All test cases passed!

q05_test_transform_carat results: All test cases passed!

q05_transform_to_depth_pct results: All test cases passed!

q05_transform_to_quantile results: All test cases passed!