In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Lecture 16

# Feature Engineering

### EECS 398: Practical Data Science, Winter 2025

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/wn25">github.com/practicaldsc/wn25</a> • 📣 See latest announcements [**here on Ed**](https://edstem.org/us/courses/69737/discussion/5943734) </small>
    
</div>

### Agenda 📆

- Recap: Multiple linear regression.
- Feature engineering.
- Numerical-to-numerical transformations 🧬.
- The modeling recipe, revisited.
- `StandardScaler` and standardized regression coefficients.

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Remember that you can always ask questions anonymously at the link above!

## Recap: Multiple linear regression

---

### The general problem

-  We have $n$ data points, $\left({ \vec x_1}, {y_1}\right), \left({ \vec x_2}, {y_2}\right),  \ldots, \left({ \vec x_n}, {y_n}\right)$,
where each $ \vec x_i$ is a feature vector of $d$ features:
$${\vec{x_i}} = \begin{bmatrix} 
{x^{(1)}_i} \\ {x^{(2)}_i} \\ \vdots \\ {x^{(d)}_i}
\end{bmatrix}$$	   

-  We want to find a good linear hypothesis function:

$$H({\vec x_i}) = w_0 + w_1 { x_i^{(1)}} + w_2 { x_i^{(2)}} + \ldots + w_d { x_i^{(d)}} = \vec w \cdot \text{Aug}({ \vec x_i})$$

- Specifically, we want to find the optimal parameters, $w_0^*$, $w_1^*$, ..., $w_d^*$ that minimize mean squared error:

$$\begin{align*} R_\text{sq}(\vec w) &= \frac{1}{n} \sum_{i = 1}^n (y_i - H(\vec x_i))^2 \\ &=
\frac{1}{n} \sum_{i = 1}^n \left( y_i - (w_0 + w_1 { x_i^{(1)}} + w_2 { x_i^{(2)}} + \ldots + w_d { x_i^{(d)}})\right)^2 
\\ &= \frac{1}{n} \sum_{i = 1}^n \left(y_i - \text{Aug}(\vec x_i) \cdot \vec{w} \right)^2 \\ &= \frac{1}{n} \lVert \vec y - X \vec w \rVert^2 \end{align*}$$

### The general solution

- Define the **design matrix** $ X \in \mathbb{R}^{n \times (d + 1)}$ and **observation vector** $\vec y \in \mathbb{R}^n$:

$${ X=  \begin{bmatrix}  
{1} & { x^{(1)}_1} & { x^{(2)}_1} & \dots & { x^{(d)}_1} \\\\
{ 1} & { x^{(1)}_2} & { x^{(2)}_2} & \dots & { x^{(d)}_2} \\\\
\vdots & \vdots & \vdots  &  & \vdots \\\\
{ 1} & { x^{(1)}_n} & { x^{(2)}_n} & \dots & { x^{(d)}_n}
\end{bmatrix} = \begin{bmatrix} 
       \text{Aug}({\vec{x_1}})^T \\\\
       \text{Aug}({\vec{x_2}})^T \\\\
       \vdots \\\\
       \text{Aug}({\vec{x_n}})^T
   \end{bmatrix}} \qquad { \vec y = \begin{bmatrix} { y_1} \\ { y_2} \\ \vdots \\ { y_n} \end{bmatrix}}$$

- Then, solve the **normal equations** to find the optimal parameter vector, $\vec{w}^*$:

$$X^TX \vec w^* = X^T \vec y$$

- The $\vec w^*$ that satisfies the equations above minimizes mean squared error, $R_\text{sq}(\vec w)$.

- If $X^TX$ is invertible, then:

$$\boxed{\vec w^* = (X^TX)^{-1}X^T \vec y}$$

- `sklearn` can compute $\vec w^*$ automatically, as we will see again shortly.

## Feature engineering ⚙️

---

### The goal of feature engineering

- **Feature engineering** is the act of finding **transformations** that transform data into effective **quantitative variables**.<br><small>Put simply: feature engineering is creating new features using existing features.</small>

- **Example**: One hot encoding.

<center><img src="imgs/one-hot.png" width=40%></center>

- **Example**: Numerical-to-numerical transformations.

<center><img src="imgs/quant-scale.png" width=40%></center>

### One hot encoding

- One hot encoding is a transformation that turns a categorical feature into several binary features.

<center><img src="imgs/one-hot.png" width=40%></center>

- Suppose a column has $N$ unique values, $A_1$, $A_2$, ..., $A_N$. For each unique value $A_i$, we define the following **feature function**:

$$\phi_i(x) = \left\{\begin{array}{ll}1 & {\rm if\ } x == A_i \\ 0 &  {\rm if\ } x\neq A_i \\ \end{array}\right. $$

- Note that 1 means "yes" and 0 means "no".

- One hot encoding is also called "dummy encoding", and $\phi_i(x)$ may also be referred to as an "indicator variable".

Run the cells below to set up the next slide.

In [None]:
df = pd.read_csv('data/commute-times.csv')
df['day_of_month'] = pd.to_datetime(df['date']).dt.day
df.head()

### Example: One hot encoding `'day'`

- For each unique value of `'day'` in our dataset, we must create a column for just that `'day'`.

In [None]:
df.head()

In [None]:
df['day'].value_counts()

In [None]:
(df['day'] == 'Tue').astype(int) 

In [None]:
for val in df['day'].unique():
    df[f'day == {val}'] = (df['day'] == val).astype(int)

In [None]:
df.loc[:, df.columns.str.contains('day')] 

### Using `'day'` as a feature, along with `'departure_hour'` and `'day_of_month'`

- Now that we've converted `'day'` to a numerical variable, we can use it as input in a regression model. Here's the model we'll try to fit:

$$\begin{align*}\text{pred. commute time}_i = w_0 &+ w_1 \cdot \text{departure hour}_i \\ &+ w_2 \cdot \text{day of month}_i \\ &+ w_3 \cdot \text{day$_i$ == Mon} \\ 
&+ w_4 \cdot \text{day$_i$ == Tue} \\ 
&+ w_5 \cdot \text{day$_i$ == Wed} \\ 
&+ w_6 \cdot \text{day$_i$ == Thu} \end{align*}$$

- **Subtlety**: Since there are only 5 values of `'day'`, we don't need to include $\text{day}_i \text{ == Fri}$ as a feature. We know it's Friday if $\text{day}_i \text{ == Mon}$, $\text{day}_i \text{ == Tue}$, ... are all 0.<br><small>More on this next class!</small>

In [None]:
X_for_ohe = df[['departure_hour', 
                'day_of_month',
                'day == Mon',
                'day == Tue',
                'day == Wed',
                'day == Thu']]
X_for_ohe

In [None]:
from sklearn.linear_model import LinearRegression
model_with_ohe = LinearRegression()
model_with_ohe.fit(X=X_for_ohe, y=df['minutes'])

- The following cell gives us our $w^*$s:

In [None]:
model_with_ohe.intercept_, model_with_ohe.coef_

- Thus, our trained linear model to predict commute time given `'departure_hour'`, `'day_of_month'`, and `'day'` (Mon, Tue, Wed, or Thu) is:

$$\begin{align*}\text{pred. commute time}_i = 134 &- 8.42 \cdot \text{departure hour}_i \\ &- 0.03 \cdot \text{day of month}_i \\ 
&+ 5.09 \cdot \text{day$_i$ == Mon} \\ 
&+ 16.38 \cdot \text{day$_i$ == Tue} \\ 
&+ 5.12 \cdot \text{day$_i$ == Wed} \\ 
&+ 11.5 \cdot \text{day$_i$ == Thu} \end{align*}$$

### Visualizing our latest model

- Our trained linear model to predict commute time given `'departure_hour'`, `'day_of_month'`, and `'day'` (Mon, Tue, Wed, or Thu) is:

$$\begin{align*}\text{pred. commute time}_i = 134 &- 8.42 \cdot \text{departure hour}_i \\ &- 0.03 \cdot \text{day of month}_i \\ 
&+ 5.09 \cdot \text{day$_i$ == Mon} \\ 
&+ 16.38 \cdot \text{day$_i$ == Tue} \\ 
&+ 5.12 \cdot \text{day$_i$ == Wed} \\ 
&+ 11.5 \cdot \text{day$_i$ == Thu} \end{align*}$$

- Since we have 6 features here, we'd need 7 dimensions to graph our model.

- But, as we see in Homework 7, Question 5, our model is really a collection of **five parallel planes** in 3D, all with slightly different $z$-intercepts!

- If we want to visualize in 2D, we need to pick a single feature to place on the $x$-axis.

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['departure_hour'], y=df['minutes'], 
                         mode='markers', name='Original Data'))
fig.add_trace(go.Scatter(x=df['departure_hour'], y=model_with_ohe.predict(X_for_ohe), 
                         mode='markers', name='Predicted Commute Times using Departure Hour, <br>Day of Month, and Day of Week'))
fig.update_layout(showlegend=True, title='Commute Time vs. Departure Hour',
                  xaxis_title='Departure Hour', yaxis_title='Minutes', width=1000)

- Despite being a linear model, why **doesn't** this model **look** like a straight line?

Run the cells below to set up the next slide.

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
# Multiple linear regression model.
model_multiple = LinearRegression()
model_multiple.fit(X=df[['departure_hour', 'day_of_month']], y=df['minutes'])
mse_dict = {}
mse_dict['departure_hour + day_of_month'] = mean_squared_error(df['minutes'], model_multiple.predict(df[['departure_hour', 'day_of_month']]))

In [None]:
# Simple linear model.
model_simple = LinearRegression()
model_simple.fit(X=df[['departure_hour']], y=df['minutes'])
mse_dict['departure_hour'] = mean_squared_error(df['minutes'], model_simple.predict(df[['departure_hour']]))

In [None]:
# Constant model.
model_constant = df['minutes'].mean()
mse_dict['constant'] = mean_squared_error(df['minutes'], np.ones(df.shape[0]) * model_constant)

### Comparing our latest model to earlier models

- Let's see how the inclusion of the day of the week impacts the quality of our predictions.

In [None]:
mse_dict['departure_hour + day_of_month + ohe day'] = mean_squared_error(
    df['minutes'],
    model_with_ohe.predict(X_for_ohe)
)
pd.Series(mse_dict).plot(kind='barh', title='Mean Squared Error')

- Adding the day of the week decreased our MSE **significantly**!

### Reflection

In [None]:
df.head()

- We've one hot encoded `'day'`, but it required a `for`-loop.

- Is there a way we could have encoded it without a `for`-loop?

- Yes, using `sklearn.preprocessing`'s `OneHotEncoder`. More on this soon!

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
Remember that you can always ask questions anonymously at the link above!

## Numerical-to-numerical transformations 🧬

---

### Example: Horsepower 🚗

- The following dataset, built into the `seaborn` plotting library, contains various information about (older) cars.

In [None]:
import seaborn as sns
mpg = sns.load_dataset('mpg').dropna()
mpg.head()

- We really do mean old:

In [None]:
mpg['model_year'].value_counts()

- Let's investigate the relationship between `'horsepower'` and `'mpg'`.

### The relationship between `'horsepower'` and `'mpg'`

In [None]:
px.scatter(mpg, x='horsepower', y='mpg')

- It appears that there is a negative association between `'horsepower'` and `'mpg'`, though it's not quite linear.

- Let's try and fit a simple linear model that uses `'horsepower'` to predict `'mpg'` and see what happens.

### Predicting `'mpg'` using `'horsepower'`

In [None]:
car_model = LinearRegression()
car_model.fit(mpg[['horsepower']], mpg['mpg'])

- What do our predictions look like?

In [None]:
hp_points = pd.DataFrame({'horsepower': [25, 225]})
fig = px.scatter(mpg, x='horsepower', y='mpg')
fig.add_trace(go.Scatter(
    x=hp_points['horsepower'],
    y=car_model.predict(hp_points),
    mode='lines',
    name='Predicted MPG using Horsepower'
))

- Our regression line doesn't capture the curvature in the relationship between `'horsepower'` and `'mpg'`.

In [None]:
# As a baseline:
mean_squared_error(mpg['mpg'], car_model.predict(mpg[['horsepower']]))

### Linear in the parameters

-  **Using linear regression**, we can fit hypothesis functions like:
    $$
    H(x_i) = w_0 + w_1x_i+w_2x_i^2
    \qquad \qquad 
    H(\vec x_i) = w_1e^{-x_i^{{(1)}^2}} + w_2 \cos(x_i^{(2)}+\pi) +w_3 \frac{\log 2x_i^{(3)}}{x_i^{(2)}}
    $$
    
    <br>
    <small>This includes all polynomials, for example. These are all <b>linear combinations of (just) features</b>.</small>

- For any of the above examples, we **could** express our model as $\vec w \cdot \text{Aug} (\vec x_i)$, for some carefully chosen feature vector $\vec x_i$, <br>and that's all that `LinearRegression` in `sklearn` needs.<br><small>What we put in the `X` argument to `model.fit` is up to us!</small>

-  Using linear regression, we **can't** fit hypothesis functions like:
    $$
    H(x_i) = w_0 + e^{w_1 x_i}
    \qquad \qquad 
    H(\vec x_i) = w_0 + \sin (w_1 x_i^{(1)} + w_2 x_i^{(2)})
    $$
    <br><small>These are <b>not</b> linear combinations of just features.</small>

-  We can have any number of parameters, as long as our hypothesis function is **linear in the parameters**, or linear when we think of it as a function of the parameters.

$$H(\vec x_i) = w_0 + w_1 f_1(\vec x_i) + w_2 f_2(\vec x_i) + ... + w_d f_d(\vec x_i)$$

### Linearization

- The [Tukey Mosteller Bulge Diagram](https://sites.stat.washington.edu/pds/stat423/Documents/LectureNotes/notes.423.ch4.pdf) helps us pick which **numerical-to-numerical** transformations to apply to data in order to **linearize** it.<br><small>Alternative interpretation: it helps us determine which features to create.</small>

<center><img src="imgs/bulge.png" width=400></center>

- **Why**? We're working with linear models. The more linear our data looks in terms of its features, the better we'll able to model the data.

In [None]:
fig

- Here, the bottom-left quadrant appears to match the shape of the scatter plot between `'horsepower'` and `'mpg'` the best – let's try taking the `log` of `'horsepower'` (the $x$ variable).

In [None]:
mpg['log hp'] = np.log(mpg['horsepower'])

- What does our data look like now?

In [None]:
px.scatter(mpg, x='log hp', y='mpg')

### Predicting `'mpg'` using `log('horsepower')`

- Let's fit another linear model.

In [None]:
car_model_log = LinearRegression()
car_model_log.fit(mpg[['log hp']], mpg['mpg'])

- Note that implicitly, we defined the following design matrix:

$$X = \begin{bmatrix} 1 & \log(x_1) \\ 1 & \log(x_2) \\ \vdots & \vdots \\ 1 & \log(x_n) \end{bmatrix}$$

- What do our predictions look like now?

In [None]:
fig = px.scatter(mpg, x='log hp', y='mpg')
log_hp_points = pd.DataFrame({'log hp': [3.7, 5.5]})
fig = px.scatter(mpg, x='log hp', y='mpg')
fig.add_trace(go.Scatter(
    x=log_hp_points['log hp'],
    y=car_model_log.predict(log_hp_points),
    mode='lines',
    name='Predicted MPG using log(Horsepower)'
))

- The fit looks a bit better! How about the MSE?

In [None]:
# Using log hp:
mean_squared_error(mpg['mpg'], car_model_log.predict(mpg[['log hp']]))

In [None]:
# Using hp, from before:
mean_squared_error(mpg['mpg'], car_model.predict(mpg[['horsepower']]))

- Also a bit better!

- What do our predictions look like on the original, non-transformed scatter plot? Let's see:

In [None]:
fig = px.scatter(mpg, x='horsepower', y='mpg')
fig.add_trace(
    go.Scatter(
        x=mpg['horsepower'], 
        y=car_model_log.intercept_ + car_model_log.coef_[0] * np.log(mpg['horsepower']),  
        mode='markers', name='Predicted MPG using log(Horsepower)'
    )
)
fig

- Our predictions that used $\log(\text{Horsepower})$ as an input don't fall on a straight line. We shouldn't expect them to; the orange dots come from:

$$\text{predicted MPG}_i = 108.70 - 18.582 \cdot \log(\text{Horsepower}_i)$$

In [None]:
car_model_log.intercept_, car_model_log.coef_

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>

Which hypothesis function is **not** linear in the parameters?

- A. $H(\vec{x}_i) = w_1 (x_i^{(1)} x_i^{(2)}) + \frac{w_2}{x_i^{(1)}} \sin \left( x_i^{(2)} \right)$
- B. $H(\vec{x}_i) = 2^{w_1} x_i^{(1)}$
- C. $H(\vec{x}_i) = \vec{w} \cdot \text{Aug}(\vec{x}_i)$
- D. $H(\vec{x}_i) = w_1 \cos (x_i^{(1)}) + w_2 2^{x_i^{(2)} \log x_i^{(3)}}$
- E. More than one of the above.

### How do we fit hypothesis functions that aren't linear in the parameters?

-  Suppose we want to fit the hypothesis function:

$$H(x_i) = w_0 e^{w_1 x_i}$$

- This is **not** linear in terms of $w_0$ and $w_1$, so our results for linear regression don't apply.

-  **Possible solution**: Try to transform the above equation so that it **is** linear in some other parameters, by applying an operation to both sides.

- See the attached Reference Slide for more details.

<div class="alert alert-danger">

#### Reference Slide

### Transformations
    
</div>

$$H(x_i) = w_0 e^{w_1 x_i}$$

- Suppose we take the $\log$ of both sides of the equation.

$$\log H(x_i) = \log (w_0 e^{w_1x_i})$$

- Then, using properties of logarithms, we have:

$$\log H(x_i) = \underbrace{\log(w_0)}_{\text{this is just a constant!}} + w_1 x_i$$

- **Solution**: Create a new hypothesis function, $T(x_i)$, with parameters $b_0$ and $b_1$, where $T(x_i) = b_0 + b_1 x_i$.

-  This hypothesis function is related to $H(x_i)$ by the relationship $T(x_i) = \log H(x_i)$.

-  $\vec{b}$ is related to $\vec{w}$ by $b_0 = \log w_0$ and $b_1 = w_1$.

-  Our new observation vector, $\vec{z}$, is $\begin{bmatrix} \log y_1 \\ \log y_2 \\ ... \\ \log y_n \end{bmatrix}$.

-  $T(x_i) = b_0 + b_1x_i$ is linear in its parameters, $b_0$ and $b_1$.

-  Use the solution to the normal equations to find $\vec{b}^*$, and the relationship between $\vec{b}$ and $\vec{w}$ to find $\vec{w}^*$.

## The modeling recipe, revisited

---

### The original modeling recipe, from Lecture 11

1. Choose a model.

2. Choose a loss function.

3. Minimize average loss (empirical risk) to find optimal model parameters, $\vec w^*$.

### The updated modeling recipe

0. Create, or engineer, features to best reflect the "meaning" behind data.<br><small>Recently, we've done this with one hot encoding and numerical-to-numerical transformations.</small>

1. Choose a model.<br><small>Recently, we've used the simple/multiple linear regression model.</small>

2. Choose a loss function.<br><small>Recently, we've mostly used squared loss.</small>

3. Minimize average loss (empirical risk) to find optimal model parameters, $\vec{w}^*$.<br><small>Originally, we had to use calculus or linear algebra to minimize empirical risk, but more recently we've just used `model.fit`. This step is also called **fitting the model to the data**.</small>

4. Evaluate the performance of the model in relation to other models.

- **We can do all of the above directly in `sklearn`!**

### `preprocessing` and `linear_model`s

- For the **feature engineering** step of the modeling pipeline, we will use `sklearn`'s [`preprocessing`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module.

<br><br><br><br>

- For the **model creation** step of the modeling pipeline, we will use `sklearn`'s [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module, as we've already seen. `linear_model.LinearRegression` is an example of an **estimator** class.

<br><br><br><br>

### Transformer classes

- **Transformers** take in "raw" data and output "processed" data. They are used for **creating features**.<br><small>These are not directly related to "transformers" in large language models and neural networks.</small>

- Transformers, like most relevant features of `sklearn`, are **classes**, not functions, meaning you need to instantiate them and call their methods.

- Today, we'll introduce one transformer class, `StandardScaler`. We'll look at how to write code to use it, and also discuss some of the underlying statistical nuances.

- Next class, we'll learn about another transformer class - `OneHotEncoder` – and we'll see how to chain transformers and estimators together into larger **Pipelines**.

## `StandardScaler` and standardized regression coefficients

---

### Example: Predicting sales 📈

- To illustrate our first transformer class, we'll introduce a new dataset.

In [None]:
sales = pd.read_csv('data/sales.csv')
sales.head()

- For each of 26 stores, we have:
    -  net sales, 
    -  square feet, 
    -  inventory,
    -  advertising expenditure, 
    -  district size, and
    -  number of competing stores.

- Our goal is to predict `'net_sales'` as a function of other features.

### An initial model

In [None]:
sales.head()

- No transformations are _needed_ to predict `'net_sales'`.

In [None]:
sales_model = LinearRegression()
sales_model.fit(X=sales.iloc[:, 1:], y=sales.iloc[:, 0])

- Suppose we're interested in learning **how** the various features impact `'net_sales'`, rather than just predicting `'net_sales'` for a new store. We'd then look at the coefficients.

In [None]:
sales_model.coef_

In [None]:
coefs = pd.DataFrame().assign(
    column=sales.columns[1:],
    original_coef=sales_model.coef_,
).set_index('column')
coefs.plot(kind='barh', title='Original Coefficients')

- What do you notice?

In [None]:
sales.iloc[:, 1:]

### Thought experiment

- Consider the white point in the scatter plot below.

<center><img src="imgs/std-example.png" width=600></center>

- Which class is it more "similar" to – <b><span style="color:blue">blue</span></b> or <b><span style="color:orange">orange</span></b>?

- Intuitively, the answer may be <b><span style="color:blue">blue</span></b>, but take a close look at the scale of the axes!<br>The <b><span style="color:orange">orange</span></b> point is much closer to the white point than the <b><span style="color:blue">blue</span></b> points are.

### Standardization

- When we standardize two or more features, we bring them to the **same scale**.

-  Recall: to standardize a feature $x_1, x_2, ..., x_n$, we use the formula:
    $$z(x_i) = \frac{x_i - \bar{x}}{\sigma_x}$$

-  Example: 1, 7, 7, 9.
    -  Mean: $\frac{1 + 7 + 7 + 9}{4} = \frac{24}{4} = 6$.
    -  Standard deviation:

    $$\text{SD} = \sqrt{\frac{1}{4} \left( (1-6)^2 + (7-6)^2 + (7-6)^2 + (9-6)^2 \right)} = \sqrt{\frac{1}{4} \cdot 36} = 3$$
    -  Standardized data: 

    $$1 \mapsto \frac{1-6}{3} = \boxed{-\frac{5}{3}} \qquad 7 \mapsto \frac{7-6}{3} = \boxed{\frac{1}{3}} \qquad 7 \mapsto \boxed{\frac{1}{3}} \qquad 9 \mapsto \frac{9-6}{3} = \boxed{1}$$


### Pre- and post- standardization

- **Before** we standardize both axes:

<center><img src="imgs/std-example.png" width=500></center>

- **After** we standardize both axes:

<center><img src="imgs/std-example-2.png" width=500></center>

- After we standardize, our features are measured on the same scales, so **they can be compared directly**.

### Which features are most "important"?

-  The most important feature is **not necessarily** the feature with largest magnitude coefficient, because different features may be on different scales.

In [None]:
coefs.plot(kind='barh', title='Original Coefficients')

In [None]:
sales.iloc[:, 1:]

- Here, `'inventory'` values are much larger than `'sq_ft'` values, which means that the coefficient for `'inventory'` will inherently be smaller, even if `'inventory'` is a more important feature than `'sq_ft'`.<br><small>Intuition: if the values themselves are larger, you need to multiply them by smaller coefficients to get the same predictions!</small>

- **Solution**: If you care about the interpretability of the resulting coefficients, **standardize** each feature before performing regression.

### Example transformer: `StandardScaler`

- `StandardScaler` **standardizes** data using the mean and standard deviation of the data.

$$z(x_i) = \frac{x_i - \bar{x}}{\sigma_x}$$

- First, we need to import the relevant class from `sklearn.preprocessing`.<br><small>It's best practice to import just the relevant classes you need from `sklearn`.</small>

In [None]:
from sklearn.preprocessing import StandardScaler

- Like an estimator, we need to instantiate **and fit** our `OneHotEncoder` instsance before it can transform anything.<br><small>Here, "fitting" the transformer involves computing and saving the mean and SD of each column.</small>

In [None]:
stdscaler = StandardScaler()

In [None]:
# Doesn't work! Need to fit first.
stdscaler.transform(sales.iloc[:, 1:])

In [None]:
# This is like saying "determine the mean and SD of each column in sales, 
# other than the 'net_sales' column".
stdscaler.fit(sales.iloc[:, 1:])

- Now, we can standardize any dataset, using the mean and standard deviation of the columns in `sales.iloc[:, 1:]`. Typical usage is to fit transformer on a sample and use that already-fit transformer to transform future data.

In [None]:
stdscaler.transform([[5, 300, 10, 15, 6]])

In [None]:
stdscaler.transform(sales.iloc[:, 1:].tail(5))

- We can peek under the hood and see what it computed!

In [None]:
stdscaler.mean_

In [None]:
stdscaler.var_

- If needed, the `fit_transform` method will fit the transformer and then transform the data in one go.

In [None]:
new_scaler = StandardScaler()

In [None]:
new_scaler.fit_transform(sales.iloc[:, 1:].tail(5))

- Why are the values above different from the values in `stdscaler.transform(sales.iloc[:, 1:].tail(5))`?

### Interpreting standardized regression coefficients

- Now that we have a technique for standardizing the feature columns of `sales`, let's fit a new regression object.

In [None]:
sales_model_std = LinearRegression()
sales_model_std.fit(X=stdscaler.transform(sales.iloc[:, 1:]),
                    y=sales.iloc[:, 0])

- Let's now look at the resulting coefficients, and compare them to the coefficients before we standardized.

In [None]:
pd.DataFrame().assign(
    column=sales.columns[1:],
    original_coef=sales_model.coef_,
    standardized_coef=sales_model_std.coef_
).set_index('column').plot(kind='barh', barmode='group', title='Standardized and Original Coefficients')

- Did the performance of the resulting model change?

In [None]:
mean_squared_error(sales.iloc[:, 0],
                   sales_model.predict(sales.iloc[:, 1:]))

In [None]:
mean_squared_error(sales.iloc[:, 0],
                   sales_model_std.predict(stdscaler.transform(sales.iloc[:, 1:])))

- **No!**<br><small>The span of the design matrix did not change, so the predictions did not change. It's just the coefficients that changed.</small>

### Key takeaways

-  The result of standardizing each feature (separately!) is that the units of each feature are on the same scale.
    -  There's no need to standardize the outcome (`'net_sales'` here), since it's not being compared to anything.
    - Also, we can't standardize the column of all 1s.

-  Then, solve the normal equations. The resulting $w_0^*, w_1^*, \ldots, w_d^*$ are called the **standardized regression coefficients**.


-  Standardized regression coefficients can be directly compared to one another.

- As we saw on the previous slide, standardizing each feature **does not** change the MSE of the resulting hypothesis function!

### `StandardScaler` summary

|Property|Example|Description|
|---|---|---|
|Initialize with parameters| `stdscaler = StandardScaler()` | z-score the data (no parameters) |
|Fit the transformer| `stdscaler.fit(X)` | Compute the mean and SD of `X`|
|Transform data in a dataset | `feat = stdscaler.transform(X_new)` | z-score `X_new` with mean and SD of `X`|
|Fit and transform| `stdscaler.fit_transform(X)` | Compute the mean and SD of `X`, then z-score `X`|

### What's next?

- How does `OneHotEncoder` work?

- Even though we have a `StandardScaler` transformer object, to actually use standardize our features AND make predictions, we need to:
    - Manually instantiate a `StandardScaler` object, and then `fit` it.
    - Create a new design matrix by taking the result of calling `transform` on the `StandardScaler` object and concatenating other relevant numerical columns.
    - Manually instantiate a `LinearRegression` object, and then `fit` it using the result of the above step.

- As we build more and more sophisticated models, it will be challenging to keep track of all of these individual steps ourselves.

- As such, we often build **Pipelines**.