# Practical Lab: Understanding Feature Scaling and Optimizing Learning Rates

When working with multiple features in machine learning, two critical factors affect training performance: the learning rate and feature scaling. This lab explores these concepts hands-on.

## Learning Objectives

By completing this lab, you will:
- Apply gradient descent to datasets with multiple input features
- Experiment with different learning rates and observe their effects on convergence
- Understand why feature scaling matters for optimization
- Implement z-score normalization to improve training efficiency
- Build on the multi-variable regression functions from previous work

## Setup and Dependencies

We'll use NumPy for numerical computations and matplotlib for visualization. The lab also includes helper functions for loading data and running gradient descent.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import  load_house_data, run_gradient_descent 
from lab_utils_multi import  norm_plot, plt_equal_scale, plot_cost_i_w
from lab_utils_common import dlc
np.set_printoptions(precision=2)
plt.style.use('./deeplearning.mplstyle')

## Mathematical Notation Reference

Throughout this notebook, we use the following notation:

|Symbol | Meaning | Python Variable |
|:------|:--------|:----------------|
| $a$ | scalar value (not bold) | |
| $\mathbf{a}$ | vector (bold lowercase) | |
| $\mathbf{A}$ | matrix (bold uppercase) | |
| **For Regression Problems:** | | |
|  $\mathbf{X}$ | matrix of training examples | `X_train` |   
|  $\mathbf{y}$  | vector of target values | `y_train` |
|  $\mathbf{x}^{(i)}$, $y^{(i)}$ | $i^{th}$ training example and its target | `X[i]`, `y[i]`|
| $m$ | total number of training examples | `m`|
| $n$ | number of input features | `n`|
|  $\mathbf{w}$  |  weight parameters (one per feature) | `w` |
|  $b$ |  bias parameter | `b` |     
| $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ | Model prediction: $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)}+b$ | `f_wb` | 
|$\frac{\partial J(\mathbf{w},b)}{\partial w_j}$| Partial derivative of cost with respect to weight $w_j$ |`dj_dw[j]`| 
|$\frac{\partial J(\mathbf{w},b)}{\partial b}$| Partial derivative of cost with respect to bias $b$| `dj_db`|

# The Housing Price Prediction Challenge

Let's tackle a practical problem: predicting house prices based on multiple characteristics. Our dataset includes houses with four features:
- Size in square feet
- Number of bedrooms
- Number of floors
- Age in years

Our goal is to build a linear regression model that can predict the price of any house. For instance, if someone asks "What's a fair price for a 1200 sqft house with 3 bedrooms, 1 floor, and 40 years old?", our model should provide an estimate.

Note: Unlike earlier exercises that used 1000s of sqft, this dataset uses actual square footage.

## Sample Data

Here's a glimpse of our training data:

| Size (sqft) | Bedrooms  | Floors | Age (years) | Price ($1000s) |   
| ----------- | --------- | ------ | ----------- | -------------- |  
| 952         | 2         | 1      | 65          | 271.5          |  
| 1244        | 3         | 2      | 64          | 232            |  
| 1947        | 3         | 2      | 17          | 509.8          |  
| ...         | ...       | ...    | ...         | ...            |


In [None]:
# load the dataset
X_train, y_train = load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']

Let's visualize our data to understand which features might be most predictive of price.

In [None]:
fig,ax=plt.subplots(1, 4, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X_train[:,i],y_train)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("Price (1000's)")
plt.show()

From the scatter plots above, we can see different patterns:
- **Size**: Clear positive correlation - larger homes cost more
- **Bedrooms & Floors**: Weaker relationship with price
- **Age**: Newer homes (lower age values) tend to have higher prices

<a name="toc_15456_5"></a>
## Multi-Feature Gradient Descent

Gradient descent for multiple variables follows this update rule (iterating until convergence):

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\;
& w_j := w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1}  \; & \text{for j = 0..n-1}\newline
&b\ \ := b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b}  \newline \rbrace
\end{align*}$$

where $n$ represents the feature count, and all parameters ($w_j$ and $b$) update simultaneously.

The partial derivatives are computed as:

$$
\begin{align}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{2}  \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3}
\end{align}
$$

Where:
* $m$ is the number of training examples
    
*  $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ represents our model's prediction, while $y^{(i)}$ is the actual target


## The Learning Rate: Finding the Sweet Spot
<figure>
    <img src="./images/C1_W2_Lab06_learningrate.PNG" style="width:1200px;" >
</figure>

The learning rate $\alpha$ is a hyperparameter that controls how big a step we take during each iteration of gradient descent. It's shared across all parameters and can dramatically affect training:

- **Too large**: The algorithm overshoots and may diverge
- **Too small**: Training is slow but stable
- **Just right**: Efficient convergence to the optimal solution

Let's experiment with different learning rates on our housing dataset to see these effects in action.

### Experiment 1: $\alpha$ = 9.9e-7 (Too High)

In [None]:
#set alpha to 9.9e-7
_, _, hist = run_gradient_descent(X_train, y_train, 10, alpha = 9.9e-7)

Notice that the cost is *increasing* instead of decreasing! This is a clear sign that our learning rate is too large - the algorithm is overshooting the minimum and diverging rather than converging.

In [None]:
plot_cost_i_w(X_train, y_train, hist)

The visualization shows what's happening: the parameter $w_0$ (right plot) oscillates wildly, overshooting the optimal value with each iteration. This causes the cost (left plot) to increase rather than decrease toward the minimum.

Note: While all 4 parameters are being updated simultaneously, this visualization focuses on $w_0$ to illustrate the concept. The blue and orange lines may appear slightly offset due to this simplification.


### Experiment 2: $\alpha$ = 9e-7 (Better)

Let's reduce the learning rate slightly and observe the improvement.

In [None]:
#set alpha to 9e-7
_,_,hist = run_gradient_descent(X_train, y_train, 10, alpha = 9e-7)

Good news! The cost is now consistently decreasing throughout the training, indicating our learning rate is in a workable range.

In [None]:
plot_cost_i_w(X_train, y_train, hist)

The left plot confirms cost is decreasing properly. The right plot shows $w_0$ still oscillates around the optimal value, but now the cost decreases with each step rather than increasing. Notice how `dj_dw[0]` alternates sign as `w[0]` jumps across the minimum.

This learning rate will eventually converge, though not optimally. Try adjusting the iteration count to see the full convergence behavior.

### Experiment 3: $\alpha$ = 1e-7 (Optimal)

What happens if we reduce the learning rate even further?

In [None]:
#set alpha to 1e-7
_,_,hist = run_gradient_descent(X_train, y_train, 10, alpha = 1e-7)

Excellent! The cost continues to decrease steadily, confirming $\alpha$ is in a safe range.

In [None]:
plot_cost_i_w(X_train,y_train,hist)

Perfect behavior! The left plot shows steady cost reduction. The right plot reveals that $w_0$ smoothly approaches the minimum without oscillation - notice how `dj_w0` stays negative throughout, indicating consistent progress in the right direction. This learning rate will converge reliably.

## Why Feature Scaling Matters
<figure>
    <img src="./images/C1_W2_Lab06_featurescalingheader.PNG" style="width:1200px;" >
</figure>

Our experiments revealed that even with a well-chosen learning rate, convergence is sluggish. Why? The answer lies in feature scaling.

### The Problem with Unscaled Features

Look at our housing data:
- Size ranges from ~900 to 2000 sqft
- Number of bedrooms ranges from 1 to 5

These vastly different scales cause gradient descent to take inefficient, zig-zagging steps. Small changes in the "size" parameter have huge effects on predictions, while large changes in "bedrooms" barely matter.

### The Solution: Z-Score Normalization

Z-score normalization (also called standardization) transforms each feature to have:
- Mean ($\mu$) = 0
- Standard deviation ($\sigma$) = 1

The formula:
$$x^{(i)}_j = \frac{x^{(i)}_j - \mu_j}{\sigma_j}$$

where:
- $\mu_j$ is the mean of feature $j$
- $\sigma_j$ is the standard deviation of feature $j$

After normalization, all features are on a comparable scale, allowing gradient descent to converge much faster.

<details>
<summary>
    <font size='3', color='darkgreen'><b>Technical Deep-Dive: Why Feature Scaling Works</b></font>
</summary>

Let's examine what happens during training with $\alpha$ = 9e-7 more closely:

<figure>
    <img src="./images/C1_W2_Lab06_ShortRun.PNG" style="width:1200px;" >
</figure>

In the early iterations above, notice that $w_0$ (corresponding to size) changes dramatically while other parameters barely budge. This is because $w_0$ has a much larger gradient.

Here's what a very long training run looks like (this can take hours!):

<figure>
    <img src="./images/C1_W2_Lab06_LongRun.PNG" style="width:1200px;" >
</figure>
    
Observe how cost decreases rapidly at first, then crawls. $w_0$ quickly reaches its optimal value (notice `dj_dw0` becomes tiny), while $w_1$, $w_2$, and $w_3$ take much longer to converge.

**Why does this happen?**

<figure>
    <center> <img src="./images/C1_W2_Lab06_scale.PNG"   ></center>
</figure>   

The gradient descent update rule multiplies the error by each feature value:
- $\alpha$ is the same for all parameters
- The error term is shared across all updates
- But each $w_j$ gets multiplied by feature $x_j$

Since house size is typically 1000+ while bedroom count is 2-4, $w_0$'s gradient is hundreds of times larger than $w_1$'s gradient. This creates uneven updates.

Feature scaling solves this by normalizing all features to similar ranges.
</details>

### Normalization Techniques

Several methods exist for feature normalization:

1. **Min-Max Scaling**: Rescales features to [0, 1] or [-1, 1] range
   - Formula: $x_i := \dfrac{x_i - min}{max - min}$

2. **Mean Normalization**: Centers data around zero
   - Formula: $x_i := \dfrac{x_i - \mu_i}{max - min}$

3. **Z-Score Normalization** (what we'll use): Standardizes to mean=0, std=1

We'll focus on z-score normalization as it's particularly effective for gradient descent.


### Implementing Z-Score Normalization

Z-score normalization transforms features so they all have mean = 0 and standard deviation = 1.

The transformation formula:
$$x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} \tag{4}$$ 

where:
- $j$ identifies which feature (column) in $\mathbf{X}$
- $\mu_j$ is the mean of feature $j$ across all examples
- $\sigma_j$ is the standard deviation of feature $j$

Computing the statistics:
$$
\begin{align}
\mu_j &= \frac{1}{m} \sum_{i=0}^{m-1} x^{(i)}_j \tag{5}\\
\sigma^2_j &= \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2  \tag{6}
\end{align}
$$

>**Critical Implementation Note:** Save the mean and standard deviation values you compute during training! When making predictions on new data, you must normalize using these same statistics (not new ones calculated from the test data). Otherwise, your model's predictions will be incorrect.

Let's implement this:

In [None]:
def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column
    
    Args:
      X (ndarray (m,n))     : input data, m examples, n features
      
    Returns:
      X_norm (ndarray (m,n)): input normalized by column
      mu (ndarray (n,))     : mean of each feature
      sigma (ndarray (n,))  : standard deviation of each feature
    """
    # find the mean of each column/feature
    mu     = np.mean(X, axis=0)                 # mu will have shape (n,)
    # find the standard deviation of each column/feature
    sigma  = np.std(X, axis=0)                  # sigma will have shape (n,)
    # element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma      

    return (X_norm, mu, sigma)
 
#check our work
#from sklearn.preprocessing import scale
#scale(X_orig, axis=0, with_mean=True, with_std=True, copy=True)

Let's visualize what z-score normalization does to our data step-by-step.

In [None]:
mu     = np.mean(X_train,axis=0)   
sigma  = np.std(X_train,axis=0) 
X_mean = (X_train - mu)
X_norm = (X_train - mu)/sigma      

fig,ax=plt.subplots(1, 3, figsize=(12, 3))
ax[0].scatter(X_train[:,0], X_train[:,3])
ax[0].set_xlabel(X_features[0]); ax[0].set_ylabel(X_features[3]);
ax[0].set_title("unnormalized")
ax[0].axis('equal')

ax[1].scatter(X_mean[:,0], X_mean[:,3])
ax[1].set_xlabel(X_features[0]); ax[0].set_ylabel(X_features[3]);
ax[1].set_title(r"X - $\mu$")
ax[1].axis('equal')

ax[2].scatter(X_norm[:,0], X_norm[:,3])
ax[2].set_xlabel(X_features[0]); ax[0].set_ylabel(X_features[3]);
ax[2].set_title(r"Z-score normalized")
ax[2].axis('equal')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
fig.suptitle("distribution of features before, during, after normalization")
plt.show()

The visualization above shows how normalization transforms two features ("size" and "age") with equal axis scaling:

- **Left (Original)**: Notice how "size(sqft)" has a much wider range than "age". The data is stretched horizontally.
- **Middle (After centering)**: Subtracting the mean ($\mu$) shifts both features to center around zero. "Size" is now clearly centered at zero, though "age" is harder to see due to scale.
- **Right (After full normalization)**: Dividing by standard deviation ($\sigma$) brings both features to the same scale. Now they're both centered at zero with comparable variance.

This transformation makes gradient descent much more efficient!

Now let's apply normalization to our entire dataset and examine the results.

In [None]:
# normalize the original features
X_norm, X_mu, X_sigma = zscore_normalize_features(X_train)
print(f"X_mu = {X_mu}, \nX_sigma = {X_sigma}")
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X_train,axis=0)}")   
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")

Notice the dramatic improvement! Normalization reduced the peak-to-peak range from thousands down to just 2-3 for each feature.

In [None]:
fig,ax=plt.subplots(1, 4, figsize=(12, 3))
for i in range(len(ax)):
    norm_plot(ax[i],X_train[:,i],)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("count");
fig.suptitle("distribution of features before normalization")
plt.show()
fig,ax=plt.subplots(1,4,figsize=(12,3))
for i in range(len(ax)):
    norm_plot(ax[i],X_norm[:,i],)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("count"); 
fig.suptitle("distribution of features after normalization")

plt.show()

The histograms above show a key insight: after normalization, all features share similar x-axis ranges (roughly -2 to +2) and are centered around zero. This uniform scaling is exactly what gradient descent needs for efficient convergence.

Time to test gradient descent on our normalized data!

Pay attention to the **much larger learning rate** we can now use (0.1 vs 1e-7). This dramatically accelerates training.

In [None]:
w_norm, b_norm, hist = run_gradient_descent(X_norm, y_train, 1000, 1.0e-1, )

Impressive results! With normalized features, we achieved excellent accuracy in just 1000 iterations - **orders of magnitude faster** than before. The tiny gradients at the end confirm convergence.

A learning rate of 0.1 works great for normalized features. Let's visualize how well our model predicts by comparing predictions to actual prices. (Note: predictions use normalized features, but the plot shows original feature values for interpretability.)

In [None]:
#predict target using normalized features
m = X_norm.shape[0]
yp = np.zeros(m)
for i in range(m):
    yp[i] = np.dot(X_norm[i], w_norm) + b_norm

    # plot predictions and targets versus original features    
fig,ax=plt.subplots(1,4,figsize=(12, 3),sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X_train[:,i],y_train, label = 'target')
    ax[i].set_xlabel(X_features[i])
    ax[i].scatter(X_train[:,i],yp,color=dlc["dlorange"], label = 'predict')
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()

The predictions look excellent! A few observations:
- With multiple features, we can't visualize everything in a single plot - hence the four separate plots.
- Remember: these predictions use normalized features. Any new data must be normalized using the **same** mean and standard deviation from training.

## Making Predictions on New Data

The whole purpose of building this model is to predict prices for houses not in our dataset. Let's predict the price of a house with:
- 1200 sqft
- 3 bedrooms
- 1 floor
- 40 years old

**Critical reminder:** You must normalize this new data using the same $\mu$ and $\sigma$ computed from the training set!

In [None]:
# First, normalize out example.
x_house = np.array([1200, 3, 1, 40])
x_house_norm = (x_house - X_mu) / X_sigma
print(x_house_norm)
x_house_predict = np.dot(x_house_norm, w_norm) + b_norm
print(f" predicted price of a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old = ${x_house_predict*1000:0.0f}")

## Visualizing the Cost Function Landscape

<img align="left" src="./images/C1_W2_Lab06_contours.PNG"   style="width:240px;" >

Cost contour plots provide another perspective on why feature scaling matters. When features have mismatched scales, the cost function becomes highly asymmetric.

The plots below compare cost contours before and after normalization:
- **Left**: Before normalization - the contour plot of $w_0$ (size) vs $w_1$ (bedrooms) is so stretched that you can barely see the contour curves. The extreme asymmetry means gradient descent makes uneven progress.
- **Right**: After normalization - the cost contours are much more circular and symmetric. This allows gradient descent to make balanced progress on all parameters simultaneously.

This geometric insight explains why normalized features converge so much faster!


In [None]:
plt_equal_scale(X_train, X_norm, y_train)


## Summary and Key Takeaways

In this lab, you've learned essential techniques for training multi-feature regression models:

1. **Learning Rate Selection**: Experimented with different $\alpha$ values and saw how:
   - Too large → divergence
   - Too small → slow convergence
   - Just right → efficient training

2. **Feature Scaling Power**: Discovered that z-score normalization:
   - Enables much larger learning rates (0.1 vs 1e-7)
   - Accelerates convergence by orders of magnitude
   - Creates balanced parameter updates

3. **Practical Implementation**: Built working code that normalizes features and makes accurate predictions

These techniques are fundamental to training effective machine learning models!

## Data Attribution

The housing dataset used in this lab is derived from the [Ames Housing dataset](http://jse.amstat.org/v19n3/decock.pdf) compiled by Dean De Cock for educational purposes in data science.