In [1]:
import numpy as np

# Chapter 2: Visualizing Gradient Descent

Now that you've learned how gradient descent works, it's time to put your knowledge into action :-)

We're generating a new synthetic dataset using *b = 0.5* and *w = -3* for a **linear regression with a single feature (x)**:

$$
\Large
y = b + w x
$$

You'll implement the **five steps** of gradient descent in order to **learn these parameters** from the data.

## Data Generation

In [2]:
true_b = .5
true_w = -3
N = 100

# Data Generation
np.random.seed(42)
x = np.random.rand(N, 1)
epsilon = (.1 * np.random.randn(N, 1))
y = true_b + true_w * x + epsilon

# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)

# Uses first 80 random indices for train
train_idx = idx[:int(N*.8)]
# Uses the remaining indices for validation
val_idx = idx[int(N*.8):]

# Generates train and validation sets
x_train, y_train = x[train_idx], y[train_idx]
x_val, y_val = x[val_idx], y[val_idx]

## Step 0: Random Initialization

The first step - actually, the zeroth step - is the *random initialization* of the parameters. Using Numpy's `random.randn` method, you should write code to initialize both *b* and *w*:

In [11]:
# Step 0 - Initializes parameters "b" and "w" randomly
np.random.seed(42)

b = np.random.rand(1)
w = np.random.rand(1)

print(b, w)

[0.37454012] [0.95071431]


## Step 1: Compute Model's Predictions

The first step (for real) is the **forward pass**, that is, the **predictions** of the model. Our model is a linear regression with a single feature (x), and its parameters are *b* and *w*. You should write code to generate predictions (yhat):

In [12]:
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w*x_train

## Step 2: Compute the Mean Squared Error (MSE) Loss

Since our model is a linear regression, the appropriate loss is the **Mean Squared Error (MSE)** loss:

$$
\Large
error_i = \hat{y_i} - y_i
\\
\Large
loss = \frac{1}{N}\sum_{i=0}^N{error_i^2}
$$

For each data point (i) in our training set, you should write code to compute the difference between the model's predictions (yhat) and the actual values (y_train), and use the errors of all N data points to compute the loss:

Obs.: DO NOT use loops!

In [13]:
error = yhat - y_train
loss = (error**2).mean()

In [14]:
# Added by KA
print(error)

[[ 2.98961138]
 [ 0.05998564]
 [ 3.09783772]
 [-0.05270525]
 [ 2.75726298]
 [ 0.14384401]
 [ 0.79165467]
 [ 1.20061059]
 [ 1.69884829]
 [ 3.80414561]
 [ 0.3601495 ]
 [ 2.99349823]
 [ 3.02532782]
 [ 2.64114613]
 [ 0.01409053]
 [ 0.3858017 ]
 [ 0.51289238]
 [-0.13567664]
 [ 3.69526617]
 [ 2.28962965]
 [ 2.72210829]
 [ 2.39173846]
 [ 3.50974555]
 [ 2.47541577]
 [ 0.96168611]
 [ 0.8178027 ]
 [ 0.2648168 ]
 [ 2.71999435]
 [ 3.63076038]
 [ 2.08615767]
 [ 2.16735959]
 [ 1.90696647]
 [ 0.58317086]
 [ 0.97876877]
 [ 3.67350174]
 [ 0.66210454]
 [ 0.50225444]
 [ 3.34838163]
 [ 1.34553642]
 [ 1.17145474]
 [ 3.07093847]
 [-0.04377753]
 [ 3.21625919]
 [ 1.88786302]
 [ 3.04915365]
 [ 3.28332658]
 [ 0.33037908]
 [ 0.14516264]
 [ 2.2311294 ]
 [ 2.59698542]
 [ 2.50268721]
 [-0.10388294]
 [ 2.33022316]
 [ 3.06649678]
 [ 1.58330214]
 [ 1.08053806]
 [ 3.01861055]
 [ 2.90472331]
 [ 1.94771043]
 [ 0.69766564]
 [ 3.66044147]
 [ 1.02820882]
 [ 0.39953498]
 [ 0.9657421 ]
 [ 2.64208225]
 [ 0.40374081]
 [ 1.26142

In [15]:
# Added by KA
print(loss)

4.495325075114491


## Step 3: Compute the Gradients

PyTorch's autograd will take care of that later on, so we don't have to compute any derivatives yourself! So, no need to manually implement this step.

You *still* should understand what the gradients *mean*, though.

In [16]:
# Step 3 - Computes gradients for both "b" and "w" parameters
b_grad = 2 * error.mean()
w_grad = 2 * (x_train * error).mean()
print(b_grad, w_grad)

3.464425403927533 2.3835657220659554


The gradients above indicate that:
- for a tiny increase in the value of the parameter *b*, the loss will increase roughly 2.7 times as much
- for a tiny increase in the value of the parameter *w*, the loss will increase roughly 1.8 times as much

## Step 4: Update the Parameters

The fourth step is the **parameter update** - you should write code that use the gradients and a learning rate (set to 0.1) to update the parameters:

In [17]:
# Sets learning rate - this is "eta" ~ the "n" like Greek letter
lr = 0.1

# Step 4 - Updates parameters using gradients and the 
# learning rate
b = b - lr*b_grad
w = w - lr*w_grad

print(b, w)

[0.02809758] [0.71235773]


## Step 5: Rinse and Repeat!

The last step consists of putting the other steps together and organize them inside a loop. Write code to fill in the blanks in the loop below:

In [19]:
# Step 0 - Initializes parameters "b" and "w" randomly
np.random.seed(42)

b = np.random.rand(1)
w = np.random.rand(1)

lr = 0.1

for epoch in range(1000):
    # Step 1: Forward pass
    yhat = b + w*x_train
    
    # Step 2: Compute MSE loss
    error = yhat - y_train
    loss = (error**2).mean()
    
    # Step 3: Compute the gradients
    b_grad = 2 * error.mean()
    w_grad = 2 * (x_train * error).mean()

    # Step 4: Update the parameters
    b = b - lr*b_grad
    w = w - lr*w_grad
    
print(b, w)
print(loss)

[0.52354035] [-3.03103474]
0.008044657695553976


Congratulations! Your model is able to learn both *b* and *w* that are **really close** to their true values. They will never be a perfect match, though, because of the *noise* we added to the synthetic data (and that's always present in real world data!).