# Calculation example for Stochastic Gradient Descent and Linear Regression

In [41]:
# You can do the exercise on paper! Read through the exercise carefully. The calculations start at section 2.2.

## 1) Prepare a dataset for the linear regression

### 1.1) Example dataset

In [42]:
# Example dataset (x values)
X = [
    [8, 2],
    [2, 8],
    [1, 4],
    [3, 6],
    [9, 4],
    [6, 5]
]

# Dependent variable on X (output of the model)
# first row in X corresponds to first row in y and so forth
y = [
    4,
    -14,
    -7,
    -9,
    1,
    -4
]

### 1.2) Depending on the dataset we think that a Linear Regression model fits the best to our data

In [43]:
# The formular for two independent input variables for linear regresion is as follows

$y_{hat}= w_1x_1 + w_2x_2 + b$

### 1.3) We need a function to evaluate to model outcome - Loss function

In [44]:
# For this example we define our loss function to be the 'Mean Squared Error' function.
# Keep in mind that we are not bound to this method and that we can choose 'whatever' function we want to use.
# The function should be a representative function to estimate the models performance (How good is my model?)

$$L = MSE = \frac{1}{n}\sum_{i=1}^n (y_{i_{hat}}-y_{i})^2$$

In [45]:
# In general for classical approaches: The lower the loss the better the model!

### 1.4) How can we minimize the loss of our model? - SGD

In [46]:
# This is were Stochastic Gradient Descent comes into play!
# We need to calculate the gradients of our model parameters (w_1, w_2, b) and update their values!

# Calculation follows!

## 2) How to use all the information from above?

### 2.1) Initialize random values for

$w_1, w_2, b$

e. g.

$w_1 = -0.5$, $w_2 = 0.02$, $b = 0$

In [47]:
# The parameter value initialization is only done once (at the beginning)
# We need these random values because we don't know anything about them at the beginning

# Initialised random parameter values
w_1 = -0.5
w_2 = 0.02
b = 0

### 2.2) Use the parameter values and one data row of the dataset and calculate the outputs

In [48]:
# Use for example the first row of X as x_1 and x_2 values and
# calculate the output with the initially set parameter values w_1, w_2, b

In [49]:
# TODO !

# First iteration
x_1, x_2 = X[0]
y_i = y[0]

# Compute the predicted output
y_hat = w_1 * x_1 + w_2 * x_2 + b

print(f"First iteration: x_1={x_1}, x_2={x_2}, y={y_i}")
print(f"Output with w_1={w_1}, w_2={w_2}, b={b} is {y_hat}")

First iteration: x_1=8, x_2=2, y=4
Output with w_1=-0.5, w_2=0.02, b=0 is -3.96


### 2.3) Calculate the loss of the model - How good is our model

In [50]:
# Now we want to find out how good our model is regarding our defined loss function
# Therefore, calculate to loss of our model based on the defined loss function above
# Use the correct y from the data (e. g. if you use the first row X[0] then you have to use y[0])

In [51]:
# TODO !
n = 1 # first iteration

loss = 1/n * (y_hat - y_i) ** 2

print(f"Loss: {loss}")

Loss: 63.3616


### 2.4) Calculate the gradients of the learning parameters (w_1, w_2 and b)

In [52]:
# The gradients can be calculated using the partial derivatives with respect to the parameters

In [53]:
# We start with the derivation of our loss function because this is our evaluation function.
# The loss function gives us information about how good our model performs
# We want to update our parameters based on our quality measurement function (loss function)

In [54]:
# Partially derive the loss function (you can write the end result in code)

# TODO !
dL_dyhat = 2 * (y_hat - y_i)

$$\frac{\partial L}{\partial y_{hat}} = 2(y_{hat}-y)$$

In [55]:
# Parially derive the model function with respect to every parameter...

$y_{hat}= w_1x_1 + w_2x_2 + b$

In [56]:
# for example for w_1... where x_1 from one data row the first entry (column)

$$\frac{\partial y_{hat}}{\partial w_1} = x_1$$

In [57]:
# do this for w_1, w_2 and b with for example the first data row X[0]

In [58]:
# TODO !
dyhat_dw1 = x_1
dyhat_dw2 = x_2
dyhat_db = 1

In [59]:
# we need to combine the loss function derivation and all the parameter derivations to calculate the final gradient
# this can be done by the chain rule!
# that is backtracing the output and the finding the impact for every single parameter
# the chain rule is just combining the partial derivatives from the end of the model (loss fn) to the parameter we search for
# the chain rule for the paramter w_1 is as follows:

$$\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial y_{hat}} * \frac{\partial y_{hat}}{\partial w_1}$$
$$\frac{\partial L}{\partial w_{1}} = 2(y_{hat}-y) * x_1$$

In [60]:
# calculate the gradient for all parameters w_1, w_2 and b

In [61]:
# TODO !
dL_dw1 = dL_dyhat * dyhat_dw1
dL_dw2 = dL_dyhat * dyhat_dw2
dL_db = dL_dyhat * dyhat_db

### 2.5) Apply the parameter update rule

$$\theta := \theta - \eta\Delta_{\theta}L(\theta)$$

In [62]:
# For w_1 this looks as follows

$$w_1 := w_1 - \eta \space(\frac{\partial L}{\partial y_{hat}} * \frac{\partial y_{hat}}{\partial w_1})$$

In [63]:
# Lets set the learning rate eta to 0.05

$$\eta = 0.05$$

In [64]:
# calculate the new value for all parameters w_1, w_2, b

In [65]:
# TODO !
learning_rate = 0.05

# New values
w_1 -= learning_rate * dL_dw1
w_2 -= learning_rate * dL_dw2
b -= learning_rate * dL_db

print(f'Updated parameters: w1 = {w_1}, w2 = {w_2}, b = {b}')


Updated parameters: w1 = 5.868, w2 = 1.612, b = 0.796


## 3) Iterate this procedure with every data row from X and y to update your model

In [71]:
# One iteration is using all data given once
# Epochs can be defined to iterate over all data rows multiple times
import numpy as np

# Initialised values
w_1 = -0.05
w_2 = 0.02
b = 0
learning_rate = 0.005
epochs = 1000 # number of iterations

# Convert to numpy array
X = np.array(X)
y = np.array(y)


# Loss function (MSE)
def mse(y, y_pred):
    return np.mean((y - y_pred) ** 2)

def sgd(X, y, w_1, w_2, b, learning_rate, epochs):
    for epoch in range(epochs):
        for i in range(len(y)):
            x_1, x_2 = X[i]
            y_i = y[i]

            # Predicted output
            y_hat = w_1 * x_1 + w_2 * x_2 + b

            # Compute gradients
            # - partial derivate loss function
            dL_dyhat = 2 * (y_hat - y_i)

            # - partial derivate model
            dyhat_dw1 = x_1
            dyhat_dw2 = x_2
            dyhat_db = 1

            # - gradients
            dL_dw1 = dL_dyhat * dyhat_dw1
            dL_dw2 = dL_dyhat * dyhat_dw2
            dL_db = dL_dyhat * dyhat_db

            # Update parameters
            w_1 -= learning_rate * dL_dw1
            w_2 -= learning_rate * dL_dw2
            b -= learning_rate * dL_db
        
        y_pred = w_1 * X[:, 0] + w_2 * X[:, 1] + b
        loss = mse(y, y_pred)

    return w_1, w_2, b, loss

# Train the model
w_1, w_2, b, loss = sgd(X, y, w_1, w_2, b, learning_rate, epochs)
print(f'Final parameters: w1 = {w_1}, w2 = {w_2}, b = {b}')
print(f"Loss: {loss}")


Final parameters: w1 = 1.0004744293882963, w2 = -1.9990895337170753, b = -0.0075577599328742326
Loss: 2.778292840029211e-06
