# Gradient Descent

# Gradient Descent

According to this method, the coefficients $\theta$ are adjusted iteratively towards the minimum. To determine the direction for adjusting $\theta$ we use the partial derivatives of the loss function. An advantage of Gradient Descent is that the approach is applicable to any loss function.

# Update Rule

The basis for Gradient Descent is an update rule that adjusts the coefficients towards the minimum. We use a general update rule, in which we update $\theta$ with the formula below. In this assignment $\alpha$ is the 'learning rate' and $m$ is the number of training samples.

$$ \theta := \theta - \frac{\alpha}{m} X^T \cdot (X \cdot \theta - y) $$

# Data

For starters, we read the Wine data set like before, which gives us an input matrix X and a target vector y.

In [1]:
from ml import *
data = wines_quality_alcohol(bias=True, column_y = True)
X = data.train_X
y = data.train_y

# Model

The next thing we need is a vector that contains the parameters for our linear function. In Machine Learning, by default we use $\theta$ (pronounce theta) to hold our parameters. In the background, we use the `numpy` library to create matrices to hold these numbers.

#### Look at the example from the lecture and enter code on de ... to create an array with 2 rows and 1 column that contains zeros. The result should look like below.

`array([[0.],
       [0.]])`

In [3]:
# create a column vector with 2 zeros: 2 rows, 1 column
𝜃 = np.zeros( (2,1) )
𝜃

array([[0.],
       [0.]])

Then $X \cdot \theta$ will give the estimates in a column vector. In this case, we will get only zeros, because $\theta_0 = \theta_1 = 0$, and therefore the dot product with whatever values will be zero. We still need to train the model to find the best values for $\theta$.

In [4]:
X @ 𝜃

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

# Training

We now apply the vectorized update rule we derived above repeatedly and the values of $\theta$ will converge to the optimal value.

We introduce a 'learning rate' to update $\theta$ with some fraction of the gradient to iteratively take a small step in the right direction.

In [5]:
# learning rate
𝛼 = 0.01

In [6]:
for i in range(100000):
    𝜃 = 𝜃 - 𝛼 * X.T @ (X @ 𝜃 - y) / len(X)
print(𝜃)

[[1.82168849]
 [0.36672844]]


In [7]:
y_predict = data.train_X @ 𝜃

#### now compute the mean squared error by using the function `mean_squared_error` on `y_predict` and `data.train_y`. What do you think is in data.train_y and what is in y_predict? What does the mean squared error mean?

In [8]:
mean_squared_error( y_predict, data.train_y )

0.5132594452398507

When running the code below, you may notice a few things:

- We are getting an estimation of the optimal coefficients instead of the exact outcome
- It takes a few seconds to converge, which indicates that it may be less efficient for large data sets.