# Gradient descent

To solve the optimization problem

$$
    \mathcal L(\boldsymbol w) \to \min\limits_{\boldsymbol w}
$$

do the followind steps:

1. initialize $\boldsymbol w$ by some random values (e.g., from $\mathcal N(0, 1$))
2. choose **tolerance** $\varepsilon > 0$ and **learning rate** $\eta > 0$
3. while $\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon$ do the **gradient step**

    $$
    \boldsymbol w := \boldsymbol w - \eta\nabla\mathcal L(\boldsymbol w)
    $$
4. return $\boldsymbol w$


If condition $\Vert \nabla\mathcal L(\boldsymbol w) \Vert > \varepsilon$ holds for too long, the loop in step 3 terminates after some number iterations `max_iter`.

For ordinary linear regression the loss function is proportional to 

$$
    \mathcal L(\boldsymbol w) = \frac 12\Vert\boldsymbol {Xw} - \boldsymbol y \Vert_2^2,
$$

therefore, $\nabla \mathcal L(\boldsymbol w) = \boldsymbol X^\top(\boldsymbol{Xw} - \boldsymbol y)$

In [231]:
import numpy as np

def linear_regression_gd(X, y, learning_rate=0.01, tol=1e-3, max_iter=10000):
    w = np.random.normal(size=X.shape[1])
    gradient = X.T.dot(X.dot(w) - y)
    for i in range(max_iter):
        # YOUR CODE HERE
    print("max_iter exceeded")
    return w

## Compare with analytic solution

As we know, the minimum is attained at point 

$$
     \widehat{\boldsymbol w} = (\boldsymbol X^\top \boldsymbol X)^{-1} \boldsymbol X^\top \boldsymbol y
$$

Now create some random matrix $\boldsymbol X$ and compare these optimal weights and the result of numerical optimization.

In [134]:
n = 50
d = 5
X = 10 * np.random.randn(n, d)
y = 10 * np.random.rand(n) - 5
w_opt = np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))
w_num = linear_regression_gd(X, y, learning_rate=0.0001, tol=1e-5)
print("Analytic solution:", w_opt)
print("Numeric solution:", w_num)
print("Norm of difference:", np.linalg.norm(w_opt - w_num))

Analytic solution: [ 0.07440762 -0.03416464  0.05172848 -0.01814075  0.00409011]
Numeric solution: [ 0.07440762 -0.03416464  0.05172848 -0.01814075  0.00409011]
Norm of difference: 2.1578831624068235e-09


## Try on some real dataset

Load diabetes dataset:

In [163]:
from sklearn.datasets import load_diabetes

data = load_diabetes()
X, y = data.data, data.target
X.shape, y.shape

((442, 10), (442,))

Conditional number is quite small:

In [113]:
np.linalg.cond((X.T.dot(X)))

470.0779993588608

In [136]:
data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [137]:
data.target

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [138]:
from sklearn.linear_model import LinearRegression
LR = LinearRegression(fit_intercept=False)
LR.fit(X, y)

In [139]:
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))

intercept: 0.0
coefficients: [ -10.0098663  -239.81564367  519.84592005  324.3846455  -792.17563855
  476.73902101  101.04326794  177.06323767  751.27369956   67.62669218]
r-score: -3.3852947912492795
MSE: 26004.29335112887


In [153]:
w = linear_regression_gd(X, y, learning_rate=1e-3, max_iter=1000000)
print("numerical weights:", w)
print("MSE:", np.mean((X.dot(w) - y)**2))

max_iter exceeded
numerical weights: [ -10.00919583 -239.81489142  519.84761495  324.38398331 -792.02977464
  476.62328039  100.97803062  177.04461952  751.21934985   67.62722868]
MSE: 26004.293351946846


In [154]:
np.linalg.norm(w - LR.coef_)

0.20550722930804544

Add intercept:

In [164]:
LR = LinearRegression(fit_intercept=True)
LR.fit(X, y)

In [165]:
print("intercept:", LR.intercept_)
print("coefficients:", LR.coef_)
print("r-score:", LR.score(X, y))
print("MSE:", np.mean((LR.predict(X) - y) ** 2))

intercept: 152.13348416289597
coefficients: [ -10.0098663  -239.81564367  519.84592005  324.3846455  -792.17563855
  476.73902101  101.04326794  177.06323767  751.27369956   67.62669218]
r-score: 0.5177484222203498
MSE: 2859.6963475867506


Use gradient descent to find weights close to optimal:

In [166]:
X_with_ones = np.hstack([np.ones(X.shape[0])[:, None], X])

In [176]:
w = linear_regression_gd(X_with_ones, y, learning_rate=1e-3, max_iter=1000000)
print("bias:", w[0])
print("numerical weights:", w[1:])
print("MSE:", np.mean((X_with_ones.dot(w) - y)**2))

max_iter exceeded
bias: 152.13348416289597
numerical weights: [ -10.00919514 -239.81489065  519.84761669  324.38398263 -792.02962482
  476.62316151  100.97796361  177.04460039  751.21929402   67.62722923]
MSE: 2859.6963484064113


In [177]:
np.linalg.norm(w[1:] - LR.coef_)

0.2057183048623492