# Linear regression

## Normal equation

We have $\textbf{x}^{(i)} \in \mathbb{R}^p$, $y_i \in \mathbb{R}$ and $\textbf{w} \in \mathbb{R}^p$.

We would like to minimize $\sum_{i=1}^n (y_i - \textbf{w}^T \textbf{x}^{(i)})^2$

We write it in matrix notation as $(\textbf{y} - \textbf{X} \textbf{w})^T (\textbf{y} - \textbf{X} \textbf{w})$

Distribute the first term to get:

$(\textbf{y} - \textbf{X} \textbf{w})^T \textbf{y} - (\textbf{y} - \textbf{X} \textbf{w})^T \textbf{X} \textbf{w}$

$(A + B)^T = A^T + B^T$ yields:

$(\textbf{y}^T - (\textbf{X} \textbf{w})^T) \textbf{y} - (\textbf{y}^T - (\textbf{X} \textbf{w})^T) \textbf{X} \textbf{w}$

Distribute terms again:

$\textbf{y}^T \textbf{y} - (\textbf{X} \textbf{w})^T \textbf{y} - \textbf{y}^T \textbf{X} \textbf{w} + (\textbf{X} \textbf{w})^T \textbf{X} \textbf{w}$

$A^T B = B^T A$ (dot product is commutative):

$\textbf{y}^T \textbf{y} - (\textbf{X} \textbf{w})^T \textbf{y} - (\textbf{X} \textbf{w})^T \textbf{y} + (\textbf{X} \textbf{w})^T \textbf{X} \textbf{w}$

Using $(AB)^T = B^T A^T$ and collecting terms:

$\textbf{y}^T \textbf{y} - 2 \textbf{w}^T \textbf{X}^T \textbf{y} + \textbf{w}^T \textbf{X}^T \textbf{X} \textbf{w}$

Take the derivative with respect to $\textbf{w}$ using $\frac{\partial \textbf{w}^T \textbf{A}}{\partial \textbf{w}} = \textbf{A}$ and $\frac{\partial \textbf{w}^T \textbf{A} \textbf{w}}{\partial \textbf{w}} = 2 \textbf{A} \textbf{w}$ and setting to 0:

$- 2 \textbf{X}^T \textbf{y} + 2 \textbf{X}^T \textbf{X} \textbf{w} = 0$

Rearranging:

$\textbf{X}^T \textbf{X} \textbf{w} = \textbf{X}^T \textbf{y}$

If $\textbf{X}^T \textbf{X}$ is invertible, then:

$\textbf{w} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y}$

In [1]:
import numpy as np

np.random.seed(0)
n = 256
p = 2
X = np.random.randn(n, p)
w = np.random.randn(p, 1)
b = np.random.randn(1, 1)
y = np.dot(X, w) + b

X1 = np.concatenate((X, np.ones((n, 1))), axis=1)
wb_hat = np.dot(np.linalg.inv(np.dot(X1.T, X1)), np.dot(X1.T, y))
w_hat_eq = wb_hat[:p, :]
b_hat_eq = wb_hat[p, 0]

print(w_hat_eq)
print(b_hat_eq)
np.testing.assert_almost_equal(w_hat_eq, w)
np.testing.assert_almost_equal(b_hat_eq, b)

[[-1.33221165]
 [-1.96862469]]
-0.660056320134083


## SGD

### Gradient of the cost function

#### Unvectorized

$\textbf{x}^{(i)} \in \mathbb{R}^p$

$y_i \in \mathbb{R}$

$\textbf{w} \in \mathbb{R}^p$

$b \in \mathbb{R}$

$J(w, b) = \frac{1}{n} \sum_{i=1}^n \left(\textbf{w}^T \textbf{x}^{(i)} + b - y_i \right)^2 = \frac{1}{n} \sum_{i=1}^n \left(w_1 x_1^{(i)} + \dots + w_p x_p^{(i)} + b - y_i \right)^2$

$\frac{\partial J}{\partial w_j} = \frac{2}{n} \sum_{i=1}^n \left(\textbf{w}^T \textbf{x}^{(i)} - y_i \right) x_j^{(i)}$

$\frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^n \left(\textbf{w}^T \textbf{x}^{(i)} - y_i \right)$

#### Vectorized

$\frac{\partial J}{\partial \textbf{w}} = \frac{2}{n} \left( -\textbf{X}^T \textbf{y} + \textbf{X}^T \textbf{X} \textbf{w} \right)$ (see "Normal equation" section)

$\frac{\partial J}{\partial \textbf{w}} = \frac{2}{n} \textbf{X}^T (\textbf{X}\textbf{w} - \textbf{y})$


### Implementation

In [13]:
def train_sgd(X, y):
    n, p = X.shape

    np.random.seed(0)
    w = np.random.randn(p, 1) * 0.01
    b = np.zeros((1, 1))

    lr = 0.01
    batch_size = 8
    num_epochs = 100
    for _ in range(num_epochs):
        start = 0
        end = batch_size
        while end <= n:
            X_batch = X[start:end, :]
            y_batch = y[start:end, :]
            y_hat = np.dot(X, w) + b
            mse = np.mean((y - y_hat)**2)
            dw = (1./n) * np.dot(X.T, y_hat - y)
            db = np.mean(y_hat - y)
            w = w - lr * dw
            b = b - lr * db
            start = end
            end = start + batch_size
    return w, b

w_hat_sgd, b_hat_sgd = train_sgd(X, y)

print(w_hat_sgd)
print(b_hat_sgd)
np.testing.assert_almost_equal(w_hat_sgd, w)
np.testing.assert_almost_equal(b_hat_sgd, b)

[[-1.33221165]
 [-1.96862469]]
[[-0.66005632]]


## Sources

* https://www.coursera.org/lecture/machine-learning/gradient-descent-for-multiple-variables-Z9DKX
* https://www.geeksforgeeks.org/vectorization-of-gradient-descent/
* https://medium.com/ml-ai-study-group/vectorized-implementation-of-cost-functions-and-gradient-vectors-linear-regression-and-logistic-31c17bca9181
* https://www.kaggle.com/paulrohan2020/tutorial-vectorizing-gradient-descent