# Chapter.03 Regression
---

3.2.6. Recursive least squares<br>
The idea of recursive least squares is that "What if additional training examples are available?". The RLS aims to solve the least squares problem recursively. That is, it finds the new regression model. 

$$ 
\mathbf{w}(n) = \arg\min_{\mathbf{w}} 
\left|\left|
    \begin{bmatrix}
        \mathbf{y}(1) \\
        \mathbf{y}(2) \\
        \vdots \\
        \mathbf{y}(n) \\
    \end{bmatrix}
    -
    \begin{bmatrix}
        X(1) \\
        X(2) \\
        \vdots \\
        X(n) \\
    \end{bmatrix} \mathbf{w}
\right|\right| ^2
$$

in terms of new training sample $ (X(n), \,\ \mathbf{y}(n)) $ and previous model<br>

$$ 
\mathbf{w}(n-1) = \arg\min_{\mathbf{w}} 
\left| \left|
    \begin{bmatrix}
        \mathbf{y}(1) \\
        \mathbf{y}(2) \\
        \vdots \\
        \mathbf{y}(n-1) \\
    \end{bmatrix}
    -
    \begin{bmatrix}
        X(1) \\
        X(2) \\
        \vdots \\
        X(n-1) \\
    \end{bmatrix} \mathbf{w}
\right| \right| ^2
$$

<br>

Suppose $ P(i) = \begin{bmatrix} X(1) \\ \vdots \\ X(i) \end{bmatrix}^T \begin{bmatrix} X(1) \\ \vdots \\ X(i) \end{bmatrix}. $<br>
$ ^\forall i \in \mathbb{N}, \quad \mathbf{w}(i) = \mathbf{w}(i-1) - P(i) X^T(i) ( \mathbf{y}(i) - X(i) \mathbf{w}(i-1) ) $<br>
$ \qquad \text{where} \qquad P(i) = P(i-1) - P(i-1) X^T(i) (I + X(i)P(i-1)X^T(i))^{-1} X(i)P(i-1) $<br><br>

<strong>Proof.</strong><br>
[PDF file (Too long)](./res/ch03/note_recursive_linear_regression.pdf)  $\blacksquare$<br><br>

3.2.7. Regularized least squares<br>

Consider the following error function :<br>
$$ J(\mathbf{w}) = E(\mathbf{w}) + \lambda R(\mathbf{w}) $$
- $E$ is called error term, $R$ is called regularization term.
- $\lambda$ is called the regularization coefficient.

In context, regularized least squares is
$$ \min_{\mathbf{w}} \{ J(\mathbf{w}) = \frac{1}{2}|| \mathbf{y} - X \mathbf{w} ||^2 + \frac{\lambda}{2} R(\mathbf{w}) \} $$

- Regularization helps the model less overfit.
- In general, the norm of the weight vector $\mathbf{w}$ is usually regularized 
$$ R(\mathbf{w}) = || \mathbf{w} ||_{q}^{q} $$

The cost function with a more general regularizer is 
$$ J(\mathbf{w}) = \frac{1}{2} \sum_{i = 1}^{N} (y_i - \mathbf{w}^T \mathbf{x}^2) + \frac{\lambda}{2} \sum_{i = 1}^{N} |w_i|^q $$

<img src="./res/ch03/fig_3_4.png" width="600" height="200"><br>

<div align="center">
  Figure.3.2.2
</div>


<br>

In above picture, we can know that Lasso tends to generate sparser(most of $w_i$'s are nearly zero) solutions than a quadratic regularizer. We can consider above problem as optimization problem with constraint(Lagrange dual problem).
<br><br>

Quadratic RLS : 
$$ \min_{\mathbf{w}} \{ J(\mathbf{w}) = \frac{1}{2} || \mathbf{y} - X\mathbf{w} ||^2 + \frac{\lambda}{2} || \mathbf{w} ||^2 \} $$

RLS solution : 
$$ \mathbf{w} = (X^TX + \lambda I )^{-1} X^T \mathbf{y} $$

- $ X^T X + \lambda I $ is always invertible even if $X^T X$ is not invertible. (Trivial)
- Regularized LS can be used regardless of $N$ and $m$.
<br>

<strong>Proof.</strong><br>
Trivial. $\blacksquare$<br><br>

$q$-norm RLS : 
$$ \min_{\mathbf{w}} \{ J(\mathbf{w}) = \frac{1}{2} || \mathbf{y} - X\mathbf{w} ||^2 + \frac{\lambda}{2} || \mathbf{w} ||_q^q \} $$

RLS solution : 
$$ \mathbf{w} = (X^TX + \lambda I )^{-1} X^T \mathbf{y} $$

- The above optimization problem is convex.
- It can be solved efficiently via the <strong>convex optimization techniques,</strong> e.g. interior point method at the computational complexity of $O((\max\{N, m\})^3)$ : [Interior point method](https://en.wikipedia.org/wiki/Interior-point_method)
- Also, the gradient descent based methods can be used.

3.2.8. Comparisons<br>
Let $L$ is number of iteraions in gradient descent based method, and $N_{max} = \max_i N_i, \,\ X(i) : N_i \times m$

| Batch gradient descent | Stochastic gradient descent |     Least squares     |  Recursive least squares  | 
|------------------------|-----------------------------|-----------------------|---------------------------|
| $O(LNm)$               | $O(LN)$                     | $O((\max\{N, m\})^3)$ |$O((\max\{N_{max}, m\})^3)$|
<br>

- All the methods yields the same optimal performance
- According to the situation, the complexity may be different
- If the initial point can be chosen to be very close to the optimal point, then the gradient descent methods are the most efficient
- Otherwise, the least squares may be more efficient

3.2.9. Linear regression with basis functions<br>
$$ \hat{y_i} = \mathbf{w}^T \mathbf{\phi}(\mathbf{x_i}) \quad where \quad \mathbf{\phi} = [\phi_1, \cdots, \phi_m]^T $$
- $\mathbf{\phi}(\mathbf{x})$ is known as basis function.
- Linear basis functions : $^\forall i, \,\ \phi_i(\mathbf{x}) = x_i$
- Polynomial basis functions : $ \phi_i(x) = x^i $<br>
    These are global functions(a small change in $x$ affect all basis functions)
- Gaussian basis functions : $ \phi_i(x) = \exp \{ - \frac{(x- \mu_i)^2}{2s^2} \} $<br>
    These are local(a small change in $x$ only affect nearby basis functions)
- Sigmoid basis functions : $ \phi_i(x) = \sigma(\frac{x-\mu_i}{s}) $<br>
    These are local. $\mu_i$ and $s$ control location and scale(slope)

<br><br>
Cost function is set to be the sum of squares error 
$$ E(\mathbf{w}) = \frac{1}{2} \sum_{i = 1}^{N} \{ y_i - \mathbf{w}^T \mathbf{\phi}(\mathbf{x}_i) \}^2 $$

Solution : 
$$ 
\mathbf{w}_{\text{LS}} = \Phi^+ \mathbf{y} = (\Phi^T \Phi)^{-1} \Phi^T \mathbf{y}
\qquad \text{where} \qquad 
\Phi = 
\begin{bmatrix}
\phi_1(\mathbf{x}_1) & \phi_2(\mathbf{x}_1) & \cdots & \phi_m(\mathbf{x}_1) \\
\phi_1(\mathbf{x}_2) & \phi_2(\mathbf{x}_2) & \cdots & \phi_m(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
\phi_1(\mathbf{x}_N) & \phi_2(\mathbf{x}_N) & \cdots & \phi_m(\mathbf{x}_N) \\
\end{bmatrix}_{N \times m}
$$
- In parctice, $N > M$.

3.2.10. Proper step size<br>
When the number of training samples is very large$(N \rightarrow \infty)$, the least squares solution becomes
