# Chapter.03 Regression
---

### 3.1. Regressive and approximated models
3.1.1. General regressive model

<img src="./res/ch03/fig_3_1.png" width="700" height="200"><br>
<div align="center">
  Figure.3.1.1
</div>
<br>

3.1.2. Approximated model

<img src="./res/ch03/fig_3_2.png" width="700" height="110"><br>
<div align="center">
  Figure.3.1.2
</div>
<br>

I want to find a function $\hat{f}(\mathbf{x}; D)$ that approximates the true function $f(\mathbf{x})$ as well as possible in terms of the mean square error (MSE) between them, by means of some learning algorithm based on a training dataset (sample)

### 3.2. Linear Regression
3.2.1. Linearly approximated model<br>

<img src="./res/ch03/fig_3_3.png" width="850" height="130"><br>
<div align="center">
  Figure.3.1.3
</div>
<br>
Linear regression techniques aim to find a linear function $\hat{f} = w^T x$ that approximates the true function $f(x)$ as well as possible in terms of the mean square error (MSE) between them, based on a training dataset (sample)
<br>

3.2.2. Hypothesis<br>
$$ \hat{y} = \sum_j w_i x_i + b = \mathbf{w}^T \mathbf{x} \quad \text{where} \quad \mathbf{w} = [w_1, \,\ \cdots, \,\ w_{m-1}, \,\ b]^T, \,\ \mathbf{x} = [x_1, \,\ \cdots, \,\ x_{m-1}, \,\ 1]^T $$
- $y$ : Target(of label)
- $\hat{y}$ : Output of model
- $w_i$ : Weights
- $b$ : Bias

3.2.3. Linear regression problem<br>
Given the training set, to optimize the parameters $\mathbf{w}$ to minimize the least squares error :
$$ \min_{\mathbf{w}} \{ J(\mathbf{w}) = \frac{1}{2} \sum_i (y_i - \mathbf{w}^T \mathbf{x}_i)^2 \} $$
$J(\mathbf{w})$ is convex and quadratic function.

3.2.4. Learning algorithm : A numerical approach<br>
1) Gradient descent algorithm

$$ \mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} $$
$\alpha$ is Learning rate.<br>

2) Gradient calculation

$$ 
\begin{align*}
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} &= \frac{\partial}{\partial \mathbf{w}} \frac{1}{2} \sum_i (\mathbf{w}^T \mathbf{x}_i - y_i)^2 \\
                                                   &= \sum_i (\mathbf{w}^T \mathbf{x}_i - y_i) \frac{\partial}{\partial \mathbf{w}} (\mathbf{w}^T \mathbf{x}_i - y_i) \\
                                                   &= \sum_i (\mathbf{w}^T \mathbf{x}_i - y_i) \mathbf{x}_i \\
\end{align*}                                          
$$<br>

3) Batch learning algorithm<br>
Perform gradient descent step over the whole training set.<br>
<strong>Repeat until convergence : </strong>
$$ \mathbf{w} \leftarrow \mathbf{w} - \alpha \sum_i (\mathbf{w}^T \mathbf{x}_i - y_i) \mathbf{x}_i $$

4) Online learning algorithm<br>
Perform gradient descent step over a single training example.<br>
<strong>Repeat until convergence : </strong>
$$
\begin{align*}
\text{For } \,\ i &= 1 \,\ \text{ to } \,\  N :  \\
& \mathbf{w} \leftarrow \mathbf{w} - \alpha \sum_i (\mathbf{w}^T \mathbf{x}_i - y_i) \mathbf{x}_i 
\end{align*}
$$

<br><br>
All of these algorithms converge to the global optimal point!(convex and quadratic!)<br>
Update depends on the error, small(or large) update when the error is small(or large).<br>
It is called Widrow-Hoff(or LMS) learning rule.<br><br>

In big learning late, these are unstable(zigzaging).<br>
In small learning late, these converge slowly.<br>

3.2.5. Learning algorithm : Least squares(One-shot learning approach)<br>
Let's rewrite the cost function in a compact form as
$$
\begin{align*}
J(\mathbf{w}) &= \frac{1}{2} \sum_{i = 1}^{N} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 \\
              &= \frac{1}{2} 
\begin{bmatrix}
y_1 - \mathbf{x}_1^T \mathbf{w} & y_2 - \mathbf{x}_2^T \mathbf{w} & \cdots & y_N - \mathbf{x}_N^T \mathbf{w}
\end{bmatrix}
\begin{bmatrix}
y_1 - \mathbf{x}_1^T \mathbf{w} \\
y_2 - \mathbf{x}_2^T \mathbf{w} \\
 \vdots \\
y_N - \mathbf{x}_N^T \mathbf{w} \\
\end{bmatrix} \\
             &= \frac{1}{2} (\mathbf{y} - X \mathbf{w})^T (\mathbf{y} - X \mathbf{w}) \\
             &= \frac{1}{2} || \mathbf{y} - X \mathbf{w} ||^2
\end{align*}
$$

- $ \mathbf{y} = [y_1, \,\ \cdots, \,\ y_N]_{N \times 1}^T $
- $ X = [\mathbf{x}_1, \,\ \cdots, \,\ \mathbf{x}_N]_{N \times m}^T $
- $m$ : Order of model
- $N$ : Number of training examples
<br>

Thus, this problem can be considered as the ordinary least squares problem : 
$$ \min_{\mathbf{w}} \{ J(\mathbf{w}) = \frac{1}{2} || \mathbf{y} - X \mathbf{w} ||^2 \} $$

<strong>Case 1 : Exact and unique solution when </strong> $ N = m = \text{Rank}(X) $<br>
$$ \mathbf{w} = X^{-1} y $$
When $N = m$ and $X$ is full-rank, the zero error can be achieved. Using a simple linear model and a set of training samples, it is able to perfectly estimate an unknown function. However, this ideal case of $N = m$ rarely occurs in practice(Not practical case).
<br><br>

<strong>Case 2 : Over-determined case when </strong> $ N > m = \text{Rank}(X) $<br>
General case(Most practical case). But no unique solution to the equation $\mathbf{y} = X\mathbf{w}$.<br>
Instead, try to minimize the error:
$$
\begin{align*}
J(\mathbf{w}) &= \frac{1}{2} || \mathbf{y} - X \mathbf{w} ||^2 \\
              &= \frac{1}{2} (\mathbf{y} - X \mathbf{w})^T (\mathbf{y} - X \mathbf{w}) \\
              &= \frac{1}{2} (\mathbf{y}^T \mathbf{y} - 2 \mathbf{y}^T X \mathbf{w} + \mathbf{w}^T X^T X \mathbf{w}) \\
\frac{\partial}{\partial \mathbf{w}} J(\mathbf{w}) &= \frac{1}{2} \frac{\partial}{\partial \mathbf{w}} ( \mathbf{y}^T \mathbf{y} - 2 \mathbf{y}^T X \mathbf{w} + \mathbf{w}^T X^T X \mathbf{w} ) \\
              &= X^T X \mathbf{w} - X^T \mathbf{y} \\
\end{align*}
$$

$$ \frac{\partial}{\partial \mathbf{w}} J(\mathbf{w}) = \mathbf{0} \quad \Longleftrightarrow \quad X^T X \mathbf{w} = X^T \mathbf{y} \quad \text{(Normal equation)}$$
LS solution : $\mathbf{w} = (X^T X)^{-1} X^T \mathbf{y}$
<br><br>

<strong>Case 3 : Under-determined case when </strong> $ m > N = \text{Rank}(X) $<br>
There are infinitely many solutions to the equation $\mathbf{y} = X \mathbf{w}$, each of which yields the zero error. Try to obtain a particular solution with the minimum norm :
$$ \min_{\mathbf{w}} \frac{1}{2} || \mathbf{w} ||^2 \qquad s.t. \qquad \mathbf{y} = X \mathbf{w} $$

__CHECK_POINT__ : 라그랑주 승수법 ㄱㄱ


case4 -> XX^T or X^TX 비가역적. -> 무어펜로즈 의사 역행렬


LS의 기하학적 해석

### 3.3. Bayesian Regression

### 3.4. Logistic and Softmax Regression

### 3.5. $k$-Nearest Neighbors($k$-NN) Regression