# Chapter.03 Regression
---

### 3.4. Logistic and softmax regression
3.4.1. Logistic regression : hypothesis<br>
To estimate the probability of Bernoulli Random Variable(or a binary labels), $y \in \{0, 1\}$<br>
Let 
$$ \widehat{\Pr}(y = 1) = f(\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) \quad \text{where} \,\ \sigma(x) = \frac{1}{1 + \exp(-x)} $$

- $w_j$ & $b$ : weights & bias
- $\mathbf{w} = [b, \,\ w_1, \,\ w_2, \,\ \cdots]^T $ 
- $\mathbf{x} = [1, \,\ x_1, \,\ x_2, \,\ \cdots]^T $

Probability density function is 
$$
\begin{align*}
p(y = 1 | \mathbf{x} ; \mathbf{w}) &= \sigma(\mathbf{w}^T \mathbf{x}) \\
p(y = 0 | \mathbf{x} ; \mathbf{w}) &= 1 - \sigma(\mathbf{w}^T \mathbf{x}) \\
\end{align*}
$$

$$ p(y | \mathbf{x} ; \mathbf{w}) = [\sigma(\mathbf{w}^T \mathbf{x})]^y [1 - \sigma(\mathbf{w}^T \mathbf{x})]^{1-y} $$

If we assume that the training examples are i.i.d. 

$$
\begin{align*}
L(\mathbf{w}) &= \log p(\mathbf{y} | X; \mathbf{w}) = \log \prod_i p(y_i | \mathbf{x}_i ; \mathbf{w}) \\
              &= \log \prod_i [\sigma(\mathbf{w}^T \mathbf{x}_i)]^{y_i} [1 - \sigma(\mathbf{w}^T \mathbf{x}_i)]^{1-y_i} \\
              &= \sum_i y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1 - y_i) \log (1 - \sigma(\mathbf{w}^T \mathbf{x}_i)) \\
\end{align*}
$$

Given the training set, learning the parameters to maximize the log-likelihood function:
$$ \max_\mathbf{w} L(\mathbf{w}) \quad  \text{which is concave.}$$

3.4.2. Logistic regression : learning based on gradient ascent algorithm<br>

$$ \mathbf{w} \leftarrow \mathbf{w} + \alpha \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} $$

$$
\begin{align*}
\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} &= (y \frac{1}{\sigma(\mathbf{w}^T \mathbf{x})} - (1-y) \frac{1}{1 - \sigma(\mathbf{w}^T \mathbf{x})}) \frac{\partial}{\partial \mathbf{w}} \sigma(\mathbf{w}^T \mathbf{x}) \qquad (\because \,\ \text{Chain rule}) \\
                                                   &= (y \frac{1}{\sigma(\mathbf{w}^T \mathbf{x})} - (1-y) \frac{1}{1 - \sigma(\mathbf{w}^T \mathbf{x})}) \sigma(\mathbf{w}^T \mathbf{x}) (1 - \sigma(\mathbf{w}^T \mathbf{x})) \frac{\partial}{\partial \mathbf{w}} \mathbf{w}^T \mathbf{x} \qquad (\because \,\ \text{Chain rule}) \\
                                                   &= (y(1 - \sigma(\mathbf{w}^T \mathbf{x})) - (1-y) \sigma(\mathbf{w}^T \mathbf{x}))\mathbf{x} \\
                                                   &= (y - \sigma(\mathbf{w}^T \mathbf{x}))\mathbf{x} \qquad (\text{LMS learning rule})
\end{align*}
$$

Therefore, batch learning update form is 
$$ \mathbf{w} \leftarrow \alpha \sum_i (y_i - \sigma(\mathbf{w}^T \mathbf{x}_i)) \mathbf{x}_i $$

Also, online learning update form is 
$$ \mathbf{w} \leftarrow \alpha (y_i - \sigma(\mathbf{w}^T \mathbf{x}_i)) \mathbf{x}_i $$

3.4.3. Logistic regression : learning via Iterative Reweighted Least Squares(IRLS) based on Newton-Rapson method<br>
Newton-Rapson method is in ___Calculus Early Transcendentals 8th Ed p. 345___. <br>
Newton-Rapson method is an iterative algorithm to seek a solution $f(x) = 0$. <br>
In the same context, we can find a solution $\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{0} $.

$$ \mathbf{y} = [\nabla \mathbf{f}(\mathbf{w})]^T (\mathbf{w} - \mathbf{w}_0) + \mathbf{f}(\mathbf{w}_0) $$

$$ \Rightarrow \quad \mathbf{w}_1 = \mathbf{w}_0 - ([\nabla \mathbf{f}(\mathbf{w})]^T)^{-1} \mathbf{f}(\mathbf{w}_0) $$

Let $ \mathbf{f}(\mathbf{w}) = \nabla L(\mathbf{w}) $.

$$ \mathbf{y} = \mathbf{H}(\mathbf{w}_0) (\mathbf{w} - \mathbf{w}_0) + \nabla L(\mathbf{w}_0) = \mathbf{0} $$ 

$$ \therefore \quad \mathbf{w}_1 = \mathbf{w}_0 - \mathbf{H}(\mathbf{w}_0)^{-1} \nabla L(\mathbf{w}_0) $$

It also can expand based on the Taylor series. <br>
A definition of taylor series of a differentiable function $f : \mathbb{R}^d \rightarrow \mathbb{R} $ is  

$$ T_f(x_1, x_2, \cdots, x_d) = \sum_{n_1 = 0}^{\infty} \sum_{n_2 = 0}^{\infty} \cdots \sum_{n_d = 0}^{\infty} \frac{(x_1 - a_1)^{n_1} (x_2 - a_2)^{n_2} \cdots (x_d - a_d)^{n_d}}{n_1 ! n_2 ! \cdots n_d !} (\frac{\partial^{n_1 + n_2 + \cdots + n_d} f}{\partial x_1^{n_1} \partial x_2^{n_2} \cdots \partial x_d^{n_d}}) (a_1, a_2, \cdots, a_d) $$

In above context, we can write second-degree taylor expansion.
$$ L(\mathbf{w} + \Delta \mathbf{w}) \approx L(\mathbf{w}) + [\nabla L(\mathbf{w})]^T \Delta \mathbf{w} + \frac{1}{2} \Delta \mathbf{w}^T \mathbf{H}(\mathbf{w}) \Delta\mathbf{w} $$

$$ \frac{\partial}{\partial \Delta \mathbf{w}}L(\mathbf{w} + \Delta \mathbf{w}) \approx \nabla L(\mathbf{w}) + \mathbf{H}(\mathbf{w}) \Delta \mathbf{w} = \mathbf{0} $$

$$ \Delta \mathbf{w} = \mathbf{H}(\mathbf{w})^{-1} \nabla L (\mathbf{w}) $$

$$ \therefore \quad \mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}(\mathbf{w})^{-1} \nabla L(\mathbf{w}) $$

- Generally speaking, Newton's method converges quickly asymptotically and does not exhibit the zigzagging behavior that sometimes characterizes the method of the gradient descent.
- However, for Newton's method to work, the Hessian $\mathbf{H}(\mathbf{w})$ has to be a __positive definite matrix__ for all \mathbf{w}.
- Unfortunately, in general, there is no guarantee that $\mathbf{H}(\mathbf{w})$ is positive definite at every iteration of the algorithm.
- If the Hessian $\mathbf{H}(\mathbf{w})$ is not positive definite, modification of Newton's method is necessary. (Bertsekas, 1995)
- In any event, a major limitation of Newton’s method is its computational complexity.
