# Chapter.03 Regression
---

### 3.4. Logistic and softmax regression
3.4.1. Logistic regression : hypothesis<br>
To estimate the probability of Bernoulli Random Variable(or a binary labels), $y \in \{0, 1\}$<br>
Let 
$$ \widehat{\Pr}(y = 1) = f(\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) \quad \text{where} \,\ \sigma(x) = \frac{1}{1 + \exp(-x)} $$

- $w_j$ & $b$ : weights & bias
- $\mathbf{w} = [b, \,\ w_1, \,\ w_2, \,\ \cdots]^T $ 
- $\mathbf{x} = [1, \,\ x_1, \,\ x_2, \,\ \cdots]^T $

Probability density function is 
$$
\begin{align*}
p(y = 1 | \mathbf{x} ; \mathbf{w}) &= \sigma(\mathbf{w}^T \mathbf{x}) \\
p(y = 0 | \mathbf{x} ; \mathbf{w}) &= 1 - \sigma(\mathbf{w}^T \mathbf{x}) \\
\end{align*}
$$

$$ p(y | \mathbf{x} ; \mathbf{w}) = [\sigma(\mathbf{w}^T \mathbf{x})]^y [1 - \sigma(\mathbf{w}^T \mathbf{x})]^{1-y} $$

If we assume that the training examples are i.i.d. 

$$
\begin{align*}
L(\mathbf{w}) &= \log p(\mathbf{y} | X; \mathbf{w}) = \log \prod_i p(y_i | \mathbf{x}_i ; \mathbf{w}) \\
              &= \log \prod_i [\sigma(\mathbf{w}^T \mathbf{x}_i)]^{y_i} [1 - \sigma(\mathbf{w}^T \mathbf{x}_i)]^{1-y_i} \\
              &= \sum_i y_i \log \sigma(\mathbf{w}^T \mathbf{x}_i) + (1 - y_i) \log (1 - \sigma(\mathbf{w}^T \mathbf{x}_i)) \\
\end{align*}
$$

Given the training set, learning the parameters to maximize the log-likelihood function:
$$ \max_\mathbf{w} L(\mathbf{w}) \quad  \text{which is concave.}$$

3.4.2. Logistic regression : learning based on gradient ascent algorithm<br>

$$ \mathbf{w} \leftarrow \mathbf{w} + \alpha \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} $$

$$
\begin{align*}
\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} &= (y \frac{1}{\sigma(\mathbf{w}^T \mathbf{x})} - (1-y) \frac{1}{1 - \sigma(\mathbf{w}^T \mathbf{x})}) \frac{\partial}{\partial \mathbf{w}} \sigma(\mathbf{w}^T \mathbf{x}) \qquad (\because \,\ \text{Chain rule}) \\
                                                   &= (y \frac{1}{\sigma(\mathbf{w}^T \mathbf{x})} - (1-y) \frac{1}{1 - \sigma(\mathbf{w}^T \mathbf{x})}) \sigma(\mathbf{w}^T \mathbf{x}) (1 - \sigma(\mathbf{w}^T \mathbf{x})) \frac{\partial}{\partial \mathbf{w}} \mathbf{w}^T \mathbf{x} \qquad (\because \,\ \text{Chain rule}) \\
                                                   &= (y(1 - \sigma(\mathbf{w}^T \mathbf{x})) - (1-y) \sigma(\mathbf{w}^T \mathbf{x}))\mathbf{x} \\
                                                   &= (y - \sigma(\mathbf{w}^T \mathbf{x}))\mathbf{x} \qquad (\text{LMS learning rule})
\end{align*}
$$

Therefore, batch learning update form is 
$$ \mathbf{w} \leftarrow \alpha \sum_i (y_i - \sigma(\mathbf{w}^T \mathbf{x}_i)) \mathbf{x}_i $$

Also, online learning update form is 
$$ \mathbf{w} \leftarrow \alpha (y_i - \sigma(\mathbf{w}^T \mathbf{x}_i)) \mathbf{x}_i $$

3.4.3. Logistic regression : learning via Iterative Reweighted Least Squares(IRLS) based on Newton-Rapson method<br>
Newton-Rapson method is in ___Calculus Early Transcendentals 8th Ed p. 345___. <br>
Newton-Rapson method is an iterative algorithm to seek a solution $f(x) = 0$. <br>
In the same context, we can find a solution $\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{0} $.

$$ \mathbf{y} = [\nabla \mathbf{f}(\mathbf{w})]^T (\mathbf{w} - \mathbf{w}_0) + \mathbf{f}(\mathbf{w}_0) $$

$$ \Rightarrow \quad \mathbf{w}_1 = \mathbf{w}_0 - ([\nabla \mathbf{f}(\mathbf{w})]^T)^{-1} \mathbf{f}(\mathbf{w}_0) $$

Let $ \mathbf{f}(\mathbf{w}) = \nabla L(\mathbf{w}) $.

$$ \mathbf{y} = \mathbf{H}(\mathbf{w}_0) (\mathbf{w} - \mathbf{w}_0) + \nabla L(\mathbf{w}_0) = \mathbf{0} $$ 

$$ \therefore \quad \mathbf{w}_1 = \mathbf{w}_0 - \mathbf{H}(\mathbf{w}_0)^{-1} \nabla L(\mathbf{w}_0) $$

It also can expand based on the Taylor series. <br>
A definition of taylor series of a differentiable function $f : \mathbb{R}^d \rightarrow \mathbb{R} $ is  

$$ T_f(x_1, x_2, \cdots, x_d) = \sum_{n_1 = 0}^{\infty} \sum_{n_2 = 0}^{\infty} \cdots \sum_{n_d = 0}^{\infty} \frac{(x_1 - a_1)^{n_1} (x_2 - a_2)^{n_2} \cdots (x_d - a_d)^{n_d}}{n_1 ! n_2 ! \cdots n_d !} (\frac{\partial^{n_1 + n_2 + \cdots + n_d} f}{\partial x_1^{n_1} \partial x_2^{n_2} \cdots \partial x_d^{n_d}}) (a_1, a_2, \cdots, a_d) $$

In above context, we can write second-degree taylor expansion.
$$ L(\mathbf{w} + \Delta \mathbf{w}) \approx L(\mathbf{w}) + [\nabla L(\mathbf{w})]^T \Delta \mathbf{w} + \frac{1}{2} \Delta \mathbf{w}^T \mathbf{H}(\mathbf{w}) \Delta\mathbf{w} $$

$$ \frac{\partial}{\partial \Delta \mathbf{w}}L(\mathbf{w} + \Delta \mathbf{w}) \approx \nabla L(\mathbf{w}) + \mathbf{H}(\mathbf{w}) \Delta \mathbf{w} = \mathbf{0} $$

$$ \Delta \mathbf{w} = \mathbf{H}(\mathbf{w})^{-1} \nabla L (\mathbf{w}) $$

$$ \therefore \quad \mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}(\mathbf{w})^{-1} \nabla L(\mathbf{w}) $$

- Generally speaking, Newton's method converges quickly asymptotically and does not exhibit the zigzagging behavior that sometimes characterizes the method of the gradient descent.
- However, for Newton's method to work, the Hessian $\mathbf{H}(\mathbf{w})$ has to be a __positive definite matrix__ for all \mathbf{w}.
- Unfortunately, in general, there is no guarantee that $\mathbf{H}(\mathbf{w})$ is positive definite at every iteration of the algorithm.
- If the Hessian $\mathbf{H}(\mathbf{w})$ is not positive definite, modification of Newton's method is necessary. (Bertsekas, 1995)
- In any event, a major limitation of Newton’s method is its computational complexity.

$$ \mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}(\mathbf{w})^{-1} \nabla L(\mathbf{w}) = \mathbf{w} + (X^T S X)^{-1} X^T (\mathbf{y} - \mathbf{z}) $$

$ \text{where} \quad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ v_N \end{bmatrix}, \,\ \mathbf{z} = \begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_N \end{bmatrix}, \,\ z_i = \sigma(\mathbf{w}^T \mathbf{x}_i), \,\ X = \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_N \end{bmatrix} = \begin{bmatrix}
 1 & x_1(1) & x_1(2) & \cdots \\
 1 & x_2(1) & x_2(2) & \cdots \\ 
 \vdots & \vdots & \ddots & \vdots \\
 1 & x_N(1) & x_N(2) & \cdots \\
 \end{bmatrix}, \,\ \mathbf{x}_i = \begin{bmatrix} x_i(1) \\ x_i(2) \\ \vdots \end{bmatrix} $<br>
$ S = diag(z_1(1-z_1), \cdots z_N(1-z_N)) = 
\begin{bmatrix}
z_1(1 - z_1) & 0 & \cdots & 0 \\
 0 & z_2(1 - z_2) & \cdots & 0 \\ 
 \vdots & \vdots & \ddots & \vdots \\
 0 & 0 & \cdots & z_N(1 - z_N) \\
 \end{bmatrix} $<br><br>

 <strong>Proof.</strong><br>
 [PDF File (too long)](./res/ch03/note_IRLS.pdf)

$$
\begin{align*}
\mathbf{w}_{new} &= \mathbf{w}_{old} - (X^T S X)^{-1} X^T (\mathbf{y} - \mathbf{z}) \\
                 &= (X^T S X)^{-1} \{X^T S X \mathbf{w}_{old} - X^T (\mathbf{y} - \mathbf{z}) \} \\
                 &= (X^T S X)^{-1} X^T S \{X \mathbf{w}_{old} - S^{-1} X^T (\mathbf{y} - \mathbf{z}) \} \\
                 &= (X^T S X)^{-1} X^T S \mathbf{b} \quad \text{where} \quad \mathbf{b} = X \mathbf{w}_{old} - S^{-1} X^T (\mathbf{y} - \mathbf{z})
\end{align*}  
$$

Above form is generalized version of the least squares solution.

3.4.4. Logistic regression : Binary classification<br>

- Classification rule : 
$$ f(\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) \underset{y = 0}{\overset{y = 1}\lessgtr} 0.5 $$

Logistic classification can be considered as a linear classification

$$ f(\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) \underset{y = 0}{\overset{y = 1} \lessgtr} 0.5 \quad \rightleftharpoons \quad g(\mathbf{x}) = \mathbf{w}^T \mathbf{x} \underset{y = 0}{\overset{y = 1} \lessgtr} 0 $$

$$ g(\mathbf{x}) = \mathbf{w}^T \mathbf{x} = \log \frac{p(y = 1 | \mathbf{x}, \mathbf{w})}{p(y = 0 | \mathbf{x}, \mathbf{w})} \underset{y = 0}{\overset{y = 1} \lessgtr} 0 $$

If the data is generated from the logistic model, then the logistic classification has the optimality in the maximum likelihood (ML) sense.

3.4.5. Softmax regression : Overview<br>
Generalization when $K > 2$ 
$$ p(C_k | \mathbf{x}) = \frac{p(\mathbf{x} | C_k)p(C_k)}{\sum_j p(\mathbf{x} | C_j)p(C_j)} = \frac{\exp(a_k)}{\sum_j \exp(a_j)} \quad \text{where} \quad a_k = \ln(p(\mathbf{x}|C_k)p(C_k))  $$

When $K = 2$, the softmax function becomes the logistic function
$$ \frac{\exp(a_1)}{\exp(a_1) + \exp(a_2)} = \frac{1}{1 + \exp(a_2 - a_1)} = \frac{1}{1 + \exp(a)} = \sigma(a) $$

3.4.6. Softmax regression : Hypothesis<br>
To estimate probabilityes of $K$ labels, $y \in \{1, \cdots, K\}$

$$ f(\mathbf{x}) = \begin{bmatrix} p(y = 1| \mathbf{x}, W) \\ p(y = 2| \mathbf{x}, W) \\ \vdots \\ p(y = K| \mathbf{x}, W) \end{bmatrix} = \frac{1}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x})} \begin{bmatrix} \exp(\mathbf{w}_1^T \mathbf{x}) \\ \exp(\mathbf{w}_2^T \mathbf{x}) \\ \vdots \\  \exp(\mathbf{w}_K^T \mathbf{x}) \end{bmatrix}$$

Log-likelihood function 

$$ L(\mathbf{w}) = \sum_i \sum_{k = 1}^{K} \mathbf{1}(y_i = k) \log \frac{\exp(\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)} \qquad \text{(Cross-entropy loss)}$$

Given the training set, learning the parameters to maximize the log-likelihood function:
$$ \max_W L(\mathbf{w}) \qquad \text{(Concave)} $$

3.4.7. Softmax regression : Derivative of softmax function<br>
$$ \pi(z_k) = \frac{e^{z_k}}{\sum_{k^\prime = 1}^{K} e^{z_k^\prime}} $$

$ \text{If} \,\ k = j : $<br>
$$
\begin{align*}
\frac{\partial \pi(z_k)}{\partial z_j} &= \frac{\partial}{\partial z_j} \frac{e^{z_k}}{\sum_{k^\prime = 1}^{K} e^{z^{k^\prime}}} \\
                                       &= \frac{e^{z_k}(\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}} - e^{z_j})}{(\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}})^2} \\
                                       &= \frac{e^{z_k}}{\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}}} \frac{\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}} - e^{z_j}}{\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}}} \\
                                       &= \pi(z_k) (1 - \pi(z_j)) \\
\end{align*}
$$

$ \text{If} \,\ k \neq j : $<br>
$$
\begin{align*}
\frac{\partial \pi(z_k)}{\partial z_j} &= \frac{\partial}{\partial z_j} \frac{e^{z_k}}{\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}}} \\
                                       &= \frac{0 - e^{z_k} e^{z_j}}{(\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}})^2} \\
                                       &= - \frac{e^{z_k}}{\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}}} \frac{e^{z_j}}{\sum_{k^\prime = 1}^{K} e^{z_{k^\prime}}} \\
                                       &= - \pi(z_k) \pi(z_j)
\end{align*}
$$

3.4.8. Softmax regression : learning based on gradient ascent algorithm<br>
$$ \mathbf{w}_k \leftarrow \mathbf{w}_k + \alpha \frac{\partial L(W)}{\partial \mathbf{w}_k}, \quad k = 1,2,\cdots, K $$
$$ \text{where} \quad \frac{\partial L(W)}{\partial \mathbf{w}_k} = \sum_i \mathbf{x}_i [\mathbf{1}(y_i = k) - \frac{\exp(\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)}] \quad \text{(LMS learning rule)}$$

<strong>Proof.</strong><br>
$$
\begin{align*}
\frac{\partial L(W)}{\partial \mathbf{w}_l} &= \sum_i \sum_{k = 1}^{K} \mathbf{1}(y_i = k) \frac{\partial}{\partial \mathbf{w}_l} \log \frac{\exp (\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)} \\
                                            &= \sum_i \sum_{k = 1}^{K} \mathbf{1}(y_i = k) \frac{1}{ \frac{\exp (\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp (\mathbf{w}_j^T \mathbf{x}_i)}} \frac{\partial}{\partial \mathbf{w}_l} \frac{\exp(\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w_j^T \mathbf{x}_i})} \\
                                            &= \sum_i \sum_{k = 1}^{K} \mathbf{1}(y_i = k) \frac{1}{ \frac{\exp (\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp (\mathbf{w}_j^T \mathbf{x}_i)}} \frac{\exp(\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w_j^T \mathbf{x}_i})} (\mathbf{1}_{k = l} - \frac{\exp (\mathbf{w}_l^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)}) \frac{\partial(\mathbf{w}_l^T \mathbf{x}_i)}{\partial \mathbf{w}_l} \\
                                            &= \sum_i \sum_{k = 1}^{K} \mathbf{1}(y_i = k) (\mathbf{1}_{k = l} - \frac{\exp (\mathbf{w}_l^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)}) \mathbf{x}_i \\
\end{align*}
$$

$$ 
\therefore \quad \frac{\partial L(W)}{\partial \mathbf{w}_l} = 
\begin{cases}
\sum_{i} \mathbf{x}_i \left[ \mathbf{1}(y_i = k) - \frac{\exp (\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)} \right] & \text{if} \,\ l = k \\
0 & \text{if} \,\ l \neq k
\end{cases}
$$

Batch learning :

$$ \mathbf{w}_k \leftarrow \mathbf{w}_k + \alpha \sum_{i} \mathbf{x}_i \left[ \mathbf{1}(y_i = k) - \frac{\exp (\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)} \right] $$

Online learning :

$$ \mathbf{w}_k \leftarrow \mathbf{w}_k + \alpha \mathbf{x}_i \left[ \mathbf{1}(y_i = k) - \frac{\exp (\mathbf{w}_k^T \mathbf{x}_i)}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x}_i)} \right] $$

3.4.9. Softmax regression : learning via Iterative Reweighted Least Squares(IRLS) based on Newton-Rapson method<br>

$$ \mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}^{-1} \nabla_\mathbf{w} L(W) $$

$ \text{where} $ $ \mathbf{H} = \begin{bmatrix}
\mathbf{H}_{11} & \mathbf{H}_{12} & \cdots  & \mathbf{H}_{1K} \\
\mathbf{H}_{21} & \mathbf{H}_{22} & \cdots  & \mathbf{H}_{2K} \\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{H}_{K1} & \mathbf{H}_{K2} & \cdots  & \mathbf{H}_{KK} \\
\end{bmatrix}, $ $ \,\ \mathbf{H}_{kj} \triangleq \frac{\partial^2 L(W)}{\partial \mathbf{w}_k \partial \mathbf{w}_j} = - \sum_i z_{ki} (\mathbf{1}_{k = j} - z_{ji}) \mathbf{x}_i \mathbf{x}^T, $

$$ 
\nabla_\mathbf{w} L(W) = \left[ \frac{\partial L(W)}{\partial \mathbf{w}_1} \,\ \frac{\partial L(W)}{\partial \mathbf{w}_2} \,\ \cdots \,\ \frac{\partial L(W)}{\partial \mathbf{w}_K}  \right]^T,  \,\
\frac{\partial L(W)}{\partial \mathbf{w}_k} = \sum_i \mathbf{x}_i [\mathbf{1}(y_i = k) - z_{ki}] = X^T (\mathbf{y} - \mathbf{z}_k) \\
$$

<strong>Proof.</strong><br>
On the same way in 3.4.3, it can solved. $\blacksquare$

3.4.10. Softmax regression : Multi-Class classification via softmax regression<br>
Classification rule :

$$ y = \underset{_{k \in \{1, \cdots , K\}}}{\arg \max} \left\{ f(\mathbf{x}) = \frac{1}{\sum_{j = 1}^{K} \exp(\mathbf{w}_j^T \mathbf{x})} \begin{bmatrix} \exp(\mathbf{w}_1^T \mathbf{x}) \\ \exp(\mathbf{w}_2^T \mathbf{x}) \\ \vdots \\ \exp(\mathbf{w}_K^T \mathbf{x}) \end{bmatrix} \right\} $$