# Softmax regression

<a href="https://nbviewer.jupyter.org/github/hongjiaherng/ML-Collections/blob/main/just4funml/notes/note_softmax_reg.ipynb" 
   target="_parent">
   <img align="left" 
      src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg" 
      width="109" height="20">
</a>

### 1. Brief explanation

`Softmax regression`, or `Multinomial logistic regression` is the generalized version of logistic regression which can perform multiclass classification directly, unlike regular logistic regression which needs to use one-versus-all technique to enable multiclass classification.
<br><br>

Given an instance $x$, this model computes a score $s_k(x)$ for each class $k$ on the instance. Then, `softmax function` is applied to the scores to obtain the probability of instance $x$ belongs to every class. Finally, the model chooses the class which has the highest probability and classify instance $x$ to that particular class.

### 2. Model hypothesis

The following formulas in this part are all based on 1 training example

Define:<br>
$\mathbf{x} = 
\begin{bmatrix}
x_0 \\
x_1 \\
\vdots \\
x_n \\
\end{bmatrix}$ &nbsp;&nbsp;&nbsp;
$\theta^{(k)} =
\begin{bmatrix}
\theta^{(k)}_0 \\
\theta^{(k)}_1 \\
\vdots \\
\theta^{(k)}_n \\
\end{bmatrix}
$<br>
$
\Theta =
\begin{bmatrix}
\theta^{(1)}_0 & \theta^{(2)}_0 & ... & \theta^{(K)}_0 \\
\theta^{(1)}_1 & \theta^{(2)}_1 & ... & \theta^{(K)}_1 \\
\theta^{(1)}_2 & \theta^{(2)}_2 & ... & \theta^{(K)}_2 \\
\vdots & \vdots & \ddots & \vdots \\
\theta^{(1)}_n & \theta^{(2)}_n & ... & \theta^{(K)}_n \\
\end{bmatrix} =
\left[\begin{array}{cccc}| & | & | & | \\
\theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(K)} \\
| & | & | & |
\end{array}\right]
$
<br><br>
where<br>
$n$ = number of features, <br>
$K$ = number of classes, <br>
$\mathbf{x} \in \mathbb{R}^{(n+1)\times1}$ &nbsp;, (n+1-dimensional vector include bias term)<br>
$\Theta \in \mathbb{R}^{(n+1)\times K}$


a. ***Softmax score for class k of instance $x$ (aka logit)*** 
    
- Compute this for all class k where k = 1, 2, ..., k
<br><br>
$s_k(\mathbf{x}) = \mathbf{x}^T\theta^{(k)}$

<br>where <br>
$s_k(\mathbf{x}) \in \mathbb{R}$, <br>
$\mathbf{x}$ = feature vector of an instance, <br>
$\theta^{(k)}$ = weight of class $k$ for this particular instance (Also vector)

b. ***Softmax function*** 
- Compute this for all class k where k = 1, 2, ..., k
<br><br>
$\hat{p}_k = \sigma(s(\mathbf{x}))_k = \dfrac{\exp(s_k(\mathbf{x}))}{\sum\limits_{j=1}^{K}{\exp(s_j(\mathbf{x}))}}$

where <br>
$K$ = number of classes, <br>
$\sum\limits_{j=1}^{K}{\exp(s_j(\mathbf{x}))}$ = the sum of exponential of softmax scores for every class of instance $\mathbf{x}$

c. ***Softmax regression classifier prediction***
- This function obtains the maximum entry in the vector and return its class's number $k$
<br><br>
$\hat{y} = argmax_{k}<{\hat{p}_1}, {\hat{p}_2}, ..., {\hat{p}_K}>$

where <br>
$\hat{y}$ = prediction of instance $\mathbf{x}$ (class's number)

### 3. Cost function

$J(\Theta) = - \frac{1}{m} \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)}) $

$J(\Theta) = - \frac{1}{m} \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)}) + C \sum\limits_{j=1}^{n} \sum\limits_{k=1}^{K} |\theta_j^{(k)}| $

$J(\Theta) = - \frac{1}{m} \sum\limits_{i=1}^{m} \sum\limits_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)}) + \frac{C}{2} \sum\limits_{j=1}^{n} \sum\limits_{k=1}^{K} (\theta_j^{(k)})^2 $

### 4. Involve $m$ training examples

$\mathbf{X} =
\begin{bmatrix}
 --- (\mathbf{x}^{(1)})^T ---\\ 
 --- (\mathbf{x}^{(2)})^T ---\\ 
 \vdots \\
 --- (\mathbf{x}^{(m)})^T ---\\ 
\end{bmatrix}$

$
\Theta =
\left[\begin{array}{cccc}| & | & | & | \\
\theta^{(1)} & \theta^{(2)} & \cdots & \theta^{(K)} \\
| & | & | & |
\end{array}\right]$

a. ***Compute softmax score / logits***

$
S(\mathbf{X}) = 
\mathbf{X} \cdot \Theta =
\begin{bmatrix}
 {(\mathbf{x}^{(1)})}^T\cdot\theta^{(1)}  & {(\mathbf{x}^{(1)})}^T\cdot\theta^{(2)} & ... & {(\mathbf{x}^{(1)})}^T\cdot\theta^{(K)}\\ 
 {(\mathbf{x}^{(2)})}^T\cdot\theta^{(1)}  & {(\mathbf{x}^{(2)})}^T\cdot\theta^{(2)} & ... & {(\mathbf{x}^{(2)})}^T\cdot\theta^{(K)}\\ 
 {(\mathbf{x}^{(3)})}^T\cdot\theta^{(1)}  & {(\mathbf{x}^{(3)})}^T\cdot\theta^{(2)} & ... & {(\mathbf{x}^{(3)})}^T\cdot\theta^{(K)}\\ 
 \vdots  & \vdots  & \ddots & \vdots\\
 {(\mathbf{x}^{(m)})}^T\cdot\theta^{(1)}  & {(\mathbf{x}^{(m)})}^T\cdot\theta^{(2)} & ... & {(\mathbf{x}^{(m)})}^T\cdot\theta^{(K)}\\ 
\end{bmatrix} =
\begin{bmatrix}
 s_1(\mathbf{x}^{(1)}) & s_2(\mathbf{x}^{(1)}) & ... & s_K(\mathbf{x}^{(1)})\\ 
 s_1(\mathbf{x}^{(2)}) & s_2(\mathbf{x}^{(2)}) & ... & s_K(\mathbf{x}^{(2)})\\ 
 s_1(\mathbf{x}^{(3)}) & s_2(\mathbf{x}^{(3)}) & ... & s_K(\mathbf{x}^{(3)})\\ 
 \vdots  & \vdots  & \ddots & \vdots\\
 s_1(\mathbf{x}^{(m)}) & s_2(\mathbf{x}^{(m)}) & ... & s_K(\mathbf{x}^{(m)})\\ 
\end{bmatrix}
$

b. ***Apply `Softmax function` to logits***

$
\hat{\mathbf{P}} =
\exp(S(\mathbf{X})) / sumByColumns(\exp(S(\mathbf{X}))) \\   = 
\begin{bmatrix}
 \exp(s_1(\mathbf{x}^{(1)})) & 
 \exp(s_2(\mathbf{x}^{(1)})) & ... & 
 \exp(s_K(\mathbf{x}^{(1)})) \\ 
 \exp(s_1(\mathbf{x}^{(2)})) & 
 \exp(s_2(\mathbf{x}^{(2)})) & ... & 
 \exp(s_K(\mathbf{x}^{(2)})) \\ 
 \exp(s_1(\mathbf{x}^{(3)})) & 
 \exp(s_2(\mathbf{x}^{(3)})) & ... & 
 \exp(s_K(\mathbf{x}^{(3)})) \\ 
 \vdots & \vdots  & \ddots & \vdots\\
 \exp(s_1(\mathbf{x}^{(m)})) & 
 \exp(s_2(\mathbf{x}^{(m)})) & ... & 
 \exp(s_K(\mathbf{x}^{(m)})) \\ 
\end{bmatrix} \div
\begin{bmatrix}
\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(1)})) \\
\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(2)})) \\
\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(3)})) \\
\vdots \\
\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(m)})) \\
\end{bmatrix} \\    =
\begin{bmatrix}
 \frac{\exp(s_1(\mathbf{x}^{(1)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(1)}))} & 
 \frac{\exp(s_2(\mathbf{x}^{(1)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(1)}))} & ... & 
 \frac{\exp(s_K(\mathbf{x}^{(1)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(1)}))} \\ 
 \frac{\exp(s_1(\mathbf{x}^{(2)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(2)}))} & 
 \frac{\exp(s_2(\mathbf{x}^{(2)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(2)}))} & ... & 
 \frac{\exp(s_K(\mathbf{x}^{(2)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(2)}))} \\ 
 \frac{\exp(s_1(\mathbf{x}^{(3)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(3)}))} & 
 \frac{\exp(s_2(\mathbf{x}^{(3)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(3)}))} & ... & 
 \frac{\exp(s_K(\mathbf{x}^{(3)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(3)}))} \\ 
 \vdots  & \vdots  & \ddots & \vdots\\
 \frac{\exp(s_1(\mathbf{x}^{(m)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(m)}))} & 
 \frac{\exp(s_2(\mathbf{x}^{(m)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(m)}))} & ... & 
 \frac{\exp(s_K(\mathbf{x}^{(m)}))}{\sum\limits_{j=1}^{K}  \exp(s_j(\mathbf{x}^{(m)}))} \\ 
\end{bmatrix} \\    =
\begin{bmatrix}
 \hat{{p}}_1^{(1)} & \hat{{p}}_2^{(1)} & ... & \hat{{p}}_K^{(1)} \\ 
 \hat{{p}}_1^{(2)} & \hat{{p}}_2^{(2)} & ... & \hat{{p}}_K^{(2)} \\ 
 \vdots  & \vdots  & \ddots & \vdots\\ 
 \hat{{p}}_1^{(m)} & \hat{{p}}_2^{(m)} & ... & \hat{{p}}_K^{(m)} \\ 
\end{bmatrix}
$

c. ***Make prediction by choosing the class with highest probability***

$
\hat{y} = argmax_k (\hat{\mathbf{P}}) \\ = 
argmax_k (\begin{bmatrix}
 --- & \hat{{p}}^{(1)} & --- \\ 
 --- & \hat{{p}}^{(2)} & --- \\ 
 & \vdots & \\ 
 --- & \hat{{p}}^{(m)} & --- \\ 
\end{bmatrix}) \\ =
\begin{bmatrix}
\hat{y}^{(1)} \\
\hat{y}^{(2)} \\
\vdots \\
\hat{y}^{(m)} \\
\end{bmatrix}
$