# Softmax Logistic Regression
Multinomial logistic regression for more than 2 classes, but still, each sample can only belong to one class.

## The Model



## Implementation of the Feed-Forward Pass

In [None]:
def softmax(zet):
    zet -= np.max(zet)
    sm = (np.exp(zet).T / np.sum(np.exp(zet),axis=1)).T
    return sm

Z = Xb.dot(W)
Y = softmax(Z)

## The Loss Function
Multinomial logistic regression has a slightly different loss function than binary logistic regression because it uses the softmax rather than the sigmoid classifier. Consider for all training samples $\{x_n\}$ and classes k:
$t_{nk} = 1$ if the target for sample $n$ is of class $k$ and $=0$ otherwise. 

$$L_{\mathbf{W}, b} \stackrel{\text{def}}{=} \prod_{n=1}^N \prod_{k=1}^K y_{nk}^{\ \Large t_{nk}}$$

The Loss to be minimised is the negative log-likelihood, also called Categorical Cross Entropy Loss:

$$L_{CE}\stackrel{\text{def}}{=} - \sum_{n=1}^N \sum_{k=1}^K t_{nk} \log y_{nk} $$

If the more $y_{nk}$ is wrong, the larger the loss: Consider just 1 sample:

* Exactly right $\; \rightarrow \; L_{CE}=0$
* 50\% probability on correct target $\; \rightarrow \; L_{CE}=-1*\cdot\log(0.5)=0.693$
* 25\% probability on correct target $\; \rightarrow \; L_{CE}=-1*\cdot\log(0.25)=1.386$
* 0\% probability on correct target $\; \rightarrow \; L_{CE}=-1*\cdot\log(0)=\infty$


Or maximise the the log-likelihood:

$$J = \sum_{n=1}^N \sum_{k=1}^K t_{nk} \log y_{nk}$$

## Optimisation
There is no closed-form solution for the optimal $W$ in softmax logistic regression $\rightarrow$ find the partial derivaties of $J$ with respect to the $W_{dk}$ and perform gradient descent.

$$z=W^Tx$$
$$y=softmax(z)$$
$$y_{nk} = \frac{e^{z_{nk}}}{\sum_{k'=1}^K e^{z_{nk'}}}$$

$$J = \sum_{n=1}^N \sum_{k=1}^K t_{nk} \log y_{nk}$$

partial derivative with respect to weight that connects $x_d$ with "class" $k$:
$$\frac{\partial J}{\partial W_{dk}} = \sum_{n=1}^N \sum_{i=1}^K \frac{\partial J_{ni}}{\partial y_{ni}} \frac{\partial y_{ni}}{\partial z_{nk}} \frac{\partial z_{nk}}{\partial W_{dk}}$$

$$J_{ni} = t_{ni} \log y_{ni}$$
$$\frac{\partial J_{ni}}{\partial y_{ni}} = \frac{t_ni}{y_{ni}}$$

$$\begin{aligned}
y_{ni} &=\frac{e^{z_{ni}}}{\sum_j e^{z_{nj}}}\\
&= e^{z_{ni}}\left(\sum_j e^{z_{nj}}\right)^{-1}
\end{aligned}$$

$$\begin{aligned}
\frac{\partial y_{ni}}{\partial z_{nk}} &= (-1)  e^{z_{ni}} \left(\sum_j e^{z_{nj}}\right)^{-2}  e^{z_{nk}} \; \; \; \; \text{if} \; i\neq k\\
&= (-1) \frac{e^{z_{ni}}}{\sum_j e^{z_{nj}}} \frac{e^{z_{nk}}}{\sum_j e^{z_{nj}}}\\
&= -y_{ni}y_{nk}\\
\end{aligned}$$

$$\begin{aligned}
\frac{\partial y_{ni}}{\partial z_{nk}} &=  e^{z_{nk}} \left(\sum_j e^{z_{nj}}\right)^{-1} -  e^{z_{nk}} \left(\sum_j e^{z_{nj}}\right)^{-2} e^{z_{nk}} \; \; \; \; \text{if} \; i= k\\
&= y_{ni}(1-y_{nk}) \; \; \; \; \text{although} \; k=i\\
\end{aligned}$$

with
$$\begin{aligned}
\delta_{ki} &= 1 \; \; \; \; \text{if} \; i= k \\
\delta_{ki} &= 0 \; \; \; \; \text{if} \; i\neq k
\end{aligned}$$
the expression simplifies to
$$\frac{\partial y_{ni}}{\partial z_{nk}} = y_{ni}(\delta_{ki}-y_{nk})$$


$$z_{nk} = W_{:,k}^{T}x_n$$
$$\frac{\partial z_{nk}}{\partial W_{dk}} = x_{nd}$$

combine and simplify
$$\begin{aligned}
\frac{\partial J}{\partial W_{dk}} &= \sum_{n=1}^N \sum_{i=1}^K \frac{t_{ni}}{y_{ni}} y_{ni}(\delta_{ki}-y_{nk}) x_{nd}\\
&= \sum_{n=1}^N \sum_{i=1}^K t_{ni}(\delta_{ki}-y_{nk}) x_{nd}\\
&= \sum_{n=1}^N x_{nd}\left( \sum_{i=1}^K t_{ni}\delta_{ki}- \sum_{i=1}^K t_{ni} y_{nk} \right)\\
&= \sum_{n=1}^N \left(t_{nk} - y_{nk}\right) x_{nd}
\end{aligned}$$

In matrix form:
$$\nabla J=X^T(T-Y)$$

In code:
```python
X.T.dot(T-Y)
```