# ELLE - Elastic Learning

## Basics (can be ignored)

In [ ]:
import numpy as np

### Softmax and Cross-Entropy

We use the softmax output function to compute class probabilites $p_k$ for each example. In a supervised $K$-multi-class classification setting, the probability for an example $x$ to belong to class $C_k$ is given a-priori by $y_k = P(C_k|x)$. We measure the deviation of predicted values $p_k$ from the target values $y_k$ by means of the cross-entropy error $L(x,y)$. The derivatives of the cross-entropy error with respect to the inputs of the softmax are given below. For implementation purposes, it's much more efficient to pre-compute the derivative of $\frac{\partial L(softmax(x),y)}{\partial x}$ instead of computing and multiplying the gradient $\frac{\partial L(p, y)}{\partial p}$ with the Jacobean $\frac{\partial softmax(x)}{\partial{x}}$

$$p_k(\mathbf{x}) = softmax_k(\mathbf{x}) = \frac{e^{x_k}}{\sum_i e^{x_i}}$$
if $k = i$:
\begin{align}
\frac{\partial p_k}{\partial x_i} &= \frac{e^{x_k}}{\sum_i e^{x_i}} + e^{x_k} \cdot - \frac{1}{(\sum_i e^{x_i})^2} \cdot e^{x_i} \\
 &= \frac{e^{x_k}}{\sum_i e^{x_i}} - \frac{e^{x_k}}{\sum_i e^{x_i}} \cdot \frac{e^{x_i}}{\sum_i e^{x_i}} \\
 &= p_k - p_k p_i\\
 &= p_k (1 - p_i) 
\end{align}

if $k \ne i$:

\begin{align}
\frac{\partial p_k}{\partial x_i} &= e^{x_k} \cdot - \frac{1}{(\sum_i e^{x_i})^2} \cdot e^{x_i} \\
 &= -p_k p_i
\end{align}
$L(\mathbf{x}, \mathbf{y}) = - \sum_k y_k \cdot \log p_k(\mathbf{x})$
\begin{align}
\frac{\partial L}{\partial x_i} &= - \sum_k y_k \cdot \frac{1}{p_k} \cdot \frac{\partial p_k}{\partial x_i} \\
 &= - y_i (1 - p_i) + \sum_{k \ne i} y_k p_i \\
 &= - y_i + \sum_k y_k p_i \\
 &= - y_i + p_i \cdot \underbrace{\sum_k y_k}_{= 1} \\
 &= p_i - y_i
\end{align}

In [ ]:
a = np.asarray([4,5,6])
print a

In [ ]:
b = np.asarray([7,10,9])
print b