# Softmax

The softmax function is commonly used as the final layer in a neural network and in 'softmax' or multinomial logistic regression.

Recall from your experience with logistic regression that the output predicts a binary class membership [0,1] and can be interpretecd as the probability of class 1 membership.

Multinomal logistic regression is a generalization of logistic regression that allows the classifier to predict probability of membeship in k classes.

Imagine a set of logistic regressions, predicting membership of each class in {0...k}


$$ f(x) = \begin{Bmatrix}
            Pr(Y = 0) = sigmoid(\theta_{0} * X) \\
            Pr(Y = 1) = sigmoid(\theta_{1} * X) \\
            Pr(Y = 2) = sigmoid(\theta_{2} * X) \\
            Pr(Y = k) = sigmoid(\theta_{k} * X)
           \end{Bmatrix}$$
           
An obvious problem in this method is that the output of each logistic regression is independent. This keeps us from using these outputs as an overall probability of class membership in class K.  Softmax squeezes the outputs of all these logistic regressions such that they sum to 1 and the outputs can be used as an overall class membership probability.

$$ Pr(Y=k) =softmax \begin{Bmatrix}
            Pr(Y = 0) = sigmoid(\theta_{0} * X) \\
            Pr(Y = 1) = sigmoid(\theta_{1} * X) \\
            Pr(Y = 2) = sigmoid(\theta_{2} * X) \\
            Pr(Y = k) = sigmoid(\theta_{k} * X)
           \end{Bmatrix}$$
           
           
Mathematically, softmax looks like this:

$$ \theta{(z)}_j = \frac{e^{z_{j}}}{\sum^{K}_{k=1}{e^{z_k}}} $$ for j = 1...K



Lets pretend we have the outputs from 5 logistic regressions that look like this:
           


In [7]:
import numpy as np
import math
z = np.array([0.9, 0.8, 0.2, 0.1, 0.5])

In [5]:
sum(z)

2.5

These numbers obviously don't add up to 1.   I would interpret this as logit 0 and logit 1 strongly believe an observation is a member of these classes.   Logit 3 and 4 strongly believe against class membership.  Logit 5 is undecided.

Lets apply softmax to these outputs

In [8]:

def softmax(z):
    z_exp = [math.exp(x) for x in z]
    sum_z_exp = sum(z_exp)
    softmax = [round(i / sum_z_exp, 3) for i in z_exp]
    return softmax

In [9]:
softmax(z)

[0.284, 0.257, 0.141, 0.128, 0.19]

In [10]:
sum(softmax(z))

1.0

Some neural networks, such as those used in vision tasks, can predict membership across thousands of classes. Softmax as a last layer allows us to know the probability some observation belongs to a particular class by squeezing the outputs into a single probability distribution.