## Softmax

Softmax is way of dealing with probabilities in identification. 

A deep learning model, or regression analysis done with the possibility of multiple outcomes, does not "choose" which outcome is being predicted. Rather, it estimates a probability for each outcome, and then takes the highest one.

So if you are predicting digits, you are actually predicting the probability that the writing you are analyzing is equal to each possible digit (0,1,2,...) and then choosing the highest probability answer. 

Softmax is the mathmatical way we implement this.

Let's say each class is $k$. Under the assumption that we are classifying an iris flower, we have 3 *k*'s or possible classifications; Virginica, Versicolor, and Setosa.

For each instance of training data, we calculate a linear regression of that instance to each possible outcome. 
$$ s_k(\mathbf{x}) = \mathbf{x}^\intercal\boldsymbol{\theta}^{(k)} $$

This gives each potential class a score. However, these scores are just logit probabilities. So They will not add to 1. We want the probability that each class is correct, and so all classes should add to 1. This is what the softmax function does.

$$ \hat{P}_k = \sigma(\mathbf{s}(\mathbf{x}))_k  \frac{exp(s_k(\mathbf{x}))}{\sum^{K}_{j=1}exp(s_j(\mathbf{x}))} $$

Each score is divided by the sum of all scores, thereby making them "percents" and adding to 1. 

* $K$ is the number of classes
* $\mathbf{s}(\mathbf{x})$ is the vector of scores for each class for the instance $\mathbf{x}$
* $\sigma(\mathbf{s}(\mathbf{x}))_k$ is the predicted probability that $\mathbf{x}$ belongs to class $k$, given $\mathbf{s}(\mathbf{x})$

So the Softmax regression classifier prediction is
$$ \hat{y} = argmax_k  \sigma(\mathbf{s}(\mathbf{x}))_k $$

$argmax$ is a way of saying "return the value that maximizes the function.

so this is "return the k that maximizes the predicted probability that $\mathbf{x}$ belongs to class $k$. Or which class is the higest probability choice?

## Loss Function

The loss function used to train models using probabilities and softmax is cross entropy loss.

## Implementing

In [8]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris = datasets.load_iris()

In [9]:
# Let's look at the famous flower classification example
list(iris.keys())

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename']

In [14]:
X = iris['data'][:, (2,3)]
X[:5]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

In [15]:
y = iris['target']
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [16]:
from sklearn.linear_model import LogisticRegression
# LogisticRegression uses one vs the rest on default, so we will use multinomial 
# plus a solver that uses softmax such as 'lbfgs'
softmax_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)
softmax_reg.fit(X, y)

LogisticRegression(C=10, multi_class='multinomial')

In [24]:
# Let's test a pedal 5cm long and 2 cm wide.
prediction = softmax_reg.predict([[5,2]])
prediction

array([2])

In [21]:
iris['target_names'][2]

'virginica'

It is "virginica". But we can see all the probabilities after the softmax function applied (and they should add to 1)

In [27]:
softmax_reg.predict_proba([[5,2]])[0]

array([6.38014896e-07, 5.74929995e-02, 9.42506362e-01])

Virginca = 94.25% 
Versicolor = 5.75%

In [28]:
dict(zip(iris['target_names'],softmax_reg.predict_proba([[5,2]])[0]))

{'setosa': 6.380148956081825e-07,
 'versicolor': 0.05749299952558009,
 'virginica': 0.9425063624595242}