## Softmax Regression (Multi-Class Logistic Regression)

So far, we have had classification problems where we had to predict the outcome as binary, 0 or 1? cat or not a cat? There is a generalization of logistic regression called Softmax regression, that lets us make predictions on a multi-class problem. 

$C = $ number of classes (different categories of classes).

Let us say that $C=4$. In this case, we are going to build a new NN where the output layer has 4 units (same as the number of classes). So $n^{[L]} = 4 = C$. Furthermore, each unit in the output layer denotes the probability of the outcome of each of the classes given the input, X, i.e. $P(\text{Class} | X)$. Here the outcome label $\hat{y}$ is going to be dimension $(4,1)$.

In the final layer of the NN, we are going to compute as usual the linear part of the layer, $z^{[L]} = w^{[L]} a^{[L-1]}+ b^{[L]}$. After having computed the linear part, we need to apply **Softmax Activation Function**.

Softmax Activation Function: 

- First we are going to compute a temporary variable called $t$ as follows
$$ t = e^{z^{[L]}} $$
    - This is an exponential-wise operation in python and will of the dimension (4,1)
- After computing $t$, we are going to normalize $t$ so that it sums to 1. ($t$ is a vector)
$$  a^{[L]} =  \frac{e^{z^{[L]}}}{\sum_{j=1}^{C} t_i } $$
    - For a particular example in the final layer, it will look as follows:
$$ a_i^{[L]} =  \frac{ t_i }{\sum_{j=1}^{C} t_i } $$

The output $a_i^{[L]}$ denotes the probabilities of the final units.

The thing to note is that the decision boundary among various classes when we plot the classes, will always be linear.

## Training Softmax Classifier

A hard max classifier is a classifier when the output is of the following form: One class output has the probability 1, and the all other classes have proability 0.

Softmax Regression generalizes logistic regression to C classes rather than just two classes.

**Loss Function**

$$ L(\hat{y},\ y) = -\sum_{j=1}^{4} y_j \text{log}\hat{y}_j $$

**Cost Function**
$$  J(w^{[i]}, b^{[i]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) $$

We can use the gradient descent to minimize the cost function like we have been doing before.

**Back-Prop** 

$$ \frac{ \partial J }{ \partial z^{[L]} } = dz^{[L]} = \hat y - y $$