## Softmax Regression (AKA Multinomial Logistic Regression)

Softmax is a generalization of logistic regression that is used for multi-class classification problems. In multi-class classification, the goal is to predict one of multiple possible classes for a given input. Softmax Regression extends binary logistic regression to handle multiple classes by using a softmax function, which normalizes the output probabilities, ensuring they sum up to 1 across all the classes.

In Softmax Regression, the model computes a score for each class, usually using a linear function of input features. These scores are then passed through the softmax function to produce probabilities for each class. The softmax function takes the exponent of each score, and then normalizes the result so that the sum of probabilities for all classes equals 1. Mathematically, the softmax function is defined as:

softmax(z)_i = exp(z_i) / Σ(exp(z_j))

where z is a vector of scores for each class, and i and j are the indices for the classes.

The class with the highest probability is then chosen as the prediction for the input data. During training, the model learns the optimal weights and biases for the linear functions by minimizing a loss function, typically the cross-entropy loss, which measures the discrepancy between the predicted probabilities and the true class labels.

Softmax Regression is widely used in various applications, such as image classification, natural language processing, and recommender systems, where the goal is to classify an input into one of multiple possible classes.

### Implementing Softmax
- C = number of classes you are attempting to identify
- If you were classifying cats, dogs, and chicks C would be equal to 4 (as you are also interested in classifying "other" or things which are not cats, dogs, and chicks)
- You would then build your network so it has four output nodes (instead of one if you were only interested in cat or no cat)
- Under this circumstance your shape of Yhat would be (4,1)
- To determine what class the sample is a part of, you would simply look at the values of Yhat for that example and select the value which is highest and then use the label associated with that output node.
- **IMPORTANT** the summed value of the Yhat vector should be equal to 1.0, so the network is effectively distributing a probability that the image belongs to each of the classes
- In order to implement this, you will build what is called a Softmax layer

### Building a Softmax Layer
- The softmax layer is defined by the use of a specific activation function
- t = e^(Z[l])
- t will be a (4,1) vector just like Z[l]
- a[l] = e^Z[l] / np.sum(t)
- a[l] is also a (4,1) vector where a[i] = t[i] / np.sum(t)
- **IMPORTANT** what is unusual about this activation function is the fact that it outputs a vector instead of a single real number like the other activation functions we have discussed so far.

### Training a Network that Uses a Softmax Layer
The loss function commonly used for softmax regression (also known as multiclass logistic regression) is the cross-entropy loss (also called the negative log-likelihood). Given the true probability distribution y and the predicted probability distribution from the softmax function y_hat, the cross-entropy loss is calculated as follows:

L(y, y_hat) = - ∑(y_i * log(y_hat_i))

Here, the sum runs over all classes i. In the context of classification problems, y is usually a one-hot encoded vector representing the true class label, and y_hat is the predicted probability distribution over the classes obtained by applying the softmax function on the output of the last layer of the neural network.

For a single data point, the loss can be calculated as:

L = -log(y_hat_c)

where y_hat_c is the predicted probability of the true class c.

When calculating the loss for a dataset, you compute the average loss across all data points in the dataset.