# SoftMax regression

From regression to classification

In [3]:
import torch
import torch.nn as nn

Rather than predicting quantities, we often want to classify things.

**Example**: Classify a mail as a spam or not, is there a cat in this image ?, etc.

**In case of a binary classification**, we only provide one output unit with a Sigmoid applied to it

**Why don't we have 2 output units?**

In [5]:
sigmoid = nn.Sigmoid()

<center>
    <img src="images/sigmoid.png" height="70%" width="70%"/>
</center>

When you want to predict, you apply a threshold (usually 0.5) to know which category was predicted

In case of a binary classification with one output unit, we use the **Binary Cross Entropy** loss

In [7]:
criterion = torch.nn.BCELoss()

When we have more than one classes, we represent them using encoding

This encoding ensure there are no order in the representation
if for **{dog, cat, bird, fish}** we were assigning $y \in \{1, 2, 3, 4\}$ we would have assign an **order** and a **value** to each class. We don't want that!

The usual way to represent categorical data is the *one-hot encoding*.

It is a vector with as many components as we have categories.

The component corresponding to particular instance's category is set to 1
and all other components are set to 0.

$$y \in \{(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)\}.$$

To estimate the conditional probabilities of all the possible classes, we need a model with one output per class

<center><img src="images/softmaxreg.svg" height="70%" width="70%"/></center>

Suppose that the entire dataset $\{\mathbf{X}, \mathbf{Y}\}$ has $n$ examples,
where the example indexed by $i$
consists of a feature vector $\mathbf{x}^{(i)}$ and a one-hot label vector $\mathbf{y}^{(i)}$.
We can compare the estimates with reality
by checking how probable the actual classes are
according to our model, given the features:

$$
P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
$$

**If $P(\mathbf{Y} \mid \mathbf{X}) = 1$ we have a perfect model!** 

We want to *maximize* the maximum likelihood. However, in neural network, we want to have a loss we can *minimize*

Minimizing the **negative log-likelihood** is equivalent to maximizing the maximum likelihood

This loss is called the **cross-entropy loss**.
It takes the output layer (called **logit**) and the ground truth, transform them into probabilities and compares with the target

In [13]:
criterion = nn.CrossEntropyLoss()
prediction = torch.randn(3, 5) # the per category-logits you've predicted 
target = torch.empty(3, dtype=torch.long).random_(5)
output = criterion(prediction, target)

**Tip**: The CrossEntropyLoss allows you to assign weight to each class, it can be usefull if your dataset is **unbalanced**

**For multi class classification problem**: Our model output scalars, we want probabilities.
These scalars are called **logits**.

To transform a vector of **logits** into a probability vector, we use the **SoftMax** function

$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}. $$

We need a loss function capable to mesure the quality of our predicted probabilities

We rely on the **maximum likelihood** estimation

**Softmax** provides a vector $\hat{\mathbf{y}}$,
which we can interpret as estimated conditional probabilities
of each class given any input $\mathbf{x}$, e.g.,
$\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$.

In [4]:
softmax = nn.Softmax(dim=1) # what is dim 0, what is dim 1?

# ⚠️ Cross-Entropy itself apply a SoftMax, if your model outputs probabilities, use the Negative Log Likelood loss ⚠️

In [12]:
criterion = nn.NLLLoss()

When you want to predict, you simply take the index with the maximum probability as the category

In [9]:
a = torch.randn(4, 4)
torch.max(a, dim=1) # torch.max return the maximum value and the corresponding index

torch.return_types.max(
values=tensor([ 2.4403,  1.3832,  1.8918, -0.1694]),
indices=tensor([2, 0, 1, 0]))

Minimizing the **negative log-likelihood** is equivalent to maximizing the maximum likelihood

$$
-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),
$$

where for any pair of label $\mathbf{y}$ and model prediction $\hat{\mathbf{y}}$ over $q$ classes,
the loss function $l$ is

$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$

This loss is called the **cross-entropy loss**