# SoftMax regression

From regression to classification

In [1]:
import torch
import torch.nn as nn

Rather than predicting quantities, we often want to classify things.

**Example**: Classify a mail as a spam or not, is there a cat in this image ?, etc.

**In case of a binary classification**, we only provide one output unit with a Sigmoid applied to it

**Why don't we have 2 output units?**

In [2]:
sigmoid = nn.Sigmoid()

<center>
    <img src="images/sigmoid.png" height="70%" width="70%"/>
</center>

When you want to predict, you apply a threshold (usually 0.5) to know which category was predicted

In case of a binary classification with one output unit, we use the **Binary Cross Entropy** loss

In [3]:
criterion = torch.nn.BCELoss()

When we have more than one classes, we represent them using encoding

This encoding ensure there are no order in the representation
if for **{dog, cat, bird, fish}** we were assigning $y \in \{1, 2, 3, 4\}$ we would have assign an **order** and a **value** to each class. We don't want that!

The usual way to represent categorical data is the *one-hot encoding*.

It is a vector with as many components as we have categories.

The component corresponding to particular instance's category is set to 1
and all other components are set to 0.

$$y \in \{(1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)\}.$$

To estimate the conditional probabilities of all the possible classes, we need a model with one output per class

<center><img src="images/softmaxreg.svg" height="70%" width="70%"/></center>

Suppose that the entire dataset $\{\mathbf{X}, \mathbf{Y}\}$ has $n$ examples,
where the example indexed by $i$
consists of a feature vector $\mathbf{x}^{(i)}$ and a one-hot label vector $\mathbf{y}^{(i)}$.
We can compare the estimates with reality
by checking how probable the actual classes are
according to our model, given the features:

$$
P(\mathbf{Y} \mid \mathbf{X}) = \prod_{i=1}^n P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}).
$$

**If $P(\mathbf{Y} \mid \mathbf{X}) = 1$ we have a perfect model!** 

We want to *maximize* the maximum likelihood. However, in neural network, we want to have a loss we can *minimize*

Minimizing the **negative log-likelihood** is equivalent to maximizing the maximum likelihood

This loss is called the **cross-entropy loss**.
It takes the output layer (called **logit**) and the ground truth, transform them into probabilities and compares with the target

In [8]:
criterion = nn.CrossEntropyLoss()
prediction = torch.randn(3, 5) # the per category-logits you've predicted 
target = torch.empty(3, dtype=torch.long).random_(5)
print(target)
output = criterion(prediction, target)
print(prediction)

tensor([4, 4, 1])
tensor([[ 0.9347, -1.3496,  0.6083,  0.1674,  0.9082],
        [-0.7064, -1.2840, -1.0878, -0.3773,  0.9027],
        [ 1.6767,  0.5550,  0.2900, -2.2211, -0.1674]])


**Tip**: The CrossEntropyLoss allows you to assign weight to each class, it can be usefull if your dataset is **unbalanced**

**For multi class classification problem**: Our model output scalars, we want probabilities.
These scalars are called **logits**.

To transform a vector of **logits** into a probability vector, we use the **SoftMax** function

$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o})\quad \text{where}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum_k \exp(o_k)}. $$

We need a loss function capable to mesure the quality of our predicted probabilities

We rely on the **maximum likelihood** estimation

**Softmax** provides a vector $\hat{\mathbf{y}}$,
which we can interpret as estimated conditional probabilities
of each class given any input $\mathbf{x}$, e.g.,
$\hat{y}_1$ = $P(y=\text{cat} \mid \mathbf{x})$.

In [5]:
softmax = nn.Softmax(dim=1) # what is dim 0, what is dim 1?

# ⚠️ Cross-Entropy itself apply a SoftMax, if your model outputs probabilities, use the Negative Log Likelood loss ⚠️

In [6]:
criterion = nn.NLLLoss()

When you want to predict, you simply take the index with the maximum probability as the category

In [7]:
a = torch.randn(4, 4)
torch.max(a, dim=1) # torch.max return the maximum value and the corresponding index

torch.return_types.max(
values=tensor([1.2424, 0.5465, 0.5827, 0.4447]),
indices=tensor([3, 3, 3, 2]))

Minimizing the **negative log-likelihood** is equivalent to maximizing the maximum likelihood

$$
-\log P(\mathbf{Y} \mid \mathbf{X}) = \sum_{i=1}^n -\log P(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})
= \sum_{i=1}^n l(\mathbf{y}^{(i)}, \hat{\mathbf{y}}^{(i)}),
$$

where for any pair of label $\mathbf{y}$ and model prediction $\hat{\mathbf{y}}$ over $q$ classes,
the loss function $l$ is

$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$

This loss is called the **cross-entropy loss**

In [12]:
model = nn.Sequential(nn.Linear(4, 3))
criterion = nn.CrossEntropyLoss()
input_features = torch.randn(10, 4)
targets = torch.empty(10, dtype=torch.long).random_(3)
model(input_features)

tensor([[-0.6341,  0.3860,  0.5390],
        [-0.5015,  0.6337,  0.6364],
        [-0.3050,  0.1935,  0.8646],
        [-0.6423,  0.7129, -0.4183],
        [-0.2103, -0.6932,  1.1270],
        [-0.3464,  0.6114,  0.1044],
        [-0.2144, -0.3100,  0.3348],
        [-0.0124, -0.6519, -0.2480],
        [-0.9427,  2.0335,  1.3515],
        [-0.5265,  0.5125,  0.9804]], grad_fn=<AddmmBackward0>)

In [14]:
optimizer = torch.optim.SGD(model.parameters(), lr=3e-3)
epochs = 100
for idx_epoch in range(epochs):
    optimizer.zero_grad()
    Y_pred = model(input_features)
    loss = criterion(Y_pred, targets)
    loss.backward()
    optimizer.step()
    print(f'loss now {loss.item()}')

loss now 1.3080755472183228
loss now 1.3068199157714844
loss now 1.3055660724639893
loss now 1.3043138980865479
loss now 1.303063154220581
loss now 1.3018144369125366
loss now 1.300567388534546
loss now 1.2993220090866089
loss now 1.2980785369873047
loss now 1.2968366146087646
loss now 1.2955964803695679
loss now 1.294358253479004
loss now 1.293121576309204
loss now 1.291886806488037
loss now 1.2906538248062134
loss now 1.2894222736358643
loss now 1.2881925106048584
loss now 1.2869646549224854
loss now 1.285738468170166
loss now 1.2845139503479004
loss now 1.2832914590835571
loss now 1.2820703983306885
loss now 1.280851125717163
loss now 1.279633641242981
loss now 1.2784178256988525
loss now 1.277203917503357
loss now 1.2759915590286255
loss now 1.2747809886932373
loss now 1.273572325706482
loss now 1.2723652124404907
loss now 1.2711598873138428
loss now 1.269956350326538
loss now 1.2687546014785767
loss now 1.2675542831420898
loss now 1.266356110572815
loss now 1.2651593685150146
loss