In [None]:
import torch
from torch import Tensor

# Tutorial 1b: Softmax Function

**Question:** To have the logistic regressor output probabilities, they need to be processed through a softmax layer. Implement a softmax layer yourself. What numerical issues may arise in this layer? How can you solve them? Use the testing code to confirm you implemented it correctly.

1. The numerical issue that can arise in the softmax layer is that for large range of values, the softmax function become unstable and outputs `nan` values.
2. To overcome this problem, we normalize the each of the logit by subtracting by the maximum value of the logits. that is: `new_logits=logits-max(logits)`. After this step we compute the softmax using the usual formular defined below.

In [None]:
logits = torch.rand((1, 20)) + 100

In [None]:
def bad_softmax(x: Tensor) -> Tensor:
    return torch.exp(x) / torch.sum(torch.exp(logits), axis=0)

In [None]:
bad_softmax(logits)

tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

In this section we provide two implementations of the `good softmax` function.

In [None]:
def good_softmax(x: Tensor) -> Tensor:
    ###########################################################################
    # TODO: Implement a more stable way to compute softmax                    #
    max= torch.max(x.squeeze(0))
    # softmax= x.squeeze(0)-max-torch.log(torch.sum(torch.exp(x.squeeze(0)-max)))
    # print(len(x))
    x_= x-max
    softmax= torch.exp(x_) / torch.sum(torch.exp(x_))

    # softmax= [x[i]-max-torch.log(torch.sum(torch.exp(x-max),axis=0)) for i in range(len(x.squeeze(0)))]
    ###########################################################################
    return softmax

In [None]:
def good_softmax(x: Tensor) -> Tensor:
    ###########################################################################
    # TODO: Implement a more stable way to compute softmax                    #
    max= torch.max(x.squeeze(0))
    softmax= x.squeeze(0)-max-torch.log(torch.sum(torch.exp(x.squeeze(0)-max)))
    # print(len(x))
    # x_= x-max
    # softmax= torch.exp(x_) / torch.sum(torch.exp(x_))

    #softmax= [x[i]-max-torch.log(torch.sum(torch.exp(x-max),axis=0)) for i in range(len(x.squeeze(0)))]
    ###########################################################################
    return torch.exp(softmax)

In [None]:
good_softmax(logits)

tensor([0.0435, 0.0468, 0.0488, 0.0460, 0.0393, 0.0772, 0.0323, 0.0719, 0.0408,
        0.0331, 0.0360, 0.0570, 0.0843, 0.0677, 0.0347, 0.0323, 0.0738, 0.0459,
        0.0468, 0.0418])

In [None]:
torch.sum(good_softmax(logits))

tensor(1.)

Because of numerical issues like the one you just experiences, PyTorch code typically uses a `LogSoftmax` layer.

**Question [optional]:** PyTorch automatically computes the backpropagation gradient of a module for you. However, it can be instructive to derive and implement your own backward function. Try and implement the backward function for your softmax module and confirm that it is correct.