# Softmax, part 1

Task: practice using the `softmax` function.

**Why**: The softmax is a building block that is used throughout machine learning, statistics, data modeling, and even statistical physics. This activity is designed to get comfortable with how it works at a high and low level.

**Note**: Although "softmax" is the conventional name in machine learning, you may also see it called "soft *arg* max". The [Wikipedia article](https://en.wikipedia.org/w/index.php?title=Softmax_function&oldid=1065998663) has a good explanation.

## Setup

In [2]:
import torch
from torch import tensor
import ipywidgets as widgets
import matplotlib.pyplot as plt
%matplotlib inline

## Task

The following function defines `softmax` by using PyTorch built-in functionality.

In [3]:
def softmax_torch(x):
    return torch.softmax(x, axis=0)

Let's try it on an example tensor.

In [4]:
x = tensor([1., 2., 3.])
softmax_torch(x)

tensor([0.0900, 0.2447, 0.6652])

1. Start by playing with the interactive widget below. Describe the outputs when:

    1. All of the inputs are the same.
    2. One input is much bigger than the others.
    3. One input is much smaller than the others.

Finally, describe the input that gives the largest possible value for output 1.

In [5]:
r = 2.0
@widgets.interact(x0=(-r, r), x1=(-r, r), x2=(-r, r))
def show_softmax(x0, x1, x2):
    x = tensor([x0, x1, x2])
    xs = softmax_torch(x)
    plt.barh([2, 1, 0], xs)
    plt.xlim(0, 1)
    plt.yticks([2, 1, 0], ['output 0', 'output 1', 'output 2'])
    plt.ylabel("softmax(x)")
    return xs

interactive(children=(FloatSlider(value=0.0, description='x0', max=2.0, min=-2.0), FloatSlider(value=0.0, descâ€¦

A. When all the outputs are the same, the outputs are evenly divided from a total value of 1, so the tensor values are [0.3333, 0.3333, 0.3333].  
B. When one output is much larger than the others, the outputs are weighted so that the input that is the largest has the largest value in the tensor and the other two values have the same value as each other, splitting the difference between the two with what is left over from the largest output. The tensor values when x0 = 2 and x1 = x2 = -2 are [0.9647, 0.0177, 0.0177] which sum to approximately 1.  
C. When one output is much smaller than the other two, the three numbers still sum to 1, with the smallest output being close to zero and the other two large outputs evenly split with the rest. The tensor values when x2 = -2 and x0 = x1 = 2 are [0.4955, 0.4955, 0.0091].  
D. The largest value for output 1 (or any of the outputs) occurs when the input 1 is at its largest value and x0 and x2 are at their smallest values. Since all three outputs are split and weighted to equal one, this makes sense! 

2. Fill in the following function to implement softmax yourself:

In [6]:
def softmax(xx):
    # Exponentiate x so all numbers are positive.
    expos = xx.exp()
    assert expos.min() >= 0
    # Normalize (divide by the sum).
    return expos / expos.sum()

3. Evaluate `softmax(x)` and verify that it is close to the `softmax_torch(x)` you evaluated above.

In [7]:
softmax(x)

tensor([0.0900, 0.2447, 0.6652])

4. Evaluate `softmax_torch(__)` for each of the following expressions. Observe how each output relates to `softmax_torch(x)`.

- `x + 1`
- `x - 100`
- `x - x.max()`
- `x * 0.5`
- `x * 3.0`

In [8]:
softmax_torch(x * 3)

tensor([0.0024, 0.0473, 0.9503])

x + 1 = tensor([0.0900, 0.2447, 0.6652]) --> same as soft_max(x)    
x - 100 = tensor([0.0900, 0.2447, 0.6652]) --> same as softmax(x)    
x - x.max() = tensor([0.0900, 0.2447, 0.6652]) --> same as softmax(x)  
x * 0.5 = tensor([0.1863, 0.3072, 0.5065]) --> tensor values / 0.5    
x * 3.0 = tensor([0.0024, 0.0473, 0.9503]) --> tensor values / 3   

5. *Numerical issues*. Assign `x2 = 50 * x`. Try `softmax(x2)` and observe that the result includes the dreaded `nan` -- "not a number". Something went wrong. **Evaluate the first mathematical operation in `softmax`** for this particularly problematic input. You should see another kind of abnormal value.

In [9]:
x2 = 50 * x
softmax(x2)

tensor([0., nan, nan])

In [10]:
softmax(x + 1)

tensor([0.0900, 0.2447, 0.6652])

6. *Fixing numerical issues*. Now try `softmax(x2 - 150.0)`. Observe that you now get valid numbers. Also observe how the constant we subtracted relates to the value of `x2`.

In [10]:
softmax(x2 - 150.0)

tensor([3.7835e-44, 1.9287e-22, 1.0000e+00])

In [11]:
x2

tensor([ 50., 100., 150.])

7. Copy your `softmax` implementation to a new function, `softmax_stable`, and change it so that it subtracts `xx.max()` before exponentiating. (Don't use any in-place operations.) Verify that `softmax_stable(x2)` now works, and obtains the same result as `softmax_torch(x2)`.

In [12]:
def softmax_stable(xx):
    # Subtract max and then exponentiate x so all numbers are positive.
    expos = (xx - xx.max()).exp()
    assert expos.min() >= 0
    # Normalize (divide by the sum).
    return expos / expos.sum()

In [13]:
softmax_torch(x2)

tensor([3.7835e-44, 1.9287e-22, 1.0000e+00])

In [14]:
softmax_stable(x2)

tensor([3.7835e-44, 1.9287e-22, 1.0000e+00])

## Analysis

Consider the following situation:

In [15]:
x2 = tensor([1., 0.,])
x3 = x2 - 1
x3

tensor([ 0., -1.])

In [16]:
x4 = x2 * 2
x4

tensor([2., 0.])

1. Are `softmax(x2)` and `softmax(x3)` the same or different? How could you tell without having to evaluate them?


softmax(x2) and softmax(x3) are the same because the change in input is subtraction.

2. Are `softmax(x2)` and `softmax(x4)` the same or different? How could you tell without having to evaluate them?


softmax(x2) and softmax(x4) are different because the change in input is multiplication.

3. Explain why `softmax(x2)` failed.

softmax(x2) failed because the numbers are very large after being exponentiated and create a larger error.

4. Use your observations in \#1-2 above to explain why `softmax_stable` still gives the correct answer even though we changed the input.

softmax_stable still gives the correct answer because subtracting the maximum value ensures that the sum of the tensor values will be 1.

5. Explain why `softmax_stable` doesn't give us infinity or Not A Number anymore.

softmax_stable doesn't output inf or nan because subtracting the maximum values allows for the values to not be so large as to be unsolvable due to error.

## Extension *optional*

Try to prove your observation in Analysis \#1 by symbolically simplifying the expression `softmax(logits + c)` and seeing if you can get `softmax(logits)`. Remember that `softmax(x) = exp(x) / exp(x).sum()` and `exp(a + b) = exp(a)exp(b)`.