# The Softmax Algorithm

The problem with the epsilon-Greedy algorithm: it explores options completely at random without any concern about their merits. For example, in one scenario (call it Scenario A), you might have two arms, one of which rewards you 10% of the time and the other rewards you 13% of the time. In Scenario B, the two arms might reward you 10% of the time and 99% of the time. In both of these scenarios, the probability that the epsilon-Greedy algorithm explores the worse arm is exactly the same (it’s `epsilon` / 2), despite the inferior arm in Scenario B being, in relative terms, much worse than the inferior arm in Scenario A.

This is a problem for several reasons:
- If the difference in reward rates between two arms is small, you’ll need to explore a lot more often than 10% of the time to correctly determine which of the two options is actually better.
- In contrast, if the difference is large, you need to explore a lot less than 10% of the time to correctly estimate the better of the two options. For that reason, you’ll end up losing a lot of reward by exploring an unambiguously inferior option in this case. When we first described the epsilon-Greedy algorithm, we said that we wouldn’t set epsilon = 1.0 precisely so that we wouldn’t waste time on inferior options, but, if the difference between two arms is large enough, we end up wasting time on inferior options simply because the epsilon-Greedy algorithm always explores completely at random.

Putting these two points together, it seems clear that there’s a qualitative property missing from the epsilon-Greedy algorithm. We need to make our bandit algorithm cares about the known differences between the estimated values of the arms when our algorithm decides which arm to explore. We need structured exploration rather than the haphazard exploration that the epsilon-Greedy algorithm provides.

The first algorithm we’ll describe that takes this structural information into account is called the Softmax algorithm. The Softmax algorithm tries to cope with arms differing in estimated value by explicitly incorporating information about the reward rates of the available arms into its method for choosing which arm to select when it explores.

You can get an initial intuition for how the Softmax algorithm handles this problem by imagining that you choose each arm in proportion to its estimated value. Suppose that you have two arms, A and B. Now imagine that, based on your past experiences, these two arms have had two different rates of success: `rA` and `rB`. With those assumptions, the most naive possible implementation of a Softmax-like algorithm would have you choose Arm A with probability `rA / (rA + rB)` and Arm B with probability `rB / (rA + rB)`

In code this would look like:

In [1]:
import random

def categorical_draw(probs):
    z = random.random()
    cum_prob = 0.0
    for i in range(len(probs)):
        prob = probs[i]
        cum_prob += prob
        if cum_prob > z:
            return i
    return len(probs) - 1

def select_arm(self):
    z = sum(self.values)
    probs = [v / z for v in self.values]
    return categorical_draw(probs)

In practice, this very naive algorithm isn’t something people actually use. To reconstruct the algorithm people actually use, we need to make two changes to it.

First, we will calculate a different scale for reward rates by exponentiating our estimates of `rA` and `rB`. Using this new scale, we will choose Arm A with probability `exp(rA) / (exp(rA) + exp(rB))` and Arm B with probability `exp(rB) / (exp(rA) + exp(rB))`. This naive exponential rescaling has the virtue of not behaving strangely if you someone used negative numbers as rates of success, since the call to exp will turn any negative numbers into positive numbers and insure that the negative numbers in the denominator of these fractions can’t cancel out any positive numbers that may be found in the denominator.

More importantly, this exponentiation trick brings us very close to the full Softmax algorithm. In fact, plain exponential rescaling gives us the Softmax algorithm if you hardcoded one of the configurable parameters that the standard Softmax algorithm possesses. This additional parameter is a different sort of scaling factor than the exponentiation we just introduced.

This new type of scaling factor is typically called a temperature parameter based on an analogy with physics in which systems at high temperatures tend to behave randomly, while they take on more structure at low temperatures. In fact, the full Softmax algorithm is closely related to a concept called the Boltzmann distribution in physics, which is used to describe how groups of particles behave.

We’ll call this new temperature parameter tau. We introduce `tau` to produce the following new algorithm:
- At time T, select one of the two arms with probabilities computed as follows:
    - `exp(rA / tau) / (exp(rA / tau) + exp(rB / tau))`
    - `exp(rB / tau) / (exp(rA / tau) + exp(rB / tau))`
    
- For whichever arm you picked, update your estimate of the mean using the same update rule we used for the epsilon-Greedy algorithm.

In [12]:
import numpy as np

"""
Success rate of arm A is x3 less than of B. 
"""
success_rate_arm_a = 0.2
success_rate_arm_b = 0.6
temp = 0.1

In [13]:
# Prob of selecting Arm A
np.exp(success_rate_arm_a / temp) / (np.exp(success_rate_arm_a / temp) + np.exp(success_rate_arm_b / temp))

0.017986209962091573

In [14]:
# Prob of selecting Arm B
np.exp(success_rate_arm_b / temp) / (np.exp(success_rate_arm_a / temp) + np.exp(success_rate_arm_b / temp))

0.9820137900379083