# Action Selection Strategies Comparison

We will compare the following action selection strategies
1. ε-greedy
1. Optimistic Initial Estimates
1. Softmax
1. Upper-Confidence-Bound

#### Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random
import math

#### Parameters

In [None]:
# Environment Parameters
initial_winning_probs = [0.1, 0.3, 0.6, 0.4, 0.1, 0.4, 0.69, 0.71, 0.5, 0.2]
k = np.size(initial_winning_probs)

# Comparison Parameters
iterations_for_average = 100
number_of_plays = 1000

In [None]:
def play_bandid(action: int):
    times_won = 0
    for _ in range(10):
        if random.random() < initial_winning_probs[action]:
            times_won += 1
    return times_won

# ε-greedy
### Idea
The *ε-greedy* strategy combines exploration and exploitation. The agent maintains a list with the average reward of each action.
Usually the agent will choose the action with the highest estimated value (exploitation). 
In some cases the agent chooses a random action to explore the environment (exploration).  


### Parameter
- $ε$ is the probability of choosing a random action instead of the action with the highest estimated value.


### Formula

\begin{equation}
A_t = \left\{
    \begin{array}{ll}
        argmax_a\big[Q_t(a)\big] & with\:probability\:1-ε \\
        random\:action & with\:probability\:ε
    \end{array}
\right.
\end{equation}

- $Q_t(a)$ is the estimated value of action $a$ at time step $t$.

In [None]:
def learn_epsilon_greedy(epsilon, number_of_plays=number_of_plays):
    mean_rewards = [0]
    q = np.zeros(k)
    count = np.zeros(k)
    for t in range(1, number_of_plays):
        a = np.argmax(q)
        if random.random() < epsilon:
            a = np.random.randint(k)
        reward = play_bandid(a)
        count[a] += 1
        mean_reward = mean_rewards[-1] + ((reward - mean_rewards[-1]) / t)
        mean_rewards.append(mean_reward)
        q[a] = q[a] + (reward - q[a]) / count[a]
    return mean_rewards

In [None]:
def epsilon_greedy_plot():
    epsilon_averages_1_128 = []
    epsilon_averages_1_64 = []
    epsilon_averages_1_32 = []
    epsilon_averages_1_16 = []
    epsilon_averages_1_8 = []
    epsilon_averages_1_4 = []
    epsilon_averages_1_2 = []
    epsilon_averages_1 = []
    [ epsilon_averages_1_128.append(np.average(learn_epsilon_greedy(1/128))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1_64.append(np.average(learn_epsilon_greedy(1/64))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1_32.append(np.average(learn_epsilon_greedy(1/32))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1_16.append(np.average(learn_epsilon_greedy(1/16))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1_8.append(np.average(learn_epsilon_greedy(1/8))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1_4.append(np.average(learn_epsilon_greedy(1/4))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1_2.append(np.average(learn_epsilon_greedy(1/2))) for x in range(iterations_for_average) ]
    [ epsilon_averages_1.append(np.average(learn_epsilon_greedy(1))) for x in range(iterations_for_average) ]
    epsilon_greedy_epsilons = [1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1]
    epsilon_greedy_averages = [
        np.average(epsilon_averages_1_128),
        np.average(epsilon_averages_1_64),
        np.average(epsilon_averages_1_32),
        np.average(epsilon_averages_1_16),
        np.average(epsilon_averages_1_8),
        np.average(epsilon_averages_1_4),
        np.average(epsilon_averages_1_2),
        np.average(epsilon_averages_1),
    ]
    return epsilon_greedy_epsilons, epsilon_greedy_averages

In [None]:
plt.figure(figsize=(10, 5))
line1, = plt.plot(learn_epsilon_greedy(0.05), label="ε=0.05")
line2, = plt.plot(learn_epsilon_greedy(0.1), label="ε=0.1")
line3, = plt.plot(learn_epsilon_greedy(0.2), label="ε=0.2")
plt.legend()
plt.title('Different Exploration Rates (ε)')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
epsilon_greedy_epsilons, epsilon_greedy_averages = epsilon_greedy_plot()
plt.plot(epsilon_greedy_epsilons, epsilon_greedy_averages, "r-", label="ε-greedy (ε)")
plt.legend()
plt.semilogx(base=2)
plt.xticks(
    [1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1], 
    ['1/128', '1/64', '1/32', '1/16', '1/8', '1/4', '1/2', '1'])
plt.ylabel(f'Average reward over first {number_of_plays} steps')
plt.title(f'Comparison ε values \n(averaging over {iterations_for_average} executions)')
plt.show()

# Optimistic Initial Estimates
### Idea
With the *Optimistic Initial Estimates* we try to encourage exploring actions. This works even in combination with a plain greedy approach. The idea behind the Optimistic Initial Estimates is that we initialize all the $Q(a)$ with values that are in the range or even a little bit higher than the mean rewards. 

This leads to the behaviour that the agent picks an action and then probably gets disappointed by the reward because it's very likely that the initial value for this action has been higher. In the next iteration the agent is likely to pick a different action. The result is an agent that is encouraged to try different actions in the beginning before the value estimates converge. 

### Parameter
- $initial\:value$ is used for initializing the estimates $Q_0(a)$. This parameter should usually stay in the range of the mean rewards or even a little bit higher.

### Formula
$$\forall a:A: Q_0(a) = initial\:value$$
- $A$ is the set of all actions the agent can pick.
- $Q_0(a)$ is the estimated value of action $a$ at time step $0$.

In [None]:
def learn_weighted_average_with_initial_values(initial_q_values, number_of_plays = number_of_plays):
    mean_rewards = [0]
    alpha = 0.1
    q = np.full(k, initial_q_values, dtype=float)
    count = np.zeros(k)
    for t in range(1, number_of_plays):
        a = np.argmax(q)
        reward = play_bandid(a)
        count[a] += 1
        mean_reward = mean_rewards[-1] + 1/100 * (reward - mean_rewards[-1])
        mean_rewards.append(mean_reward)
        q[a] = q[a] + alpha * (reward - q[a])
    return mean_rewards

In [None]:
def optimistic_values_plot():
    epsilon_optimistic_averages_1_2 = []
    epsilon_optimistic_averages_1 = []
    epsilon_optimistic_averages_2 = []
    epsilon_optimistic_averages_4 = []
    epsilon_optimistic_averages_8 = []
    epsilon_optimistic_averages_10 = []
    [ epsilon_optimistic_averages_1_2.append(np.average(learn_weighted_average_with_initial_values(1/2))) for x in range(iterations_for_average) ]
    [ epsilon_optimistic_averages_1.append(np.average(learn_weighted_average_with_initial_values(1))) for x in range(iterations_for_average) ]
    [ epsilon_optimistic_averages_2.append(np.average(learn_weighted_average_with_initial_values(2))) for x in range(iterations_for_average) ]
    [ epsilon_optimistic_averages_4.append(np.average(learn_weighted_average_with_initial_values(4))) for x in range(iterations_for_average) ]
    [ epsilon_optimistic_averages_8.append(np.average(learn_weighted_average_with_initial_values(8))) for x in range(iterations_for_average) ]
    [ epsilon_optimistic_averages_10.append(np.average(learn_weighted_average_with_initial_values(10))) for x in range(iterations_for_average) ]
    epsilon_optimistic_initializations = [1/2, 1, 2, 4, 8, 10]
    epsilon_optimistic_averages = [
        np.average(epsilon_optimistic_averages_1_2),
        np.average(epsilon_optimistic_averages_1),
        np.average(epsilon_optimistic_averages_2),
        np.average(epsilon_optimistic_averages_4),
        np.average(epsilon_optimistic_averages_8),
        np.average(epsilon_optimistic_averages_10)]
    return epsilon_optimistic_initializations, epsilon_optimistic_averages

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(learn_weighted_average_with_initial_values(10), label="initial_values=10, α=0.1")
plt.plot(learn_weighted_average_with_initial_values(8), label="initial_values=8, α=0.1")
plt.plot(learn_weighted_average_with_initial_values(4), label="initial_values=4, α=0.1")
plt.plot(learn_weighted_average_with_initial_values(1), label="initial_values=1, α=0.1")
plt.legend()
plt.title('Different initial values vs. using ε-Greedy-Approach')
plt.show()

In [None]:
plt.figure(figsize=(10, 5))

epsilon_optimistic_initializations, epsilon_optimistic_averages = optimistic_values_plot()

plt.plot(epsilon_optimistic_initializations, epsilon_optimistic_averages, "k-", label="optimistic initialization (initial value, α=0.1)")
plt.legend()
plt.semilogx(base=2)
plt.xticks(
    [1/2, 1, 2, 4, 8, 10], 
    ['1/2', '1', '2', '4', '8', '10'])
plt.ylabel(f'Average reward over first {number_of_plays} steps')
plt.title(f'Comparison initial action values \n(averaging over {iterations_for_average} executions)')
plt.show()

# Softmax
### Idea
The idea behind *softmax* is that we rate actions with a good performance higher and discriminate actions that performs poorly. *Softmax* create a probability distribution where actions with a high estimated value have a high probability.

### Parameter
- $τ$ is called the temperature. A higher temperature leads to more similar probabilities for actions.

### Formula
$$\pi_t(a) = Pr\{A_t = a\} = \frac{e^{Q_t(a)/τ}}{\sum\limits_{i=1}^{n} e^{Q_t(i)/τ}}$$
- $n$ is the number of actions
- $Q_t(a)$ is the estimated value of action $a$ at time step $t$.

In [None]:
def softmax(q, tau):
    return (np.exp(q / tau) / np.sum(np.exp(q / tau)))

In [None]:
def learn_softmax(tau, number_of_plays = number_of_plays):
    mean_rewards = [0]
    q = np.zeros(k)
    count = np.zeros(k)
    for t in range(1, number_of_plays):
        pi_t = softmax(q, tau)
        a = np.random.choice(np.arange(k), p=pi_t)
        reward = play_bandid(a)
        count[a] += 1
        mean_reward = mean_rewards[-1] + ((reward - mean_rewards[-1]) / t)
        mean_rewards.append(mean_reward)
        q[a] = q[a] + (reward - q[a]) / count[a]
    return mean_rewards

In [None]:
plt.figure(figsize=(10, 5))
line1, = plt.plot(learn_softmax(0.8, 2000), label="τ=0.8")
line2, = plt.plot(learn_softmax(1.0, 2000), label="τ=1.0")
line3, = plt.plot(learn_softmax(1.2, 2000), label="τ=1.2")
plt.legend()
plt.title('Softmax Action Selection with different Temperatures (τ)')
plt.show()

In [None]:
def softmax_plot():
    softmax_averages_1_16 = []
    softmax_averages_1_8 = []
    softmax_averages_1_4 = []
    softmax_averages_1_2 = []
    softmax_averages_3_4 = []
    softmax_averages_1 = []
    softmax_averages_5_4 = []
    softmax_averages_3_2 = []
    softmax_averages_2 = []
    softmax_averages_4 = []
    [ softmax_averages_1_16.append(np.average(learn_softmax(1/16))) for x in range(iterations_for_average) ]
    [ softmax_averages_1_8.append(np.average(learn_softmax(1/8))) for x in range(iterations_for_average) ]
    [ softmax_averages_1_4.append(np.average(learn_softmax(1/4))) for x in range(iterations_for_average) ]
    [ softmax_averages_1_2.append(np.average(learn_softmax(1/2))) for x in range(iterations_for_average) ]
    [ softmax_averages_3_4.append(np.average(learn_softmax(3/4))) for x in range(iterations_for_average) ]
    [ softmax_averages_1.append(np.average(learn_softmax(1))) for x in range(iterations_for_average) ]
    [ softmax_averages_5_4.append(np.average(learn_softmax(5/4))) for x in range(iterations_for_average) ]
    [ softmax_averages_3_2.append(np.average(learn_softmax(3/2))) for x in range(iterations_for_average) ]
    [ softmax_averages_2.append(np.average(learn_softmax(2))) for x in range(iterations_for_average) ]
    [ softmax_averages_4.append(np.average(learn_softmax(4))) for x in range(iterations_for_average) ]
    softmax_taus = [1/16, 1/8, 1/4, 1/2, 3/4, 1, 5/4, 3/2, 2, 4]
    softmax_averages = [
        np.average(softmax_averages_1_16),
        np.average(softmax_averages_1_8),
        np.average(softmax_averages_1_4),
        np.average(softmax_averages_1_2),
        np.average(softmax_averages_3_4),
        np.average(softmax_averages_1),
        np.average(softmax_averages_5_4),
        np.average(softmax_averages_3_2),
        np.average(softmax_averages_2),
        np.average(softmax_averages_4)]
    return softmax_taus, softmax_averages

In [None]:
plt.figure(figsize=(10, 5))

softmax_taus, softmax_averages = softmax_plot()
plt.plot(softmax_taus, softmax_averages, "c-", label="Softmax (τ)")

plt.legend()
plt.semilogx(base=2)
plt.xticks(
    softmax_taus, 
    ['1/16', '1/8', '1/4', '1/2', '0.75', '1', '1.25', '1.5', '2', '4'])
plt.ylabel(f'Average reward over first {number_of_plays} steps')
plt.title(f'Comparison τ values \n(averaging over {iterations_for_average} executions)')
plt.show()

# Upper-Confidence-Bound
### Idea
With the *Upper-Confidence-Bound* we try to encourage exploring actions that were not explored very often. Similar to the other strategies we also have a portion for exploration and exploitation that can be controlled by a hyperparameter.

### Parameter
- $c$ is a confidence value that controls the level of exploration. This parameter should be in the same range as the mean reward $Q_t(a)$ because otherwise its impact would be too small and the likelihood of exploration would decrease.

### Formula
$$A_t = argmax_a\Bigg[ Q_t(a) + c\sqrt\frac{\ln t}{N_t(a)}\Bigg]$$
- $Q_t(a)$ is the estimated value of action $a$ at time step $t$.
- $N_t(a)$ is the number of times that action $a$ has been selected, prior to time $t$.

In [None]:
def upper_confidence_bound(q, c, t, count):
    return np.argmax(q + c * np.sqrt(np.log(t) / count))

In [None]:
def learn_uppper_confidence_bound(c, number_of_plays = number_of_plays):
    mean_rewards = [0]
    q = np.zeros(k)
    count = np.full(k, 1)
    for t in range(1, number_of_plays):
        a = upper_confidence_bound(q, c, t, count)
        reward = play_bandid(a)
        count[a] += 1
        mean_reward = mean_rewards[-1] + ((reward - mean_rewards[-1]) / t)
        mean_rewards.append(mean_reward)
        q[a] = q[a] + ((reward - q[a]) / count[a])
    return mean_rewards

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(learn_uppper_confidence_bound(4.0), label="c=4.0")
plt.plot(learn_uppper_confidence_bound(3.0), label="c=3.0")
plt.plot(learn_uppper_confidence_bound(2.0), label="c=2.0")
plt.plot(learn_uppper_confidence_bound(1.0), label="c=1.0")
plt.legend()
plt.title('Different c values')
plt.show()

In [None]:
def ucb_plot():
    ucb_averages_1_2 = []
    ucb_averages_1 = []
    ucb_averages_2 = []
    ucb_averages_4 = []
    ucb_averages_8 = []
    ucb_averages_10 = []
    [ ucb_averages_1_2.append(np.average(learn_uppper_confidence_bound(1/2))) for x in range(iterations_for_average) ]
    [ ucb_averages_1.append(np.average(learn_uppper_confidence_bound(1))) for x in range(iterations_for_average) ]
    [ ucb_averages_2.append(np.average(learn_uppper_confidence_bound(2))) for x in range(iterations_for_average) ]
    [ ucb_averages_4.append(np.average(learn_uppper_confidence_bound(4))) for x in range(iterations_for_average) ]
    [ ucb_averages_8.append(np.average(learn_uppper_confidence_bound(8))) for x in range(iterations_for_average) ]
    [ ucb_averages_10.append(np.average(learn_uppper_confidence_bound(10))) for x in range(iterations_for_average) ]
    ucb_cs = [1/2, 1, 2, 4, 8, 10]
    ucb_averages = [
        np.average(ucb_averages_1_2),
        np.average(ucb_averages_1),
        np.average(ucb_averages_2),
        np.average(ucb_averages_4),
        np.average(ucb_averages_8),
        np.average(ucb_averages_10)]
    return ucb_cs, ucb_averages

In [None]:
plt.figure(figsize=(10, 5))

ucb_cs, ucb_averages = ucb_plot()

plt.plot(ucb_cs, ucb_averages, "k-", label="optimistic initialization (initial value, α=0.1)")
plt.legend()
plt.semilogx(base=2)
plt.xticks(
    [1/2, 1, 2, 4, 8, 10], 
    ['1/2', '1', '2', '4', '8', '10'])
plt.ylabel(f'Average reward over first {number_of_plays} steps')
plt.title(f'Comparison initial action values \n(averaging over {iterations_for_average} executions)')
plt.show()

# Comparison Action Selection Strategies

In [None]:
plt.figure(figsize=(10, 5))

# Epsilon-greedy
epsilon_greedy_epsilons, epsilon_greedy_averages = epsilon_greedy_plot()
plt.plot(epsilon_greedy_epsilons, epsilon_greedy_averages, "r-", label="ε-greedy (ε)")

# Weighted Average with Optimistic Initialization
epsilon_optimistic_initializations, epsilon_optimistic_averages = optimistic_values_plot()
plt.plot(epsilon_optimistic_initializations, epsilon_optimistic_averages, "k-", label="optimistic initialization (initial value, α=0.1, ε=0)")

# UCB
ucb_cs, ucb_averages = ucb_plot()
plt.plot(ucb_cs, ucb_averages, "b-", label="UCB (c)")

# Softmax
softmax_taus, softmax_averages = softmax_plot()
plt.plot(softmax_taus, softmax_averages, "g-", label="Softmax (τ)")

plt.legend()
plt.semilogx(base=2)
plt.xticks(
    [1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1, 2, 4, 8, 10], 
    ['1/128', '1/64', '1/32', '1/16', '1/8', '1/4', '1/2', '1', '2', '4', '8', '10'])
plt.ylabel(f'Average reward over first {number_of_plays} steps')
plt.title(f'Comparison Action Selection Strategies \n(averaging over {iterations_for_average} executions)')
plt.show()

# Performance / Interpretation

<span style="color:red; font-weight:bold">ε-greedy</span>

The *ε-greedy* performs acceptable in the range between $ε=\frac{1}{16}$ and $ε=\frac{1}{4}$. With lower ε values the share of explorations is to low therefore the valuable actions are found only later. With a higher ε value the agent learns the *real* action values more quickly. However the agent will still choose random actions during the execution. This leeds to a lower average reward.

The best performance can be achieved with $ε=\frac{1}{8}$. With that configuration the agent has a good relation of exploitation and exploration. 

The agent reaches an average reward of $4$ with $ε=1$. In this case the agent will choose a random action a each step. $4$ is the average of all winning probs (0,4) multiplied by k (10) - the number of play per step.

**Domain of definition** In this environment the $ε$ values should be in the range between $\frac{1}{16}$ and $\frac{1}{4}$.

<span style="color:black; font-weight:bold">Optimistic Initial Estimates</span>

The *Optimistic Initial Estimates* starts to perform reasonable with $initial\:value=4$. This is because the exploration portion increases with $initial\:value$ which leads to the agent is more likely to pick an action that has not been picked often yet.

The highest average reward can be achieved with $initial\:value=8$. One possible reason for performing best at this value might be that the value is close to the best reward of all bandits. Then its quite likely that the agent has picked every action at least once and has an estimate for each action.

As $initial\:value$ converges to inifinity the expected average reward is 4 (the average reward of all bandits) because the agent will pick each action equally likely.

**Domain of definition** The $initial\:value$ should be in the range of the expected best reward of the bandits.

<span style="color:green; font-weight:bold">Softmax</span>

The *Softmax* performs reasonable with $τ$ between $0.75$ and $1.5$. Lower $τ$ values reduce the exploitation therefore more steps are needed to find the best actions. With higher $τ$ values the probability distribution is too similar and actions that performs poorly are not discriminated enough. The best performance is reached with $τ=1$.

**Domain of definition** In this environment the temperature $τ$ should be between $0.75$ and $1.5$.

<span style="color:blue; font-weight:bold">Upper-Confidence-Bound</span>

The *Upper-Confidence-Bound* starts to perform reasonable with $c=2$. This is because the exploration portion increases with $c$ which leads to the agent is more likely to pick an action that has not been picked often yet. 

The highest average reward can be achieved with $c=4$. One possible reason for performing best at this value is that the value might be equal to the average reward of all bandits.

As $c$ increases from that point on and converges to the best case reward of 10 the exploration portion gets to high which leads to almost randomly picking actions. As $c$ converges to inifinity the expected average reward is 4 (the average reward of all bandits) because the agent will pick each action equally likely.

**Domain of definition** The $c$ parameter should be in the range of the expected average reward of the bandits.