# Implementing Softmax Exploration

Now, let's learn how to implement the softmax exploration to find the best arm.

# Reference: 
    
Deep Reinforcement Learning with Python

By: Sudharsan Ravichandiran

In [3]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

In [4]:
from bandits import BanditTwoArmedHighLowFixed
env = BanditTwoArmedHighLowFixed()

In [5]:
print(env.p_dist)

[0.8, 0.2]


In [6]:
count = np.zeros(2)

In [7]:
sum_rewards = np.zeros(2)

In [8]:
Q = np.zeros(2)

In [9]:
num_rounds = 100

## Defining the softmax exploration function

Now, let's define the softmax function with temperature `T` as:

$$P_t(a) = \frac{\text{exp}(Q_t(a)/T)} {\sum_{i=1}^n \text{exp}(Q_t(i)/T)} $$

In [10]:
def softmax(T):
    
    #compute the probability of each arm based on the above equation
    denom = sum([np.exp(i/T) for i in Q]) 
    probs = [np.exp(i/T)/denom for i in Q]
    
    #select the arm based on the computed probability distribution of arms
    arm = np.random.choice(env.action_space.n, p=probs)
    
    return arm

Let's begin by setting the temperature `T` to a high number, say 50:

In [11]:
T = 50

In [12]:
for i in range(num_rounds):
    
    #select the arm based on the softmax exploration method
    arm = softmax(T)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm) 

    #increment the count of the arm by 1
    count[arm] += 1
    
    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
    
    #reduce the temperature
    T = T*0.99

In [13]:
print(Q)

[0.8        0.25454545]


In [14]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
