# Implementing UCB

Now, let's learn how to implement the UCB algorithm to find the best arm.


# Reference: 
    
Deep Reinforcement Learning with Python

By: Sudharsan Ravichandiran

In [14]:
import gym
import gym_bandits
import numpy as np

## Creating the bandit environment

In [15]:
from bandits import BanditTwoArmedHighLowFixed
env = BanditTwoArmedHighLowFixed()

In [16]:
print(env.p_dist)

[0.8, 0.2]


In [17]:
count = np.zeros(2)

In [18]:
sum_rewards = np.zeros(2)

In [19]:
Q = np.zeros(2)

In [20]:
num_rounds = 1000

## Defining the UCB function

Now, we define the `UCB` function which returns the best arm as the one which has the
high upper confidence bound (UCB) arm: 

$$ \text{UCB(a)} =Q(a) +\sqrt{\frac{2 \log(t)}{N(a)}}  --- (1) $$

In [21]:
def UCB(i):
    
    #initialize the numpy array for storing the UCB of all the arms
    ucb = np.zeros(2)
    
    #before computing the UCB, we explore all the arms at least once, so for the first 2 rounds,
    #we directly select the arm corresponding to the round number
    if i < 2:
        return i
    
    #if the round is greater than 10 then, we compute the UCB of all the arms as specified in the
    #equation (1) and return the arm which has the highest UCB:
    else:
        for arm in range(2):
            ucb[arm] = Q[arm] + np.sqrt((2*np.log(sum(count))) / count[arm])
        return (np.argmax(ucb))

In [22]:
for i in range(num_rounds):
    
    #select the arm based on the UCB method
    arm = UCB(i)

    #pull the arm and store the reward and next state information
    next_state, reward, done, info = env.step(arm) 

    #increment the count of the arm by 1
    count[arm] += 1
    
    #update the sum of rewards of the arm
    sum_rewards[arm]+=reward

    #update the average reward of the arm
    Q[arm] = sum_rewards[arm]/count[arm]
    

In [23]:
print(Q)

[0.81697342 0.13636364]


In [24]:
print('The optimal arm is arm {}'.format(np.argmax(Q)+1))

The optimal arm is arm 1
