## Upper Confidence Bound (UCB)

- Reinforcement learning including concepts such as agent, environment, policy, action, state and reward.
- The Multi-Armed Bandit Problem
- Exploration and Exploitation
- When deciding what state it should choose next, an agent faces a trade-off between exploration and exploitation. Exploration involves choosing new states that the agent hasn’t chosen or has chosen fewer times till then. Exploitation involves making a decision regarding the next state from its experiences so far.
- Real time process
- Minimum number of rounds
- Deterministic
- Requiees update at every round


### Importing the libraries


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


### Importing the dataset


In [None]:
dataset = pd.read_csv('Ads_CTR_Optimisation.csv')


### Implementing UCB


In [None]:
import math
N = 10000  # main variable
d = 10
ads_selected = []
numbers_of_selections = [0] * d
sums_of_rewards = [0] * d
total_reward = 0

for n in range(0, N):
    ad = 0
    max_upper_bound = 0
    for i in range(0, d):
        if (numbers_of_selections[i] > 0):
            average_reward = sums_of_rewards[i] / numbers_of_selections[i]
            delta_i = math.sqrt(3/2 * math.log(n + 1) /
                                numbers_of_selections[i])
            upper_bound = average_reward + delta_i
        else:
            upper_bound = 1e400  # trick to use infinity
        if upper_bound > max_upper_bound:
            max_upper_bound = upper_bound
            ad = i
    ads_selected.append(ad)
    numbers_of_selections[ad] = numbers_of_selections[ad] + 1
    reward = dataset.values[n, ad]
    sums_of_rewards[ad] = sums_of_rewards[ad] + reward
    total_reward = total_reward + reward


In [None]:
print('ads_selected:', ads_selected)
print('numbers_of_selections:', numbers_of_selections)
print('sums_of_rewards:', sums_of_rewards)
print('total_reward:', total_reward)


### Visualizing the results


In [None]:
plt.hist(ads_selected)
plt.title('Histogram of ads selections')
plt.xlabel('Ads')
plt.ylabel('Number of times each ad was selected')
plt.show()
