# Finding the best advertisement banner using bandits

# Reference: 
    
Deep Reinforcement Learning with Python

By: Sudharsan Ravichandiran

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns

%matplotlib inline
plt.style.use('ggplot')

## Creating a dataset

Now, let's create a dataset. We generate a dataset with five columns denoting the five
advertisement banners and we generate 100000 rows where the values in the row will be
either o or 1 indicating that whether the advertisement banner has been clicked (1) or not
clicked (0) by the user:

In [44]:
df = pd.DataFrame()
for i in range(5):
    df['Banner_type_'+str(i)] = np.random.randint(0,2,1000000)

Let's look at the first few rows of our dataset:

In [45]:
df.head()

Unnamed: 0,Banner_type_0,Banner_type_1,Banner_type_2,Banner_type_3,Banner_type_4
0,0,0,0,0,0
1,0,1,0,1,0
2,0,0,1,1,1
3,1,1,1,0,1
4,0,1,1,1,1


As we can observe we have the 5 advertisement banners (0 to 4) and
the rows consist of value 0 or 1 indicating that whether the banner has been clicked (0) or not clicked (1). 

In [46]:
num_iterations = 1000000

In [47]:
num_banner = 5

In [48]:
count = np.zeros(num_banner)

In [49]:
sum_rewards = np.zeros(num_banner)

In [50]:
Q = np.zeros(num_banner)

In [51]:
banner_selected = []

## Define the epsilon-greedy method

Now, let's define the epsilon-greedy method. We generate a random value from a uniform
distribution. If the random value is less than epsilon, then we select the random banner else
we select the best banner which has a maximum average reward:

In [52]:
def epsilon_greedy_policy(epsilon):
    
    if np.random.uniform(0,1) < epsilon:
        return  np.random.choice(num_banner)
    else:
        return np.argmax(Q)

In [53]:
#for each iteration
for i in range(num_iterations):
    
    #select the banner using the epsilon-greedy policy
    banner = epsilon_greedy_policy(0.5)
    
    #get the reward of the banner
    reward = df.values[i, banner]
    
    #increment the counter
    count[banner] += 1
    
    #store the sum of rewards
    sum_rewards[banner]+=reward
    
    #compute the average reward
    Q[banner] = sum_rewards[banner]/count[banner]
    
    #store the banner to the banner selected list
    banner_selected.append(banner)

In [54]:
print( 'The optimal banner is banner {}'.format(np.argmax(Q)))

The optimal banner is banner 3


In [55]:
Q

array([0.50082829, 0.49931147, 0.50070491, 0.50205183, 0.50025567])