# Week 6 - AI Lab

Author: Khushee Kapoor

Registration Number: 200968052

The problem agent formulation involves determining the most optimal ad to display to a user at a given time instant to maximize the number of clicks.

- State: The state of the system can be defined as the ad's historical data of CTR (click through rate).

- Action: The action space consists of the ads that can be displayed as per the exploration or exploitation approach.

- Reward: The reward function can be designed to maximize the CTR. If for a particular timestep, the user has clicked on the ad, then the reward is 1 otherwise 0.

- Environment: There are 10 different ads to choose from and the goal is to maximize the total number of clicks received by the company. At each time step, the MAB agent must select one of the ads to display to the user. After the ad is displayed, the agent observes whether or not the user clicks on the ad.

- Policy: The MAB agent has to learn a policy that maps the current state of the environment (i.e., which ads have been shown and clicked on in the past) to a decision (i.e., which ad to display next). The policy should take into account the uncertainty in the click probabilities of the ads and the expected reward of choosing each ad. The goal of the MAB agent is to learn the optimal policy that maximizes the total number of clicks received from the users over a given period of time.

The MAB agent's objective is to learn the true CTR of each ad while minimizing the regret, which is the difference between the expected number of clicks obtained by displaying the best ad and the expected number of clicks obtained by displaying the chosen ad at each time step. The MAB agent must balance the exploration of less-known ads to learn their CTRs with the exploitation of the ads that are known to have higher CTRs to maximize the total number of clicks.

To start, we first import the following libraries:

- NumPy: for mathematical computations
- Pandas: for data manipulation

In [None]:
# importing the libraries
import numpy as np
import pandas as pd

Next, we import the Ads Optimization Data using the read_csv() function from the Pandas library.

In [None]:
# reading the data
df = pd.read_csv('Ads_Optimisation.csv')
df.head()

Unnamed: 0,Ad 1,Ad 2,Ad 3,Ad 4,Ad 5,Ad 6,Ad 7,Ad 8,Ad 9,Ad 10
0,1,0,0,0,1,0,0,0,1,0
1,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0


## Epsilon Greedy

For the epsilon greedy approach, we first define the reward function. In this function, we see what is the reward for a particular action at a particular timesteps and return it.

In [None]:
# defining the reward function for epislon greedy
def reward(t, A):
  ad = 'Ad {}'.format(A+1)
  reward = df.loc[df.index==t, ad]
  return reward.values[0]

Next, we implement the epsilon greedy algorithm with eps = 0.01. We use dictionaries to store the values of number of steps of different actions. 

In [None]:
# initializing the parameters
timesteps = 2000
k = 10
e = 0.01

# initializing the probabilities
probs = np.random.random(size=timesteps)

# initializing total rewards
tot_R = 0

# dictionaries to store the values and number of steps
Q = {}
N = {}

# initiializing values and number of stepss to be 0
for i in range(k):
  Q[i] = 0
  N[i] = 0

# epsilon greedy algorithm
for t in range(timesteps):
  if probs[t]>e:
    A = max(zip(Q.values(), Q.keys()))[1] # exploitation
  else:
    A = np.random.randint(k) # exploration

  # updating values of parameters
  R = reward(t, A)
  N[A] = N[A] + 1
  Q[A] = Q[A] + (R-Q[A])/N[A]
  tot_R += R # computing total rewards

print('Total Rewards: ', tot_R)

Total Rewards:  383


Next, we implement the epsilon greedy algorithm with eps = 0.3. We use dictionaries to store the values of number of steps of different actions. 

In [None]:
# initializing the parameters
timesteps = 2000
k = 10
e = 0.3

# initializing the probabilities
probs = np.random.random(size=timesteps)

# initializing total rewards
tot_R = 0

# dictionaries to store the values and number of steps
Q = {}
N = {}

# initiializing values and number of stepss to be 0
for i in range(k):
  Q[i] = 0
  N[i] = 0

# epsilon greedy algorithm
for t in range(timesteps):
  if probs[t]>e:
    A = max(zip(Q.values(), Q.keys()))[1] # exploitation
  else:
    A = np.random.randint(0, 10) # exploration

  # updating values of parameters
  R = reward(t, A)
  N[A] = N[A] + 1
  Q[A] = Q[A] + (R-Q[A])/N[A]
  tot_R += R # computing total rewards

print('Total Rewards: ', tot_R)

Total Rewards:  436


As we can see, the setting with a higher value of epsilon has a higher cummulative reward function value. This is because, higher the value of epsilon, higher is the exploration factor, hence, even though there are sub optimal payouts initially, in the long run, we have a higher probability of hitting the jackpot (in this case, picking an arm with higher reward).

The ε-greedy approach estimates the value of an action using the sample average method, which calculates the average of rewards received for that action. However, this method can have high variance if the number of samples is low, resulting in slow convergence to true values. To address this, the ε-greedy approach explores more at the beginning to reduce uncertainty and exploits the best action based on estimated values later.

## Upper Confidence Bound

For the upper confidence bound approach, we first define the reward function. In this function, we see what is the reward for a particular action at a particular timesteps and return it.

In [None]:
# defining the reward function for upper confidence bound approach
def reward(t, A):
  ad = 'Ad {}'.format(A+1)
  reward = df.loc[df.index==t, ad]
  return reward.values[0]

Next, we implement the upper confidence bound algorithm with c=1.5. 

In [None]:
# initializing the parameters of the algorithm
np.seterr(divide='ignore', invalid='ignore')
timesteps = 2000
k = 10
c = 1.5

# initializing total rewards to 0
tot_R = 0

# initializing dictionaries
N = {}
R = {}

for i in range(k):
  N[i] = 1 # to avoid zero divide error
  R[i] = 0

# implementing the ucb algorithm
for t in range(timesteps):
  q = {}
  delta = {}
  ucb = {}

  for i in range(k):
    q[i] = R[i]/N[i] # average reward till timestep t
    delta[i] = np.sqrt(c*np.log(t)/N[i]) # confidence interval
    ucb[i] = q[i] + delta[i]
  
  # updating values of parameters
  A = max(zip(ucb.values(), ucb.keys()))[1]
  N[A] = N[A] + 1
  R[A] = R[A] + reward(t, A)
  tot_R += reward(t, A) # computing total reward

print('Total Rewards: ', tot_R)

Total Rewards:  338


The Upper-Confidence-Bound approach estimates action values using the Upper Confidence Bound (UCB), which combines the average reward and an uncertainty term. This term prioritizes actions that haven't been selected many times and gives lower priority to those that have. This approach results in more stable value estimates and faster convergence to the optimal action.