# DSE 3260
## Week - 6
### Reg. No - 200968216
#### Pratinav Seth 

#### Q1. MAB agent formulation :
##### The problem agent formulation involves determining the most optimal ad to display to a user at a given time instant to maximize the number of clicks on the webpage. 

The problem can be defined as :

 -- There are 10 different ads to choose from, and at each time step, the MAB agent must decide which ad to display to the user. 

 -- Goal is to maximize the total number of clicks obtained from the users over a specified time horizon. 

 -- Each ad has an unknown click-through rate (CTR) that represents the probability of a user clicking on that ad. 
 
 -- The MAB agent's objective is to learn the true CTR of each ad while minimizing the regret, which is the difference between the expected number of clicks obtained by displaying the best ad and the expected number of clicks obtained by displaying the chosen ad at each time step. 
 
 -- The MAB agent must balance the exploration of less-known ads to learn their CTRs with the exploitation of the ads that are known to have higher CTRs to maximize the total number of clicks.

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random

ads_clicks = pd.read_csv('/content/Ads_Optimisation.csv')

In [12]:
num_ads = len(ads_clicks.columns)
print("Number of ads - " + str(num_ads))

Number of ads - 10


#### Q2. Total rewards after 2000-time steps using the ε-greedy algorithm for ε=
- 0.01 

- 0.3

In [14]:
def epsilon_greedy(epsilon, rewards):
    if random.uniform(0, 1) < epsilon:
        ad = random.randint(0, num_ads - 1)
    else:
        ad = np.argmax(rewards)
    return ad

In [15]:
rewards = [0] * num_ads
total_rewards_01 = []
total_rewards_03 = []

Iterating the ε-greedy algorithm for 2000 time steps using ε=0.01 and ε=0.3

In [16]:
for t in range(2000):
    ad_01 = epsilon_greedy(0.01, rewards)
    ad_03 = epsilon_greedy(0.3, rewards)
    reward = ads_clicks.iloc[t][ad_01]
    rewards[ad_01] = rewards[ad_01] + reward
    total_rewards_01.append(sum(rewards))
    reward = ads_clicks.iloc[t][ad_03]
    rewards[ad_03] = rewards[ad_03] + reward
    total_rewards_03.append(sum(rewards))

In [18]:
print("Total rewards for ε - 0.01: ", total_rewards_01[-1])
print("Total rewards for ε - 0.3: ", total_rewards_03[-1])

Total rewards for ε - 0.01:  650
Total rewards for ε - 0.3:  650


#### Q3. Compute the total rewards after 2000-time steps using the Upper-Confidence-Bound action method for c = 1.5

In [19]:
def ucb(rewards, n, t, c=1.5):
    average_rewards = rewards / n
    ucb_values = average_rewards + c * np.sqrt(np.log(t + 1) / n)
    ad = np.argmax(ucb_values)
    return ad

In [20]:
rewards = np.zeros(num_ads)
n = np.zeros(num_ads)
total_rewards = []

In [22]:
for t in range(2000):
    ad = ucb(rewards, n, t, c=1.5)    
    reward = ads_clicks.iloc[t][ad]
    rewards[ad] = rewards[ad] + reward
    n[ad] = n[ad] + 1
    total_rewards.append(sum(rewards))

print("Total rewards for c=1.5: ", total_rewards[-1])

Total rewards for c=1.5:  763.0


## In the epilson-greedy strategy, how does the estimated action value compare to the optimal action:

The epilson-greedy technique estimates action value using the sample average method, with the estimated value of each action equaling the average of the rewards received for that action. However, if the number of samples is minimal, the sample average approach can produce considerable variance in the estimate. Thereby, the estimations may not converge to the genuine action values soon. As a result, the epilson -greedy strategy prefers to explore more in the start of the experiment to reduce uncertainty, and then exploit the optimum action later on based on the projected action values.



#### How the action value estimated compares to the ideal action in the UCB approach:

The Upper-Confidence-Bound strategy, on the other hand, uses the Upper Confidence Bound (UCB), which is a combination of the average reward and an uncertainty term, to estimate the action values.

The uncertainty term ensures that activities that have not been chosen many times are given a higher priority, whereas actions that have been chosen many times are given a lower priority. This method yields a more stable estimate of the action values, and the algorithm converges to the optimal action faster.


