# <span style="color:#4CAF50; font-size:30px;">📘 What We Are Doing in This Notebook</span>

---

## <span style="color:#2196F3; font-weight:bold;">1. Simulating the 10-Armed Bandit Testbed:</span>
We are implementing a **bandit problem** with **10 arms**, where:
- Each arm will have a true value **$q^∗(a)$** sampled from a Gaussian distribution with **mean 0** and **variance 1**.
- Rewards for selecting an arm **a** will be sampled from a Gaussian distribution with **mean q(a)** and **variance 1**.

---

## <span style="color:#FFC107; font-weight:bold;">2. Setting Up the Testbed:</span>
### Steps:
1. **Generate 2000 independent bandit problems.**  
2. **Use the sample-average method** (with incremental updates) to learn action-value estimates.  
3. Simulate learning over **1000 timesteps** and analyze the performance of the following algorithms:
   - **Greedy**  
   - **ϵ-greedy** (ϵ = 0.1 and ϵ = 0.01)  

---

## <span style="color:#FF5722; font-weight:bold;">3. Running the UCB Algorithm:</span>
### Steps:
1. **Run the UCB algorithm** on the **ten-armed testbed**.  
2. Compare its performance with **ϵ-greedy (ϵ = 0.1)**.  
3. Plot and reproduce the following **learning curves**, averaged over **2000 bandits** for **1000 timesteps**:
   - **Average Reward vs. Time**
   - **% Optimal Actions vs. Time**

---

## <span style="color:#9C27B0; font-weight:bold;">4. Learning Curves:</span>
For each algorithm, we will generate the following plots:
- **Average Reward vs. Time:**  
  Tracks the **average reward** obtained over time.  
- **% Optimal Actions vs. Time:**  
  Tracks the **percentage of times the optimal action** is chosen over time.

---

# <span style="color:brown; font-weight:bold;">📌Imports</span>

In [None]:
# import necessary libraries here
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(63)

# <span style="color:#3F51B5; font-weight:bold;">🎯 Objective:</span>
Plotting the following results:
1. **Average Reward vs. Number of Steps**  
2. **% Optimal Action vs. Number of Steps**

---

# <span style="color:#FF6347; font-size:30px;">🔨 Building the Bandit Object</span>
---

We start by constructing the **Bandit Object**, representing the individual arms of the bandit.

---

In [None]:
# Defining bandit object
class Bandit(object):
    def __init__(self, arms=10):
        self.arms = arms
        self.q_star = np.random.normal(0, 1, self.arms)  # true action values

    def get_reward(self, action):
        return np.random.normal(self.q_star[action], 1)  # rewards sampled from N(q*(a), 1)

# <span style="color:#4682B4; font-size:30px;">🤖 Defining the Multiarm Bandit for Greedy Approach</span>

This implementation simulates the **multi-arm bandit problem** using the **epsilon-greedy (ϵ-greedy)** method. Here's a step-by-step explanation:

---

### **🔍 Key Parameters**
1. **`bandit`**: The multi-arm bandit environment containing:
   - `arms`: Number of arms.
   - $q^*(a)$: True reward distribution for each arm.
   - `get_reward(action)`: Method to simulate the reward for a chosen arm.
   
2. **`timesteps`**: The total number of plays or time steps.

3. **`epsilon (ϵ)`**: The probability of choosing a random (exploratory) action. 
   - With probability $1 - \epsilon$, we select the arm with the **highest estimated value** (exploitation).
   - With probability $\epsilon$, we select a **random arm** (exploration).

---

### **📐 Algorithm**
1. **Initialization**:
   - `q_estimates`: An array of zeros to store the estimated reward for each arm.
   - `action_counts`: An array of zeros to track the number of times each arm is selected.
   - `rewards`: An array to record rewards received at each time step.
   - `optimal_actions`: An array to record whether the **optimal arm** was selected at each time step.

2. **Identifying the Optimal Arm**:
   - $$ \text{optimal\_action} = \arg\max(q^*(a)) $$
     This finds the arm with the highest true reward.

3. **Game Simulation (for `t` in `timesteps`)**:
   - **Exploration vs. Exploitation**:
     - Generate a random number $p \sim U(0, 1)$.
     - If $p < \epsilon$: **Exploration** → Choose a random arm.
     - Otherwise: **Exploitation** → Choose the arm with the highest estimated reward:
       $$ \text{action} = \arg\max(q_{\text{estimates}}) $$

   - **Receive Reward**:
     - Simulate the reward for the selected arm using:
       $$ \text{reward} = \text{bandit.get\_reward(action)} $$

   - **Track Optimal Action**:
     - If the chosen action matches the optimal arm:
       $$ \text{optimal\_actions}[t] = 1 $$
     - Otherwise, set it to 0.

   - **Update Counts and Estimates**:
     - Increment the count for the chosen arm:
       $$ \text{action\_counts}[a] = \text{action\_counts}[a] + 1 $$
     - Update the reward estimate using the incremental formula:
       $$ q_{\text{estimates}}[a] = q_{\text{estimates}}[a] + \frac{\text{reward} - q_{\text{estimates}}[a]}{\text{action\_counts}[a]} $$

4. **Return Results**:
   - **`rewards`**: A time series of rewards at each timestep.
   - **`optimal_actions`**: A time series indicating whether the optimal action was selected.

---

### **🧮 Mathematical Formulation**

#### **Incremental Update Formula**:
The estimated reward for arm $a$ is updated as:
$$ q_{\text{estimates}}[a] = q_{\text{estimates}}[a] + \frac{\text{reward} - q_{\text{estimates}}[a]}{\text{action\_counts}[a]} $$

#### **Action Selection**:
1. Exploitation:
   $$ \text{action} = \arg\max(q_{\text{estimates}}) $$
2. Exploration:
   $$ \text{action} = \text{randomly chosen arm} $$

---

In [None]:
# Defining Multiarm bandit game using epsilon greedy approach
def BanditGame(bandit, timesteps, epsilon=0):
    arms = bandit.arms
    q_estimates = np.zeros(arms)
    action_counts = np.zeros(arms)
    rewards = np.zeros(timesteps)
    optimal_actions = np.zeros(timesteps)

    optimal_action = np.argmax(bandit.q_star)

    for t in range(timesteps):
        p = np.random.rand()
        if p < epsilon:
            action = np.random.choice(arms)
        else:
            action = np.argmax(q_estimates)

        reward = bandit.get_reward(action)
        rewards[t] = reward

        if action == optimal_action:
            optimal_actions[t] = 1

        action_counts[action] += 1
        q_estimates[action] += (reward - q_estimates[action]) / action_counts[action]

    return rewards, optimal_actions


## <span style="color:#FFD700; font-size:30px;">⚙️ Simulating the Game with Parameters</span>
---

This section explores the simulation with different **parameters**, such as exploration strategies.

---

### **🎯 Insights**

- **Balancing Exploration and Exploitation**:
  - The parameter $ϵ$ controls the trade-off:
    - High $ϵ$: Encourages exploration.
    - Low $ϵ$: Favors exploitation.

- **Optimal Action Tracking**:
  - Tracks how often the best action (with the highest $q^*(a)$) is selected.

This approach helps evaluate how well the ϵ-greedy algorithm performs over multiple time steps and compares exploration vs. exploitation strategies.


In [None]:
# Simulating game with different parameters
def SimulateBanditGame(n_games, arms, timesteps, epsilons):
    avg_rewards = {epsilon: np.zeros(timesteps) for epsilon in epsilons}
    avg_optimal_actions = {epsilon: np.zeros(timesteps) for epsilon in epsilons}

    for _ in range(n_games):
        bandit = Bandit(arms)
        for epsilon in epsilons:
            rewards, optimal_actions = BanditGame(bandit, timesteps, epsilon)
            avg_rewards[epsilon] += rewards
            avg_optimal_actions[epsilon] += optimal_actions

    for epsilon in epsilons:
        avg_rewards[epsilon] /= n_games
        avg_optimal_actions[epsilon] /= n_games

    return avg_rewards, avg_optimal_actions


## <span style="color:#8A2BE2; font-size:30px;">🎮 Playing Game for Greedy Approach and 📊 Analyzing Output</span>
---

Run the simulation with the **Greedy algorithm**, and analyze results like average rewards and optimal actions.

---


In [None]:
# Playing the Game
def PlayBanditGame(n_games, arms, timesteps, epsilons):
    avg_rewards, avg_optimal_actions = SimulateBanditGame(n_games, arms, timesteps, epsilons)
    steps = np.arange(timesteps)
    # Adjusting figure size for better clarity
    plt.figure(figsize=(10, 10))
    
    # Defining colors and line styles for a visually appealing distinction
    colors = {0: '#2E8B57', 0.01: '#FF4500', 0.1: '#1E90FF'}
    linestyles = {0: 'dotted', 0.01: 'dashed', 0.1: 'solid'}
    
    # Plotting Average Rewards
    plt.subplot(2, 1, 1)
    for epsilon, rewards in avg_rewards.items():
        label = 'Greedy' if epsilon == 0 else f'ε = {epsilon}'
        plt.plot(
            steps,
            rewards,
            label=label,
            color=colors[epsilon],
            linestyle=linestyles[epsilon],
            linewidth=2
        )
    plt.xlabel('Steps', fontsize=14, fontweight='bold', color='#333')
    plt.ylabel('Average Reward', fontsize=14, fontweight='bold', color='#333')
    plt.legend(fontsize=12)
    plt.title('Average Rewards vs. Steps', fontsize=16, fontweight='bold', color='#111')
    plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)
    
    # Plotting % Optimal Actions
    plt.subplot(2, 1, 2)
    for epsilon, optimal_action in avg_optimal_actions.items():
        label = 'Greedy' if epsilon == 0 else f'ε = {epsilon}'
        plt.plot(
            steps,
            optimal_action * 100,
            label=label,
            color=colors[epsilon],
            linestyle=linestyles[epsilon],
            linewidth=2
        )
    plt.xlabel('Steps', fontsize=14, fontweight='bold', color='#333')
    plt.ylabel('% Optimal Actions', fontsize=14, fontweight='bold', color='#333')
    plt.legend(fontsize=12)
    plt.title('% Optimal Actions vs. Steps', fontsize=16, fontweight='bold', color='#111')
    plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)
    
    # Tight layout and display
    plt.tight_layout(pad=3)
    plt.show()

In [None]:
#GamePlay function call
epsilons = [0,0.01,0.1]
n_games = 2000
arms = 10
timesteps=1000

PlayBanditGame(n_games,arms,timesteps,epsilons)

# <span style="color:#FF6347; font-size:30px;">📈 Analysis of Performance for Greedy and ϵ-Greedy Methods</span>
---

### **🟢 Upper Graph: Expected Reward vs. Experience**

- The greedy method starts strong, improving faster initially compared to other methods.
- **Key Observation**: Despite the early advantage, the greedy method plateaus at a **lower performance level** over time.
- **Why?** The greedy method struggles in the long term because it often gets stuck with **suboptimal actions**, failing to explore other possibilities.

---

### **🔵 Lower Graph: % Optimal Actions vs. Experience**

- The greedy method identifies the optimal action in only **one-third** of the tasks.
- In the remaining two-thirds, it prematurely dismisses the optimal action due to **initial suboptimal samples** and never revisits it.

---

### **🌟 Advantages of ϵ-Greedy Methods**

1. The **ϵ = 0.1** method:
   - Explores more effectively, finding the optimal action earlier.
   - **Limitation**: It selects the optimal action only about **91% of the time**.

2. The **ϵ = 0.01** method:
   - Improves slower but performs **better in the long run** than ϵ = 0.1 on both measures.
   - Strikes a balance between exploration and exploitation.

---

# <span style="color:#20B2AA; font-size:30px;">🔨 Creating Multiarm Bandit for UCB</span>

This implementation simulates the **multi-arm bandit problem** using the **Upper Confidence Bound (UCB)** algorithm. It dynamically balances exploration and exploitation by incorporating confidence intervals in action selection.

---

### **🔍 Key Parameters**
1. **`bandit`**: The multi-arm bandit environment containing:
   - `arms`: Number of arms.
   - $q^*(a)$: True reward distribution for each arm.
   - `get_reward(action)`: Method to simulate the reward for a chosen arm.
   
2. **`timesteps`**: The total number of plays or time steps.

3. **`c`**: A parameter controlling the degree of exploration. 
   - Larger $c$ values encourage exploration by widening the confidence bounds.
   - Smaller $c$ values focus on exploitation.

---

### **📐 Algorithm**
1. **Initialization**:
   - `Q_estimates`: An array of zeros to store the estimated reward for each arm.
   - `action_counts`: An array of zeros to track the number of times each arm is selected.
   - `rewards`: An array to record rewards received at each time step.

2. **Game Simulation (for `t` in `timesteps`)**:
   - **Arm Selection**:
     - In the first `arms` steps, each arm is played once to gather initial information:
       $$ \text{action} = t \quad \text{for } t < \text{arms} $$
     - Afterward, the UCB formula is used to select the arm with the highest **Upper Confidence Bound**:
 
      $$ Q_{\text{estimates}}[a] = Q_{\text{estimates}}[a] + \frac{\text{reward} - Q_{\text{estimates}}[a]}{\text{action\_counts}[a]} 
$$


   - **Receive Reward**:
     - Simulate the reward for the selected arm:
       $$ \text{reward} = \text{bandit.get\_reward(action)} $$

   - **Update Counts and Estimates**:
     - Increment the count for the chosen arm:
       $$ \text{action\_counts}[a] = \text{action\_counts}[a] + 1 $$
     - Update the reward estimate using the incremental formula:
       $$ Q_{\text{estimates}}[a] = Q_{\text{estimates}}[a] + \frac{\text{reward} - Q_{\text{estimates}}[a]}{\text{action\_counts}[a]} $$

3. **Return Results**:
   - **`rewards`**: A time series of rewards at each timestep.

---

### **🧮 Mathematical Formulation**

#### **UCB Formula**:
The **Upper Confidence Bound** is calculated as:
$$ Q_{\text{estimates}}[a] + c \cdot \sqrt{\frac{\log(t+1)}{\text{action\_counts}[a]}} $$

#### **Incremental Update**:
The estimated reward for arm $a$ is updated as:
$$ Q_{\text{estimates}}[a] = Q_{\text{estimates}}[a] + \frac{\text{reward} - Q_{\text{estimates}}[a]}{\text{action\_counts}[a]} $$

---

### **🎯 Insights**

- **Exploration vs. Exploitation**:
  - The UCB formula adds a **confidence interval** term to the reward estimate.
  - Arms with lower play counts are assigned wider confidence intervals, encouraging exploration.

- **Parameter `c`**:
  - A higher $c$ value encourages more exploration by widening the confidence bounds.
  - A lower $c$ value focuses on exploitation by narrowing the bounds.

- **Performance**:
  - UCB typically performs better than $\epsilon$-greedy as it dynamically adjusts exploration based on the number of plays and observed rewards.

This approach helps evaluate how well the UCB algorithm performs over multiple time steps by efficiently balancing exploration and exploitation.


In [None]:
#Defining UCB Bandit game
def UCBBanditGame(bandit,timesteps,c=2):

  arms=bandit.arms
  Q_estimates= np.zeros(arms)
  action_counts=np.zeros(arms)
  rewards= np.zeros(timesteps)

  for t in range (timesteps):
    if t < arms:
      action =t
    else:
      action = np.argmax(Q_estimates+ c * np.sqrt(np.log(t+1)/action_counts))

    reward = bandit.get_reward(action)
    rewards[t]=reward

    action_counts[action]+=1
    Q_estimates[action] += (reward - Q_estimates[action])/ action_counts[action]

  return rewards


## <span style="color:#FF4500; font-size:30px;">⚙️ Running UCB Simulations and 📊 Analyzing Output</span>
---

Simulate the **UCB algorithm** and compare its performance with other methods like **ϵ-greedy**.

---


In [None]:
#play the UCBEpislon game
def PlayUCBEpsilon(n_games,arms,timesteps,c,epsilon):
    avg_rewards = { epsilon : np.zeros(timesteps) }
    avg_rewards['UCB']=np.zeros(timesteps)
    
    for _ in range (n_games):
    
        bandit=Bandit(arms)
        rewards,_ = BanditGame(bandit,timesteps,epsilon)
        avg_rewards[epsilon]+= rewards
        
        rewardsUCB = UCBBanditGame(bandit,timesteps,c)
        avg_rewards['UCB']+= rewardsUCB
    
    for key in avg_rewards.keys():
        avg_rewards[key] /= n_games
    
        steps= np.arange(timesteps)

    # Setting up the figure
    plt.figure(figsize=(10, 8))
    
    # Plotting Average Rewards
    plt.subplot(2, 1, 1)
    for key, rewards in avg_rewards.items():
        # Distinguishing UCB with a bold color and other methods with grey
        label = f'ε = {key}' if key != 'UCB' else 'UCB (c=2)'
        color = 'grey' if key != 'UCB' else '#1E90FF'
        linestyle = 'solid' if key == 'UCB' else 'dashed'
        plt.plot(
            steps,
            rewards,
            label=label,
            color=color,
            linestyle=linestyle,
            linewidth=2 if key == 'UCB' else 1.5
        )
    
    # Adding labels, legend, and title
    plt.xlabel('Steps', fontsize=14, fontweight='bold', color='#333')
    plt.ylabel('Average Reward', fontsize=14, fontweight='bold', color='#333')
    plt.legend(fontsize=12, loc='lower right')
    plt.title('Average Rewards vs. Steps', fontsize=16, fontweight='bold', color='#111')
    
    # Adding grid for better readability
    plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)
    
    # Adjust layout for spacing
    plt.tight_layout()
    plt.show()


In [None]:
#GamePlay
PlayUCBEpsilon(n_games=2000,arms=10,timesteps=1000,c=2,epsilon=0.1)



## <span style="color:#4682B4; font-size:30px;">🚀 Insights on UCB (c=2) vs. ϵ-Greedy (ϵ=0.1)</span>
---

### **🔎 Initial Phase (First 10 Plays)**

- **UCB (c=2)** focuses on **exploring all arms equally**, gathering information on each arm's rewards.
- During this phase, the **average reward remains close to zero** for UCB.
- Conversely, **ϵ-greedy (ϵ=0.1)** starts **outperforming UCB**, leveraging its early exploitation advantage.

---

### **📊 Mid-Phase (Around 11th Play)**

- **Performance Spike**: 
   - UCB, having gathered sufficient information, shifts towards **exploitation**.
   - It begins favoring the arm with the **highest estimated reward** based on confidence bounds.

---

### **🌟 Long-Term Advantage of UCB**

- UCB eventually surpasses ϵ-greedy in **average reward** because it:
  1. **Reduces exploration** over time.
  2. Focuses more effectively on the **optimal arm**, leveraging its confidence-based strategy.
- Compared to ϵ-greedy, UCB achieves **higher long-term rewards** while maintaining efficient exploration.

---


<span style="color:#4CAF50; font-size:22px; font-weight:bold;">💬 Thank You for Reading!</span>  

<p style="font-size:18px; color:#23fc;">
If you enjoyed exploring this notebook or found it helpful, feel free to share your thoughts, suggestions, and feedback in the comments section. Your input is greatly valued!  
</p>

<p style="font-size:18px; color:#FF5722; font-weight:bold;">
⭐ Don’t forget to upvote if you liked this notebook!
</p>

<p style="font-size:16px; color:#888;">
Happy Learning and Best Wishes! 🚀
</p>
