# Epsilon-Greedy Bandit Algorithm

## Installation
First, install the required dependencies with pinned versions for reproducibility.

In [None]:
!pip install numpy==1.24.3 matplotlib==3.7.2

## Overview

The **Epsilon-Greedy** algorithm is one of the simplest and most widely used strategies for balancing exploration and exploitation in multi-armed bandit problems. It provides an elegant solution to the fundamental trade-off: should we exploit what we know works well, or explore to discover potentially better options?

### The Exploration-Exploitation Dilemma

In reinforcement learning, we face a constant dilemma:
- **Exploitation**: Choose the action that currently appears best based on our estimates
- **Exploration**: Try other actions to gather more information and potentially discover better options

Pure exploitation can lead to suboptimal behavior if our initial estimates are wrong. Pure exploration wastes opportunities by not leveraging what we've learned. Epsilon-greedy provides a simple probabilistic solution.

### How It Works

At each time step $t$, the algorithm:
- With probability $\epsilon$: **Explores** by choosing a random action (excluding the current best to ensure true exploration)
- With probability $1-\epsilon$: **Exploits** by choosing the action with the highest estimated value $Q_t(a)$

This means that if $\epsilon = 0.1$, the agent will explore 10% of the time and exploit 90% of the time.

### Action Selection Rule

The action selection at time $t$ follows:

$$A_t = \begin{cases}
\text{random action} \neq \arg\max_a Q_t(a) & \text{with probability } \epsilon \\
\arg\max_a Q_t(a) & \text{with probability } 1-\epsilon
\end{cases}$$

Where $Q_t(a)$ represents our current estimate of the value of action $a$.

### Value Update Rule

The action-value estimates are updated using incremental sample averaging:

$$Q_{t+1}(a) = Q_t(a) + \frac{1}{N_t(a)}\left[R_t - Q_t(a)\right]$$

This can be rewritten as:

$$Q_{t+1}(a) = Q_t(a) + \alpha_t(a) \left[R_t - Q_t(a)\right]$$

where $\alpha_t(a) = \frac{1}{N_t(a)}$ is the step size.

**Interpretation:**
- $Q_t(a)$: Current estimate of action $a$'s value
- $N_t(a)$: Number of times action $a$ has been selected (ensures we average over all samples)
- $R_t$: Reward received from the most recent selection
- $R_t - Q_t(a)$: Prediction error (how much we were wrong)

### Algorithm Parameters

- **`epsilon` ($\epsilon$)**: Exploration rate, typically between 0.01 and 0.1
  - Smaller values: More exploitation, faster convergence but risk of suboptimal solution
  - Larger values: More exploration, slower convergence but better chance of finding optimal action
  
- **`initial_q`**: Initial Q-value estimates
  - Setting high initial values encourages early exploration ("optimistic initialization")
  - Setting to zero is neutral
  
- **`alfa` ($\alpha$)**: Optional fixed step size
  - If `None`: Uses sample average (decreasing step size $1/N_t(a)$)
  - If fixed: Constant step size, gives more weight to recent rewards (useful for non-stationary problems)

### Advantages and Limitations

**Advantages:**
- Simple to implement and understand
- Computationally efficient
- Guaranteed to explore all actions infinitely often
- Works well in practice for many problems

**Limitations:**
- Explores uniformly at random (doesn't prioritize promising actions)
- Fixed $\epsilon$ doesn't adapt over time
- May waste time on obviously bad actions
- No uncertainty modeling

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from deepmind_bandits import EpsilonGreedyAlgorithm, GaussianBandits, BanditDataAnalyzer

# Set random seed for reproducibility
np.random.seed(42)

## Environment Setup

We create a Gaussian bandit environment with 4 actions. Each action has a true mean reward (unknown to the agent) and some noise (standard deviation).

**Environment Design:**
- Action 0: Low mean (1.0), low noise - consistently mediocre
- Action 1: High mean (2.0), moderate noise - **optimal action**
- Action 2: Negative mean (-1.0), low noise - clearly bad
- Action 3: Zero mean (0.0), high noise - unpredictable

The agent must learn through interaction which action is best.

In [None]:
# Create bandit environment with Gaussian reward distributions
means = [1.0, 2.0, -1.0, 0.0]
stds = [0.1, 0.2, 0.1, 0.3]
env = GaussianBandits(means, stds)
num_actions = env.num_arms

print(f"Number of actions: {num_actions}")
print(f"True mean rewards: {means}")
print(f"Reward noise (stds): {stds}")
print(f"\nOptimal action: {np.argmax(means)} with mean reward = {max(means):.2f}")
print(f"\nGoal: Learn to select action {np.argmax(means)} most often through trial and error!")

## Agent Initialization

We create an epsilon-greedy agent and a data analyzer to track its performance over time.

**Agent Configuration:**
- All Q-values initialized to 0 (neutral, no initial bias)
- Exploration rate $\epsilon = 0.1$ (10% random exploration)
- Uses sample-average updates (step size = 1/N)

**Analyzer:**
- Tracks rewards, actions, and regret at each step
- Computes cumulative metrics
- Generates visualization plots

In [None]:
# Create epsilon-greedy agent
agent = EpsilonGreedyAlgorithm(num_actions=num_actions, initial_q=0.0)
agent.epsilon = 0.1  # 10% exploration rate

# Create analyzer for tracking performance metrics
analyzer = BanditDataAnalyzer(means, num_actions)

print(f"Agent Configuration:")
print(f"  Epsilon (exploration rate): {agent.epsilon}")
print(f"  Initial Q-values: {agent.Q}")
print(f"  Initial action counts: {agent.N}")
print(f"\nThe agent starts with no knowledge - all Q-values are 0!")

## Training Loop

Now we run the main learning loop for 1000 time steps. At each step:

1. **Select**: Agent chooses action using epsilon-greedy policy
2. **Execute**: Action is taken in environment, reward is observed
3. **Update**: Q-value estimate is updated based on observed reward
4. **Track**: Performance metrics are recorded

Over time, the agent should:
- Build accurate estimates of each action's value
- Increasingly select the optimal action (action 1)
- Minimize cumulative regret

In [None]:
T = 1000  # Number of time steps

for t in range(T):
    # Agent selects action using epsilon-greedy policy
    action = agent.select_action_greedy()
    
    # Execute action and observe reward from environment
    reward = env.pull_arm(action)
    
    # Update agent's Q-value estimates
    agent.update_values(action, reward)
    
    # Track performance for analysis
    analyzer.update_and_analyze(action, reward)

print(f"Training completed: {T} time steps\n")
print(f"Final Q-value estimates:")
for a in range(num_actions):
    error = abs(agent.Q[a] - means[a])
    print(f"  Action {a}: Q = {agent.Q[a]:.3f} (true = {means[a]:.2f}, error = {error:.3f})")
    
print(f"\nAction selection counts: {agent.N}")
print(f"Total selections: {sum(agent.N)} (should equal {T})")

## Results Analysis

Let's analyze how well the epsilon-greedy agent performed. We'll look at:
- **Q-value convergence**: Did estimates converge to true values?
- **Action selection**: Did we find and exploit the optimal action?
- **Regret**: How much reward did we lose by not always picking optimally?
- **Cumulative reward**: Total reward accumulated over time

In [None]:
# Finalize analysis (compute cumulative metrics)
analyzer.finalize_analysis()

### Q-Value Progression Over Time

This plot shows how the agent's Q-value estimates evolved during learning:
- **Solid colored lines**: Q-values for each action over time
- **Dashed lines**: True mean rewards (ground truth)
- **Red arrows**: When the agent switches between actions

**What to look for:**
- Q-values should converge toward the true means (dashed lines)
- More frequently selected actions have smoother estimates
- Early exploration causes more switching (red arrows)

In [None]:
analyzer.plot_Qvalue()

### Cumulative Regret

**Regret** measures the difference between the reward we *could* have gotten (always picking optimal) and what we *actually* got.

At each step $t$:
- Instantaneous regret = $r^* - r_t$ where $r^*$ is the optimal mean reward
- Cumulative regret = sum of all instantaneous regrets up to time $t$

**What to look for:**
- **Slope**: Should decrease over time as we learn
- **Final value**: Lower is better (less total regret)
- **Shape**: Linear growth means we're not learning; sublinear (flattening) means we're improving

For epsilon-greedy, regret grows *linearly* in the long run because we always explore with probability $\epsilon$.

In [None]:
analyzer.plot_regret()

### Cumulative Reward

This shows the total reward accumulated over time:
- **Black line**: Total cumulative reward across all actions
- **Dashed lines**: Cumulative reward per action

**What to look for:**
- Steeper slope = higher reward rate
- The optimal action's line should dominate
- Total reward should increase steadily

In [None]:
analyzer.plot_cumulative_reward()

## Performance Summary and Analysis

Let's quantify how well the agent learned:

In [None]:
print("=" * 60)
print("EPSILON-GREEDY PERFORMANCE SUMMARY")
print("=" * 60)

print(f"\nExperiment Parameters:")
print(f"  Total time steps: {T}")
print(f"  Exploration rate (epsilon): {agent.epsilon}")
print(f"  Expected exploration steps: ~{int(T * agent.epsilon)} ({agent.epsilon * 100:.0f}%)")

print(f"\nLearned Q-Values vs True Means:")
for a in range(num_actions):
    q_val = agent.Q[a]
    true_mean = means[a]
    error = abs(q_val - true_mean)
    count = agent.N[a]
    pct = 100 * count / T
    marker = " ← OPTIMAL" if a == np.argmax(means) else ""
    print(f"  Action {a}: Q={q_val:6.3f}, True={true_mean:5.2f}, Error={error:.3f} | "
          f"Selected {count:4d} times ({pct:5.1f}%){marker}")

optimal_action = np.argmax(means)
optimal_selections = agent.N[optimal_action]
optimal_pct = 100 * optimal_selections / T

print(f"\nOptimal Action Performance:")
print(f"  Optimal action: {optimal_action} (true mean = {means[optimal_action]:.2f})")
print(f"  Times selected: {optimal_selections}/{T} ({optimal_pct:.1f}%)")
print(f"  Theoretical maximum (1-ε): {100*(1-agent.epsilon):.1f}%")

if optimal_pct >= 100 * (1 - agent.epsilon) - 5:  # within 5% of theoretical max
    print(f"  ✓ Successfully learned to exploit optimal action!")
else:
    print(f"  ⚠ Could improve - not yet converged to optimal policy")

print(f"\nRegret Analysis:")
print(f"  Final cumulative regret: {analyzer.regret[-1]:.2f}")
print(f"  Average regret per step: {analyzer.regret[-1]/T:.3f}")
expected_regret_rate = agent.epsilon * (means[optimal_action] - np.mean(means))
print(f"  Theoretical regret rate: ~{expected_regret_rate:.3f} per step")

print("\n" + "=" * 60)

## Key Takeaways

**What we learned:**
1. ✅ Epsilon-greedy successfully balances exploration and exploitation
2. ✅ Q-value estimates converge to true means with enough samples  
3. ✅ The optimal action is identified and selected most often
4. ⚠️ Continuous exploration (fixed $\epsilon$) causes linear regret growth

**Improvements to consider:**
- **Decaying epsilon**: Reduce $\epsilon$ over time (e.g., $\epsilon_t = \epsilon_0/t$)
- **Optimistic initialization**: Start with high Q-values to encourage early exploration
- **UCB**: Use uncertainty-based exploration instead of random
- **Thompson Sampling**: Bayesian approach that naturally balances exploration/exploitation

**When to use epsilon-greedy:**
- Simple baseline for bandit problems
- When computational efficiency matters
- When you want interpretable, predictable behavior
- As a starting point before trying more sophisticated methods

---

## Exercises

### Exercise 1: Experiment with Different Reward Distributions

Try modifying the bandit environment with different reward distributions. How does the epsilon-greedy algorithm perform when:
- The optimal action has high variance?
- Multiple actions have similar mean rewards?
- The worst action has very high variance?

Modify the `means` and `stds` arrays and re-run the training loop to observe the behavior.

<details>
    <summary>Click here for hint</summary>

```python
# Example 1: High variance optimal action
means_high_var = [1.0, 2.5, 1.2, 0.5]
stds_high_var = [0.1, 1.5, 0.1, 0.1]  # Optimal action has high noise!

# Example 2: Similar mean rewards (harder to distinguish)
means_similar = [1.0, 1.2, 1.1, 0.9]
stds_similar = [0.1, 0.1, 0.1, 0.1]

# Example 3: High variance bad action
means_trap = [1.5, 2.0, 0.5, -1.0]
stds_trap = [0.1, 0.1, 0.1, 3.0]  # Bad action has huge variance

# Create new environment
env_new = GaussianBandits(means_high_var, stds_high_var)

# Reset and retrain agent
agent_new = EpsilonGreedyAlgorithm(num_actions=len(means_high_var), initial_q=0.0)
agent_new.epsilon = 0.1
analyzer_new = BanditDataAnalyzer(means_high_var, len(means_high_var))

# Run training loop
for t in range(1000):
    action = agent_new.select_action_greedy()
    reward = env_new.pull_arm(action)
    agent_new.update_values(action, reward)
    analyzer_new.update_and_analyze(action, reward)

# Analyze results
analyzer_new.finalize_analysis()
analyzer_new.plot_Qvalue()
analyzer_new.plot_regret()
```

</details>

### Exercise 2: Compare Different Epsilon Values

The exploration rate $\epsilon$ controls the exploration-exploitation trade-off. Experiment with different values:
- Low epsilon (e.g., 0.01): More exploitation
- Medium epsilon (e.g., 0.1): Balanced approach
- High epsilon (e.g., 0.3): More exploration

Compare their performance in terms of regret, convergence speed, and optimal action selection rate.

<details>
    <summary>Click here for hint</summary>

```python
# Test different epsilon values
epsilon_values = [0.01, 0.05, 0.1, 0.2, 0.3]
results = {}

# Use the original environment
env_test = GaussianBandits(means, stds)

for eps in epsilon_values:
    # Create fresh agent with this epsilon
    agent_test = EpsilonGreedyAlgorithm(num_actions=num_actions, initial_q=0.0)
    agent_test.epsilon = eps
    analyzer_test = BanditDataAnalyzer(means, num_actions)
    
    # Train
    for t in range(1000):
        action = agent_test.select_action_greedy()
        reward = env_test.pull_arm(action)
        agent_test.update_values(action, reward)
        analyzer_test.update_and_analyze(action, reward)
    
    analyzer_test.finalize_analysis()
    
    # Store results
    results[eps] = {
        'regret': analyzer_test.regret[-1],
        'optimal_pct': 100 * agent_test.N[np.argmax(means)] / 1000,
        'Q_values': agent_test.Q.copy()
    }
    
    print(f"ε={eps:.2f}: Regret={results[eps]['regret']:.2f}, "
          f"Optimal%={results[eps]['optimal_pct']:.1f}%")

# Plot comparison
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.bar([str(e) for e in epsilon_values], [results[e]['regret'] for e in epsilon_values])
plt.xlabel('Epsilon')
plt.ylabel('Cumulative Regret')
plt.title('Regret vs Epsilon')

plt.subplot(1, 2, 2)
plt.bar([str(e) for e in epsilon_values], [results[e]['optimal_pct'] for e in epsilon_values])
plt.axhline(y=100, color='r', linestyle='--', label='Perfect (100%)')
plt.xlabel('Epsilon')
plt.ylabel('Optimal Action %')
plt.title('Optimal Action Selection vs Epsilon')
plt.legend()
plt.tight_layout()
plt.show()
```

</details>