# Week 1: Introduction to Reinforcement Learning
## Lecture Examples

**Course:** Reinforcement Learning - Continuing Education  
**Institution:** Zurich University of Applied Sciences  

---

## Learning Objectives
- Understand the basic RL paradigm: agent, environment, actions, rewards
- Grasp the exploration vs exploitation tradeoff
- Introduction to Multi-Armed Bandits as the simplest RL problem

## 1. The Reinforcement Learning Paradigm

In Reinforcement Learning, an **agent** interacts with an **environment**:
1. Agent observes the current **state**
2. Agent selects an **action**
3. Environment provides a **reward** and new state
4. Agent learns to maximize cumulative rewards

### Key Difference from Supervised Learning
- **Supervised Learning:** Given correct labels, minimize error
- **Reinforcement Learning:** Only get reward signals, must discover good actions through trial and error

## 2. Multi-Armed Bandits: The Simplest RL Problem

Imagine a casino with multiple slot machines ("one-armed bandits"). Each machine has:
- A hidden probability distribution of rewards
- Different expected payouts

**Your Goal:** Maximize total winnings over many plays

**The Challenge:** You don't know which machine is best! You must:
- **Explore:** Try different machines to learn about them
- **Exploit:** Play the machine you currently think is best

This is called the **Exploration-Exploitation Tradeoff**

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

## 3. Example: A Simple 3-Armed Bandit

In [None]:
class SimpleBandit:
    """A simple multi-armed bandit with Gaussian reward distributions."""
    
    def __init__(self, true_means):
        """
        Args:
            true_means: List of true mean rewards for each arm
        """
        self.true_means = np.array(true_means)
        self.n_arms = len(true_means)
        
    def pull(self, arm):
        """Pull an arm and get a reward (mean + Gaussian noise)."""
        reward = self.true_means[arm] + np.random.randn()
        return reward

# Create a bandit with 3 arms
# True means: [1.0, 2.5, 1.5] - Arm 1 is the best!
bandit = SimpleBandit([1.0, 2.5, 1.5])

print("True mean rewards:")
for i, mean in enumerate(bandit.true_means):
    print(f"  Arm {i}: {mean:.1f}")

In [None]:
# Let's pull each arm a few times to see the randomness
print("\nSample rewards from pulling each arm 5 times:")
for arm in range(3):
    rewards = [bandit.pull(arm) for _ in range(5)]
    print(f"Arm {arm}: {[f'{r:.2f}' for r in rewards]}")
    print(f"  Sample mean: {np.mean(rewards):.2f} (true mean: {bandit.true_means[arm]:.2f})")

## 4. Strategy 1: Pure Exploitation (Greedy)

Always choose the arm with highest estimated value.

**Problem:** What if our initial estimates are wrong? We might get stuck with a suboptimal arm!

In [None]:
def greedy_strategy(n_steps=1000):
    """Pure exploitation: always pick the current best arm."""
    bandit = SimpleBandit([1.0, 2.5, 1.5])
    
    # Initialize
    n_arms = 3
    counts = np.zeros(n_arms)  # How many times each arm was pulled
    values = np.zeros(n_arms)  # Estimated value of each arm
    rewards = []               # Track rewards over time
    
    # Pull each arm once to initialize
    for arm in range(n_arms):
        reward = bandit.pull(arm)
        counts[arm] += 1
        values[arm] = reward
        rewards.append(reward)
    
    # Greedy selection for remaining steps
    for step in range(n_arms, n_steps):
        # Pick the arm with highest estimated value
        arm = np.argmax(values)
        
        # Pull the arm and observe reward
        reward = bandit.pull(arm)
        
        # Update estimates
        counts[arm] += 1
        values[arm] += (reward - values[arm]) / counts[arm]  # Incremental average
        
        rewards.append(reward)
    
    return rewards, counts, values

rewards_greedy, counts_greedy, values_greedy = greedy_strategy(1000)

print("Greedy Strategy Results:")
print(f"Average reward: {np.mean(rewards_greedy):.3f}")
print(f"\nArm selection counts: {counts_greedy}")
print(f"Estimated values: {values_greedy}")
print(f"True values: {bandit.true_means}")

## 5. Strategy 2: Epsilon-Greedy

Balance exploration and exploitation:
- With probability ε: **Explore** (pick a random arm)
- With probability 1-ε: **Exploit** (pick the best arm)

Common values: ε = 0.1 or ε = 0.05

In [None]:
def epsilon_greedy_strategy(epsilon=0.1, n_steps=1000):
    """Epsilon-greedy: explore with probability epsilon, exploit otherwise."""
    bandit = SimpleBandit([1.0, 2.5, 1.5])
    
    n_arms = 3
    counts = np.zeros(n_arms)
    values = np.zeros(n_arms)
    rewards = []
    
    for step in range(n_steps):
        # Epsilon-greedy action selection
        if np.random.random() < epsilon:
            # Explore: random action
            arm = np.random.randint(n_arms)
        else:
            # Exploit: best action
            arm = np.argmax(values)
        
        # Pull arm and observe reward
        reward = bandit.pull(arm)
        
        # Update estimates
        counts[arm] += 1
        values[arm] += (reward - values[arm]) / counts[arm]
        
        rewards.append(reward)
    
    return rewards, counts, values

rewards_eps, counts_eps, values_eps = epsilon_greedy_strategy(epsilon=0.1, n_steps=1000)

print("Epsilon-Greedy Strategy Results (ε=0.1):")
print(f"Average reward: {np.mean(rewards_eps):.3f}")
print(f"\nArm selection counts: {counts_eps}")
print(f"Estimated values: {values_eps}")
print(f"True values: {bandit.true_means}")

## 6. Comparing Strategies

In [None]:
# Run multiple experiments
n_experiments = 100
n_steps = 1000

avg_rewards_greedy = np.zeros(n_steps)
avg_rewards_eps = np.zeros(n_steps)

for _ in range(n_experiments):
    rewards_g, _, _ = greedy_strategy(n_steps)
    rewards_e, _, _ = epsilon_greedy_strategy(epsilon=0.1, n_steps=n_steps)
    
    avg_rewards_greedy += rewards_g
    avg_rewards_eps += rewards_e

avg_rewards_greedy /= n_experiments
avg_rewards_eps /= n_experiments

# Calculate cumulative averages
cumavg_greedy = np.cumsum(avg_rewards_greedy) / np.arange(1, n_steps + 1)
cumavg_eps = np.cumsum(avg_rewards_eps) / np.arange(1, n_steps + 1)

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(avg_rewards_greedy, alpha=0.7, label='Greedy')
plt.plot(avg_rewards_eps, alpha=0.7, label='Epsilon-Greedy (ε=0.1)')
plt.axhline(y=2.5, color='r', linestyle='--', label='Optimal (Arm 1)', alpha=0.5)
plt.xlabel('Step')
plt.ylabel('Average Reward')
plt.title('Average Reward per Step')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(cumavg_greedy, label='Greedy')
plt.plot(cumavg_eps, label='Epsilon-Greedy (ε=0.1)')
plt.axhline(y=2.5, color='r', linestyle='--', label='Optimal', alpha=0.5)
plt.xlabel('Step')
plt.ylabel('Cumulative Average Reward')
plt.title('Cumulative Average Reward')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal cumulative average reward:")
print(f"  Greedy: {cumavg_greedy[-1]:.3f}")
print(f"  Epsilon-Greedy: {cumavg_eps[-1]:.3f}")
print(f"  Optimal: 2.500")

## 7. Key Takeaways

1. **RL is about trial-and-error learning** through interaction with an environment
2. **Exploration vs Exploitation** is fundamental - we need both!
3. **Multi-Armed Bandits** are the simplest RL problem (no state transitions)
4. **Greedy strategies** can get stuck with suboptimal choices
5. **Epsilon-greedy** provides a simple but effective way to balance exploration and exploitation

---

## Next Steps
In the lab assignment, you'll implement more sophisticated exploration strategies and analyze their performance!