# **✨Monte Carlo Methods**

## Table of Contents

1. [**Introduction to Reinforcement Learning Paradigms**](#1-introduction-to-reinforcement-learning-paradigms)
   - [1.1 Model-Based Learning Recap](#11-model-based-learning-recap)
   - [1.2 Model-Free Learning Fundamentals](#12-model-free-learning-fundamentals)

1. [**Monte Carlo Methods in Reinforcement Learning**](#2-monte-carlo-methods-in-reinforcement-learning)
   - [2.1 Core Concepts and Definitions](#21-core-concepts-and-definitions)
   - [2.2 Episode Collection and Q-Value Estimation](#22-episode-collection-and-q-value-estimation)
   - [2.3 Custom Grid World Environment Example](#23-custom-grid-world-environment-example)
   - [2.4 First-Visit vs. Every-Visit Monte Carlo Methods](#24-first-visit-vs-every-visit-monte-carlo-methods)
   - [2.5 Complete Code Implementation](#25-complete-code-implementation)



# 1. ⭐**Introduction to Reinforcement Learning Paradigms**

## 1.1 ✔️Model-Based Reinforcement Learning(Recap)

### 🎯 **Core Concept**

**Model-based learning** assumes you have **complete knowledge** of how the environment works - like having the rulebook for a game before you play.[1][4]

- **Environment dynamics are known**: You understand $P(s'|s,a)$ (transition probabilities) and $R(s,a)$ (reward function)
- **No trial-and-error needed**: Can calculate optimal actions mathematically
- **Planning-based approach**: Think first, act later[2]


### 🧮 **Mathematical Foundations**

When you know the environment model, you can predict:
- **Next state**: Given current state $s$ and action $a$, what's the probability of reaching state $s'$?
- **Expected reward**: What reward do you get for taking action $a$ in state $s$?

This knowledge enables **dynamic programming** techniques for finding optimal policies.[4]



### 🔧 **Core Algorithms**

#### **🔖 Policy Iteration** 
*"Improve the policy step by step"*

**Process**: Initialize policy → Evaluate policy → Improve policy → Repeat until optimal

##### **1. Policy Evaluation** (How good is my current policy?)

**Full Formula**:
$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^{\pi}(s')]$$

**Simplified Approach**:
- **Iterative updates**: $V(s) \leftarrow \sum_a \pi(a|s) [r + \gamma V(s')]$
- **No complex summations** - just update each state value repeatedly
- **Stop when values converge**

##### **2. Policy Improvement** (Make the policy better)

**Full Formula**:
$$\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^{\pi}(s')]$$

**Simplified Approach**:
- **Act greedily**: Choose action with highest expected value
- $\pi'(s) = \arg\max_a Q^{\pi}(s,a)$ where $Q^{\pi}(s,a) = r + \gamma V^{\pi}(s')$



#### **🔖 Value Iteration**
*"Find the best value for each state directly"*

**Process**: Initialize values → Update values by selecting best actions → Repeat until optimal

##### **Full Formula**:
$$V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]$$

##### **Simplified Understanding**:
- **One-step lookahead**: For each state, try all actions and pick the best
- **Direct optimization**: No separate policy - values directly give you optimal actions
- **Policy extraction**: $\pi(s) = \arg\max_a Q(s,a)$



### 📊 **Key Variables Explained**

| Symbol | Meaning | Simple Explanation |
|--------|---------|-------------------|
| $V^{\pi}(s)$ | Value of state $s$ under policy $\pi$ | "How good is this state if I follow my current strategy?" |
| $\pi(a\|s)$ | Probability of taking action $a$ in state $s$ | "How likely am I to choose this action here?" |
| $\gamma$ | Discount factor (0 ≤ γ ≤ 1) | "How much do I care about future rewards vs immediate ones?" |
| $P(s'\|s,a)$ | Transition probability | "If I do this action here, where will I end up?" |
| $R(s,a)$ | Reward function | "What reward do I get for this action in this state?" |



### ⚡ **Why Model-Based Learning Matters**

#### **Advantages**:
- **⚡ Sample Efficiency**: No need for trial-and-error - can solve mathematically
- **🎯 Computational Efficiency**: Planning is faster than learning through experience  
- **📈 Theoretical Guarantees**: Provable convergence to optimal policies
- **🔄 Quick Adaptation**: Can immediately adjust to goal changes

#### **Limitations**:[1]
- **🤔 Model Complexity**: Real environments are often too complex to model accurately
- **❓ Unknown Dynamics**: Many real-world scenarios don't provide transition probabilities
- **⚠️ Model Errors**: Wrong model leads to suboptimal policies
- **🔄 Non-Stationary**: Environments that change over time break the model



### 🎮 **When to Use Model-Based vs Model-Free**

| **Use Model-Based When**[4] | **Use Model-Free When** |
|------------------------------|---------------------------|
| Environment rules are known | Environment is complex/unknown |
| Sample efficiency is critical | Can afford many interactions |
| Planning is computationally feasible | Environment changes frequently |
| **Examples**: Chess, Grid worlds | **Examples**: Video games, Robotics |



### 🚀 **Practical Takeaways**

1. **Start Simple**: If you can model the environment, model-based is often faster.
2. **Know Your Limits**: Complex real-world problems usually need model-free approaches.
3. **Hybrid Approaches**: Many modern systems combine both methods.
4. **Simplified Formulas**: Use iterative updates instead of complex summations for easier implementation.

## 1.2 ✔️Model-Free Learning Fundamentals

### 🎯 **Core Concept**

**Model-free learning** is like learning to ride a bike by actually riding it - no instruction manual needed, just **trial and error**

- **Experience-based**: Learn from sequences of $(s, a, r, s')$ tuples (state, action, reward, next state)
- **No environment model required**: Don't need to know $P(s'|s,a)$ or $R(s,a)$ in advance
- **Direct interaction**: Agent learns by doing, not by thinking



### 🧮 **Mathematical Foundations (Simplified)**

Instead of complex environment models, model-free methods use **direct experience**:

- **Experience tuple**: $(s_t, a_t, r_{t+1}, s_{t+1})$ - "I was here, did this, got this reward, ended up there"
- **Value estimation**: Learn $V(s)$ or $Q(s,a)$ directly from observed rewards
- **Policy learning**: Improve actions based on **actual outcomes**, not predictions



### 🔧 **Core Algorithm Categories**

#### **🔖 Value-Based Methods**
*"Learn how good each action is, then pick the best one"*

##### **Q-Learning** (Most Popular):
**Simplified Formula**:
$$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max Q(s',a') - Q(s,a)]$$

**In Plain English**:
- **Update rule**: "My estimate + learning_rate × (what_actually_happened - my_estimate)"
- **No model needed**: Just observe $(s, a, r, s')$ and update
- **Policy**: Always choose $\pi(s) = \arg\max_a Q(s,a)$

##### **Deep Q-Networks (DQN)**:
- **Same Q-learning principle** but uses neural networks for complex state spaces
- **Handles high-dimensional inputs** like images (Atari games)
- **Experience replay**: Learn from past experiences multiple times



#### **🔖 Policy-Based Methods**
*"Learn the strategy directly, skip the value estimation"*

##### **REINFORCE Algorithm**:
**Simplified Formula**:
$$\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot R]$$

**In Plain English**:
- **Direct policy optimization**: Adjust policy parameters to maximize rewards
- **Monte Carlo approach**: Use complete episode returns
- **No value function needed**: Just improve policy based on episode outcomes



#### **🔖 Actor-Critic Methods**
*"Best of both worlds: Learn values AND policy"*

**Two Components**:
- **Actor**: Learns the policy $\pi(a|s)$ (what to do)
- **Critic**: Learns the value function $V(s)$ (how good is this state)

**Popular Algorithms**: A2C, A3C, PPO



### 📊 **Key Characteristics & Advantages**

#### **✅ Why Model-Free Works**:

| **Advantage** | **Explanation** |
|---------------|----------------|
| **🌍 Real-World Ready** | No need to model complex environments (robotics, games) |
| **🔄 Adapts Naturally** | Handles changing environments without remodeling |
| **🎯 Robust to Uncertainty** | Works even when environment dynamics are unknown |
| **📈 Scalable** | Handles high-dimensional state spaces better |

#### **⚠️ Challenges**:[3]

| **Challenge** | **Impact** |
|---------------|------------|
| **⏰ Sample Inefficiency** | Needs many interactions to learn well |
| **🎲 High Variance** | Learning can be unstable and noisy |
| **⚖️ Exploration-Exploitation** | Hard to balance trying new things vs using known good actions|
| **🔧 Hyperparameter Sensitivity** | Performance depends heavily on tuning|


### 🎮 **Real-World Applications**

#### **Perfect for Model-Free**:
- **🤖 Autonomous Navigation**: Self-driving cars in traffic
- **🎯 Game Playing**: Chess, Go, video games (AlphaGo, OpenAI Five)
- **💰 Financial Trading**: Stock market strategies
- **☁️ Cloud Computing**: Resource allocation and load balancing
- **🏭 Robotics**: Manipulation tasks in unstructured environments

***

### 🆚 **Model-Free vs Model-Based Comparison**

| **Aspect** | **Model-Free** | **Model-Based** |
|------------|----------------|-----------------|
| **Learning Method** | Trial and error | Mathematical planning |
| **Environment Knowledge** | None required | Complete model needed |
| **Sample Efficiency** | Low (needs many samples) | High (few samples needed) |
| **Real-World Suitability** | Excellent | Limited to known environments |
| **Computational Cost** | Low per step | High (planning cost) |
| **Robustness** | High (adapts to changes) | Low (breaks with model errors) |



### 🧠 **Simplified Algorithm Comparison**

#### **Q-Learning Example**:
```
1. Start with random Q-values
2. Take action, observe reward
3. Update: Q(s,a) += α[r + γ×max(Q(s',a')) - Q(s,a)]
4. Repeat until convergence
```

#### **Policy Gradient Example**:
```
1. Start with random policy
2. Run episode, collect rewards
3. Update: Increase probability of good actions
4. Repeat until optimal
```



### 🚀 **Practical Takeaways**

#### **Choose Model-Free When**:
- ✅ Environment is **complex or unknown**
- ✅ Environment **changes over time**
- ✅ You can afford **many interactions**
- ✅ **Safety** is more important than efficiency
- ✅ Working with **high-dimensional** problems

#### **Implementation Tips**:
1. **Start with Q-Learning** for discrete problems
2. **Use DQN** for complex state spaces
3. **Try Actor-Critic** for continuous actions
4. **Focus on exploration strategies** early in learning
5. **Monitor sample efficiency** - if too slow, consider model-based approaches



### 💡 **The Big Picture**

Model-free reinforcement learning is like **learning to drive in real traffic** rather than studying traffic rules in a classroom. It's messier, takes longer, but ultimately produces agents that can handle **real-world complexity and uncertainty**.

# 2. ⭐**Monte Carlo Methods in Reinforcement Learning**

## 2.1 Core Concepts and Definitions

**Core Concept:**
- Monte Carlo methods
    - Model-free techniques
    - Estimate Q-values based on episodes

### Expanded Explanation:

**Monte Carlo (MC) methods** form a class of model-free reinforcement learning algorithms that estimate action-value functions $Q(s,a)$ by averaging returns observed across multiple episodes.

#### Mathematical Foundations:

**The Concept of Return:**
The **return** (denoted as $G_t$) represents the cumulative discounted reward from time step $t$:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

Where:
- $R_{t+k+1}$ = Reward received at time step $t+k+1$
- $\gamma$ = Discount factor (0 ≤ γ ≤ 1)

**Q-Value Estimation:**
MC methods estimate $Q(s,a)$ as the expected return following visits to state-action pair $(s,a)$:
$$Q(s,a) = \mathbb{E}[G_t | S_t = s, A_t = a]$$

#### Why Monte Carlo Methods Matter:
- **Unbiased Estimates**: Converge to true Q-values with infinite episodes
- **Simple Implementation**: Straightforward averaging of observed returns
- **No Bootstrapping**: Don't rely on estimates of other values

#### Long-term Consequences:
MC methods require **complete episodes** to update values, making them:
- Suitable for episodic tasks with clear terminal states
- Less efficient for continuing tasks or very long episodes
- Memory-intensive for storing complete episode histories

## 2.2 Episode Collection and Q-Value Estimation

**Core Concept:** - *Collecting random episodes → Estimate Q-values using MC → Optimal policy*

### Advanced Implementation Details:

#### Core Algorithm:
**Step 1: Episode Generation**
- Initialize environment to random starting state
- Follow policy (initially random) until episode termination
- Record sequence: $(S_0, A_0, R_1), (S_1, A_1, R_2), \ldots, (S_{T-1}, A_{T-1}, R_T)$

**Step 2: Return Calculation**
- For each time step $t$, compute: $G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$
- Calculate returns working backwards from terminal state

**Step 3: Q-Value Updates**
- For each $(s,a)$ pair visited, update running average:
- $Q(s,a) \leftarrow \frac{1}{N(s,a)} \sum_{i=1}^{N(s,a)} G_i(s,a)$

Where:
- $N(s,a)$ = Number of times $(s,a)$ has been visited
- $G_i(s,a)$ = Return from $i$-th visit to $(s,a)$

## 2.3 Custom Grid World Environment Example

**Core Concept from PDF:** - *Custom grid world [showing states 0-5] with episode data showing State, Action, Reward, Return.*

### Expanded Explanation:

#### Environment Setup:
The custom grid world consists of 6 states (0-5) arranged in a specific layout where:
- Each state allows multiple actions (Left, Down, Right, Up)
- Actions result in immediate rewards (can be negative)
- Episodes terminate when reaching specific goal states

#### Example Episode Analysis:

**Episode 1 Data:**
| State | Action | Reward | Return |
|-------|--------|--------|--------|
| 3     | Right  | -2     | 5      |
| 4     | Left   | -1     | 7      |
| 3     | Right  | -2     | 8      |
| 4     | Right  | 10     | 10     |

**Episode 2 Data:**
| State | Action | Reward | Return |
|-------|--------|--------|--------|
| 3     | Right  | -2     | 5      |
| 4     | Up     | -1     | 7      |
| 1     | Down   | -2     | 8      |
| 4     | Right  | 10     | 10     |

#### Return Calculation Process:
- **Working Backwards**: Calculate returns from episode end to start
- **Cumulative Rewards**: Each return includes all future rewards in episode
- **State-Action Visits**: Track which $(s,a)$ pairs occur and their associated returns

## 2.4 `First-Visit` vs. `Every-Visit` Monte Carlo Methods

- **Core Concept:**
  - Q(3, right) - first-visit Monte Carlo
    -  Average first visit to (s,a) within episodes"
  - Q(3, right) - every-visit Monte Carlo  
    -  Average every visit to (s,a) within episodes"

### Expanded Explanation:

#### First-Visit Monte Carlo:
**Definition:** Only the **first occurrence** of each state-action pair $(s,a)$ within an episode contributes to the average return.

**Mathematical Formulation:**
- **Specific case 1:** If $(s,a)$ appears multiple times in episode, only use first return
- **Specific case 2:** Each episode contributes at most one sample per $(s,a)$ pair

- **Example:**
    - For $Q(3, \text{Right})$ with first-visit:
        - Episode 1: First visit return = 5
        - Episode 2: First visit return = 5  
        - **Average: $(5 + 5)/2 = 5$**

#### Every-Visit Monte Carlo:
**Definition:** **Every occurrence** of state-action pair $(s,a)$ within episodes contributes to the average.

**Mathematical Formulation:**
- **Specific case 1:** Multiple visits within same episode all contribute
- **Specific case 2:** More samples per episode, potentially faster convergence

- **Example:**
    - For $Q(3, \text{Right})$ with every-visit:
        - Episode 1: Returns = 5, 8 (two visits)
        - Episode 2: Return = 5 (one visit)
        - **Average: $(5 + 8 + 5)/3 = 6$**

#### Why the Difference Matters:
**First-Visit Characteristics:**
- **Unbiased estimates** of true Q-values
- **Lower variance** per episode
- **Cleaner theoretical analysis**

**Every-Visit Characteristics:**
- **Biased estimates** (especially early in learning)
- **Higher variance** but more samples
- **Potentially faster convergence** in practice

## 2.5 Complete Code Implementation

In [3]:
def generate_episode(): 
    episode = [] 
    state, info = env.reset()  
    terminated = False 
    while not terminated: 
        action = env.action_space.sample()  
        next_state, reward, terminated, truncated, info = env.step(action)  
        episode.append((state, action, reward)) 
        state = next_state 

    return episode

In [4]:
import numpy as np
import gymnasium as gym

def generate_episode(env, policy=None, max_steps=1000):
    """Generate a single episode using given policy or random actions."""
    episode = []
    state, info = env.reset()
    terminated = False
    truncated = False
    step_count = 0
    
    while not terminated and not truncated and step_count < max_steps:
        # Use provided policy or random action selection
        if policy is not None:
            action = policy.get(state, env.action_space.sample())
        else:
            action = env.action_space.sample()
            
        next_state, reward, terminated, truncated, info = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        step_count += 1
    
    return episode

def calculate_returns(episode, gamma=1.0):
    """Calculate discounted returns for each step in episode."""
    returns = []
    G = 0
    
    # Work backwards through episode
    for i in reversed(range(len(episode))):
        _, _, reward = episode[i]
        G = reward + gamma * G
        returns.insert(0, G)
    
    return returns

In [5]:
def first_visit_mc(num_episodes): 
    Q = np.zeros((num_states, num_actions)) 
    returns_sum = np.zeros((num_states, num_actions)) 
    returns_count = np.zeros((num_states, num_actions))  
    for i in range(num_episodes): 
        episode = generate_episode() 
        visited_states_actions = set()  
        for j, (state, action, reward) in enumerate(episode):  
            if (state, action) not in visited_states:  
                returns_sum[state, action] += sum([x[2] for x in episode[j:]])  
                returns_count[state, action] += 1 
                visited_states_actions.add((state, action))  
    nonzero_counts = returns_count != 0  
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts] 
    return Q

In [6]:
def first_visit_mc(env, num_episodes, num_states, num_actions, gamma=1.0):
    """First-visit Monte Carlo for estimating Q-values."""
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))
    
    for episode_num in range(num_episodes):
        # Generate episode
        episode = generate_episode(env)
        returns = calculate_returns(episode, gamma)
        
        # Track first visits only
        visited_state_actions = set()
        
        for step, ((state, action, reward), G) in enumerate(zip(episode, returns)):
            if (state, action) not in visited_state_actions:
                returns_sum[state, action] += G
                returns_count[state, action] += 1
                visited_state_actions.add((state, action))
        
        # Print progress periodically
        if (episode_num + 1) % 100 == 0:
            print(f"Completed {episode_num + 1}/{num_episodes} episodes")
    
    # Calculate final Q-values (avoid division by zero)
    nonzero_counts = returns_count > 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    
    return Q

def every_visit_mc(env, num_episodes, num_states, num_actions, gamma=1.0):
    """Every-visit Monte Carlo for estimating Q-values."""
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))
    
    for episode_num in range(num_episodes):
        episode = generate_episode(env)
        returns = calculate_returns(episode, gamma)
        
        # Update for every visit (no set tracking needed)
        for (state, action, reward), G in zip(episode, returns):
            returns_sum[state, action] += G
            returns_count[state, action] += 1
        
        if (episode_num + 1) % 100 == 0:
            print(f"Completed {episode_num + 1}/{num_episodes} episodes")
    
    nonzero_counts = returns_count > 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    
    return Q

#### Policy Derivation

In [7]:
def get_policy(): 
    policy = {state: np.argmax(Q[state]) for state in range(num_states)}     
    return policy

In [8]:
def get_policy(Q, num_states):
    """Derive greedy policy from Q-values."""
    policy = {}
    for state in range(num_states):
        # Select action with highest Q-value
        best_action = np.argmax(Q[state])
        policy[state] = best_action
    return policy

def evaluate_policy(env, policy, num_eval_episodes=100):
    """Evaluate policy performance over multiple episodes."""
    total_rewards = []
    
    for _ in range(num_eval_episodes):
        episode_reward = 0
        state, info = env.reset()
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            action = policy.get(state, env.action_space.sample())
            state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            
        total_rewards.append(episode_reward)
    
    return {
        'mean_reward': np.mean(total_rewards),
        'std_reward': np.std(total_rewards),
        'min_reward': np.min(total_rewards),
        'max_reward': np.max(total_rewards)
    }

In [9]:
# Environment setup
env = gym.make('FrozenLake-v1', is_slippery=False)
num_states = env.observation_space.n
num_actions = env.action_space.n

# Run Monte Carlo methods
print("Running First-Visit Monte Carlo...")
Q_first = first_visit_mc(env, 1000, num_states, num_actions)
policy_first = get_policy(Q_first, num_states)

print("Running Every-Visit Monte Carlo...")
Q_every = every_visit_mc(env, 1000, num_states, num_actions)
policy_every = get_policy(Q_every, num_states)

# Evaluate policies
print("Evaluating First-Visit Policy...")
eval_first = evaluate_policy(env, policy_first)
print(f"First-visit policy: {policy_first}")
print(f"Performance: {eval_first}")

print("Evaluating Every-Visit Policy...")
eval_every = evaluate_policy(env, policy_every)
print(f"Every-visit policy: {policy_every}")
print(f"Performance: {eval_every}")

env.close()

Running First-Visit Monte Carlo...
Completed 100/1000 episodes
Completed 200/1000 episodes
Completed 300/1000 episodes
Completed 400/1000 episodes
Completed 500/1000 episodes
Completed 600/1000 episodes
Completed 700/1000 episodes
Completed 800/1000 episodes
Completed 900/1000 episodes
Completed 1000/1000 episodes
Running Every-Visit Monte Carlo...
Completed 100/1000 episodes
Completed 200/1000 episodes
Completed 300/1000 episodes
Completed 400/1000 episodes
Completed 500/1000 episodes
Completed 600/1000 episodes
Completed 700/1000 episodes
Completed 800/1000 episodes
Completed 900/1000 episodes
Completed 1000/1000 episodes
Evaluating First-Visit Policy...
First-visit policy: {0: np.int64(1), 1: np.int64(2), 2: np.int64(1), 3: np.int64(0), 4: np.int64(1), 5: np.int64(0), 6: np.int64(1), 7: np.int64(0), 8: np.int64(2), 9: np.int64(1), 10: np.int64(1), 11: np.int64(0), 12: np.int64(0), 13: np.int64(2), 14: np.int64(2), 15: np.int64(0)}
Performance: {'mean_reward': np.float64(1.0), 'std_r