# **‚ú®Monte Carlo Methods**

## Table of Contents

1. [**Introduction to Reinforcement Learning Paradigms**](#1-introduction-to-reinforcement-learning-paradigms)
   - [1.1 Model-Based Learning Recap](#11-model-based-learning-recap)
   - [1.2 Model-Free Learning Fundamentals](#12-model-free-learning-fundamentals)
   - [1.3 Model-Free $\rightarrow$ On-Policy & Off-Policy](#13-model-free--on-policy--off-policy)

1. [**Monte Carlo Methods in Reinforcement Learning**](#2-monte-carlo-methods-in-reinforcement-learning)
   - [2.1 Core Concepts and Definitions](#21-core-concepts-and-definitions)
   - [2.2 Episode Collection and Q-Value Estimation](#22-episode-collection-and-q-value-estimation)
   - [2.3 Custom Grid World Environment Example](#23-custom-grid-world-environment-example)
   - [2.4 First-Visit vs. Every-Visit Monte Carlo Methods](#24-first-visit-vs-every-visit-monte-carlo-methods)
   - [2.5 Complete Code Implementation](#25-complete-code-implementation)



# 1. ‚≠ê**Introduction to Reinforcement Learning Paradigms**

## 1.1 ‚úîÔ∏è**Model-Based** Reinforcement Learning(Recap)

### üéØ **Core Concept**

**Model-based learning** assumes you have **complete knowledge** of how the environment works ~ like having the rulebook for a game before you play.
- **Environment dynamics are known**: You understand $P(s'|s,a)$ (transition probabilities) and $R(s,a)$ (reward function)
- **No trial-and-error needed**: Can calculate optimal actions mathematically
- **Planning-based approach**: Think first, act later


### üßÆ **Mathematical Foundations**

When you know the environment model, you can predict:
- **Next state**: Given current state $s$ and action $a$, what's the probability of reaching state $s'$?
- **Expected reward**: What reward do you get for taking action $a$ in state $s$?

This knowledge enables **dynamic programming** techniques for finding optimal policies.


### üîß **Core Algorithms**

#### **üîñ Policy Iteration** 
*"Improve the policy step by step"*

**Process**: Initialize policy ‚Üí Evaluate policy ‚Üí Improve policy ‚Üí Repeat until optimal

##### **1. Policy Evaluation** (How good is my current policy?)

**Full Formula**:
$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^{\pi}(s')]$$

**Simplified Approach**:
- **Iterative updates**: $V(s) \leftarrow \sum_a \pi(a|s) [r + \gamma V(s')]$
- **Stop when values converge**

##### **2. Policy Improvement** (Make the policy better)

**Full Formula**:
$$\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V^{\pi}(s')]$$

**Simplified Approach**:
- **Act greedily**: Choose action with highest expected value
- $\pi'(s) = \arg\max_a Q^{\pi}(s,a)$ where $Q^{\pi}(s,a) = r + \gamma V^{\pi}(s')$



#### **üîñ Value Iteration**
*"Find the best value for each state directly"*

**Process**: Initialize values ‚Üí Update values by selecting best actions ‚Üí Repeat until optimal

##### **Full Formula**:
$$V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]$$

##### **Simplified Understanding**:
- **One-step lookahead**: For each state, try all actions and pick the best
- **Direct optimization**: No separate policy - values directly give you optimal actions
- **Policy extraction**: $\pi(s) = \arg\max_a Q(s,a)$



### üìä **Key Variables Explained**

| Symbol | Meaning | Simple Explanation |
|--------|---------|-------------------|
| $V^{\pi}(s)$ | Value of state $s$ under policy $\pi$ | "How good is this state if I follow my current strategy?" |
| $\pi(a\|s)$ | Probability of taking action $a$ in state $s$ | "How likely am I to choose this action here?" |
| $\gamma$ | Discount factor (0 ‚â§ Œ≥ ‚â§ 1) | "How much do I care about future rewards vs immediate ones?" |
| $P(s'\|s,a)$ | Transition probability | "If I do this action here, where will I end up?" |
| $R(s,a)$ | Reward function | "What reward do I get for this action in this state?" |



### ‚ö° **Why Model-Based Learning Matters**

#### **Advantages**:
- **‚ö° Sample Efficiency**: No need for trial-and-error - can solve mathematically
- **üéØ Computational Efficiency**: Planning is faster than learning through experience  
- **üìà Theoretical Guarantees**: Provable convergence to optimal policies
- **üîÑ Quick Adaptation**: Can immediately adjust to goal changes

#### **Limitations**:
- **ü§î Model Complexity**: Real environments are often too complex to model accurately
- **‚ùì Unknown Dynamics**: Many real-world scenarios don't provide transition probabilities
- **‚ö†Ô∏è Model Errors**: Wrong model leads to suboptimal policies
- **üîÑ Non-Stationary**: Environments that change over time break the model



### üéÆ **When to Use Model-Based vs Model-Free**

| **Use Model-Based When** | **Use Model-Free When** |
|------------------------------|---------------------------|
| Environment rules are known | Environment is complex/unknown |
| Sample efficiency is critical | Can afford many interactions |
| Planning is computationally feasible | Environment changes frequently |
| **Examples**: Chess, Grid worlds | **Examples**: Video games, Robotics |



### üöÄ **Practical Takeaways**

1. **Start Simple**: If you can model the environment, model-based is often faster.
2. **Know Your Limits**: Complex real-world problems usually need model-free approaches.
3. **Hybrid Approaches**: Many modern systems combine both methods.
4. **Simplified Formulas**: Use iterative updates instead of complex summations for easier implementation.

## 1.2 ‚úîÔ∏è**Model-Free** Learning Fundamentals

### üéØ **Core Concept**

**Model-free learning** is like learning to ride a bike by actually riding it - no instruction manual needed, just **trial and error**

- **Experience-based**: Learn from sequences of $(s, a, r, s')$ tuples (`state`, `action`, `reward`, `next_state`)
- **No environment model required**: Don't need to know $P(s'|s,a)$ or $R(s,a)$ in advance
- **Direct interaction**: Agent learns by doing, not by thinking



### üßÆ **Mathematical Foundations (Simplified)**

Instead of complex environment models, model-free methods use **direct experience**:

- **Experience tuple**: $(s_t, a_t, r_{t+1}, s_{t+1})$ - "I was here, did this, got this reward, ended up there"
- **Value estimation**: Learn $V(s)$ or $Q(s,a)$ directly from observed rewards
- **Policy learning**: Improve actions based on **actual outcomes**, not predictions



### üîß **Core Algorithm Categories**

#### **üîñ Value-Based Methods**
*"Learn how good each action is, then pick the best one"*

##### **‚≠êQ-Learning** (Most Popular):
**Simplified Formula**:
$$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max Q(s',a') - Q(s,a)]$$

**In Plain English**:
- **Update rule**: "My estimate + learning_rate √ó (what_actually_happened - my_estimate)"
- **No model needed**: Just observe $(s, a, r, s')$ and update
- **Policy**: Always choose $\pi(s) = \arg\max_a Q(s,a)$

##### **‚≠êDeep Q-Networks (DQN)**:
- **Same Q-learning principle** but uses neural networks for complex state spaces
- **Handles high-dimensional inputs** like images (Atari games)
- **Experience replay**: Learn from past experiences multiple times



#### **üîñ Policy-Based Methods**
*"Learn the strategy directly, skip the value estimation"*

##### **REINFORCE Algorithm**:
**Simplified Formula**:
$$\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot R]$$

**In Plain English**:
- **Direct policy optimization**: Adjust policy parameters to maximize rewards
- **Monte Carlo approach**: Use complete episode returns
- **No value function needed**: Just improve policy based on episode outcomes



#### **üîñ Actor-Critic Methods**
*"Best of both worlds: Learn values AND policy"*

**Two Components**:
- **Actor**: Learns the `policy` $\pi(a|s)$ (what to do)
- **Critic**: Learns the `value function` $V(s)$ (how good is this state)

**Popular Algorithms**: 
- **A2C** ‚Äì (`Advantage Actor-Critic`) A synchronous version of the actor-critic method that uses the advantage function to reduce variance in policy gradient updates. All agents collect experience in parallel and update the model together.

- **A3C** ‚Äì (`Asynchronous Advantage Actor-Critic`) An asynchronous variant of A2C where multiple agents run in parallel but update the global model independently. This improves training stability and efficiency by decorrelating experiences.

- **PPO** ‚Äì (`Proximal Policy Optimization`) A policy gradient method that uses a clipped objective to prevent large updates, making training more stable and sample-efficient. It‚Äôs widely used in modern RL applications due to its simplicity and robustness.



### üìä **Key Characteristics & Advantages**

#### **‚úÖ Why Model-Free Works**:

| **Advantage** | **Explanation** |
|---------------|----------------|
| **üåç Real-World Ready** | No need to model complex environments (robotics, games) |
| **üîÑ Adapts Naturally** | Handles changing environments without remodeling |
| **üéØ Robust to Uncertainty** | Works even when environment dynamics are unknown |
| **üìà Scalable** | Handles high-dimensional state spaces better |

#### **‚ö†Ô∏è Challenges**:

| **Challenge** | **Impact** |
|---------------|------------|
| **‚è∞ Sample Inefficiency** | Needs many interactions to learn well |
| **üé≤ High Variance** | Learning can be unstable and noisy |
| **‚öñÔ∏è Exploration-Exploitation** | Hard to balance trying new things vs using known good actions|
| **üîß Hyperparameter Sensitivity** | Performance depends heavily on tuning|


### üéÆ **Real-World Applications**

#### **Perfect for Model-Free**:
- **ü§ñ Autonomous Navigation**: Self-driving cars in traffic
- **üéØ Game Playing**: Chess, Go, video games (AlphaGo, OpenAI Five)
- **üí∞ Financial Trading**: Stock market strategies
- **‚òÅÔ∏è Cloud Computing**: Resource allocation and load balancing
- **üè≠ Robotics**: Manipulation tasks in unstructured environments

***

### üÜö **Model-Free vs Model-Based Comparison**

| **Aspect** | **Model-Free** | **Model-Based** |
|------------|----------------|-----------------|
| **Learning Method** | Trial and error from experience | Planning with learned/given model |
| **Environment Knowledge** | None required | Requires environment model (can be learned/approximate) |
| **Sample Efficiency** | Low (needs many real samples) | High (leverages model simulations) |
| **Real-World Suitability** | Good when samples are cheap | Better when real samples are expensive/risky |
| **Computational Cost** | Low inference, high training | Low inference, high planning per step |
| **Robustness** | Robust to model errors, sensitive to distribution shift | Sensitive to model errors, can handle uncertainty well |
| **Convergence** | Guaranteed under certain conditions | Depends on model accuracy |
| **Exploration** | Direct exploration in environment | Can explore safely in model |
| **Interpretability** | Policy decisions less transparent | Planning process more interpretable |


### üß† **Simplified Algorithm Comparison**

#### **Q-Learning Example**:

1. Start with random Q-values
2. Take action, observe reward
3. Update: $Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$
4. Repeat until convergence


#### **Policy Gradient Example**:

1. Start with random policy
2. Run episode, collect rewards
3. Update: Increase probability of good actions
4. Repeat until optimal




### üöÄ **Practical Takeaways**

#### **Choose Model-Free When**:
- ‚úÖ Environment is **complex or unknown**
- ‚úÖ Environment **changes over time**
- ‚úÖ You can afford **many interactions**
- ‚úÖ **Safety** is more important than efficiency
- ‚úÖ Working with **high-dimensional** problems

#### **Implementation Tips**:
1. **Start with Q-Learning** for discrete problems
2. **Use DQN** for complex state spaces
3. **Try Actor-Critic** for continuous actions
4. **Focus on exploration strategies** early in learning
5. **Monitor sample efficiency** - if too slow, consider model-based approaches

## 1.3 ‚ú®**Model-Free** $\rightarrow$ `On-Policy` & `Off-Policy`

### On-policy vs off-policy
- **On-policy**: 
  - Learns and improves the same policy that collects the data (behavior policy = target policy).
  - Agent learns by using the same strategy it's trying to improve. Like learning to drive by actually driving the car yourself.
    - `SARSA`
    - `A2C/A3C`
    - `PPO`
    - `Monte Carlo policy gradient` (REINFORCE).
  - **On-policy analogy**: You improve your recipe by cooking it yourself each time
  - **Real-world applications**: Training ChatGPT with PPO where fresh human feedback guides policy updates, robot learning where safety requires predictable policy improvements, and game AI where direct policy optimization works well.

- **Off-policy**: 
  - Learns about a different target policy using data from a behavior policy (including replay of past data).
  - Agent learns by observing different strategies or past experiences. Like learning to drive by watching driving videos or using a driving simulator with recorded data.
    - `Q-learning`
    - `DQN`
    - `DDPG`
    - `TD3`
    - `SAC`
  - **Off-policy analogy**: You improve your recipe by studying cooking videos, past attempts, and other chefs' techniques
  - **Real-world applications**: Netflix recommendations trained on massive logged user interactions, autonomous driving using years of driving data, and trading algorithms learning from historical market data.

### Practical Differences 
- **Stability vs Efficiency**  
  - **On-policy** methods are more stable because they learn directly from the current behavior.  
  - **Off-policy** methods are more efficient with data because they reuse old experiences (via replay buffers), but this can make learning less stable.

- **Data Usage**  
  - **On-policy** discards past experiences after updates and uses only fresh data.  
  - **Off-policy** keeps and reuses past experiences, even from older or different policies, to learn better and faster.

- **Corrections for Learning**  
  - Off-policy methods often use techniques like importance sampling to adjust for differences between the behavior policy (that generated the data) and the target policy (being learned).


### Quick comparison table

| Algorithm | Method family | Policy type |
|---|---|---|
| `SARSA` | Value-based TD control | On-policy |
| `Expected SARSA` | Value-based TD control | On-policy |
| `PPO` | Policy-gradient actor-critic | On-policy  |
| `REINFORCE` | Policy gradient | On-policy |
| `A2C/A3C` | Actor-critic | On-policy |
| `Q-learning` | Value-based TD control | Off-policy |
| `DQN` (incl. Double DQN/Rainbow) | Deep value-based | Off-policy|
| `DDPG` | Deterministic actor-critic | Off-policy  |
| `TD3` | Deterministic actor-critic | Off-policy  |
| `SAC` | Stochastic actor-critic (entropy) | Off-policy |

### When to Use Each (Simple Explanation)

- **Use On-policy methods (like PPO, A2C)** when:  
  - You need **stable** and **predictable learning**.  
  - The environment allows you to **reset or collect fresh data easily**.  
  - Safety or reliable behavior is important (e.g., training robots, fine-tuning language models).

- **Use Off-policy methods (like DQN, TD3, SAC)** when:  
  - **Collecting new data is expensive or slow**.  
  - You have access to **large amounts of past recorded data**.  
  - You want to **make the most out of available data** by reusing it (e.g., recommendation systems, autonomous driving, financial trading).


# 2. ‚≠ê**Monte Carlo Methods in Reinforcement Learning**

## 2.1 Core Concepts and Definitions

### Monte Carlo Methods in Reinforcement Learning.

- Monte Carlo (MC) methods are **model-free**, meaning they do not need to know how the environment works (no model of states or transitions).
- They estimate the value of actions $Q(s,a)$ by **averaging the total rewards** (returns) collected over many complete episodes.



### Core Idea

- The **return** $G_t$ at a time step $t$ is the total discounted reward received from that time onward:
  $$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$
  - $R_{t+k+1}$ is the reward at step $t+k+1$
  - $\gamma$ (between 0 and 1) is the discount factor, controlling how future rewards are valued

- MC methods estimate the action-value function $Q(s,a)$ as the average $G_t$ observed whenever the agent visits state $s$ and takes action $a$:
    $$Q(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a]$$



### Why Monte Carlo Methods Are Useful

- They provide **unbiased estimates** of $Q$ values given enough episodes (they converge to the true values).
- They are **simple to implement** because they update values by averaging actual observed outcomes.
- They do **not rely on bootstrapping** (unlike Temporal Difference methods), so they don't update estimates using other estimated values.



### Limitations and Practical Considerations

- MC methods need **complete episodes** to compute returns ‚Äî they update values only after reaching an episode's end.
- Therefore, they work best on **episodic tasks** with clear end points, like games or robot tasks with defined goals.
- They can be less efficient and more memory-heavy for **long or continuing tasks** because they store full histories of rewards and actions for each episode.



### Real-world Example

Here are some other simple real-world examples for Monte Carlo methods in reinforcement learning:

- **Robot Navigation**: A robot explores a room trying different paths until it reaches a goal. After each complete trip, the robot looks back at the total rewards (like how fast or obstacle-free the path was) to improve its route choices. Over many trips, it learns the best paths without needing to know the map in advance.

- **Game Playing (e.g., Chess or Go)**: An AI plays many complete games. After each game ends, it reviews moves made and the final result (win/loss), updating how valuable each move was. This helps improve strategy over time by averaging results from many full games.

- **Inventory Management**: A system manages stock levels in a warehouse. It runs through full cycles of ordering and sales (episodes) and updates decisions on how much stock to keep based on cumulative profit or loss measured at the end of each cycle.

## 2.2 Episode Collection and Q-Value Estimation

### Core Concept  
- Collect random episodes by interacting with the environment.  
- Estimate Q-values $Q(s,a)$ using Monte Carlo (MC) by averaging returns after episodes.  
- Update policy towards the optimal by improving Q-values over time.


### Advanced Steps in Monte Carlo Reinforcement Learning

#### Step 1: Episode Generation  
- Start from a random initial state in the environment.  
- Follow the current policy (usually starting random) until the episode finishes.  
- Record the sequence of states, actions, and rewards:  
  
  $$(S_0, A_0, R_1), (S_1, A_1, R_2), \ldots, (S_{T-1}, A_{T-1}, R_T)$$

#### Step 2: Return Calculation  
- For each time step $t$ in the episode, compute the return $G_t$ as the discounted sum of rewards from $t+1$ to the end $T$:  
  
  $$G_t = \sum_{k=t+1}^{T} \gamma^{k-(t+1)} R_k$$

- Calculate returns by working backwards from the episode‚Äôs terminal state.

#### Step 3: Q-Value Update  
- For every visited state-action pair $(s, a)$, update the estimate $Q(s, a)$ by averaging all returns $G_i$ observed on the $N(s,a)$ visits so far:  

  $$Q(s, a) \gets \frac{1}{N(s, a)} \sum_{i=1}^{N(s, a)} G_i(s, a)$$

- Here,  
  - $N(s,a)$ is the count of times $(s,a)$ occurred.  
  - $G_i(s,a)$ is the return observed at the $i^{th}$ visit to $(s,a)$.


This procedure repeats over many episodes, gradually improving the estimates of $Q$-values and thus the policy.

## 2.3 Custom Grid World Environment Example

### Core Concept
- A custom grid world example shows states 0 to 5 with episode data displaying State, Action, Reward, and Return at each step.

### Expanded Explanation

#### Environment Setup  
- The grid world has 6 states (0 to 5) arranged in a specific layout.  
- Each state lets the agent choose among actions: Left, Down, Right, Up.  
- Actions lead to immediate rewards (which can be positive or negative).  
- Episodes end when the agent reaches certain goal states.



#### Example Episodes  

**Episode 1 Data:**

| State | Action | Reward | Return |
|-------|--------|--------|--------|
| 3     | Right  | -2     | 5      |
| 4     | Left   | -1     | 7      |
| 3     | Right  | -2     | 8      |
| 4     | Right  | 10     | 10     |

**Episode 2 Data:**

| State | Action | Reward | Return |
|-------|--------|--------|--------|
| 3     | Right  | -2     | 5      |
| 4     | Up     | -1     | 7      |
| 1     | Down   | -2     | 8      |
| 4     | Right  | 10     | 10     |

#### Return Calculation Process

- The return is calculated **backwards from the episode‚Äôs end**, summing rewards step-by-step.  
- Each return represents the cumulative sum of all future rewards from that time step onward.  
- Keep track of every state-action pair $(s, a)$ visited and associate these with their returns for updating value estimates.

## 2.4 `First-Visit` vs. `Every-Visit` Monte Carlo Methods

- **Core Concept:**
  - Q(3, right) - first-visit Monte Carlo
    -  Average first visit to (s,a) within episodes"
  - Q(3, right) - every-visit Monte Carlo  
    -  Average every visit to (s,a) within episodes"

### Expanded Explanation:

#### First-Visit Monte Carlo:
**Definition:** Only the **first occurrence** of each state-action pair $(s,a)$ within an episode contributes to the average return.

**Mathematical Formulation:**
- **Specific case 1:** If $(s,a)$ appears multiple times in episode, only use first return
- **Specific case 2:** Each episode contributes at most one sample per $(s,a)$ pair

- **Example:**
    - For $Q(3, \text{Right})$ with first-visit:
        - Episode 1: First visit return = 5
        - Episode 2: First visit return = 5  
        - **Average: $(5 + 5)/2 = 5$**

#### Every-Visit Monte Carlo:
**Definition:** **Every occurrence** of state-action pair $(s,a)$ within episodes contributes to the average.

**Mathematical Formulation:**
- **Specific case 1:** Multiple visits within same episode all contribute
- **Specific case 2:** More samples per episode, potentially faster convergence

- **Example:**
    - For $Q(3, \text{Right})$ with every-visit:
        - Episode 1: Returns = 5, 8 (two visits)
        - Episode 2: Return = 5 (one visit)
        - **Average: $(5 + 8 + 5)/3 = 6$**

#### Why the Difference Matters:
**First-Visit Characteristics:**
- **Unbiased estimates** of true Q-values
- **Lower variance** per episode
- **Cleaner theoretical analysis**

**Every-Visit Characteristics:**
- **Biased estimates** (especially early in learning)
- **Higher variance** but more samples
- **Potentially faster convergence** in practice

## 2.5 Complete Code Implementation

In [3]:
def generate_episode(): 
    episode = [] 
    state, info = env.reset()  
    terminated = False 
    while not terminated: 
        action = env.action_space.sample()  
        next_state, reward, terminated, truncated, info = env.step(action)  
        episode.append((state, action, reward)) 
        state = next_state 

    return episode

In [4]:
import numpy as np
import gymnasium as gym

def generate_episode(env, policy=None, max_steps=1000):
    """Generate a single episode using given policy or random actions."""
    episode = []
    state, info = env.reset()
    terminated = False
    truncated = False
    step_count = 0
    
    while not terminated and not truncated and step_count < max_steps:
        # Use provided policy or random action selection
        if policy is not None:
            action = policy.get(state, env.action_space.sample())
        else:
            action = env.action_space.sample()
            
        next_state, reward, terminated, truncated, info = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        step_count += 1
    
    return episode

def calculate_returns(episode, gamma=1.0):
    """Calculate discounted returns for each step in episode."""
    returns = []
    G = 0
    
    # Work backwards through episode
    for i in reversed(range(len(episode))):
        _, _, reward = episode[i]
        G = reward + gamma * G
        returns.insert(0, G)
    
    return returns

In [5]:
def first_visit_mc(num_episodes): 
    Q = np.zeros((num_states, num_actions)) 
    returns_sum = np.zeros((num_states, num_actions)) 
    returns_count = np.zeros((num_states, num_actions))  
    for i in range(num_episodes): 
        episode = generate_episode() 
        visited_states_actions = set()  
        for j, (state, action, reward) in enumerate(episode):  
            if (state, action) not in visited_states:  
                returns_sum[state, action] += sum([x[2] for x in episode[j:]])  
                returns_count[state, action] += 1 
                visited_states_actions.add((state, action))  
    nonzero_counts = returns_count != 0  
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts] 
    return Q

In [6]:
def first_visit_mc(env, num_episodes, num_states, num_actions, gamma=1.0):
    """First-visit Monte Carlo for estimating Q-values."""
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))
    
    for episode_num in range(num_episodes):
        # Generate episode
        episode = generate_episode(env)
        returns = calculate_returns(episode, gamma)
        
        # Track first visits only
        visited_state_actions = set()
        
        for step, ((state, action, reward), G) in enumerate(zip(episode, returns)):
            if (state, action) not in visited_state_actions:
                returns_sum[state, action] += G
                returns_count[state, action] += 1
                visited_state_actions.add((state, action))
        
        # Print progress periodically
        if (episode_num + 1) % 100 == 0:
            print(f"Completed {episode_num + 1}/{num_episodes} episodes")
    
    # Calculate final Q-values (avoid division by zero)
    nonzero_counts = returns_count > 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    
    return Q

def every_visit_mc(env, num_episodes, num_states, num_actions, gamma=1.0):
    """Every-visit Monte Carlo for estimating Q-values."""
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))
    
    for episode_num in range(num_episodes):
        episode = generate_episode(env)
        returns = calculate_returns(episode, gamma)
        
        # Update for every visit (no set tracking needed)
        for (state, action, reward), G in zip(episode, returns):
            returns_sum[state, action] += G
            returns_count[state, action] += 1
        
        if (episode_num + 1) % 100 == 0:
            print(f"Completed {episode_num + 1}/{num_episodes} episodes")
    
    nonzero_counts = returns_count > 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    
    return Q

#### Policy Derivation

In [7]:
def get_policy(): 
    policy = {state: np.argmax(Q[state]) for state in range(num_states)}     
    return policy

In [8]:
def get_policy(Q, num_states):
    """Derive greedy policy from Q-values."""
    policy = {}
    for state in range(num_states):
        # Select action with highest Q-value
        best_action = np.argmax(Q[state])
        policy[state] = best_action
    return policy

def evaluate_policy(env, policy, num_eval_episodes=100):
    """Evaluate policy performance over multiple episodes."""
    total_rewards = []
    
    for _ in range(num_eval_episodes):
        episode_reward = 0
        state, info = env.reset()
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            action = policy.get(state, env.action_space.sample())
            state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            
        total_rewards.append(episode_reward)
    
    return {
        'mean_reward': np.mean(total_rewards),
        'std_reward': np.std(total_rewards),
        'min_reward': np.min(total_rewards),
        'max_reward': np.max(total_rewards)
    }

In [9]:
# Environment setup
env = gym.make('FrozenLake-v1', is_slippery=False)
num_states = env.observation_space.n
num_actions = env.action_space.n

# Run Monte Carlo methods
print("Running First-Visit Monte Carlo...")
Q_first = first_visit_mc(env, 1000, num_states, num_actions)
policy_first = get_policy(Q_first, num_states)

print("Running Every-Visit Monte Carlo...")
Q_every = every_visit_mc(env, 1000, num_states, num_actions)
policy_every = get_policy(Q_every, num_states)

# Evaluate policies
print("Evaluating First-Visit Policy...")
eval_first = evaluate_policy(env, policy_first)
print(f"First-visit policy: {policy_first}")
print(f"Performance: {eval_first}")

print("Evaluating Every-Visit Policy...")
eval_every = evaluate_policy(env, policy_every)
print(f"Every-visit policy: {policy_every}")
print(f"Performance: {eval_every}")

env.close()

Running First-Visit Monte Carlo...
Completed 100/1000 episodes
Completed 200/1000 episodes
Completed 300/1000 episodes
Completed 400/1000 episodes
Completed 500/1000 episodes
Completed 600/1000 episodes
Completed 700/1000 episodes
Completed 800/1000 episodes
Completed 900/1000 episodes
Completed 1000/1000 episodes
Running Every-Visit Monte Carlo...
Completed 100/1000 episodes
Completed 200/1000 episodes
Completed 300/1000 episodes
Completed 400/1000 episodes
Completed 500/1000 episodes
Completed 600/1000 episodes
Completed 700/1000 episodes
Completed 800/1000 episodes
Completed 900/1000 episodes
Completed 1000/1000 episodes
Evaluating First-Visit Policy...
First-visit policy: {0: np.int64(1), 1: np.int64(2), 2: np.int64(1), 3: np.int64(0), 4: np.int64(1), 5: np.int64(0), 6: np.int64(1), 7: np.int64(0), 8: np.int64(2), 9: np.int64(1), 10: np.int64(1), 11: np.int64(0), 12: np.int64(0), 13: np.int64(2), 14: np.int64(2), 15: np.int64(0)}
Performance: {'mean_reward': np.float64(1.0), 'std_r