# **Summary of `05` & `06` Notebooks**

## **📑Table of Contents**

1. [Part A: Monte Carlo Methods](#Part-A:-Monte-Carlo-Methods)
2. [Part B: Temporal Difference Learning (SARSA)](#Part-B:-Temporal-Difference-Learning-(SARSA))
3. [Part C: Q-Learning](#Part-C:-Q-Learning)

## **Part A: Monte Carlo Methods**

- Model-free learning technique
- Estimates Q-values based on complete episodes
- Two main approaches: first-visit MC and every-visit MC
- Suitable for episodic tasks

In [2]:
# ==================================================================================================
# PART A: MONTE CARLO METHODS - EPISODE GENERATION
# ==================================================================================================
import numpy as np
import gymnasium as gym

def generate_episode():
    """Generate a complete episode using random actions"""
    episode = []
    state, info = env.reset()
    terminated = False
    while not terminated:
        action = env.action_space.sample()  # Random action selection
        next_state, reward, terminated, truncated, info = env.step(action)
        episode.append((state, action, reward))
        state = next_state
    return episode

# Example episode structure: [(state, action, reward), ...]

# ==================================================================================================
# FIRST-VISIT MONTE CARLO IMPLEMENTATION
# ==================================================================================================
def first_visit_mc(num_episodes):
    """First-visit Monte Carlo method for Q-value estimation"""
    Q = np.zeros((num_states, num_actions))
    returns_sum = np.zeros((num_states, num_actions))
    returns_count = np.zeros((num_states, num_actions))
    
    for i in range(num_episodes):
        episode = generate_episode()
        visited_states_actions = set()  # Track first visits only
        
        for j, (state, action, reward) in enumerate(episode):
            if (state, action) not in visited_states_actions:
                # Calculate return from this point onwards
                returns_sum[state, action] += sum([x[2] for x in episode[j:]])
                returns_count[state, action] += 1
                visited_states_actions.add((state, action))
    
    # Calculate Q-values as average returns
    nonzero_counts = returns_count != 0
    Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
    return Q 


## **Part B: Temporal Difference Learning (SARSA)**

- TD learning updates Q-table at each step within episodes
- SARSA is an on-policy method (learns from actions actually taken)
- More suitable for long or indefinite episodes

In [3]:
# ==================================================================================================
# PART B: SARSA IMPLEMENTATION
# ==================================================================================================
# Environment setup for SARSA
env = gym.make("FrozenLake-v1", is_slippery=False)
num_states = env.observation_space.n
num_actions = env.action_space.n

# SARSA parameters
Q = np.zeros((num_states, num_actions))
alpha = 0.1      # Learning rate
gamma = 1.0      # Discount factor
num_episodes = 1000

def update_q_table_sarsa(state, action, reward, next_state, next_action):
    """SARSA update rule: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)]"""
    old_value = Q[state, action]
    next_value = Q[next_state, next_action]
    Q[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_value)


# ==================================================================================================
# SARSA TRAINING LOOP
# ==================================================================================================
for episode in range(num_episodes):
    state, info = env.reset()
    action = env.action_space.sample()  # Initial action selection
    terminated = False
    
    while not terminated:
        # Take action and observe next state and reward
        next_state, reward, terminated, truncated, info = env.step(action)
        next_action = env.action_space.sample()  # Next action selection
        
        # SARSA update
        update_q_table_sarsa(state, action, reward, next_state, next_action)
        
        # Move to next state-action pair
        state, action = next_state, next_action

# Extract optimal policy
policy_sarsa = get_policy(Q)
print("SARSA Policy:", policy_sarsa)

NameError: name 'get_policy' is not defined

## **Part C: Q-Learning**

- Q-learning is an off-policy method (learns optimal policy regardless of actions taken)
- Updates Q-values using maximum future reward
- More robust for exploration strategies

In [4]:
# ==================================================================================================
# PART C: Q-LEARNING IMPLEMENTATION
# ==================================================================================================
# Environment setup for Q-learning
env = gym.make("FrozenLake-v1", is_slippery=True)
num_episodes = 1000
alpha = 0.1
gamma = 1.0

num_states, num_actions = env.observation_space.n, env.action_space.n
Q = np.zeros((num_states, num_actions))
reward_per_random_episode = []

def update_q_table_qlearning(state, action, reward, new_state):
    """Q-learning update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]"""
    old_value = Q[state, action]
    next_max = max(Q[new_state])  # Maximum Q-value for next state
    Q[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)


# ==================================================================================================
# Q-LEARNING TRAINING LOOP
# ==================================================================================================
for episode in range(num_episodes):
    state, info = env.reset()
    terminated = False
    episode_reward = 0
    
    while not terminated:
        # Random action selection for exploration
        action = env.action_space.sample()
        
        # Take action and observe new state and reward
        new_state, reward, terminated, truncated, info = env.step(action)
        
        # Q-learning update
        update_q_table_qlearning(state, action, reward, new_state)
        
        episode_reward += reward
        state = new_state
    
    reward_per_random_episode.append(episode_reward)


# ==================================================================================================
# POLICY EVALUATION
# ==================================================================================================
# Extract learned policy
policy_qlearning = get_policy(Q)
reward_per_learned_episode = []

# Test learned policy performance
for episode in range(num_episodes):
    state, info = env.reset()
    terminated = False
    episode_reward = 0
    
    while not terminated:
        # Select best action based on learned Q-table
        action = policy_qlearning[state]
        
        # Take action and observe new state
        new_state, reward, terminated, truncated, info = env.step(action)
        state = new_state
        episode_reward += reward
    
    reward_per_learned_episode.append(episode_reward)

# Performance comparison
avg_random_reward = np.mean(reward_per_random_episode)
avg_learned_reward = np.mean(reward_per_learned_episode)

print(f"Average Random Policy Reward: {avg_random_reward:.3f}")
print(f"Average Learned Policy Reward: {avg_learned_reward:.3f}")
print(f"Performance Improvement: {(avg_learned_reward/avg_random_reward - 1)*100:.1f}%")

NameError: name 'get_policy' is not defined

## **Key Differences Summary**



**Monte Carlo vs. TD Learning:**
- **Monte Carlo**: Updates after complete episodes, requires episodic tasks
- **TD Learning**: Updates at each step, works with continuing tasks

**SARSA vs. Q-Learning:**
- **SARSA**: On-policy, learns policy being followed
- **Q-Learning**: Off-policy, learns optimal policy independent of behavior

**Use Cases:**
- **Monte Carlo**: Short episodic tasks with clear endpoints
- **SARSA**: When you want to learn the policy you're actually following
- **Q-Learning**: When you want to find the optimal policy regardless of exploration strategy