# **Temporal Difference Learning**



## Table of Contents

1. **Temporal Difference Learning**
   - 3.1 TD Learning vs. Monte Carlo Comparison
   - 3.2 SARSA Algorithm (On-Policy TD)
   - 3.3 Q-Learning Algorithm (Off-Policy TD)
   - 3.4 Policy Derivation and Evaluation
  
2. **Educational Enhancement and Practice**
   - 4.1 Learning Objectives
   - 4.2 Common Misconceptions
   - 4.3 Troubleshooting Guide
   - 4.4 Advanced Topics for Further Study

***

## 1. **Temporal Difference Learning**

### 3.1 TD Learning vs. Monte Carlo Comparison

**Core Concept from PDF:**
*"TD learning vs. Monte Carlo

TD learning:
- Model-free
- Estimate Q-table based on interaction  
- Update Q-table each step within episode
- Suitable for tasks with long/indefinite episodes

Monte Carlo:
- Model-free
- Estimate Q-table based on interaction
- Update Q-table when at least one episode done  
- Suitable for short episodic tasks"*

### Expanded Explanation:

**Temporal Difference (TD) learning** represents a fundamental paradigm shift from Monte Carlo methods by combining ideas from dynamic programming and Monte Carlo techniques.

#### Core Philosophical Differences:

**Temporal Difference Learning:**
- **Bootstrapping**: Uses estimates to update other estimates
- **Online Learning**: Updates occur after each time step
- **Sample Efficiency**: Can learn from incomplete episodes
- **Lower Variance**: Reduces random fluctuations in updates

**Mathematical Foundations:**
TD methods update estimates using the **TD error** $\delta_t$:
$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

Where:
- $R_{t+1}$ = Immediate reward
- $\gamma V(S_{t+1})$ = Discounted value of next state
- $V(S_t)$ = Current estimate of state value

#### Why TD Learning Matters:
**Advantages over Monte Carlo:**
- **Continuing Tasks**: Works with non-episodic environments
- **Faster Learning**: Updates immediately, not waiting for episode completion
- **Memory Efficient**: No need to store complete episodes
- **Online Adaptation**: Adjusts to environment changes in real-time

**Examples of Applications:**
- **Scenario 1**: Robot navigation where episodes may be very long or undefined
- **Scenario 2**: Financial trading where markets operate continuously  
- **Scenario 3**: Game playing where episodes can last hours or days

#### Long-term Consequences:
- TD methods form the foundation for advanced algorithms (Q-learning, Actor-Critic)
- Enable practical RL applications in real-world scenarios
- Provide theoretical insights into the bias-variance tradeoff in learning

***

### 3.2 SARSA Algorithm (On-Policy TD)

**Core Concept from PDF:**
*"SARSA
TD algorithm
On-policy method: adjusts strategy based on taken actions

SARSA update rule
α: learning rate
γ: discount factor  
Both between 0 and 1"*

### Expanded Explanation:

**SARSA (State-Action-Reward-State-Action)** is an on-policy TD control algorithm that learns Q-values for the policy it is currently following.

#### Mathematical Formulation:
**SARSA Update Rule:**
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$$

Where:
- $\alpha$ = Learning rate (0 < α ≤ 1)
- $\gamma$ = Discount factor (0 ≤ γ ≤ 1)  
- $R_{t+1}$ = Reward received after taking action $A_t$ in state $S_t$
- $Q(S_{t+1}, A_{t+1})$ = Q-value of the actual next action taken

#### Advanced Implementation Details:

**Core Algorithm Steps:**
- **Step 1**: Initialize Q-table and select initial action
- **Step 2**: Take action, observe reward and next state
- **Step 3**: Select next action using current policy
- **Step 4**: Update Q-value using actual next action
- **Step 5**: Move to next state-action pair and repeat

#### Why SARSA is On-Policy:
**On-Policy Characteristics:**
- **Policy Evaluation**: Learns Q-values for the policy being followed
- **Conservative Updates**: Updates based on actions actually taken
- **Safe Exploration**: Accounts for exploratory actions in value estimates
- **Convergence Guarantees**: Provably converges under appropriate conditions

#### Original Code from PDF: SARSA Initialization
```python
env = gym.make("FrozenLake", is_slippery=False)  
num_states = env.observation_space.n 
num_actions = env.action_space.n 
 
Q = np.zeros((num_states, num_actions))  
alpha = 0.1 
gamma = 1 
num_episodes = 1000
```

#### Original Code from PDF: SARSA Loop
```python
for episode in range(num_episodes):  
    state, info = env.reset() 
    action = env.action_space.sample()  
    terminated = False 
    while not terminated: 
        next_state, reward, terminated, truncated, info = env.step(action)  
        next_action = env.action_space.sample()  
        update_q_table(state, action, reward, next_state, next_action) 
        state, action = next_state, next_action
```

#### Original Code from PDF: SARSA Update
```python
def update_q_table(state, action, reward, next_state, next_action):  
    old_value = Q[state, action] 
    next_value = Q[next_state, next_action]  
    Q[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_value)
```

#### Completed/Enhanced Version:
```python
import numpy as np
import gymnasium as gym

def epsilon_greedy_policy(Q, state, epsilon=0.1):
    """Epsilon-greedy action selection for exploration."""
    if np.random.random() < epsilon:
        return np.random.choice(len(Q[state]))  # Random action
    else:
        return np.argmax(Q[state])  # Greedy action

def sarsa_learning(env, num_episodes, alpha=0.1, gamma=1.0, epsilon=0.1):
    """Complete SARSA learning implementation."""
    num_states = env.observation_space.n
    num_actions = env.action_space.n
    
    # Initialize Q-table
    Q = np.zeros((num_states, num_actions))
    episode_rewards = []
    
    for episode in range(num_episodes):
        # Initialize episode
        state, info = env.reset()
        action = epsilon_greedy_policy(Q, state, epsilon)
        episode_reward = 0
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            # Take action and observe results
            next_state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            
            if not terminated and not truncated:
                # Select next action using current policy
                next_action = epsilon_greedy_policy(Q, next_state, epsilon)
                
                # SARSA update
                td_error = reward + gamma * Q[next_state, next_action] - Q[state, action]
                Q[state, action] += alpha * td_error
                
                # Move to next state-action pair
                state, action = next_state, next_action
            else:
                # Terminal state update
                td_error = reward - Q[state, action]
                Q[state, action] += alpha * td_error
        
        episode_rewards.append(episode_reward)
        
        # Decay epsilon for exploration
        if epsilon > 0.01:
            epsilon *= 0.995
            
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode + 1}, Average Reward: {avg_reward:.2f}, Epsilon: {epsilon:.3f}")
    
    return Q, episode_rewards

# Usage example
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode=None)
Q_sarsa, rewards = sarsa_learning(env, 1000)

# Extract policy
policy_sarsa = {state: np.argmax(Q_sarsa[state]) for state in range(env.observation_space.n)}
print(f"SARSA Policy: {policy_sarsa}")
env.close()
```

**Code Explanation:**
- **Lines 4-8**: Epsilon-greedy policy balancing exploration and exploitation
- **Lines 10-25**: SARSA setup with Q-table initialization and episode tracking
- **Lines 27-40**: Main SARSA loop with proper state-action transitions
- **Lines 42-48**: SARSA update rule implementation with TD error calculation
- **Lines 53-58**: Exploration decay and progress monitoring

***

### 3.3 Q-Learning Algorithm (Off-Policy TD)

**Core Concept from PDF:**
*"Q-learning vs. SARSA
SARSA: Updates based on taken action, On-policy learner
Q-learning: Updates independent of taken actions, Off-policy learner

Introduction to Q-learning
Stands for quality learning
Model-free technique  
Learns optimal Q-table by interaction"*

### Expanded Explanation:

**Q-learning** is an off-policy TD control algorithm that learns the optimal action-value function independent of the policy being followed.

#### Mathematical Formulation:
**Q-learning Update Rule:**
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t)]$$

Where:
- $\max_{a} Q(S_{t+1}, a)$ = Maximum Q-value over all possible actions in next state
- This differs from SARSA which uses $Q(S_{t+1}, A_{t+1})$ (actual next action)

#### Why Q-Learning is Off-Policy:
**Off-Policy Characteristics:**
- **Policy Independence**: Updates target the optimal policy regardless of behavior policy
- **Optimistic Updates**: Always considers the best possible next action
- **Faster Convergence**: Can learn optimal policy while following exploratory behavior
- **Separation of Concerns**: Behavior policy handles exploration, target policy handles exploitation

#### Original Code from PDF: Q-learning Implementation
```python
env = gym.make("FrozenLake", is_slippery=True)  
 
num_episodes = 1000 
alpha = 0.1 
gamma = 1 
 
num_states, num_actions = env.observation_space.n, env.action_space.n 
Q = np.zeros((num_states, num_actions)) 

reward_per_random_episode = []
```

#### Original Code from PDF: Q-learning Loop
```python
for episode in range(num_episodes): 
    state, info = env.reset() 
    terminated = False 
    episode_reward = 0 
 
    while not terminated:  
        action = env.action_space.sample()  
        new_state, reward, terminated, truncated, info = env.step(action)  
        update_q_table(state, action, new_state)  
        episode_reward += reward 
        state = new_state 
    reward_per_random_episode.append(episode_reward)
```

#### Original Code from PDF: Q-learning Update
```python
def update_q_table(state, action, reward, new_state):  
  old_value = Q[state, action]  
  next_max = max(Q[new_state])  
  Q[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
```

#### Completed/Enhanced Version:
```python
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

def q_learning(env, num_episodes, alpha=0.1, gamma=1.0, epsilon=0.1):
    """Complete Q-learning implementation with exploration strategy."""
    num_states = env.observation_space.n
    num_actions = env.action_space.n
    
    # Initialize Q-table
    Q = np.zeros((num_states, num_actions))
    episode_rewards = []
    
    for episode in range(num_episodes):
        state, info = env.reset()
        episode_reward = 0
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])
            
            # Take action and observe results
            new_state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            
            # Q-learning update (off-policy)
            if not terminated and not truncated:
                next_max = np.max(Q[new_state])
                td_error = reward + gamma * next_max - Q[state, action]
            else:
                # Terminal state
                td_error = reward - Q[state, action]
            
            Q[state, action] += alpha * td_error
            state = new_state
        
        episode_rewards.append(episode_reward)
        
        # Decay exploration
        if epsilon > 0.01:
            epsilon *= 0.995
            
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode + 1}, Average Reward: {avg_reward:.2f}")
    
    return Q, episode_rewards

def compare_policies(env, Q_random, Q_learned, num_eval_episodes=100):
    """Compare random policy vs learned policy performance."""
    
    # Random policy evaluation
    random_rewards = []
    for _ in range(num_eval_episodes):
        state, info = env.reset()
        episode_reward = 0
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            action = env.action_space.sample()
            state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
        random_rewards.append(episode_reward)
    
    # Learned policy evaluation
    learned_rewards = []
    policy = {state: np.argmax(Q_learned[state]) for state in range(env.observation_space.n)}
    
    for _ in range(num_eval_episodes):
        state, info = env.reset()
        episode_reward = 0
        terminated = False
        truncated = False
        
        while not terminated and not truncated:
            action = policy[state]
            state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
        learned_rewards.append(episode_reward)
    
    return {
        'random_mean': np.mean(random_rewards),
        'learned_mean': np.mean(learned_rewards),
        'random_std': np.std(random_rewards),
        'learned_std': np.std(learned_rewards),
        'improvement': np.mean(learned_rewards) - np.mean(random_rewards)
    }

# Usage example
env = gym.make("FrozenLake-v1", is_slippery=True, render_mode=None)

# Train Q-learning agent
print("Training Q-learning agent...")
Q_learned, training_rewards = q_learning(env, 1000)

# Extract learned policy
learned_policy = {state: np.argmax(Q_learned[state]) for state in range(env.observation_space.n)}
print(f"Learned Policy: {learned_policy}")

# Compare policies
comparison = compare_policies(env, None, Q_learned)
print(f"\nPolicy Comparison:")
print(f"Random Policy Average Reward: {comparison['random_mean']:.3f} (±{comparison['random_std']:.3f})")
print(f"Learned Policy Average Reward: {comparison['learned_mean']:.3f} (±{comparison['learned_std']:.3f})")
print(f"Improvement: {comparison['improvement']:.3f}")

env.close()
```

**Code Explanation:**
- **Lines 6-15**: Q-learning initialization with proper environment setup
- **Lines 17-30**: Main learning loop with epsilon-greedy exploration
- **Lines 32-40**: Q-learning update rule using max over next state actions
- **Lines 42-50**: Exploration decay and progress monitoring
- **Lines 52-86**: Policy comparison framework for performance evaluation

***

### 3.4 Policy Derivation and Evaluation

**Core Concept from PDF:**
*"Using the policy
reward_per_learned_episode = [] 
policy = get_policy()  
for episode in range(num_episodes): 
    state, info = env.reset() 
    terminated = False 
    episode_reward = 0 
    while not terminated: 
        action = policy[state] 
        new_state, reward, terminated, truncated, info = env.step(action) 
        state = new_state 
        episode_reward += reward  
    reward_per_learned_episode.append(episode_reward)"*

**Core Concept from PDF:**
*"Q-learning evaluation
avg_random_reward = np.mean(reward_per_random_episode)
avg_learned_reward = np.mean(reward_per_learned_episode) 

plt.bar(['Random Policy', 'Learned Policy'], 
        [avg_random_reward, avg_learned_reward], 
        color=['blue', 'green']) 

plt.title('Average Reward per Episode') 
plt.ylabel('Average Reward') 
plt.show()"*

### Expanded Explanation:

**Policy derivation** transforms learned Q-values into actionable decision-making strategies, while **policy evaluation** quantifies the effectiveness of these strategies.

#### Advanced Policy Derivation Techniques:

**Greedy Policy Extraction:**
$$\pi^*(s) = \arg\max_a Q^*(s,a)$$

**Boltzmann (Softmax) Policy:**
$$\pi(a|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}}$$

Where $\tau$ is the temperature parameter controlling exploration.

#### Complete Evaluation Framework:

```python
def comprehensive_evaluation(env, Q_table, num_eval_episodes=200):
    """Comprehensive policy evaluation with multiple metrics."""
    
    # Extract greedy policy
    policy = {state: np.argmax(Q_table[state]) for state in range(env.observation_space.n)}
    
    # Performance metrics
    episode_rewards = []
    episode_lengths = []
    success_rate = 0
    
    for episode in range(num_eval_episodes):
        state, info = env.reset()
        episode_reward = 0
        episode_length = 0
        terminated = False
        truncated = False
        
        while not terminated and not truncated and episode_length < 200:
            action = policy[state]
            next_state, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            episode_length += 1
            state = next_state
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(episode_length)
        
        # Success if positive reward received
        if episode_reward > 0:
            success_rate += 1
    
    return {
        'mean_reward': np.mean(episode_rewards),
        'std_reward': np.std(episode_rewards),
        'mean_length': np.mean(episode_lengths),
        'success_rate': success_rate / num_eval_episodes,
        'policy': policy
    }
```

***


## 2. **Educational Enhancement and Practice**

### 4.1 Learning Objectives

After completing this material, learners should be able to:

**Core Understanding:**
- Distinguish between model-based and model-free reinforcement learning approaches
- Explain the fundamental principles of Monte Carlo methods in RL contexts
- Understand the difference between first-visit and every-visit Monte Carlo estimation
- Comprehend temporal difference learning and its advantages over Monte Carlo methods

**Technical Implementation:**
- Implement Monte Carlo algorithms for episodic tasks
- Code SARSA and Q-learning algorithms from scratch
- Design epsilon-greedy exploration strategies
- Evaluate and compare different RL policies quantitatively

**Advanced Concepts:**
- Analyze the bias-variance tradeoffs in different RL algorithms
- Choose appropriate algorithms based on problem characteristics
- Debug common issues in RL implementations
- Extend basic algorithms with advanced techniques

***

### 4.2 Common Misconceptions

#### **Misconception 1**: "Monte Carlo methods are always better than TD methods"
**Reality**: Monte Carlo provides unbiased estimates but requires complete episodes and has higher variance. TD methods work for continuing tasks and have lower variance but may have bias.

#### **Misconception 2**: "Q-learning always converges faster than SARSA"
**Reality**: Convergence speed depends on exploration strategy, environment characteristics, and hyperparameter settings. SARSA may be safer in stochastic environments.

#### **Misconception 3**: "Higher learning rates always lead to faster learning"
**Reality**: Learning rates that are too high can cause instability and prevent convergence. The optimal learning rate depends on the problem and algorithm.

#### **Misconception 4**: "Random exploration is sufficient for all RL problems"  
**Reality**: Advanced exploration strategies (curiosity-driven, count-based, Thompson sampling) often significantly outperform random exploration.

***

### 4.3 Troubleshooting Guide

#### **Problem**: Q-values not converging
**Solutions:**
- Reduce learning rate α
- Ensure sufficient exploration (check ε value)
- Verify environment is not non-stationary
- Increase number of training episodes

#### **Problem**: Policy performs poorly despite training
**Solutions:**
- Check reward signal design
- Verify environment termination conditions
- Ensure adequate state representation
- Review exploration-exploitation balance

#### **Problem**: Training is too slow
**Solutions:**
- Increase learning rate (carefully)
- Implement experience replay for sample efficiency
- Use function approximation for large state spaces
- Optimize episode generation code

#### **Problem**: Results are not reproducible
**Solutions:**
- Set random seeds for environment and algorithms
- Use deterministic environment settings when appropriate
- Record hyperparameters and training procedures
- Implement proper logging and checkpointing

***

### 4.4 Advanced Topics for Further Study

#### **Function Approximation**
- Linear function approximation
- Deep Q-Networks (DQN)
- Policy gradient methods
- Actor-Critic algorithms

#### **Advanced Exploration**
- Upper Confidence Bound (UCB) exploration
- Thompson sampling
- Curiosity-driven exploration
- Count-based exploration bonuses

#### **Multi-Agent Reinforcement Learning**
- Independent learners
- Centralized training, decentralized execution
- Game-theoretic approaches
- Cooperative vs competitive settings

#### **Real-World Applications**
- Robotics control
- Autonomous systems
- Financial trading
- Resource allocation
- Game playing AI

**Recommended Practice Exercises:**
1. Implement Monte Carlo and TD methods on GridWorld environments
2. Compare SARSA vs Q-learning on stochastic environments
3. Experiment with different exploration strategies and learning rates  
4. Apply algorithms to OpenAI Gym environments (CartPole, MountainCar)
5. Implement experience replay for improved sample efficiency