# Actor-Critic Methods: Combining Policy and Value Learning

## Overview
Actor-Critic methods combine policy gradient (actor) with value function learning (critic) to reduce variance while maintaining on-policy learning.

### Learning Objectives
1. Understand the actor-critic architecture
2. Learn how value functions reduce variance
3. Implement Actor-Critic algorithm
4. Compare with REINFORCE

## 1. Actor-Critic Architecture

### Two Components

**Actor (Policy Network)**
- Learns the policy π_θ(a|s)
- Updated using policy gradient
- Selects actions

**Critic (Value Network)**
- Learns the value function V_φ(s)
- Updated using temporal difference (TD) learning
- Provides baseline for variance reduction

### Mathematical Framework

Policy gradient with advantage:
$$\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) A(s,a)]$$

where the advantage is estimated as:
$$A(s,a) \approx r + \gamma V_\phi(s') - V_\phi(s)$$

This is the **Temporal Difference (TD) error** or **TD residual**.

## 2. Temporal Difference Learning

### TD Error

The TD error measures the difference between predicted and actual value:

$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

### Advantages of TD Learning
1. **Lower variance**: Uses one-step lookahead instead of full trajectory
2. **Online learning**: Can update after each step
3. **Bootstrapping**: Uses value estimates to bootstrap

### Critic Update

The critic is trained to minimize TD error:

$$L_{critic} = \mathbb{E}[(\delta_t)^2]$$

$$\phi \leftarrow \phi - \beta \nabla_\phi (\delta_t)^2$$

## 3. Actor-Critic Algorithm

### Pseudocode

```
Initialize policy π_θ and value function V_φ
for episode = 1 to num_episodes:
    state ← env.reset()
    for t = 0 to T:
        # Actor: sample action
        action ~ π_θ(·|state)
        
        # Environment step
        next_state, reward ← env.step(action)
        
        # Critic: compute TD error
        δ ← reward + γV_φ(next_state) - V_φ(state)
        
        # Critic update
        φ ← φ + β∇_φ V_φ(state) δ
        
        # Actor update
        θ ← θ + α∇_θ log π_θ(action|state) δ
        
        state ← next_state
```

## 4. Variance Reduction Analysis

### Comparison: REINFORCE vs Actor-Critic

**REINFORCE**
- Uses full trajectory return: $G_t = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k}$
- High variance (depends on entire trajectory)
- Unbiased estimate

**Actor-Critic**
- Uses TD error: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$
- Lower variance (one-step lookahead)
- Biased estimate (depends on value function accuracy)

### Bias-Variance Trade-off

- **More bias**: Faster convergence, but may converge to suboptimal policy
- **Less bias**: Slower convergence, but better final policy
- Actor-Critic finds a good balance

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Demonstrate variance reduction
np.random.seed(42)

# Simulate trajectory
rewards = np.array([1.0, 1.0, 1.0, 0.0, 1.0, 1.0])
values = np.array([2.5, 2.0, 1.5, 0.5, 1.5, 1.0, 0.0])  # V(s_t) for each state
gamma = 0.99

# REINFORCE: use full returns
returns = np.zeros(len(rewards))
cumulative = 0
for t in reversed(range(len(rewards))):
    cumulative = rewards[t] + gamma * cumulative
    returns[t] = cumulative

# Actor-Critic: use TD errors
td_errors = np.zeros(len(rewards))
for t in range(len(rewards)):
    td_errors[t] = rewards[t] + gamma * values[t+1] - values[t]

print("Rewards:", rewards)
print("\nREINFORCE (full returns):")
print(f"  Returns: {returns}")
print(f"  Variance: {np.var(returns):.4f}")

print("\nActor-Critic (TD errors):")
print(f"  TD errors: {td_errors}")
print(f"  Variance: {np.var(td_errors):.4f}")

print(f"\nVariance reduction: {(1 - np.var(td_errors)/np.var(returns))*100:.1f}%")

## 5. Implementation

### Actor Network

In [None]:
import torch
import torch.nn as nn
from torch.distributions import Categorical

class Actor(nn.Module):
    """Policy network (actor)"""
    def __init__(self, state_dim=4, action_dim=2, hidden_dim=32):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state):
        return self.net(state)
    
    def get_action_and_log_prob(self, state):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        
        logits = self.forward(state)
        dist = Categorical(logits=logits)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        
        return action.squeeze(), log_prob.squeeze()

print("Actor network defined successfully!")

### Critic Network

In [None]:
class Critic(nn.Module):
    """Value function network (critic)"""
    def __init__(self, state_dim=4, hidden_dim=32):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        if state.dim() == 1:
            state = state.unsqueeze(0)
        return self.net(state)

print("Critic network defined successfully!")

### Training Step

In [None]:
def actor_critic_step(actor, critic, state, action, reward, next_state, done, 
                      actor_optimizer, critic_optimizer, gamma=0.99):
    """
    Perform one Actor-Critic training step.
    
    Args:
        actor: Policy network
        critic: Value network
        state: Current state
        action: Action taken
        reward: Reward received
        next_state: Next state
        done: Whether episode is done
        actor_optimizer: Optimizer for actor
        critic_optimizer: Optimizer for critic
        gamma: Discount factor
    """
    state_tensor = torch.FloatTensor(state).unsqueeze(0)
    next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0)
    
    # Critic: compute TD error
    with torch.no_grad():
        current_value = critic(state_tensor)
        next_value = critic(next_state_tensor) if not done else torch.tensor([[0.0]])
        td_target = reward + gamma * next_value
    
    # Critic loss and update
    critic_loss = torch.nn.functional.mse_loss(current_value, td_target)
    critic_optimizer.zero_grad()
    critic_loss.backward()
    critic_optimizer.step()
    
    # Actor: compute policy loss using TD error as advantage
    td_error = (td_target - current_value).detach()
    
    logits = actor(state_tensor)
    dist = Categorical(logits=logits)
    log_prob = dist.log_prob(torch.tensor(action))
    
    actor_loss = -log_prob * td_error.squeeze()
    
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()
    
    return actor_loss.item(), critic_loss.item()

print("Actor-Critic training step defined successfully!")

## 6. Advantages of Actor-Critic

### vs. REINFORCE
1. **Lower variance**: Uses TD error instead of full returns
2. **Faster convergence**: Fewer samples needed
3. **Online learning**: Can update after each step
4. **Better stability**: Value function provides stable baseline

### vs. Pure Value-Based Methods
1. **Direct policy optimization**: Learns policy directly
2. **Continuous actions**: Handles continuous action spaces naturally
3. **Stochastic policies**: Can learn exploratory policies
4. **Convergence guarantees**: Theoretical convergence properties

## 7. Summary

### Key Concepts
1. **Actor-Critic**: Combines policy gradient with value learning
2. **TD Learning**: Uses one-step lookahead for lower variance
3. **Advantage Function**: Measures relative quality of actions
4. **Bias-Variance Trade-off**: Balances between REINFORCE and pure value methods

### Next Steps
- Implement A2C with Generalized Advantage Estimation (GAE)
- Explore parallel variants (A3C)
- Study trust region methods (PPO, TRPO)