# Policy Gradient Methods: Fundamentals

## Overview
This notebook introduces the fundamental concepts of policy gradient methods in reinforcement learning.

### Learning Objectives
1. Understand the policy gradient theorem
2. Learn the difference between on-policy and off-policy methods
3. Implement a simple policy gradient algorithm
4. Understand the role of baselines in variance reduction

## 1. The Policy Gradient Theorem

### Mathematical Foundation

The policy gradient theorem states:

$$\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) Q(s,a)]$$

where:
- $J(\theta)$ is the objective function (expected return)
- $\pi_\theta(a|s)$ is the policy parameterized by $\theta$
- $Q(s,a)$ is the action-value function
- $\nabla_\theta \log \pi_\theta(a|s)$ is the score function

### Key Insight
The gradient of the expected return can be expressed as an expectation of the product of:
1. **Score function**: $\nabla_\theta \log \pi_\theta(a|s)$ - direction to increase policy probability
2. **Return signal**: $Q(s,a)$ - how good the action is

This allows us to use gradient ascent to directly optimize the policy!

## 2. Why Policy Gradient Methods?

### Advantages
- **Direct optimization**: Directly optimize the policy instead of value function
- **Continuous actions**: Handle continuous action spaces naturally
- **Stochastic policies**: Can learn stochastic (exploratory) policies
- **Convergence guarantees**: Guaranteed convergence to local optima

### Disadvantages
- **High variance**: Requires many samples for stable estimates
- **Sample inefficiency**: Only uses on-policy data
- **Slow convergence**: Can be slow in high-dimensional spaces
- **Hyperparameter sensitivity**: Sensitive to learning rate and other hyperparameters

## 3. On-Policy vs Off-Policy

### On-Policy Methods
- Learn from data generated by the current policy
- Must collect new data after each policy update
- Examples: REINFORCE, Actor-Critic, A3C
- **Advantage**: Stable, direct policy optimization
- **Disadvantage**: Sample inefficient

### Off-Policy Methods
- Learn from data generated by any policy (behavior policy)
- Can reuse old data (experience replay)
- Examples: Q-Learning, DQN
- **Advantage**: Sample efficient
- **Disadvantage**: Can be unstable, requires importance sampling

**Policy gradient methods are typically on-policy**, which is why they require frequent data collection.

## 4. The Role of Baselines

### Problem: High Variance
Using raw returns $G_t$ as the signal leads to high variance:

$$\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) G_t]$$

### Solution: Subtract a Baseline
We can subtract any baseline $b(s)$ that depends only on the state:

$$\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) (G_t - b(s))]$$

This is mathematically equivalent (the baseline term has zero expectation) but reduces variance!

### Optimal Baseline
The optimal baseline is the value function:

$$b^*(s) = V(s) = \mathbb{E}[G_t | s]$$

This gives us the **advantage function**:

$$A(s,a) = Q(s,a) - V(s)$$

which measures how much better action $a$ is compared to the average action in state $s$.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Demonstrate variance reduction with baseline
np.random.seed(42)

# Simulate returns from an episode
returns = np.array([10, 15, 8, 12, 20, 18, 14, 16, 11, 13])
baseline = np.mean(returns)  # Simple baseline: mean return

# Compute advantages
advantages = returns - baseline

print(f"Returns: {returns}")
print(f"Baseline (mean): {baseline:.2f}")
print(f"Advantages: {advantages}")
print(f"\nVariance without baseline: {np.var(returns):.2f}")
print(f"Variance with baseline: {np.var(advantages):.2f}")
print(f"Variance reduction: {(1 - np.var(advantages)/np.var(returns))*100:.1f}%")

## 5. Summary

### Key Concepts
1. **Policy Gradient Theorem**: Provides a way to compute gradients of the expected return
2. **On-Policy Learning**: Policy gradient methods learn from current policy data
3. **Baselines**: Reduce variance without introducing bias
4. **Advantage Function**: Measures relative quality of actions

### Next Steps
- Implement REINFORCE algorithm
- Add value function baseline
- Explore Actor-Critic methods