# Actor-Critic Tutorial

The Actor-Critic method is a popular approach in reinforcement learning that combines the benefits of both policy-based and value-based methods. It consists of two main components: the actor, which updates the policy, and the critic, which evaluates the policy by estimating the value function.

## Key Concepts

1. **Policy ($\pi_{\theta}(a|s)$)**: A mapping from states to actions, parameterized by $\theta$, representing the probability of taking action $a$ given state $s$.
2. **Value Function ($V^{\pi}(s)$)**: The expected return starting from state $s$ and following policy $\pi$.
3. **Action-Value Function ($Q^{\pi}(s, a)$)**: The expected return starting from state $s$, taking action $a$, and then following policy $\pi$.
4. **Advantage Function ($A^{\pi}(s, a)$)**: Measures how much better taking action $a$ in state $s$ is compared to the average action. It is defined as $A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$.

## Mathematical Background

### Objective Function

The objective in reinforcement learning is to maximize the expected return from the start state $s_0$:

$$
J(\pi_{\theta}) = \mathbb{E}_{\pi_{\theta}} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right]
$$

where $\gamma$ is the discount factor and $r(s_t, a_t)$ is the reward received at time step $t$.

### Policy Gradient

The policy gradient theorem provides the gradient of the objective function with respect to the policy parameters $\theta$:

$$
\nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) Q^{\pi_{\theta}}(s_t, a_t) \right]
$$

In Actor-Critic methods, we use an estimate of the action-value function $Q^{\pi_{\theta}}(s_t, a_t)$, typically provided by the critic.

### Critic

The critic estimates the value function $V^{\pi_{\theta}}(s)$, which is used to compute the advantage function. The parameters of the value function, $\phi$, are updated by minimizing the temporal difference (TD) error:

$$
\delta_t = r(s_t, a_t) + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t)
$$

The loss function for the critic is:

$$
L(\phi) = \mathbb{E}_{(s_t, r_t, s_{t+1}) \sim \pi_{\theta}} \left[ \delta_t^2 \right]
$$

The gradient of the loss function with respect to $\phi$ is:

$$
\nabla_{\phi} L(\phi) = -2 \mathbb{E}_{(s_t, r_t, s_{t+1}) \sim \pi_{\theta}} \left[ \delta_t \nabla_{\phi} V_{\phi}(s_t) \right]
$$

### Actor

The actor updates the policy parameters $\theta$ in the direction suggested by the advantage function. The policy gradient with the advantage function is:

$$
\nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) A^{\pi_{\theta}}(s_t, a_t) \right]
$$

Using the TD error as an estimate of the advantage function:

$$
\nabla_{\theta} J(\pi_{\theta}) \approx \mathbb{E}_{\pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \delta_t \right]
$$

### Actor-Critic Algorithm

The Actor-Critic algorithm iteratively updates the actor and the critic as follows:

1. **Initialize**: Initialize policy parameters $\theta$ and value function parameters $\phi$.
2. **Repeat**:
    - Collect trajectory $(s_t, a_t, r_t, s_{t+1})$ by following policy $\pi_{\theta}$.
    - Compute TD error:
      $$
      \delta_t = r(s_t, a_t) + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t)
      $$
    - Update critic by minimizing the loss function:
      $$
      \phi \leftarrow \phi - \alpha_c \nabla_{\phi} L(\phi)
      $$
    - Update actor by ascending the policy gradient:
      $$
      \theta \leftarrow \theta + \alpha_a \nabla_{\theta} J(\pi_{\theta})
      $$

## Conclusion

The Actor-Critic method effectively combines the strengths of policy-based and value-based approaches. The actor updates the policy directly, while the critic provides feedback on the quality of the actions taken, allowing the actor to make more informed updates.
