# **Actor Critic Methods**
Reinforce has demonstrated effectiveness; nevertheless, due to our reliance on Monte-Carlo sampling for return estimation (utilizing entire episodes for return calculation), we encounter notable variance in policy gradient estimation.

It's crucial to recall that policy gradient estimation signifies the direction of the steepest ascent in return, essentially guiding how to adjust our policy weights. This adjustment aims to increase the probability of selecting actions that contribute to favorable returns. The presence of Monte Carlo variance, a topic we will delve into further in this unit, results in a slower training process, necessitating a substantial number of samples to mitigate this variance.

The high variance in policy gradient estimation with REINFORCE and Monte Carlo sampling arises from the nature of how we estimate the return for each episode. Let's break it down with clear examples:

**Imagine you're training a robot to walk.**

-   **Episode 1:** The robot takes a few wobbly steps, falls, and gets a low return (say, -10).
-   **Episode 2:** The robot walks confidently for a longer distance and gets a high return (say, 20).

Based on these two episodes, REINFORCE uses Monte Carlo sampling to estimate the overall "goodness" of different actions.

**Problem:** The return for each episode varies significantly - -10 in the first and 20 in the second! This fluctuation isn't just about pure performance; it can also be due to random factors like the environment or small changes in the robot's initial position.

![enter image description here](https://live.staticflickr.com/65535/53437272436_54159d7a60_c.jpg)

**The consequence of this variance:**

-   **Unclear direction for policy update:** When the returns change wildly, it's hard to tell which actions led to the good outcome in Episode 2 and which contributed to the failure in Episode 1. This ambiguity makes it difficult to update the policy weights reliably.
-   **Slow training:** To reduce the influence of this noise, we need to average the returns over many episodes. This means collecting more data, making the training process slower.
- 
The solution is to mitigate the variance by  **using a large number of trajectories, hoping that the variance introduced in any one trajectory will be reduced in aggregate and provide a “true” estimation of the return.**

However, increasing the batch size significantly  **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.

Today, we will explore Actor-Critic methods, a hybrid architecture that integrates both value-based and policy-based approaches. This combination aims to enhance training stability by mitigating variance. The Actor component governs the agent's behavior, employing a Policy-Based method. On the other hand, the Critic assesses the quality of the chosen action, utilizing a Value-Based method. This dual-component system works in tandem to provide a more balanced and stable training process.
Specifically we will look at **Advantage Actor Critic (A2C)**

**Advantage Actor Critic (A2C)**

The Actor-Critic algorithm is characterized by the collaboration of two networks – the actor and the critic – in addressing a specific problem. The actor network is responsible for making action selections at each time step, while the critic network evaluates the quality or Q-value associated with a given input state. The critic, as it learns to distinguish between superior and inferior states, provides valuable feedback to the actor. The actor, in turn, utilizes this information to guide the agent towards favorable states and away from unfavorable ones, facilitating a more informed decision-making process. This cooperative interaction between the actor and the critic enhances the overall learning and decision-making capabilities of the algorithm.

Imagine you're training a self-driving car to navigate city streets. The car is the **actor**, making decisions about which lane to be in, when to turn, and how fast to drive. But how do you know if it's making good decisions? That's where the **critic** comes in.

**The Critic:**

-   Think of the critic as a backseat driver constantly assessing the car's performance. It observes the current traffic situation, speed, and surroundings and assigns a **value** to the current state. This value reflects how good the car's current position is based on reaching the destination safely and efficiently.

**The Actor:**

-   Based on the critic's feedback, the actor adjusts its driving strategy. If the critic assigns a low value because the car is stuck in heavy traffic, the actor might choose to change lanes or find an alternate route. Conversely, if the critic rewards the car for smoothly cruising on an open highway, the actor will likely continue its current behavior.

**The Dance:**

-   The actor and critic work in a continuous loop. The actor takes an action, the critic evaluates the outcome, and the actor learns from the feedback to improve its future decisions. The critic will also update their way to provide feedback so it can be better next time. This ongoing collaboration helps the car navigate the complex and dynamic environment of city streets.

In the Actor-Critic framework, we aim to learn two function approximations:

1. A policy denoted as **πθ(s)**, controlled by the actor, which dictates how our agent should act given a state s.

2. A value function represented as **q̂w(s, a)**, managed by the critic, which evaluates the quality of the action taken in a particular state (s, a).

These two function approximations work in turn to enhance the learning process. The policy learned by the actor is influenced by the feedback from the value function provided by the critic. This collaboration allows the algorithm to iteratively refine its decision-making strategy based on both the current state and the quality of actions taken.

## The Actor-Critic Process
At every timestep, t, we receive the current state, denoted as St, from the environment. This state is then fed as input to both our Actor and Critic networks.

The Actor network processes the state and produces an action, represented as At, through our policy.

![enter image description here](https://live.staticflickr.com/65535/53437723810_fc9a89ef2e_c.jpg)

The Critic also considers that action as input and, leveraging the current state St and the action At, calculates the value associated with taking that specific action in that particular state—referred to as the Q-value.
![enter image description here](https://live.staticflickr.com/65535/53437628864_4257f3e926_c.jpg)
The action At executed in the environment yields a new state, denoted as St+1, and a corresponding reward, represented as Rt+1.
![enter image description here](https://live.staticflickr.com/65535/53437746620_729467f0b3_c.jpg)
The Actor updates its policy parameters using the Q value.
![enter image description here](https://live.staticflickr.com/65535/53437475158_a0c9d675a7_c.jpg)
The Actor subsequently generates the next action to be taken at time t+1, denoted as At+1, based on the new state St+1.

Following this, the Critic proceeds to update its value parameters, adapting its understanding of the state-action values based on the feedback received.

![enter image description here](https://live.staticflickr.com/65535/53437340646_e78f0f2094_z.jpg)


## Advantage in Actor-Critic (A2C)
Imagine you're training a robot chef to make the perfect omelette. The robot (the **actor**) flips, whisks, and folds, while a food critic (the **critic**) judges each step. But wait, the critic just gives overall scores ("great omelette!" or "disaster!"). How can the robot learn what specific actions led to those verdicts?

Enter the **Advantage Actor-Critic (A2C)**, a smarter critic that provides more helpful feedback. Here's how it works:

**1. The Critic Gets Specific:** A2C uses two value functions:

-   **State value (V(s)):** Estimates the overall good-ness of being in a certain state, regardless of the next action.
-   **Action-value function (Q(s, a)):** Estimates the good-ness of taking a specific action (a) in a given state (s).

So, instead of just saying "great!", the critic now says "that flip added 5 points to the omelette score!"

**2. Advantage: The Key Difference:** Now, A2C calculates the **advantage**:
![enter image description here](https://live.staticflickr.com/65535/53437636491_4bef158ea5_z.jpg)
The advantage tells the actor how much **better** taking a specific action (a) was than just **staying in the current state** (represented by V(s)).
In simpler terms, this function calculates the additional reward obtained by taking a specific action in a given state, compared to the average reward expected in that state.

The extra reward represents what goes beyond the anticipated value of the state. If **A(s,a) > 0**, indicating that our action performs better than the average value of that state, our gradient is directed in that positive direction. Conversely, if **A(s,a) < 0** (signifying our action performs worse than the state's average value), our gradient is oriented in the opposite direction.

Implementing this advantage function poses a challenge as it necessitates two value functions—Q(s,a) and V(s). However, the good news is that we can use the TD (Temporal Difference) error as a reliable estimator for the advantage function.

![enter image description here](https://live.staticflickr.com/65535/53438060845_5264363b17_z.jpg)